linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Improve reliability of CPU hotplug
@ 2012-01-11 10:11 Mel Gorman
  2012-01-11 10:11 ` [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Mel Gorman
  2012-01-11 10:11 ` [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Mel Gorman
  0 siblings, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2012-01-11 10:11 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel, LKML
  Cc: Andrew Morton, Peter Zijlstra, Srivatsa S. Bhat,
	Russell King - ARM Linux, Gilad Ben-Yossef, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen,
	Mel Gorman

Recent stress tests doing CPU online/offline in a loop revealed at
least two separate bugs. They result in CPUs either being deadlocked on
a spinlock or the machine halting entirely. My reproduction case used
a large numbers of simultaneous kernel compiles on an 8-core machine
while CPUs were continually being brought online and offline in a
loop.

This small series includes two patches that allow hotplug stress tests
to pass for me when applied to 3.2.  This does not claim to solve
all CPU hotplug problems.  For example, the test configuration did
not have PREEMPT enabled but there is no harm in eliminating these
bugs at least.

Patch 1 looks at a sysfs dirent problem whereby under stress a dentry
	lock is taken twice. This is a sysfs-specific test but a dcache
	related fix also exists as an RFC.

Patch 2 notes that the page allocator is sending IPIs without calling
	get_online_cpus() to protect the cpuonline mask from changes.
	In low memory situations, this allows an IPI to be sent to a
	CPU going offline. This patch fixes drain_all_pages() and then
	changes the page allocator to only drain local lists with
	preempt disabled instead of sending an IPI on the grounds the
	IPI costs while having a marginal benefit.

 fs/sysfs/dir.c  |    4 ++--
 mm/page_alloc.c |   16 ++++++++++++----
 2 files changed, 14 insertions(+), 6 deletions(-)

-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex
  2012-01-11 10:11 [RFC PATCH 0/2] Improve reliability of CPU hotplug Mel Gorman
@ 2012-01-11 10:11 ` Mel Gorman
  2012-01-11 17:11   ` Eric W. Biederman
  2012-01-11 10:11 ` [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2012-01-11 10:11 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel, LKML
  Cc: Andrew Morton, Peter Zijlstra, Srivatsa S. Bhat,
	Russell King - ARM Linux, Gilad Ben-Yossef, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen,
	Mel Gorman

While running a CPU hotplug stress test under memory pressure, a
spinlock lockup was detected due to a dentry lock being recursively
taken.  When this happens varies considerably and is difficult
to trigger.

[  482.345588] BUG: spinlock lockup on CPU#2, udevd/4400
[  482.345590]  lock: ffff8803075be0d0, .magic: dead4ead, .owner: udevd/5689, .owner_cpu: 0
[  482.345592] Pid: 4400, comm: udevd Not tainted 3.2.0-vanilla #1
[  482.345592] Call Trace:
[  482.345595]  [<ffffffff811e4ffd>] spin_dump+0x88/0x8d
[  482.345597]  [<ffffffff811e5186>] do_raw_spin_lock+0xd6/0xf9
[  482.345599]  [<ffffffff813454e1>] _raw_spin_lock+0x39/0x3d
[  482.345601]  [<ffffffff811396b6>] ? shrink_dcache_parent+0x77/0x28c
[  482.345603]  [<ffffffff811396b6>] shrink_dcache_parent+0x77/0x28c
[  482.345605]  [<ffffffff811373a9>] ? have_submounts+0x13e/0x1bd
[  482.345607]  [<ffffffff811858f8>] sysfs_dentry_revalidate+0xaa/0xbe
[  482.345608]  [<ffffffff8112e6bd>] do_lookup+0x263/0x2fc
[  482.345610]  [<ffffffff8119c99b>] ? security_inode_permission+0x1e/0x20
[  482.345612]  [<ffffffff8112f2c9>] link_path_walk+0x1e2/0x763
[  482.345614]  [<ffffffff8112fcf2>] path_lookupat+0x5c/0x61a
[  482.345616]  [<ffffffff810f479c>] ? might_fault+0x89/0x8d
[  482.345618]  [<ffffffff810f4753>] ? might_fault+0x40/0x8d
[  482.345619]  [<ffffffff811302da>] do_path_lookup+0x2a/0xa8
[  482.345621]  [<ffffffff811329dd>] user_path_at_empty+0x5d/0x97
[  482.345623]  [<ffffffff8107441b>] ? trace_hardirqs_off+0xd/0xf
[  482.345625]  [<ffffffff81345bcf>] ? _raw_spin_unlock_irqrestore+0x44/0x5a
[  482.345627]  [<ffffffff81132a28>] user_path_at+0x11/0x13
[  482.345629]  [<ffffffff81128af0>] vfs_fstatat+0x44/0x71
[  482.345631]  [<ffffffff81128b7b>] vfs_lstat+0x1e/0x20
[  482.345632]  [<ffffffff81128b9c>] sys_newlstat+0x1f/0x40
[  482.345634]  [<ffffffff81075944>] ? trace_hardirqs_on_caller+0x12d/0x164
[  482.345636]  [<ffffffff811e04fe>] ?  trace_hardirqs_on_thunk+0x3a/0x3f
[  482.345638]  [<ffffffff8107441b>] ? trace_hardirqs_off+0xd/0xf
[  482.345640]  [<ffffffff8134d002>] system_call_fastpath+0x16/0x1b
[  482.515004]  [<ffffffff8107441b>] ? trace_hardirqs_off+0xd/0xf
[  482.520870]  [<ffffffff8134d002>] system_call_fastpath+0x16/0x1b

At this point, CPU hotplug stops and other processes get stuck in a
similar deadlock waiting for 5689 to unlock. RCU reports stalls but
it is collateral damage.

The deadlocked processes have sysfs_dentry_revalidate() in
common. Miklos Szeredi explained at https://lkml.org/lkml/2012/1/9/114
that the deadlock happens within dcache if two processes call
shrink_dcache_parent() on the same dentry.

In Miklos's case, the problem is with the bonding driver but during
CPU online or offline, a number of dentries are being created and
deleted and this deadlock is also being hit. Looking at sysfs, there
is a global sysfs_mutex that protects the sysfs directory tree from
concurrent reclaims. Almost all operations involving directory inodes
and dentries take place under the sysfs_mutex - linking, unlinking,
patch searching lookup, renames and readdir. d_invalidate is slightly
different. It is mostly under the mutex but if the dentry has to be
removed from the dcache, the mutex is dropped.

Where as Miklos' patch changes dcache, this patch changes sysfs to
consistently hold the mutex for dentry-related operations. Once
applied, this particular bug with CPU hotadd/hotremove no longer
occurs.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/sysfs/dir.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 7fdf6a7..acaf21d 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -279,8 +279,8 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd)
 	if (strcmp(dentry->d_name.name, sd->s_name) != 0)
 		goto out_bad;
 
-	mutex_unlock(&sysfs_mutex);
 out_valid:
+	mutex_unlock(&sysfs_mutex);
 	return 1;
 out_bad:
 	/* Remove the dentry from the dcache hashes.
@@ -294,7 +294,6 @@ out_bad:
 	 * to the dcache hashes.
 	 */
 	is_dir = (sysfs_type(sd) == SYSFS_DIR);
-	mutex_unlock(&sysfs_mutex);
 	if (is_dir) {
 		/* If we have submounts we must allow the vfs caches
 		 * to lie about the state of the filesystem to prevent
@@ -305,6 +304,7 @@ out_bad:
 		shrink_dcache_parent(dentry);
 	}
 	d_drop(dentry);
+	mutex_unlock(&sysfs_mutex);
 	return 0;
 }
 
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-11 10:11 [RFC PATCH 0/2] Improve reliability of CPU hotplug Mel Gorman
  2012-01-11 10:11 ` [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Mel Gorman
@ 2012-01-11 10:11 ` Mel Gorman
  2012-01-12 14:51   ` Gilad Ben-Yossef
  2012-01-12 15:18   ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2012-01-11 10:11 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel, LKML
  Cc: Andrew Morton, Peter Zijlstra, Srivatsa S. Bhat,
	Russell King - ARM Linux, Gilad Ben-Yossef, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen,
	Mel Gorman

While running a CPU hotplug stress test under memory pressure, it
was observed that the machine would halt with no messages logged to
console. This is difficult to trigger and required a machine with 8
cores and plenty of memory.

Part of the problem is the page allocator is sending IPIs using
on_each_cpu() without calling get_online_cpus() to prevent changes
to the online cpumask. This allows IPIs to be send to CPUs that
are going offline or offline already. At least one bug report has
been seen on ppc64 against a 3.0 era kernel that looked like a bug
receiving interrupts on a CPU being offlined.

This patch starts by adding a call to get_online_cpus() to
drain_all_pages() to make it safe versis CPU hotplug.

In the context of the page allocator, this causes a problem. It is
possible that kthreadd blocks on cpu_hotplug mutex while another
process already holding the mutex is blocked waiting for kthreadd
to make forward progress leading to deadlock. Additionally, it is
important that cpu_hotplug mutex does not become a new hot lock while
under pressure. There is also the consideration that CPU hotplug
expects that get_online_cpus() is not called frequently as it can
lead to livelock in exceptional circumstances (see comment above
cpu_hotplug_begin()).

Rather than making it safe to call get_online_cpus() from the page
allocator, this patch simply removes the page allocator call to
drain_all_pages(). To avoid impacting high-order allocation success
rates, it still drains the local per-cpu lists for high-order
allocations that failed. As a side effect, this reduces the number
of IPIs sent during low memory situations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b8ba3a..b6df6fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg)
  */
 void drain_all_pages(void)
 {
+	get_online_cpus();
 	on_each_cpu(drain_local_pages, NULL, 1);
+	put_online_cpus();
 }
 
 #ifdef CONFIG_HIBERNATION
@@ -1982,11 +1984,17 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, there is a
+	 * possibility that it is because the necessary buddies have been
+	 * freed to the per-cpu list. Drain the local list and try again.
+	 * drain_all_pages is not used because it is unsafe to call
+	 * get_online_cpus from this context as it is possible that kthreadd
+	 * would block during thread creation and the cost of sending storms
+	 * of IPIs in low memory conditions is quite high.
 	 */
-	if (!page && !drained) {
-		drain_all_pages();
+	if (!page && order && !drained) {
+		drain_pages(get_cpu());
+		put_cpu();
 		drained = true;
 		goto retry;
 	}
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex
  2012-01-11 10:11 ` [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Mel Gorman
@ 2012-01-11 17:11   ` Eric W. Biederman
  2012-01-11 18:07     ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2012-01-11 17:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Peter Zijlstra,
	Srivatsa S. Bhat, Russell King - ARM Linux, Gilad Ben-Yossef,
	Paul E. McKenney, Miklos Szeredi, Greg KH, Gong Chen

Mel Gorman <mgorman@suse.de> writes:

> While running a CPU hotplug stress test under memory pressure, a
> spinlock lockup was detected due to a dentry lock being recursively
> taken.  When this happens varies considerably and is difficult
> to trigger.
>
> [  482.345588] BUG: spinlock lockup on CPU#2, udevd/4400
> [  482.345590]  lock: ffff8803075be0d0, .magic: dead4ead, .owner: udevd/5689, .owner_cpu: 0
> [  482.345592] Pid: 4400, comm: udevd Not tainted 3.2.0-vanilla #1
> [  482.345592] Call Trace:
> [  482.345595]  [<ffffffff811e4ffd>] spin_dump+0x88/0x8d
> [  482.345597]  [<ffffffff811e5186>] do_raw_spin_lock+0xd6/0xf9
> [  482.345599]  [<ffffffff813454e1>] _raw_spin_lock+0x39/0x3d
> [  482.345601]  [<ffffffff811396b6>] ? shrink_dcache_parent+0x77/0x28c
> [  482.345603]  [<ffffffff811396b6>] shrink_dcache_parent+0x77/0x28c
> [  482.345605]  [<ffffffff811373a9>] ? have_submounts+0x13e/0x1bd
> [  482.345607]  [<ffffffff811858f8>] sysfs_dentry_revalidate+0xaa/0xbe
> [  482.345608]  [<ffffffff8112e6bd>] do_lookup+0x263/0x2fc
> [  482.345610]  [<ffffffff8119c99b>] ? security_inode_permission+0x1e/0x20
> [  482.345612]  [<ffffffff8112f2c9>] link_path_walk+0x1e2/0x763
> [  482.345614]  [<ffffffff8112fcf2>] path_lookupat+0x5c/0x61a
> [  482.345616]  [<ffffffff810f479c>] ? might_fault+0x89/0x8d
> [  482.345618]  [<ffffffff810f4753>] ? might_fault+0x40/0x8d
> [  482.345619]  [<ffffffff811302da>] do_path_lookup+0x2a/0xa8
> [  482.345621]  [<ffffffff811329dd>] user_path_at_empty+0x5d/0x97
> [  482.345623]  [<ffffffff8107441b>] ? trace_hardirqs_off+0xd/0xf
> [  482.345625]  [<ffffffff81345bcf>] ? _raw_spin_unlock_irqrestore+0x44/0x5a
> [  482.345627]  [<ffffffff81132a28>] user_path_at+0x11/0x13
> [  482.345629]  [<ffffffff81128af0>] vfs_fstatat+0x44/0x71
> [  482.345631]  [<ffffffff81128b7b>] vfs_lstat+0x1e/0x20
> [  482.345632]  [<ffffffff81128b9c>] sys_newlstat+0x1f/0x40
> [  482.345634]  [<ffffffff81075944>] ? trace_hardirqs_on_caller+0x12d/0x164
> [  482.345636]  [<ffffffff811e04fe>] ?  trace_hardirqs_on_thunk+0x3a/0x3f
> [  482.345638]  [<ffffffff8107441b>] ? trace_hardirqs_off+0xd/0xf
> [  482.345640]  [<ffffffff8134d002>] system_call_fastpath+0x16/0x1b
> [  482.515004]  [<ffffffff8107441b>] ? trace_hardirqs_off+0xd/0xf
> [  482.520870]  [<ffffffff8134d002>] system_call_fastpath+0x16/0x1b
>
> At this point, CPU hotplug stops and other processes get stuck in a
> similar deadlock waiting for 5689 to unlock. RCU reports stalls but
> it is collateral damage.
>
> The deadlocked processes have sysfs_dentry_revalidate() in
> common. Miklos Szeredi explained at https://lkml.org/lkml/2012/1/9/114
> that the deadlock happens within dcache if two processes call
> shrink_dcache_parent() on the same dentry.
>
> In Miklos's case, the problem is with the bonding driver but during
> CPU online or offline, a number of dentries are being created and
> deleted and this deadlock is also being hit. Looking at sysfs, there
> is a global sysfs_mutex that protects the sysfs directory tree from
> concurrent reclaims. Almost all operations involving directory inodes
> and dentries take place under the sysfs_mutex - linking, unlinking,
> patch searching lookup, renames and readdir. d_invalidate is slightly
> different. It is mostly under the mutex but if the dentry has to be
> removed from the dcache, the mutex is dropped.

The sysfs_mutex protects the sysfs data structures not the vfs.

> Where as Miklos' patch changes dcache, this patch changes sysfs to
> consistently hold the mutex for dentry-related operations. Once
> applied, this particular bug with CPU hotadd/hotremove no longer
> occurs.

After taking a quick skim over the code to reacquaint myself with
it appears that the usage in sysfs is idiomatic.  That is sysfs
uses shrink_dcache_parent without a lock and in a context where
the right race could trigger this deadlock.

And in particular I expect you could trigger the same deadlock in
proc, nfs, and gfs2 with if you can get the timing right.

I don't think adding a work-around for the bug in shrink_dcache_parent
is going to do anything except hide the bug in the VFS, and
unnecessarily increase the sysfs_mutex hold times.

I may be blind but I don't see a reason at this point to rush out an
incomplete work-around for the bug in shrink_dcahce_parent instead of
actually fixing shrink_dcache_parent.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex
  2012-01-11 17:11   ` Eric W. Biederman
@ 2012-01-11 18:07     ` Mel Gorman
  2012-01-11 19:02       ` Eric W. Biederman
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2012-01-11 18:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Peter Zijlstra,
	Srivatsa S. Bhat, Russell King - ARM Linux, Gilad Ben-Yossef,
	Paul E. McKenney, Miklos Szeredi, Greg KH, Gong Chen

On Wed, Jan 11, 2012 at 09:11:27AM -0800, Eric W. Biederman wrote:
> > In Miklos's case, the problem is with the bonding driver but during
> > CPU online or offline, a number of dentries are being created and
> > deleted and this deadlock is also being hit. Looking at sysfs, there
> > is a global sysfs_mutex that protects the sysfs directory tree from
> > concurrent reclaims. Almost all operations involving directory inodes
> > and dentries take place under the sysfs_mutex - linking, unlinking,
> > patch searching lookup, renames and readdir. d_invalidate is slightly
> > different. It is mostly under the mutex but if the dentry has to be
> > removed from the dcache, the mutex is dropped.
> 
> The sysfs_mutex protects the sysfs data structures not the vfs.
> 

Ok.

> > Where as Miklos' patch changes dcache, this patch changes sysfs to
> > consistently hold the mutex for dentry-related operations. Once
> > applied, this particular bug with CPU hotadd/hotremove no longer
> > occurs.
> 
> After taking a quick skim over the code to reacquaint myself with
> it appears that the usage in sysfs is idiomatic.  That is sysfs
> uses shrink_dcache_parent without a lock and in a context where
> the right race could trigger this deadlock.
> 

Yes.

> And in particular I expect you could trigger the same deadlock in
> proc, nfs, and gfs2 with if you can get the timing right.
> 

Agreed. When the dcache-specific fix was being discussed on an external
bugzilla, this came up. It's probably easiest to race in sysfs because
it's possible to create/delete directories faster than is possible
for proc, nfs or gfs2.

> I don't think adding a work-around for the bug in shrink_dcache_parent
> is going to do anything except hide the bug in the VFS, and
> unnecessarily increase the sysfs_mutex hold times.
> 

Ok.

> I may be blind but I don't see a reason at this point to rush out an
> incomplete work-around for the bug in shrink_dcahce_parent instead of
> actually fixing shrink_dcache_parent.
> 

Since I wrote this patch, the dcache specific fix was finished, merged
and I expect it'll make it to stable. Assuming that happens, this patch
will no longer be required.

Thanks Eric.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex
  2012-01-11 18:07     ` Mel Gorman
@ 2012-01-11 19:02       ` Eric W. Biederman
  0 siblings, 0 replies; 21+ messages in thread
From: Eric W. Biederman @ 2012-01-11 19:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Peter Zijlstra,
	Srivatsa S. Bhat, Russell King - ARM Linux, Gilad Ben-Yossef,
	Paul E. McKenney, Miklos Szeredi, Greg KH, Gong Chen

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Jan 11, 2012 at 09:11:27AM -0800, Eric W. Biederman wrote:
>> > In Miklos's case, the problem is with the bonding driver but during
>> > CPU online or offline, a number of dentries are being created and
>> > deleted and this deadlock is also being hit. Looking at sysfs, there
>> > is a global sysfs_mutex that protects the sysfs directory tree from
>> > concurrent reclaims. Almost all operations involving directory inodes
>> > and dentries take place under the sysfs_mutex - linking, unlinking,
>> > patch searching lookup, renames and readdir. d_invalidate is slightly
>> > different. It is mostly under the mutex but if the dentry has to be
>> > removed from the dcache, the mutex is dropped.
>> 
>> The sysfs_mutex protects the sysfs data structures not the vfs.
>> 
>
> Ok.
>
>> > Where as Miklos' patch changes dcache, this patch changes sysfs to
>> > consistently hold the mutex for dentry-related operations. Once
>> > applied, this particular bug with CPU hotadd/hotremove no longer
>> > occurs.
>> 
>> After taking a quick skim over the code to reacquaint myself with
>> it appears that the usage in sysfs is idiomatic.  That is sysfs
>> uses shrink_dcache_parent without a lock and in a context where
>> the right race could trigger this deadlock.
>> 
>
> Yes.
>
>> And in particular I expect you could trigger the same deadlock in
>> proc, nfs, and gfs2 with if you can get the timing right.
>> 
>
> Agreed. When the dcache-specific fix was being discussed on an external
> bugzilla, this came up. It's probably easiest to race in sysfs because
> it's possible to create/delete directories faster than is possible
> for proc, nfs or gfs2.

I expect we see the race in sysfs because of uevents that get triggered
on hotplug.  So a lot is occurring around the time of the race.  You can
get to shrink_dcache_parent with fork/exit in proc which is a lot easier
to trigger.  But usually in fork/exec you don't have the dentries cached...

> Since I wrote this patch, the dcache specific fix was finished, merged
> and I expect it'll make it to stable. Assuming that happens, this patch
> will no longer be required.

Sounds good.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-11 10:11 ` [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Mel Gorman
@ 2012-01-12 14:51   ` Gilad Ben-Yossef
  2012-01-12 15:08     ` Peter Zijlstra
  2012-01-12 15:08     ` Gilad Ben-Yossef
  2012-01-12 15:18   ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-12 14:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Peter Zijlstra,
	Srivatsa S. Bhat, Russell King - ARM Linux, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Wed, Jan 11, 2012 at 12:11 PM, Mel Gorman <mgorman@suse.de> wrote:
<SNIP>
> Rather than making it safe to call get_online_cpus() from the page
> allocator, this patch simply removes the page allocator call to
> drain_all_pages(). To avoid impacting high-order allocation success
> rates, it still drains the local per-cpu lists for high-order
> allocations that failed. As a side effect, this reduces the number
> of IPIs sent during low memory situations.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/page_alloc.c |   16 ++++++++++++----
>  1 files changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2b8ba3a..b6df6fc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg)
>  */
>  void drain_all_pages(void)
>  {
> +       get_online_cpus();
>        on_each_cpu(drain_local_pages, NULL, 1);
> +       put_online_cpus();
>  }
>
>  #ifdef CONFIG_HIBERNATION
> @@ -1982,11 +1984,17 @@ retry:
>                                        migratetype);
>
>        /*
> -        * If an allocation failed after direct reclaim, it could be because
> -        * pages are pinned on the per-cpu lists. Drain them and try again
> +        * If a high-order allocation failed after direct reclaim, there is a
> +        * possibility that it is because the necessary buddies have been
> +        * freed to the per-cpu list. Drain the local list and try again.
> +        * drain_all_pages is not used because it is unsafe to call
> +        * get_online_cpus from this context as it is possible that kthreadd
> +        * would block during thread creation and the cost of sending storms
> +        * of IPIs in low memory conditions is quite high.
>         */
> -       if (!page && !drained) {
> -               drain_all_pages();
> +       if (!page && order && !drained) {
> +               drain_pages(get_cpu());
> +               put_cpu();
>                drained = true;
>                goto retry;
>        }
> --
> 1.7.3.4
>

I very much like the judo like quality of relying on the fact that in
memory pressure conditions most
of the cpus will end up in the direct reclaim path to drain them all
without IPIs.

What I can't figure out is why we don't need  get/put_online_cpus()
pair around each and every call
to on_each_cpu everywhere? and if we do, perhaps making it a part of
on_each_cpu is the way to go?

Something like:

diff --git a/kernel/smp.c b/kernel/smp.c
index f66a1b2..cfa3882 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void
*info, int wait)
 {
 	unsigned long flags;

+	BUG_ON(in_atomic());
+
+	get_online_cpus();
 	preempt_disable();
 	smp_call_function(func, info, wait);
 	local_irq_save(flags);
 	func(info);
 	local_irq_restore(flags);
 	preempt_enable();
+	put_online_cpus();
 }
 EXPORT_SYMBOL(on_each_cpu);

Does that makes?

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 14:51   ` Gilad Ben-Yossef
@ 2012-01-12 15:08     ` Peter Zijlstra
  2012-01-12 15:13       ` Gilad Ben-Yossef
  2012-01-12 15:08     ` Gilad Ben-Yossef
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2012-01-12 15:08 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Mel Gorman, Linux-MM, Linux-FSDevel, LKML, Andrew Morton,
	Srivatsa S. Bhat, Russell King - ARM Linux, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Thu, 2012-01-12 at 16:51 +0200, Gilad Ben-Yossef wrote:
> What I can't figure out is why we don't need  get/put_online_cpus()
> pair around each and every call
> to on_each_cpu everywhere? and if we do, perhaps making it a part of
> on_each_cpu is the way to go?
> 
> Something like:
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f66a1b2..cfa3882 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void
> *info, int wait)
>  {
>         unsigned long flags;
> 
> +       BUG_ON(in_atomic());
> +
> +       get_online_cpus();
>         preempt_disable();

Your preempt_disable() here serializes against hotplug..

>         smp_call_function(func, info, wait);
>         local_irq_save(flags);
>         func(info);
>         local_irq_restore(flags);
>         preempt_enable();
> +       put_online_cpus();
>  }
>  EXPORT_SYMBOL(on_each_cpu); 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 14:51   ` Gilad Ben-Yossef
  2012-01-12 15:08     ` Peter Zijlstra
@ 2012-01-12 15:08     ` Gilad Ben-Yossef
  1 sibling, 0 replies; 21+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-12 15:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Peter Zijlstra,
	Srivatsa S. Bhat, Russell King - ARM Linux, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Thu, Jan 12, 2012 at 4:51 PM, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> On Wed, Jan 11, 2012 at 12:11 PM, Mel Gorman <mgorman@suse.de> wrote:
> <SNIP>
>> Rather than making it safe to call get_online_cpus() from the page
>> allocator, this patch simply removes the page allocator call to
>> drain_all_pages(). To avoid impacting high-order allocation success
>> rates, it still drains the local per-cpu lists for high-order
>> allocations that failed. As a side effect, this reduces the number
>> of IPIs sent during low memory situations.
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/page_alloc.c |   16 ++++++++++++----
>>  1 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 2b8ba3a..b6df6fc 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg)
>>  */
>>  void drain_all_pages(void)
>>  {
>> +       get_online_cpus();
>>        on_each_cpu(drain_local_pages, NULL, 1);
>> +       put_online_cpus();
>>  }
>>
>>  #ifdef CONFIG_HIBERNATION
>> @@ -1982,11 +1984,17 @@ retry:
>>                                        migratetype);
>>
>>        /*
>> -        * If an allocation failed after direct reclaim, it could be because
>> -        * pages are pinned on the per-cpu lists. Drain them and try again
>> +        * If a high-order allocation failed after direct reclaim, there is a
>> +        * possibility that it is because the necessary buddies have been
>> +        * freed to the per-cpu list. Drain the local list and try again.
>> +        * drain_all_pages is not used because it is unsafe to call
>> +        * get_online_cpus from this context as it is possible that kthreadd
>> +        * would block during thread creation and the cost of sending storms
>> +        * of IPIs in low memory conditions is quite high.
>>         */
>> -       if (!page && !drained) {
>> -               drain_all_pages();
>> +       if (!page && order && !drained) {
>> +               drain_pages(get_cpu());
>> +               put_cpu();
>>                drained = true;
>>                goto retry;
>>        }
>> --
>> 1.7.3.4
>>
>
> I very much like the judo like quality of relying on the fact that in
> memory pressure conditions most
> of the cpus will end up in the direct reclaim path to drain them all
> without IPIs.
>
> What I can't figure out is why we don't need  get/put_online_cpus()
> pair around each and every call
> to on_each_cpu everywhere? and if we do, perhaps making it a part of
> on_each_cpu is the way to go?
>
> Something like:
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f66a1b2..cfa3882 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void
> *info, int wait)
>  {
>        unsigned long flags;
>
> +       BUG_ON(in_atomic());
> +
> +       get_online_cpus();
>        preempt_disable();
>        smp_call_function(func, info, wait);
>        local_irq_save(flags);
>        func(info);
>        local_irq_restore(flags);
>        preempt_enable();
> +       put_online_cpus();
>  }
>  EXPORT_SYMBOL(on_each_cpu);
>
> Does that makes?

Well, it dies on boot (after adding the needed include file), so it's
obviously wrong, but I'm still guessing
other users of on_each_cpu will need an  get/put_online_cpus() wrapper too.

Maybe -

on_each_cpu(...)
{
  get_online_cpus();
  __on_each_cpu(...);
  put_online_cpus();
}

We'll need to audit all callers.

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 15:08     ` Peter Zijlstra
@ 2012-01-12 15:13       ` Gilad Ben-Yossef
  0 siblings, 0 replies; 21+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-12 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linux-MM, Linux-FSDevel, LKML, Andrew Morton,
	Srivatsa S. Bhat, Russell King - ARM Linux, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Thu, Jan 12, 2012 at 5:08 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2012-01-12 at 16:51 +0200, Gilad Ben-Yossef wrote:
>> What I can't figure out is why we don't need  get/put_online_cpus()
>> pair around each and every call
>> to on_each_cpu everywhere? and if we do, perhaps making it a part of
>> on_each_cpu is the way to go?
>>
>> Something like:
>>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index f66a1b2..cfa3882 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void
>> *info, int wait)
>>  {
>>         unsigned long flags;
>>
>> +       BUG_ON(in_atomic());
>> +
>> +       get_online_cpus();
>>         preempt_disable();
>
> Your preempt_disable() here serializes against hotplug..

I'm probably daft, but why didn't it work for the page allocator then?

Mel's description reads: "Part of the problem is the page allocator is
sending IPIs using
on_each_cpu() without calling get_online_cpus() to prevent changes
to the online cpumask."

on_each_cpu() disables preemption in master as well...

Gilad




Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-11 10:11 ` [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Mel Gorman
  2012-01-12 14:51   ` Gilad Ben-Yossef
@ 2012-01-12 15:18   ` Peter Zijlstra
  2012-01-12 15:37     ` Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2012-01-12 15:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Srivatsa S. Bhat,
	Russell King - ARM Linux, Gilad Ben-Yossef, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote:
> At least one bug report has
> been seen on ppc64 against a 3.0 era kernel that looked like a bug
> receiving interrupts on a CPU being offlined. 

Got details on that Mel? The preempt_disable() in on_each_cpu() should
serialize against the stop_machine() crap in unplug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 15:18   ` Peter Zijlstra
@ 2012-01-12 15:37     ` Mel Gorman
  2012-01-12 15:52       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2012-01-12 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Srivatsa S. Bhat,
	Russell King - ARM Linux, Gilad Ben-Yossef, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote:
> On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote:
> > At least one bug report has
> > been seen on ppc64 against a 3.0 era kernel that looked like a bug
> > receiving interrupts on a CPU being offlined. 
> 
> Got details on that Mel? The preempt_disable() in on_each_cpu() should
> serialize against the stop_machine() crap in unplug.

I might have added 2 and 2 together and got 5.

The stack trace clearly was while sending IPIs in on_each_cpu() and
always when under memory pressure and stuck in direct reclaim. This was
on !PREEMPT kernels where preempt_disable() is a no-op. That is why I
thought get_online_cpu() would be necessary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 15:37     ` Mel Gorman
@ 2012-01-12 15:52       ` Peter Zijlstra
  2012-01-12 17:18         ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2012-01-12 15:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Srivatsa S. Bhat,
	Russell King - ARM Linux, Gilad Ben-Yossef, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote:
> On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote:
> > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote:
> > > At least one bug report has
> > > been seen on ppc64 against a 3.0 era kernel that looked like a bug
> > > receiving interrupts on a CPU being offlined. 
> > 
> > Got details on that Mel? The preempt_disable() in on_each_cpu() should
> > serialize against the stop_machine() crap in unplug.
> 
> I might have added 2 and 2 together and got 5.
> 
> The stack trace clearly was while sending IPIs in on_each_cpu() and
> always when under memory pressure and stuck in direct reclaim. This was
> on !PREEMPT kernels where preempt_disable() is a no-op. That is why I
> thought get_online_cpu() would be necessary.

For non-preempt the required scheduling of stop_machine() will have to
wait even longer. Still there might be something funny, some of the
hotplug notifiers are ran before the stop_machine thing does its thing
so there might be some fun interaction.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 15:52       ` Peter Zijlstra
@ 2012-01-12 17:18         ` Mel Gorman
  2012-01-12 19:14           ` Gilad Ben-Yossef
                             ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Mel Gorman @ 2012-01-12 17:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, LKML, Andrew Morton, Srivatsa S. Bhat,
	Russell King - ARM Linux, Gilad Ben-Yossef, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote:
> On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote:
> > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote:
> > > > At least one bug report has
> > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug
> > > > receiving interrupts on a CPU being offlined. 
> > > 
> > > Got details on that Mel? The preempt_disable() in on_each_cpu() should
> > > serialize against the stop_machine() crap in unplug.
> > 
> > I might have added 2 and 2 together and got 5.
> > 
> > The stack trace clearly was while sending IPIs in on_each_cpu() and
> > always when under memory pressure and stuck in direct reclaim. This was
> > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I
> > thought get_online_cpu() would be necessary.
> 
> For non-preempt the required scheduling of stop_machine() will have to
> wait even longer. Still there might be something funny, some of the
> hotplug notifiers are ran before the stop_machine thing does its thing
> so there might be some fun interaction.

Ok, how about this as a replacement patch?

---8<---
From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context

While running a CPU hotplug stress test under memory pressure, it
was observed that the machine would halt with no messages logged
to console. This is difficult to trigger and required a machine
with 8 cores and plenty of memory. In at least one case on ppc64,
the warning in include/linux/cpumask.h:107 triggered implying that
IPIs are being sent to offline CPUs in some cases.

A suspicious part of the problem is that the page allocator is sending
IPIs using on_each_cpu() without calling get_online_cpus() to prevent
changes to the online cpumask. It is depending on preemption being
disabled to protect it which is a no-op on !PREEMPT kernels. This means
that a thread can be reading the mask in smp_call_function_many() when
an attempt is made to take a CPU offline. The expectation is that this
is not a problem as the stop_machine() used during CPU hotplug should
be able to prevent any problems as the reader of the online mask will
prevent stop_machine making forward progress but it's unhelpful.

On the other side, the mask can also be read while the CPU is being
brought online. In this case it is the responsibility of the
architecture that the CPU is able to receive and handle interrupts
before being marked active but that does not mean they always get it
right.

Sending excessive IPIs from the page allocator is a bad idea. In low
memory situations, a large number of processes can drain the per-cpu
lists at the same time, in quick succession and on many CPUs which is
pointless. In light of this and the unspecific CPU hotplug concerns,
this patch removes the call drain_all_pages() after failing direct
reclaim. To avoid impacting high-order allocation success rates,
it still drains the local per-cpu lists for high-order allocations
that failed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b8ba3a..63ea182 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1982,11 +1982,13 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, there is a
+	 * possibility that it is because the necessary buddies have been
+	 * freed to the per-cpu list. Drain the local list and try again.
 	 */
-	if (!page && !drained) {
-		drain_all_pages();
+	if (!page && order && !drained) {
+		drain_pages(get_cpu());
+		put_cpu();
 		drained = true;
 		goto retry;
 	}

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 17:18         ` Mel Gorman
@ 2012-01-12 19:14           ` Gilad Ben-Yossef
  2012-01-13 20:58           ` Milton Miller
  2012-01-13 20:58           ` Milton Miller
  2 siblings, 0 replies; 21+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-12 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linux-MM, Linux-FSDevel, LKML, Andrew Morton,
	Srivatsa S. Bhat, Russell King - ARM Linux, Paul E. McKenney,
	Miklos Szeredi, Eric W. Biederman, Greg KH, Gong Chen

On Thu, Jan 12, 2012 at 7:18 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote:
>> On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote:
>> > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote:
>> > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote:
>> > > > At least one bug report has
>> > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug
>> > > > receiving interrupts on a CPU being offlined.
>> > >
>> > > Got details on that Mel? The preempt_disable() in on_each_cpu() should
>> > > serialize against the stop_machine() crap in unplug.
>> >
>> > I might have added 2 and 2 together and got 5.
>> >
>> > The stack trace clearly was while sending IPIs in on_each_cpu() and
>> > always when under memory pressure and stuck in direct reclaim. This was
>> > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I
>> > thought get_online_cpu() would be necessary.
>>
>> For non-preempt the required scheduling of stop_machine() will have to
>> wait even longer. Still there might be something funny, some of the
>> hotplug notifiers are ran before the stop_machine thing does its thing
>> so there might be some fun interaction.
>
> Ok, how about this as a replacement patch?
>
> ---8<---
> From: Mel Gorman <mgorman@suse.de>
> Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
>
> While running a CPU hotplug stress test under memory pressure, it
> was observed that the machine would halt with no messages logged
> to console. This is difficult to trigger and required a machine
> with 8 cores and plenty of memory. In at least one case on ppc64,
> the warning in include/linux/cpumask.h:107 triggered implying that
> IPIs are being sent to offline CPUs in some cases.
>
> A suspicious part of the problem is that the page allocator is sending
> IPIs using on_each_cpu() without calling get_online_cpus() to prevent
> changes to the online cpumask. It is depending on preemption being
> disabled to protect it which is a no-op on !PREEMPT kernels. This means
> that a thread can be reading the mask in smp_call_function_many() when
> an attempt is made to take a CPU offline. The expectation is that this
> is not a problem as the stop_machine() used during CPU hotplug should
> be able to prevent any problems as the reader of the online mask will
> prevent stop_machine making forward progress but it's unhelpful.
>
> On the other side, the mask can also be read while the CPU is being
> brought online. In this case it is the responsibility of the
> architecture that the CPU is able to receive and handle interrupts
> before being marked active but that does not mean they always get it
> right.
>
> Sending excessive IPIs from the page allocator is a bad idea. In low
> memory situations, a large number of processes can drain the per-cpu
> lists at the same time, in quick succession and on many CPUs which is
> pointless. In light of this and the unspecific CPU hotplug concerns,
> this patch removes the call drain_all_pages() after failing direct
> reclaim. To avoid impacting high-order allocation success rates,
> it still drains the local per-cpu lists for high-order allocations
> that failed.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/page_alloc.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2b8ba3a..63ea182 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1982,11 +1982,13 @@ retry:
>                                        migratetype);
>
>        /*
> -        * If an allocation failed after direct reclaim, it could be because
> -        * pages are pinned on the per-cpu lists. Drain them and try again
> +        * If a high-order allocation failed after direct reclaim, there is a
> +        * possibility that it is because the necessary buddies have been
> +        * freed to the per-cpu list. Drain the local list and try again.
>         */
> -       if (!page && !drained) {
> -               drain_all_pages();
> +       if (!page && order && !drained) {
> +               drain_pages(get_cpu());
> +               put_cpu();
>                drained = true;
>                goto retry;
>        }



I like the patch, think it is better then current code and want it to go in.

I also think there is still some problems with IPIs somewhere that
cause some corruption when a lot of IPIs are sent and that
 the patch simply lowered the very big number of IPIs that are sent
via the direct reclaim code path so the problem
is hidden, not solved by this patch.

I've seen something related when trying to test the IPI reduction
patches. Interesting enough it was not related to CPU hotplug at all -
When a lot of IPIs are being sent, I sometime got an assert from low
level platform code that I'm trying to send IPIs with an empty mask
although the mask was NOT empty. I didn't manage to debug it then but
I did manage to recreate it quite easily.

I will see if I can recreate it with recent master and report.

-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 17:18         ` Mel Gorman
  2012-01-12 19:14           ` Gilad Ben-Yossef
@ 2012-01-13 20:58           ` Milton Miller
  2012-01-15 13:53             ` Gilad Ben-Yossef
  2012-01-13 20:58           ` Milton Miller
  2 siblings, 1 reply; 21+ messages in thread
From: Milton Miller @ 2012-01-13 20:58 UTC (permalink / raw)
  To: Gilad Ben-Yossef, Mel Gorman; +Cc: linux-kernel, Peter Zijlstra

On Thu, 12 Jan 2012 about 14:14:57 -0500 (EST), Gilad Ben-Yossef wrote:
> I also think there is still some problems with IPIs somewhere that
> cause some corruption when a lot of IPIs are sent and that
> the patch simply lowered the very big number of IPIs that are sent
> via the direct reclaim code path so the problem
> is hidden, not solved by this patch.
> 
> I've seen something related when trying to test the IPI reduction
> patches. Interesting enough it was not related to CPU hotplug at all -

> When a lot of IPIs are being sent, I sometime got an assert from low
> level platform code that I'm trying to send IPIs with an empty mask
> although the mask was NOT empty. I didn't manage to debug it then but
> I did manage to recreate it quite easily.

That is a known scenario and expected race, the check must be removed
from the platform code.

> 
> I will see if I can recreate it with recent master and report.
> 

The code in kernel/smp checks the mask, and looks for opportunities to
convert smp_call_function_many to call_function_single.  But it now
tolerates callers passing a mask that other cpus are changing.  So
while it tries to make sure the mask has > 1 cpu, its not guaranteed.
But that is not what causes the assert to fire.  

There are a few opportunities for the architecture code to detect an
empty mask.  The first case is a cpu started processing the list due
to an interrupt generated by a cpu other than the requester between
the time the requester put its work element in the list and when it
executed the architecture code to send the interrupt.  A second
case is a work element is deleted by a third cpu as its ref count
goes to zero and then that element gets reused by the owning cpu,
so we process part of the request list twice.

The solution is to tell the architecture code its not an error for
the mask to be empty.

Here is a walk though of how smp_call_function_many works.  Each cpu
has a queue element in per-cpu space that can be placed on the queue.
When a cpu calls smp_call_function_many, it does a quick check to see
if it could be converted to smp_call_function_single, but at this point
other cpus can still be changing the mask (we only guarantee cpus that
remain in the mask from the start of the call through the list copy
will be called).   Once it decides multiple cpus are needed, it first
waits for its per-cpu data to become free (if the previous caller said
not to wait, it could still be in use).  After its free it copies
the mask supplied by the caller to its per-cpu data work element.
At that point it removes itself from the list then counts the bits.
If the count is 0 it is done, but otherwise it obtains the lock,
adds its work to the front of the global list, and stores the count
of cpus.  Starting at this point other cpus could start doing the
work requested, even though they have not received an interrupt yet,
nor have we unlocked the list manipulation lock).  The requesting
cpu proceeds to unlock the list and then calls the architecture code
to request all cpus in the mask be notified there is work for them.
The low level code does its processing and then iterates over the
list sending IPIs to each cpu that is still in the mask in the work
element.  If the caller requested to wait for all (other) cpus to
perform the work before returning, then the requesting cpu waits for
its per-cpu-data work request element be unlocked by the last cpu
who does the work, after it has removed it from the list.

Whenever a cpu gets an IPI for generic_call_function_interrupt, it
starts at the top of the list and works its way down.  If a given
element has the cpu's bit set in the cpu mask in the work element,
then there will be work to do in the future.  It then proceeds to check
references, to make sure the element has been placed on the list and is
allowed to be processed.  If not, we are guaranteed the requester has
not unlocked the list and therefore has not called the architecture
notifier, so the cpu will get another interrupt in the future and
proceeds to the next element.   If the bit is set and references
are set, then the code will execute the requested function with the
supplied data argument.  This function must not enable interrupts!
(There is a defensive check, but that will only trigger if we take
another call to smp_call_function_interrupt.)  After the call, the
cpu removes itself from the cpu mask and decrements the ref count.
If it removes the last reference, it grabs the list manipulation lock,
removes the element with list_del_rcu, unlock the list (flushing the
list manipulation writes), and then unlocks the element, allowing
the requesting cpu to reuse the element.  It then goes on to the
next element.

If the cpu suddenly becomes slow here and the element owning cpu
see the unlock and is able to reuse the element, our traversal of
next will lead us to the start of the list and we will check some
entries an additional time.  This means we can see entries that were
added to the list after we started processing the list before we get
the interrupt.  Eventually we wrap back to the head of the list and
return from the interrupt.

It would be easy to write some debugging assertions for the list,
for instance:  After grabbing the list walk it and mark which cpus
per_cpu data is in the list.  Then still under the lock verify
(1) if refs is set the element is on the list (2) refs is <= the
number of cpus in the cpumask (3) any element with refs set is
under csd_lock.   Things that are valid tranisent states: (1)
refs < bits in mask (short transient), (2) refs = 0 but on list
(waiting for lock to remove) (3) off list but csd_lock (either
short trasnient or owning cpu preparing or waiting to put it
on the list).

For added saftey under the lock we could count the traversals it should
never take more than one above nr_cpu_ids (or num_possible_cpus()
for that matter) to get back to the list head.

We could also put an assert type check before taking a cpu
dead that would check all per_cpu copies that our cpu bit
is not set in any cpus element.  It should never be set
even transiently if we are offline.

milton

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-12 17:18         ` Mel Gorman
  2012-01-12 19:14           ` Gilad Ben-Yossef
  2012-01-13 20:58           ` Milton Miller
@ 2012-01-13 20:58           ` Milton Miller
  2012-01-19 16:20             ` Mel Gorman
  2 siblings, 1 reply; 21+ messages in thread
From: Milton Miller @ 2012-01-13 20:58 UTC (permalink / raw)
  To: Mel Gorman, Gilad Ben-Yossef; +Cc: linux-kernel, Peter Zijlstra

On Thu Jan 12 2012 about 12:18:59 EST, Mel Gorman wrote:
> On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote:
> > On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote:
> > > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote:
> > > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote:
> > > > > At least one bug report has
> > > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug
> > > > > receiving interrupts on a CPU being offlined.
> > > >
> > > > Got details on that Mel? The preempt_disable() in on_each_cpu() should
> > > > serialize against the stop_machine() crap in unplug.
> > >
> > > I might have added 2 and 2 together and got 5.
> > >
> > > The stack trace clearly was while sending IPIs in on_each_cpu() and
> > > always when under memory pressure and stuck in direct reclaim. This was
> > > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I
> > > thought get_online_cpu() would be necessary.

Well, stop_machine has to be selected by the scheduler, so we have to
get back and call schedule() at some point to switch to that thread.
.. unless it is the one allocating memory.

> >
> > For non-preempt the required scheduling of stop_machine() will have to
> > wait even longer. Still there might be something funny, some of the
> > hotplug notifiers are ran before the stop_machine thing does its thing
> > so there might be some fun interaction.
> 
> Ok, how about this as a replacement patch?
> 
> ---8<---
> From: Mel Gorman <mgorman@xxxxxxx>
> Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
> 
> While running a CPU hotplug stress test under memory pressure, it
> was observed that the machine would halt with no messages logged
> to console. This is difficult to trigger and required a machine
> with 8 cores and plenty of memory. In at least one case on ppc64,
> the warning in include/linux/cpumask.h:107 triggered implying that
> IPIs are being sent to offline CPUs in some cases.

That is
        WARN_ON_ONCE(cpu >= nr_cpumask_bits);

That has nothing to do with cpus going offline!

nr_cpumask_bits is set during boot based on cpu_possible_mask.  If you
see that triggered its a direct bug in the caller.  Either its looking
at random memory in a NR_CPUS loop or its assuming that there is
another cpu in a mask and not checking for a cpu_mask_next returning
nr_cpumask_bits.

Again it has nothing to do with hotplug (unless its assuming there are
n online cpus in a loop instead of looking at the return value of
the function).


> 
> A suspicious part of the problem is that the page allocator is sending
> IPIs using on_each_cpu() without calling get_online_cpus() to prevent
> changes to the online cpumask. It is depending on preemption being
> disabled to protect it which is a no-op on !PREEMPT kernels. This means
> that a thread can be reading the mask in smp_call_function_many() when
> an attempt is made to take a CPU offline. The expectation is that this
> is not a problem as the stop_machine() used during CPU hotplug should
> be able to prevent any problems as the reader of the online mask will
> prevent stop_machine making forward progress but it's unhelpful.

And without CONFIG_PREEMPT, we won't be able to schedule away from the
current task over to the stop_machine (migration/NN) thread.

> 
> On the other side, the mask can also be read while the CPU is being
> brought online. In this case it is the responsibility of the
> architecture that the CPU is able to receive and handle interrupts
> before being marked active but that does not mean they always get it
> right.

yes.  See my other reply for some things we can to to help find bugs
with smp_call_function_many (and on_each_cpu).

> 
> Sending excessive IPIs from the page allocator is a bad idea. In low
> memory situations, a large number of processes can drain the per-cpu
> lists at the same time, in quick succession and on many CPUs which is
> pointless. In light of this and the unspecific CPU hotplug concerns,
> this patch removes the call drain_all_pages() after failing direct
> reclaim. To avoid impacting high-order allocation success rates,
> it still drains the local per-cpu lists for high-order allocations
> that failed.

"There is some bug somewhere.  This seems like a big slow pain I
don't think this is likely to have much impact but if I am wrong we
will just OOM early."  vs Gilad's "Lets reduce the pain of this slow
path by doing just the required work".

Lets find the real bug.

milton

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-13 20:58           ` Milton Miller
@ 2012-01-15 13:53             ` Gilad Ben-Yossef
  0 siblings, 0 replies; 21+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-15 13:53 UTC (permalink / raw)
  To: Milton Miller; +Cc: Mel Gorman, linux-kernel, Peter Zijlstra

On Fri, Jan 13, 2012 at 10:58 PM, Milton Miller <miltonm@bga.com> wrote:
> On Thu, 12 Jan 2012 about 14:14:57 -0500 (EST), Gilad Ben-Yossef wrote:
>> I also think there is still some problems with IPIs somewhere that
>> cause some corruption when a lot of IPIs are sent and that
>> the patch simply lowered the very big number of IPIs that are sent
>> via the direct reclaim code path so the problem
>> is hidden, not solved by this patch.
>>
>> I've seen something related when trying to test the IPI reduction
>> patches. Interesting enough it was not related to CPU hotplug at all -
>
>> When a lot of IPIs are being sent, I sometime got an assert from low
>> level platform code that I'm trying to send IPIs with an empty mask
>> although the mask was NOT empty. I didn't manage to debug it then but
>> I did manage to recreate it quite easily.
>
> That is a known scenario and expected race, the check must be removed
> from the platform code.

OK.

>
>>
>> I will see if I can recreate it with recent master and report.
>>
>
> The code in kernel/smp checks the mask, and looks for opportunities to
> convert smp_call_function_many to call_function_single.  But it now
> tolerates callers passing a mask that other cpus are changing.  So
> while it tries to make sure the mask has > 1 cpu, its not guaranteed.

Yes, that much I figured out already, even counted on this behavior.

> But that is not what causes the assert to fire.

Yes, my test case involves on_each_cpu with hotplug disabled.

>
> There are a few opportunities for the architecture code to detect an
> empty mask.  The first case is a cpu started processing the list due
> to an interrupt generated by a cpu other than the requester between
> the time the requester put its work element in the list and when it
> executed the architecture code to send the interrupt.  A second
> case is a work element is deleted by a third cpu as its ref count
> goes to zero and then that element gets reused by the owning cpu,
> so we process part of the request list twice.
>

OK, that makes sense.

> The solution is to tell the architecture code its not an error for
> the mask to be empty.

I believe this patch is in order then:

diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index cce91bf..00b68a3 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -106,7 +106,10 @@ void default_send_IPI_mask_logical(const struct
cpumask *cpumask, int vector)
 	unsigned long mask = cpumask_bits(cpumask)[0];
 	unsigned long flags;

-	if (WARN_ONCE(!mask, "empty IPI mask"))
+	if (!mask)
+		/* The target CPUs must have already processed the
+		 * work items due to a concurrent IPI
+		 */
 		return;

 	local_irq_save(flags);

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com


> Here is a walk though of how smp_call_function_many works.
<SNIP>
> the interrupt.  Eventually we wrap back to the head of the list and
> return from the interrupt.

Would you mind if I pasted the above into Documentation/ipi.txt
with your name on it?

It could save the next potential bloke trying to get his head around
IPIs a lot of head scratching...


Also, it means I have no idea about the original problem with hotplug :-)

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-13 20:58           ` Milton Miller
@ 2012-01-19 16:20             ` Mel Gorman
  2012-01-19 21:46               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2012-01-19 16:20 UTC (permalink / raw)
  To: Milton Miller; +Cc: Gilad Ben-Yossef, linux-kernel, Peter Zijlstra

On Fri, Jan 13, 2012 at 02:58:39PM -0600, Milton Miller wrote:
> > > > The stack trace clearly was while sending IPIs in on_each_cpu() and
> > > > always when under memory pressure and stuck in direct reclaim. This was
> > > > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I
> > > > thought get_online_cpu() would be necessary.
> 
> Well, stop_machine has to be selected by the scheduler, so we have to
> get back and call schedule() at some point to switch to that thread.
> .. unless it is the one allocating memory.
> 
> > >
> > > For non-preempt the required scheduling of stop_machine() will have to
> > > wait even longer. Still there might be something funny, some of the
> > > hotplug notifiers are ran before the stop_machine thing does its thing
> > > so there might be some fun interaction.
> > 
> > Ok, how about this as a replacement patch?
> > 
> > ---8<---
> > From: Mel Gorman <mgorman@xxxxxxx>
> > Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
> > 
> > While running a CPU hotplug stress test under memory pressure, it
> > was observed that the machine would halt with no messages logged
> > to console. This is difficult to trigger and required a machine
> > with 8 cores and plenty of memory. In at least one case on ppc64,
> > the warning in include/linux/cpumask.h:107 triggered implying that
> > IPIs are being sent to offline CPUs in some cases.
> 
> That is
>         WARN_ON_ONCE(cpu >= nr_cpumask_bits);
> 
> That has nothing to do with cpus going offline!
> 

You're right. This particular warning was caused by xmon starting
up while it was handling another exception. There was not enough
information in the bug to tell exactly what caused the original
exception. A CPU was being offlined at the time and the system was
under memory pressure but they are not necessarily related. I'll drop
this from the changelog.

> nr_cpumask_bits is set during boot based on cpu_possible_mask.  If you
> see that triggered its a direct bug in the caller.  Either its looking
> at random memory in a NR_CPUS loop or its assuming that there is
> another cpu in a mask and not checking for a cpu_mask_next returning
> nr_cpumask_bits.
> 
> Again it has nothing to do with hotplug (unless its assuming there are
> n online cpus in a loop instead of looking at the return value of
> the function).
> 

xmon does some funny things around num_online_cpus() but tracking
down the internals of xmon is not useful. A more serious problem had
already triggered if xmon was starting at all.

> > A suspicious part of the problem is that the page allocator is sending
> > IPIs using on_each_cpu() without calling get_online_cpus() to prevent
> > changes to the online cpumask. It is depending on preemption being
> > disabled to protect it which is a no-op on !PREEMPT kernels. This means
> > that a thread can be reading the mask in smp_call_function_many() when
> > an attempt is made to take a CPU offline. The expectation is that this
> > is not a problem as the stop_machine() used during CPU hotplug should
> > be able to prevent any problems as the reader of the online mask will
> > prevent stop_machine making forward progress but it's unhelpful.
> 
> And without CONFIG_PREEMPT, we won't be able to schedule away from the
> current task over to the stop_machine (migration/NN) thread.
> 

On a different x86-64 machines with an intel-specific MCE, I have
also noted that the value of num_online_cpus() can change while
stop_machine() is running. This is sensitive to timing and part of
the problem seems to be due to cmci_rediscover() running without the
CPU hotplug mutex held. This is not related to the IPI mess and is
unrelated to memory pressure but is just to note that CPU hotplug in
general can be fragile in parts.

> > On the other side, the mask can also be read while the CPU is being
> > brought online. In this case it is the responsibility of the
> > architecture that the CPU is able to receive and handle interrupts
> > before being marked active but that does not mean they always get it
> > right.
> 
> yes.  See my other reply for some things we can to to help find bugs
> with smp_call_function_many (and on_each_cpu).
> 

Thanks for that.

> > Sending excessive IPIs from the page allocator is a bad idea. In low
> > memory situations, a large number of processes can drain the per-cpu
> > lists at the same time, in quick succession and on many CPUs which is
> > pointless. In light of this and the unspecific CPU hotplug concerns,
> > this patch removes the call drain_all_pages() after failing direct
> > reclaim. To avoid impacting high-order allocation success rates,
> > it still drains the local per-cpu lists for high-order allocations
> > that failed.
> 
> "There is some bug somewhere.  This seems like a big slow pain I
> don't think this is likely to have much impact but if I am wrong we
> will just OOM early."  vs Gilad's "Lets reduce the pain of this slow
> path by doing just the required work".
> 
> Lets find the real bug.
> 

When the underlying bug related to IPIs is isolated and fixed,
the patch still made sense. As noted in the changelog of the latest
version, high-order allocations are still draining the local list to
minimise impact. For order-0 allocations failing in this situation,
we must be already on the min watermark obviously. For an IPI to
help avoid an OOM for an order-0 allocation, we would need enough
pages on the per-cpu lists to meet the watermark with no other
allocation/freeing activity on those CPUs and that the process trying
to allocate memory never gets scheduled on another CPU. This is a
very improbable combination of events which is why I don't think the
patch makes any difference to going OOM early.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-19 16:20             ` Mel Gorman
@ 2012-01-19 21:46               ` Srivatsa S. Bhat
  2012-01-20  8:48                 ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Srivatsa S. Bhat @ 2012-01-19 21:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Milton Miller, Gilad Ben-Yossef, linux-kernel, Peter Zijlstra,
	linux-mm, linux-fsdevel, akpm, Russell King - ARM Linux,
	Paul E. McKenney, mszeredi, ebiederm, Greg Kroah-Hartman,
	gong.chen, Tony Luck, Borislav Petkov, tglx, mingo, hpa, x86,
	linux-edac, Andi Kleen

[Reinstating the original Cc list]

On 01/19/2012 09:50 PM, Mel Gorman wrote:> 

> On a different x86-64 machines with an intel-specific MCE, I have
> also noted that the value of num_online_cpus() can change while
> stop_machine() is running.


That is expected and intentional right? Meaning, it is during the
stop_machine() thing itself that a CPU is actually taken offline.
And at the same time, it is removed from the cpu_online_mask.

On Intel boxes, essentially, the following gets executed on the dying
CPU, as set up by the stop_machine stuff.

__cpu_disable()
    native_cpu_disable()
        cpu_disable_common()
            remove_cpu_from_maps()
                set_cpu_online(cpu, false)
			^^^^^^
So, set_cpu_online will remove this CPU from the cpu_online_mask.
And all this runs while still under the stop machine context.
And this is exactly what we want right?

> This is sensitive to timing and part of
> the problem seems to be due to cmci_rediscover() running without the
> CPU hotplug mutex held. This is not related to the IPI mess and is
> unrelated to memory pressure but is just to note that CPU hotplug in
> general can be fragile in parts.
> 


For the cmci_rediscover() part, I feel a simple get/put_online_cpus()
around it should work.

Something like the following patch? (It is untested at the moment, but
I will run it later and see if it works well).

I would like the opinion of MCE/Intel maintainers as to whether this is
a proper fix or something else would have been better..

----
From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] x86/intel mce: Fix race with CPU hotplug in cmci_rediscover()

cmci_rediscover() is invoked upon the CPU_POST_DEAD notification, when
the cpu_hotplug lock is no longer held. And cmci_rediscover() iterates
over all the online cpus. So this can race with an ongoing CPU hotplug
operation. Fix this by wrapping the iteration code within the pair
get_online_cpus() / put_online_cpus().

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/kernel/cpu/mcheck/mce_intel.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 38e49bc..1c30397 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -10,6 +10,7 @@
 #include <linux/interrupt.h>
 #include <linux/percpu.h>
 #include <linux/sched.h>
+#include <linux/cpu.h>
 #include <asm/apic.h>
 #include <asm/processor.h>
 #include <asm/msr.h>
@@ -179,6 +180,7 @@ void cmci_rediscover(int dying)
 		return;
 	cpumask_copy(old, &current->cpus_allowed);
 
+	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		if (cpu == dying)
 			continue;
@@ -188,6 +190,7 @@ void cmci_rediscover(int dying)
 		if (cmci_supported(&banks))
 			cmci_discover(banks, 0);
 	}
+	put_online_cpus();
 
 	set_cpus_allowed_ptr(current, old);
 	free_cpumask_var(old);



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context
  2012-01-19 21:46               ` Srivatsa S. Bhat
@ 2012-01-20  8:48                 ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2012-01-20  8:48 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Milton Miller, Gilad Ben-Yossef, linux-kernel, Peter Zijlstra,
	linux-mm, linux-fsdevel, akpm, Russell King - ARM Linux,
	Paul E. McKenney, mszeredi, ebiederm, Greg Kroah-Hartman,
	gong.chen, Tony Luck, Borislav Petkov, tglx, mingo, hpa, x86,
	linux-edac, Andi Kleen

On Fri, Jan 20, 2012 at 03:16:58AM +0530, Srivatsa S. Bhat wrote:
> [Reinstating the original Cc list]
> 
> On 01/19/2012 09:50 PM, Mel Gorman wrote:> 
> 
> > On a different x86-64 machines with an intel-specific MCE, I have
> > also noted that the value of num_online_cpus() can change while
> > stop_machine() is running.
> 
> 
> That is expected and intentional right? Meaning, it is during the
> stop_machine() thing itself that a CPU is actually taken offline.
> And at the same time, it is removed from the cpu_online_mask.
> 

It's intentional sometimes and no others. The machine does halt
sometimes and stays there.

> On Intel boxes, essentially, the following gets executed on the dying
> CPU, as set up by the stop_machine stuff.
> 
> __cpu_disable()
>     native_cpu_disable()
>         cpu_disable_common()
>             remove_cpu_from_maps()
>                 set_cpu_online(cpu, false)
> 			^^^^^^
> So, set_cpu_online will remove this CPU from the cpu_online_mask.
> And all this runs while still under the stop machine context.
> And this is exactly what we want right?
> 

We don't want it to halt in stop_machine forever waiting on acknowledges
that are never received until the NMI handler fires.

> > This is sensitive to timing and part of
> > the problem seems to be due to cmci_rediscover() running without the
> > CPU hotplug mutex held. This is not related to the IPI mess and is
> > unrelated to memory pressure but is just to note that CPU hotplug in
> > general can be fragile in parts.
> > 
> 
> 
> For the cmci_rediscover() part, I feel a simple get/put_online_cpus()
> around it should work.
> 

Yeah, that's the first thing I tried first too. Doesn't work though.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-01-20  8:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-11 10:11 [RFC PATCH 0/2] Improve reliability of CPU hotplug Mel Gorman
2012-01-11 10:11 ` [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Mel Gorman
2012-01-11 17:11   ` Eric W. Biederman
2012-01-11 18:07     ` Mel Gorman
2012-01-11 19:02       ` Eric W. Biederman
2012-01-11 10:11 ` [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Mel Gorman
2012-01-12 14:51   ` Gilad Ben-Yossef
2012-01-12 15:08     ` Peter Zijlstra
2012-01-12 15:13       ` Gilad Ben-Yossef
2012-01-12 15:08     ` Gilad Ben-Yossef
2012-01-12 15:18   ` Peter Zijlstra
2012-01-12 15:37     ` Mel Gorman
2012-01-12 15:52       ` Peter Zijlstra
2012-01-12 17:18         ` Mel Gorman
2012-01-12 19:14           ` Gilad Ben-Yossef
2012-01-13 20:58           ` Milton Miller
2012-01-15 13:53             ` Gilad Ben-Yossef
2012-01-13 20:58           ` Milton Miller
2012-01-19 16:20             ` Mel Gorman
2012-01-19 21:46               ` Srivatsa S. Bhat
2012-01-20  8:48                 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).