[PATCH] Fix the race between smp_call_function and CPU booting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Fix the race between smp_call_function and CPU booting
@ 2012-03-12  9:27 Liu, Chuansheng
  2012-03-12  9:32 ` Peter Zijlstra
  2012-03-12  9:58 ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-12  9:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Yanmin Zhang, tglx, peterz

From: liu chuansheng <chuansheng.liu@intel.com>
Subject: [PATCH] Fix the race between smp_call_function and CPU booting

When system is waking up from suspend state, sometimes the smp_call_function is called which will cause deadlock specially on the platform which just has two CPUs.

CPU0:                                           CPU1:
pm_suspend -->
suspend_devices_and_enter -->
enable_nonboot_cpus -->
_cpu_up -->
__cpu_up -->
native_cpu_up
                                                start_secondary
                                                        -- set cpu online
                                                        -- waiting for the active state new thread call:
smp_call_function -->
smp_call_function_many -->
smp_call_function_single -->
   -- csd_lock
   -- generic_exec_single -->
   -- arch_send_call_function_single_ipi
   -- csd_lock_wait

At this time, both CPUs are blocked. Normally the CPU0 will set the CPU1 with active state after finished the _cpu_up calling, but CPU1 can not handle the IPI due to the corresponding irq is still disabled, which will be enabled after waiting for the active state.

The solution is just to send smp call to active cpus instead of online cpus.

Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
---
 kernel/smp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c index fb67dfa..e67674b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -573,7 +573,7 @@ EXPORT_SYMBOL(smp_call_function_many);
 int smp_call_function(smp_call_func_t func, void *info, int wait)  {
        preempt_disable();
-       smp_call_function_many(cpu_online_mask, func, info, wait);
+       smp_call_function_many(cpu_active_mask, func, info, wait);
        preempt_enable();
 
        return 0;
--
1.7.0.4




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-12  9:27 [PATCH] Fix the race between smp_call_function and CPU booting Liu, Chuansheng
@ 2012-03-12  9:32 ` Peter Zijlstra
  2012-03-13  6:43   ` Liu, Chuansheng
  2012-03-12  9:58 ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-12  9:32 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Mon, 2012-03-12 at 09:27 +0000, Liu, Chuansheng wrote:
> From: liu chuansheng <chuansheng.liu@intel.com>
> Subject: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> When system is waking up from suspend state, sometimes the
> smp_call_function is called which will cause deadlock specially on the
> platform which just has two CPUs.
> 
> CPU0:                                           CPU1:
> pm_suspend -->
> suspend_devices_and_enter -->
> enable_nonboot_cpus -->
> _cpu_up -->
> __cpu_up -->
> native_cpu_up
>                                                 start_secondary
>                                                         -- set cpu online
>                                                         -- waiting for the active state new thread call:
> smp_call_function -->
> smp_call_function_many -->
> smp_call_function_single -->
>    -- csd_lock
>    -- generic_exec_single -->
>    -- arch_send_call_function_single_ipi
>    -- csd_lock_wait
> 
> At this time, both CPUs are blocked. Normally the CPU0 will set the
> CPU1 with active state after finished the _cpu_up calling, but CPU1
> can not handle the IPI due to the corresponding irq is still disabled,
> which will be enabled after waiting for the active state.
> 
> The solution is just to send smp call to active cpus instead of online
> cpus.

Wouldn't http://lkml.org/lkml/2011/12/15/255 also solve that?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-12  9:27 [PATCH] Fix the race between smp_call_function and CPU booting Liu, Chuansheng
  2012-03-12  9:32 ` Peter Zijlstra
@ 2012-03-12  9:58 ` Peter Zijlstra
  2012-03-13  6:46   ` Liu, Chuansheng
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-12  9:58 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Mon, 2012-03-12 at 09:27 +0000, Liu, Chuansheng wrote:
> The solution is just to send smp call to active cpus instead of online
> cpus.

BTW, that solution is broken because !active cpus can still be online
and actually running stuff, so sending IPIs to them is perfectly fine
and actually required.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-12  9:32 ` Peter Zijlstra
@ 2012-03-13  6:43   ` Liu, Chuansheng
  0 siblings, 0 replies; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-13  6:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Monday, March 12, 2012 5:33 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: Re: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Mon, 2012-03-12 at 09:27 +0000, Liu, Chuansheng wrote:
> > From: liu chuansheng <chuansheng.liu@intel.com>
> > Subject: [PATCH] Fix the race between smp_call_function and CPU
> > booting
> >
> > When system is waking up from suspend state, sometimes the
> > smp_call_function is called which will cause deadlock specially on the
> > platform which just has two CPUs.
> >
> > CPU0:                                           CPU1:
> > pm_suspend -->
> > suspend_devices_and_enter -->
> > enable_nonboot_cpus -->
> > _cpu_up -->
> > __cpu_up -->
> > native_cpu_up
> >                                                 start_secondary
> >                                                         -- set
> cpu online
> >                                                         --
> waiting for the active state new thread call:
> > smp_call_function -->
> > smp_call_function_many -->
> > smp_call_function_single -->
> >    -- csd_lock
> >    -- generic_exec_single -->
> >    -- arch_send_call_function_single_ipi
> >    -- csd_lock_wait
> >
> > At this time, both CPUs are blocked. Normally the CPU0 will set the
> > CPU1 with active state after finished the _cpu_up calling, but CPU1
> > can not handle the IPI due to the corresponding irq is still disabled,
> > which will be enabled after waiting for the active state.
> >
> > The solution is just to send smp call to active cpus instead of online
> > cpus.
> 
> Wouldn't http://lkml.org/lkml/2011/12/15/255 also solve that?
Yes, the patch in http://lkml.org/lkml/2011/12/15/255 also solve it. But it moved the action of setting active state into the start_secondary thru cpu_notifier starting,
If so, more simpler solution is calling set_cpu_active directly in function start_secondary, do not need register cpu_online/cpu_starting cpu_notifier in function migration_init at all.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-12  9:58 ` Peter Zijlstra
@ 2012-03-13  6:46   ` Liu, Chuansheng
  2012-03-13 15:57     ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-13  6:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Monday, March 12, 2012 5:58 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: Re: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Mon, 2012-03-12 at 09:27 +0000, Liu, Chuansheng wrote:
> > The solution is just to send smp call to active cpus instead of online
> > cpus.
> 
> BTW, that solution is broken because !active cpus can still be online and
> actually running stuff, so sending IPIs to them is perfectly fine and actually
> required.
> 
Based on current code base, the IPI can be handled only after the cpu is active, before CPU is active,
Any sending IPI is meanless.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-13  6:46   ` Liu, Chuansheng
@ 2012-03-13 15:57     ` Peter Zijlstra
  2012-03-14  6:27       ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-13 15:57 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Tue, 2012-03-13 at 06:46 +0000, Liu, Chuansheng wrote:
> 
> > -----Original Message-----
> > From: Peter Zijlstra [mailto:peterz@infradead.org]
> > Sent: Monday, March 12, 2012 5:58 PM
> > To: Liu, Chuansheng
> > Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> > Subject: Re: [PATCH] Fix the race between smp_call_function and CPU booting
> > 
> > On Mon, 2012-03-12 at 09:27 +0000, Liu, Chuansheng wrote:
> > > The solution is just to send smp call to active cpus instead of online
> > > cpus.
> > 
> > BTW, that solution is broken because !active cpus can still be online and
> > actually running stuff, so sending IPIs to them is perfectly fine and actually
> > required.
> > 
> Based on current code base, the IPI can be handled only after the cpu is active, before CPU is active,
> Any sending IPI is meanless.

Not so on the unplug case, we clear active long before we actually go
offline and rebuild sched_domains.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-13 15:57     ` Peter Zijlstra
@ 2012-03-14  6:27       ` Liu, Chuansheng
  2012-03-14  9:43         ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-14  6:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Tuesday, March 13, 2012 11:58 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Tue, 2012-03-13 at 06:46 +0000, Liu, Chuansheng wrote:
> >
> > > -----Original Message-----
> > > From: Peter Zijlstra [mailto:peterz@infradead.org]
> > > Sent: Monday, March 12, 2012 5:58 PM
> > > To: Liu, Chuansheng
> > > Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> > > Subject: Re: [PATCH] Fix the race between smp_call_function and CPU
> > > booting
> > >
> > > On Mon, 2012-03-12 at 09:27 +0000, Liu, Chuansheng wrote:
> > > > The solution is just to send smp call to active cpus instead of
> > > > online cpus.
> > >
> > > BTW, that solution is broken because !active cpus can still be
> > > online and actually running stuff, so sending IPIs to them is
> > > perfectly fine and actually required.
> > >
> > Based on current code base, the IPI can be handled only after the cpu
> > is active, before CPU is active, Any sending IPI is meanless.
> 
> Not so on the unplug case, we clear active long before we actually go offline
> and rebuild sched_domains.
On the unplug case, after set the CPU to !active, we do not need IPI handling for the corresponding
CPU before it is set to offline. I did not find any impact that limiting the smp_call_function
just after CPU is active.

I did a stress test that starting two different scripts concurrently:
1/ onoff_line script like below:
while true
do
 echo 0 > /sys/devices/system/cpu/cpu1/online
 echo 1 > /sys/devices/system/cpu/cpu1/online
done
2/ Adding a simple sys interface to trigger calling smp_call_function:
test_set()
{
  smp_call_function(...);
}

The script is writing the interface to trigger the calling in loop every 500ms;


The result is:
1/ without any patch, the deadlock issue is very easy to be reproduced;

2/ With your patch http://lkml.org/lkml/2011/12/15/255, the below issue is always found, and the system is hanging there.
I think it is because the booted CPU1 is set to active too early and the online do not be set yet.
[  721.759736] cpu_down
[  721.822193] LCS test smp_call_function
[  721.864892] CPU 1 is now offline
[  721.868270] SMP alternatives: switching to UP code
[  721.886925] _cpu_up
[  721.892222] SMP alternatives: switching to SMP code
[  721.906420] Booting Node 0 Processor 1 APIC 0x1
[  721.921177] Initializing CPU#1
[  721.981898] ------------[ cut here ]------------
[  721.989553] WARNING: at /root/r3_ics/hardware/intel/linux-2.6/arch/x86/kernel/smp.c:118 native_smp_send_reschedule+0x50/0x60()
[  722.000923] Hardware name: Medfield
[  722.004401] Modules linked in: atomisp lm3554 mt9m114 mt9e013 videobuf_vmalloc videobuf_core mac80211 cfg80211 compat btwilink st_drv
[  722.016408] Pid: 18865, comm: workqueue_trust Not tainted 3.0.8-137166-g2639a16-dirty #1
[  722.024486] Call Trace:
[  722.026939]  [<c1252287>] warn_slowpath_common+0x77/0x130
[  722.032321]  [<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[  722.038314]  [<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[  722.044316]  [<c1252362>] warn_slowpath_null+0x22/0x30
[  722.049445]  [<c121df70>] native_smp_send_reschedule+0x50/0x60
[  722.055268]  [<c124bacf>] try_to_wake_up+0x17f/0x390
[  722.060225]  [<c124bd34>] wake_up_process+0x14/0x20
[  722.065091]  [<c1277107>] kthread_stop+0x37/0x100
[  722.069789]  [<c126f5e0>] destroy_worker+0x50/0x90
[  722.074573]  [<c18c1b4d>] trustee_thread+0x3e3/0x4bf
[  722.079524]  [<c1277410>] ? wake_up_bit+0x90/0x90
[  722.084224]  [<c18c176a>] ? wait_trustee_state+0x91/0x91
[  722.089520]  [<c1276fc4>] kthread+0x74/0x80
[  722.093694]  [<c1276f50>] ? __init_kthread_worker+0x30/0x30
[  722.099264]  [<c18c7cfa>] kernel_thread_helper+0x6/0x10
[  722.104474] ---[ end trace fa5bcc15ece677c6 ]---

3/ With my patch, the system kept there for 1 hour ,did not find issue yet.
I will keep the stress test running for a long long time;

Any other good solution? If anything wrong, please correct me, thanks.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-14  6:27       ` Liu, Chuansheng
@ 2012-03-14  9:43         ` Peter Zijlstra
  2012-03-15  0:11           ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-14  9:43 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Wed, 2012-03-14 at 06:27 +0000, Liu, Chuansheng wrote:
> On the unplug case, after set the CPU to !active, we do not need IPI
> handling for the corresponding
> CPU before it is set to offline. I did not find any impact that
> limiting the smp_call_function
> just after CPU is active. 

Have a look at Alpha, it's flush_tlb_mm() can use smp_call_function(),
in the !active,online case you very much still need to tlb flush that
cpu.

The fact that it works on a limited use case on x86 doesn't say anything
much at all.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-14  9:43         ` Peter Zijlstra
@ 2012-03-15  0:11           ` Liu, Chuansheng
  2012-03-15 10:46             ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-15  0:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Wednesday, March 14, 2012 5:43 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Wed, 2012-03-14 at 06:27 +0000, Liu, Chuansheng wrote:
> > On the unplug case, after set the CPU to !active, we do not need IPI
> > handling for the corresponding CPU before it is set to offline. I did
> > not find any impact that limiting the smp_call_function just after CPU
> > is active.
> 
> Have a look at Alpha, it's flush_tlb_mm() can use smp_call_function(), in
> the !active,online case you very much still need to tlb flush that cpu.
> 
> The fact that it works on a limited use case on x86 doesn't say anything much
> at all.
Thanks your pointing out, do you have any other perfect solution for this issue?
As for the stress test result, advancing the setting active before setting online broken
something either.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-15  0:11           ` Liu, Chuansheng
@ 2012-03-15 10:46             ` Peter Zijlstra
  2012-03-16  6:24               ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-15 10:46 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Thu, 2012-03-15 at 00:11 +0000, Liu, Chuansheng wrote:
> 
> > -----Original Message-----
> > From: Peter Zijlstra [mailto:peterz@infradead.org]
> > Sent: Wednesday, March 14, 2012 5:43 PM
> > To: Liu, Chuansheng
> > Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> > Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> > 
> > On Wed, 2012-03-14 at 06:27 +0000, Liu, Chuansheng wrote:
> > > On the unplug case, after set the CPU to !active, we do not need IPI
> > > handling for the corresponding CPU before it is set to offline. I did
> > > not find any impact that limiting the smp_call_function just after CPU
> > > is active.
> > 
> > Have a look at Alpha, it's flush_tlb_mm() can use smp_call_function(), in
> > the !active,online case you very much still need to tlb flush that cpu.
> > 
> > The fact that it works on a limited use case on x86 doesn't say anything much
> > at all.

> Thanks your pointing out, do you have any other perfect solution for this issue?
> As for the stress test result, advancing the setting active before setting online broken
> something either.

I'm not sure I understand.. are you saying that commit
5fbd036b552f633abb394a319f7c62a5c86a9cd7 in tip/master broke something?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-15 10:46             ` Peter Zijlstra
@ 2012-03-16  6:24               ` Liu, Chuansheng
  2012-03-16  9:49                 ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-16  6:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Thursday, March 15, 2012 6:47 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Thu, 2012-03-15 at 00:11 +0000, Liu, Chuansheng wrote:
> >
> > > -----Original Message-----
> > > From: Peter Zijlstra [mailto:peterz@infradead.org]
> > > Sent: Wednesday, March 14, 2012 5:43 PM
> > > To: Liu, Chuansheng
> > > Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> > > Subject: RE: [PATCH] Fix the race between smp_call_function and CPU
> > > booting
> > >
> > > On Wed, 2012-03-14 at 06:27 +0000, Liu, Chuansheng wrote:
> > > > On the unplug case, after set the CPU to !active, we do not need
> > > > IPI handling for the corresponding CPU before it is set to
> > > > offline. I did not find any impact that limiting the
> > > > smp_call_function just after CPU is active.
> > >
> > > Have a look at Alpha, it's flush_tlb_mm() can use
> > > smp_call_function(), in the !active,online case you very much still need to
> tlb flush that cpu.
> > >
> > > The fact that it works on a limited use case on x86 doesn't say
> > > anything much at all.
> 
> > Thanks your pointing out, do you have any other perfect solution for this
> issue?
> > As for the stress test result, advancing the setting active before
> > setting online broken something either.
> 
> I'm not sure I understand.. are you saying that commit
> 5fbd036b552f633abb394a319f7c62a5c86a9cd7 in tip/master broke
> something?
Yes. I did the stress test, the commit 5fbd036b552f633abb394a319f7c62a5c86a9cd7
broken something as below traces:
[  721.989553] WARNING: at /root/r3_ics/hardware/intel/linux-2.6/arch/x86/kernel/smp.c:118 native_smp_send_reschedule+0x50/0x60()
[  722.000923] Hardware name: Medfield
[  722.004401] Modules linked in: atomisp lm3554 mt9m114 mt9e013 videobuf_vmalloc videobuf_core mac80211 cfg80211 compat btwilink st_drv
[  722.016408] Pid: 18865, comm: workqueue_trust Not tainted 3.0.8-137166-g2639a16-dirty #1
[  722.024486] Call Trace:
[  722.026939]  [<c1252287>] warn_slowpath_common+0x77/0x130
[  722.032321]  [<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[  722.038314]  [<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[  722.044316]  [<c1252362>] warn_slowpath_null+0x22/0x30
[  722.049445]  [<c121df70>] native_smp_send_reschedule+0x50/0x60
[  722.055268]  [<c124bacf>] try_to_wake_up+0x17f/0x390
[  722.060225]  [<c124bd34>] wake_up_process+0x14/0x20
[  722.065091]  [<c1277107>] kthread_stop+0x37/0x100
[  722.069789]  [<c126f5e0>] destroy_worker+0x50/0x90
[  722.074573]  [<c18c1b4d>] trustee_thread+0x3e3/0x4bf
[  722.079524]  [<c1277410>] ? wake_up_bit+0x90/0x90
[  722.084224]  [<c18c176a>] ? wait_trustee_state+0x91/0x91
[  722.089520]  [<c1276fc4>] kthread+0x74/0x80
[  722.093694]  [<c1276f50>] ? __init_kthread_worker+0x30/0x30
[  722.099264]  [<c18c7cfa>] kernel_thread_helper+0x6/0x10
[  722.104474] ---[ end trace fa5bcc15ece677c6 ]---


Based on your patch, I did a little modification, how do you think of that?
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index cdeb727..d616ed5 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -295,13 +295,6 @@ asmlinkage void __cpuinit secondary_start_kernel(void)
         */
        percpu_timer_setup();
 
-       while (!cpu_active(cpu))
-               cpu_relax();
-
-       /*
-        * cpu_active bit is set, so it's safe to enalbe interrupts
-        * now.
-        */
        local_irq_enable();
        local_fiq_enable();
 
diff --git a/arch/hexagon/kernel/smp.c b/arch/hexagon/kernel/smp.c
index c871a2c..0123c63 100644
--- a/arch/hexagon/kernel/smp.c
+++ b/arch/hexagon/kernel/smp.c
@@ -179,8 +179,6 @@ void __cpuinit start_secondary(void)
        printk(KERN_INFO "%s cpu %d\n", __func__, current_thread_info()->cpu);
 
        set_cpu_online(cpu, true);
-       while (!cpumask_test_cpu(cpu, cpu_active_mask))
-               cpu_relax();
        local_irq_enable();
 
        cpu_idle();
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 2398ce6..b0e28c4 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -550,12 +550,6 @@ int __cpuinit start_secondary(void *cpuvoid)
        S390_lowcore.restart_psw.addr =
                PSW_ADDR_AMODE | (unsigned long) psw_restart_int_handler;
        __ctl_set_bit(0, 28); /* Enable lowcore protection */
-       /*
-        * Wait until the cpu which brought this one up marked it
-        * active before enabling interrupts.
-        */
-       while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
-               cpu_relax();
        local_irq_enable();
        /* cpu_idle will call schedule for us */
        cpu_idle();
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 66d250c..58f7816 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -291,19 +291,6 @@ notrace static void __cpuinit start_secondary(void *unused)
        per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
        x86_platform.nmi_init();
 
-       /*
-        * Wait until the cpu which brought this one up marked it
-        * online before enabling interrupts. If we don't do that then
-        * we can end up waking up the softirq thread before this cpu
-        * reached the active state, which makes the scheduler unhappy
-        * and schedule the softirq thread on the wrong cpu. This is
-        * only observable with forced threaded interrupts, but in
-        * theory it could also happen w/o them. It's just way harder
-        * to achieve.
-        */
-       while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
-               cpu_relax();
-
        /* enable local interrupts */
        local_irq_enable();
 
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 2060c6e..6bf1fd3 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -640,8 +640,10 @@ void set_cpu_present(unsigned int cpu, bool present)
 
 void set_cpu_online(unsigned int cpu, bool online)
 {
-       if (online)
+       if (online) {
                cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
+               cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
+       }
        else
                cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b342f57..ef97881 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5381,7 +5381,6 @@ static int __cpuinit sched_cpu_active(struct notifier_block *nfb,
                                      unsigned long action, void *hcpu)
 {
        switch (action & ~CPU_TASKS_FROZEN) {
-       case CPU_ONLINE:
        case CPU_DOWN_FAILED:
                set_cpu_active((long)hcpu, true);
                return NOTIFY_OK;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-16  6:24               ` Liu, Chuansheng
@ 2012-03-16  9:49                 ` Peter Zijlstra
  2012-03-19  0:58                   ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-16  9:49 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Fri, 2012-03-16 at 06:24 +0000, Liu, Chuansheng wrote:

> Based on your patch, I did a little modification, how do you think of that?

> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -640,8 +640,10 @@ void set_cpu_present(unsigned int cpu, bool present)
>  
>  void set_cpu_online(unsigned int cpu, bool online)
>  {
> -       if (online)
> +       if (online) {
>                 cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
> +               cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
> +       }
>         else
>                 cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
>  }

I'm thinking this lacks rationale. What was wrong, Why does this fix it.
If you can't answer that stop tinkering and start thinking.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-16  9:49                 ` Peter Zijlstra
@ 2012-03-19  0:58                   ` Liu, Chuansheng
  2012-03-19 10:03                     ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-19  0:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Friday, March 16, 2012 5:49 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Fri, 2012-03-16 at 06:24 +0000, Liu, Chuansheng wrote:
> 
> > Based on your patch, I did a little modification, how do you think of that?
> 
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -640,8 +640,10 @@ void set_cpu_present(unsigned int cpu, bool
> > present)
> >
> >  void set_cpu_online(unsigned int cpu, bool online)  {
> > -       if (online)
> > +       if (online) {
> >                 cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
> > +               cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
> > +       }
> >         else
> >                 cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
> > }
> 
> I'm thinking this lacks rationale. What was wrong, Why does this fix it.
> If you can't answer that stop tinkering and start thinking.
Your patch advance the setting active bit before online setting, that will cause
an warning error, I just move the setting active bit after setting online setting, it can fix
the warning error, and it can fix the race issue between smp_call_function and CPU booting.

I have tested it by some stress tests.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-19  0:58                   ` Liu, Chuansheng
@ 2012-03-19 10:03                     ` Peter Zijlstra
  2012-03-20  0:22                       ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-19 10:03 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Mon, 2012-03-19 at 00:58 +0000, Liu, Chuansheng wrote:
> Your patch advance the setting active bit before online setting, that will cause
> an warning error, 

WHY!?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-19 10:03                     ` Peter Zijlstra
@ 2012-03-20  0:22                       ` Liu, Chuansheng
  2012-03-20 12:17                         ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-20  0:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx, Liu, Chuansheng



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Monday, March 19, 2012 6:03 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Mon, 2012-03-19 at 00:58 +0000, Liu, Chuansheng wrote:
> > Your patch advance the setting active bit before online setting, that
> > will cause an warning error,
> 
> WHY!?
I have done the stress tests based on your patches. The following warning error is very easy to
be reproduced, paste my result again. Thanks to give some time to have a look.

I did a stress test that starting two different scripts concurrently:
1/ onoff_line script like below:
while true
do
 echo 0 > /sys/devices/system/cpu/cpu1/online
 echo 1 > /sys/devices/system/cpu/cpu1/online
done
2/ Adding a simple sys interface to trigger calling smp_call_function:
test_set()
{
  smp_call_function(...);
}

The script is writing the interface to trigger the calling in loop every 500ms;


The result is:
1/ without any patch, the deadlock issue is very easy to be reproduced;

2/ With your patch http://lkml.org/lkml/2011/12/15/255, the below issue is always found, and the system is hanging there.
I think it is because the booted CPU1 is set to active too early and the online do not be set yet.
[  721.759736] cpu_down
[  721.822193] LCS test smp_call_function [  721.864892] CPU 1 is now offline [  721.868270] SMP alternatives: switching to UP code [  721.886925] _cpu_up [  721.892222] SMP alternatives: switching to SMP code [  721.906420] Booting Node 0 Processor 1 APIC 0x1 [  721.921177] Initializing CPU#1 [  721.981898] ------------[ cut here ]------------ [  721.989553] WARNING: at /root/r3_ics/hardware/intel/linux-2.6/arch/x86/kernel/smp.c:118 native_smp_send_reschedule+0x50/0x60()
[  722.000923] Hardware name: Medfield
[  722.004401] Modules linked in: atomisp lm3554 mt9m114 mt9e013 videobuf_vmalloc videobuf_core mac80211 cfg80211 compat btwilink st_drv [  722.016408] Pid: 18865, comm: workqueue_trust Not tainted 3.0.8-137166-g2639a16-dirty #1 
[  722.024486] Call Trace: 
[  722.026939]  [<c1252287>] warn_slowpath_common+0x77/0x130 
[  722.032321][<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[  722.038314]  [<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[  722.044316]  [<c1252362>] warn_slowpath_null+0x22/0x30 
[  722.049445]  [<c121df70>] native_smp_send_reschedule+0x50/0x60
[  722.055268]  [<c124bacf>] try_to_wake_up+0x17f/0x390 
[  722.060225]  [<c124bd34>] wake_up_process+0x14/0x20 
[  722.065091]  [<c1277107>] kthread_stop+0x37/0x100 
[  722.069789]  [<c126f5e0>] destroy_worker+0x50/0x90 
[  722.074573]  [<c18c1b4d>] trustee_thread+0x3e3/0x4bf 
[  722.079524]  [<c1277410>] ? wake_up_bit+0x90/0x90 
[  722.084224]  [<c18c176a>] ? wait_trustee_state+0x91/0x91 
[  722.089520]  [<c1276fc4>] kthread+0x74/0x80 [  722.093694]  
[<c1276f50>] ? __init_kthread_worker+0x30/0x30 [  722.099264]  
[<c18c7cfa>] kernel_thread_helper+0x6/0x10 [  722.104474] 
---[ end trace fa5bcc15ece677c6 ]---

3/ With my patch, the system kept there for 1 hour ,did not find issue yet.
I will keep the stress test running for a long long time;


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-20  0:22                       ` Liu, Chuansheng
@ 2012-03-20 12:17                         ` Peter Zijlstra
  2012-03-20 13:14                           ` Srivatsa S. Bhat
  2012-03-21  5:59                           ` Liu, Chuansheng
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-20 12:17 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Tue, 2012-03-20 at 00:22 +0000, Liu, Chuansheng wrote:
> Thanks to give some time to have a look.
> 
Does the below on top of current tip/master work?
> 
---
 include/linux/cpuset.h |    6 +---
 kernel/cpuset.c        |   20 +++------------
 kernel/sched/core.c    |   62 +++++++++++++++++++++++++++++++++++------------
 3 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..e0ffaf0 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -22,7 +22,7 @@ extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
-extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -144,10 +144,8 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
 	cpumask_copy(mask, cpu_possible_mask);
 }
 
-static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
+static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
 {
-	do_set_cpus_allowed(p, cpu_possible_mask);
-	return cpumask_any(cpu_active_mask);
 }
 
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..c9837b7 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2195,7 +2195,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 	mutex_unlock(&callback_mutex);
 }
 
-int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
+void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
 	const struct cpuset *cs;
 	int cpu;
@@ -2219,22 +2219,10 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	 * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary
 	 * set any mask even if it is not right from task_cs() pov,
 	 * the pending set_cpus_allowed_ptr() will fix things.
+	 *
+	 * select_fallback_rq() will fix things ups and set cpu_possible_mask
+	 * if required.
 	 */
-
-	cpu = cpumask_any_and(&tsk->cpus_allowed, cpu_active_mask);
-	if (cpu >= nr_cpu_ids) {
-		/*
-		 * Either tsk->cpus_allowed is wrong (see above) or it
-		 * is actually empty. The latter case is only possible
-		 * if we are racing with remove_tasks_in_empty_cpuset().
-		 * Like above we can temporary set any mask and rely on
-		 * set_cpus_allowed_ptr() as synchronization point.
-		 */
-		do_set_cpus_allowed(tsk, cpu_possible_mask);
-		cpu = cpumask_any(cpu_active_mask);
-	}
-
-	return cpu;
 }
 
 void cpuset_init_current_mems_allowed(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e37c9af..04d3f7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1263,29 +1263,59 @@ EXPORT_SYMBOL_GPL(kick_process);
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
-	int dest_cpu;
 	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
+	enum { cpuset, possible, fail } state = cpuset;
+	int dest_cpu;
 
 	/* Look for allowed, online CPU in same node. */
-	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
+	for_each_cpu_mask(dest_cpu, *nodemask) {
+		if (!cpu_online(dest_cpu))
+			continue;
+		if (!cpu_active(dest_cpu))
+			continue;
 		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
 			return dest_cpu;
+	}
 
-	/* Any allowed, online CPU? */
-	dest_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpu_active_mask);
-	if (dest_cpu < nr_cpu_ids)
-		return dest_cpu;
+	for (;;) {
+		/* Any allowed, online CPU? */
+		for_each_cpu_mask(dest_cpu, *tsk_cpus_allowed(p)) {
+			if (!cpu_online(dest_cpu))
+				continue;
+			if (!cpu_active(dest_cpu))
+				continue;
+			goto out;
+		}
 
-	/* No more Mr. Nice Guy. */
-	dest_cpu = cpuset_cpus_allowed_fallback(p);
-	/*
-	 * Don't tell them about moving exiting tasks or
-	 * kernel threads (both mm NULL), since they never
-	 * leave kernel.
-	 */
-	if (p->mm && printk_ratelimit()) {
-		printk_sched("process %d (%s) no longer affine to cpu%d\n",
-				task_pid_nr(p), p->comm, cpu);
+		switch (state) {
+		case cpuset:
+			/* No more Mr. Nice Guy. */
+			cpuset_cpus_allowed_fallback(p);
+			state = possible;
+			break;
+
+		case possible:
+			do_set_cpus_allowed(p, cpu_possible_mask);
+			state = fail;
+			break;
+
+		case fail:
+			BUG();
+			break;
+		}
+	}
+
+out:
+	if (state != cpuset) {
+		/*
+		 * Don't tell them about moving exiting tasks or
+		 * kernel threads (both mm NULL), since they never
+		 * leave kernel.
+		 */
+		if (p->mm && printk_ratelimit()) {
+			printk_sched("process %d (%s) no longer affine to cpu%d\n",
+					task_pid_nr(p), p->comm, cpu);
+		}
 	}
 
 	return dest_cpu;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-20 12:17                         ` Peter Zijlstra
@ 2012-03-20 13:14                           ` Srivatsa S. Bhat
  2012-03-20 13:41                             ` Peter Zijlstra
  2012-03-21  5:59                           ` Liu, Chuansheng
  1 sibling, 1 reply; 29+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-20 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liu, Chuansheng, linux-kernel, Yanmin Zhang, tglx,
	Rafael J. Wysocki, Linux PM mailing list


[ Adding a few more Cc's since this affects suspend/resume. ]

On 03/20/2012 05:47 PM, Peter Zijlstra wrote:

> On Tue, 2012-03-20 at 00:22 +0000, Liu, Chuansheng wrote:
>> Thanks to give some time to have a look.
>>
> Does the below on top of current tip/master work?


I don't think this patch would change anything, atleast it wouldn't get
rid of the warning that Liu reported. Because, he is running his stress
tests on a machine which has only 2 CPUs. So effectively, we are hotplugging
only CPU1 (since CPU0 can't be taken offline, on Intel boxes).

Also, CPU1 is removed from cpu_active_mask during CPU_DOWN_PREPARE time itself,
and migrate_tasks() comes much later (during CPU_DYING). And in any case,
dest_cpu will never be CPU1, because it is the CPU that is going down. So it
*has* to be CPU0 anyway.

So, I don't think changes to select_fallback_rq() to make it more careful is
going to make any difference in the particular scenario that Liu is testing.

That said, even I don't know what the root cause of the warning is.. :-(

Regards,
Srivatsa S. Bhat

>>
> ---
>  include/linux/cpuset.h |    6 +---
>  kernel/cpuset.c        |   20 +++------------
>  kernel/sched/core.c    |   62 +++++++++++++++++++++++++++++++++++------------
>  3 files changed, 52 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index e9eaec5..e0ffaf0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -22,7 +22,7 @@ extern int cpuset_init(void);
>  extern void cpuset_init_smp(void);
>  extern void cpuset_update_active_cpus(void);
>  extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
> -extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
> +extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
>  extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>  #define cpuset_current_mems_allowed (current->mems_allowed)
>  void cpuset_init_current_mems_allowed(void);
> @@ -144,10 +144,8 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
>  	cpumask_copy(mask, cpu_possible_mask);
>  }
> 
> -static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> +static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
>  {
> -	do_set_cpus_allowed(p, cpu_possible_mask);
> -	return cpumask_any(cpu_active_mask);
>  }
> 
>  static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index a09ac2b..c9837b7 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2195,7 +2195,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>  	mutex_unlock(&callback_mutex);
>  }
> 
> -int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> +void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
>  	const struct cpuset *cs;
>  	int cpu;
> @@ -2219,22 +2219,10 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  	 * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary
>  	 * set any mask even if it is not right from task_cs() pov,
>  	 * the pending set_cpus_allowed_ptr() will fix things.
> +	 *
> +	 * select_fallback_rq() will fix things ups and set cpu_possible_mask
> +	 * if required.
>  	 */
> -
> -	cpu = cpumask_any_and(&tsk->cpus_allowed, cpu_active_mask);
> -	if (cpu >= nr_cpu_ids) {
> -		/*
> -		 * Either tsk->cpus_allowed is wrong (see above) or it
> -		 * is actually empty. The latter case is only possible
> -		 * if we are racing with remove_tasks_in_empty_cpuset().
> -		 * Like above we can temporary set any mask and rely on
> -		 * set_cpus_allowed_ptr() as synchronization point.
> -		 */
> -		do_set_cpus_allowed(tsk, cpu_possible_mask);
> -		cpu = cpumask_any(cpu_active_mask);
> -	}
> -
> -	return cpu;
>  }
> 
>  void cpuset_init_current_mems_allowed(void)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e37c9af..04d3f7f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1263,29 +1263,59 @@ EXPORT_SYMBOL_GPL(kick_process);
>   */
>  static int select_fallback_rq(int cpu, struct task_struct *p)
>  {
> -	int dest_cpu;
>  	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
> +	enum { cpuset, possible, fail } state = cpuset;
> +	int dest_cpu;
> 
>  	/* Look for allowed, online CPU in same node. */
> -	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
> +	for_each_cpu_mask(dest_cpu, *nodemask) {
> +		if (!cpu_online(dest_cpu))
> +			continue;
> +		if (!cpu_active(dest_cpu))
> +			continue;
>  		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
>  			return dest_cpu;
> +	}
> 
> -	/* Any allowed, online CPU? */
> -	dest_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpu_active_mask);
> -	if (dest_cpu < nr_cpu_ids)
> -		return dest_cpu;
> +	for (;;) {
> +		/* Any allowed, online CPU? */
> +		for_each_cpu_mask(dest_cpu, *tsk_cpus_allowed(p)) {
> +			if (!cpu_online(dest_cpu))
> +				continue;
> +			if (!cpu_active(dest_cpu))
> +				continue;
> +			goto out;
> +		}
> 
> -	/* No more Mr. Nice Guy. */
> -	dest_cpu = cpuset_cpus_allowed_fallback(p);
> -	/*
> -	 * Don't tell them about moving exiting tasks or
> -	 * kernel threads (both mm NULL), since they never
> -	 * leave kernel.
> -	 */
> -	if (p->mm && printk_ratelimit()) {
> -		printk_sched("process %d (%s) no longer affine to cpu%d\n",
> -				task_pid_nr(p), p->comm, cpu);
> +		switch (state) {
> +		case cpuset:
> +			/* No more Mr. Nice Guy. */
> +			cpuset_cpus_allowed_fallback(p);
> +			state = possible;
> +			break;
> +
> +		case possible:
> +			do_set_cpus_allowed(p, cpu_possible_mask);
> +			state = fail;
> +			break;
> +
> +		case fail:
> +			BUG();
> +			break;
> +		}
> +	}
> +
> +out:
> +	if (state != cpuset) {
> +		/*
> +		 * Don't tell them about moving exiting tasks or
> +		 * kernel threads (both mm NULL), since they never
> +		 * leave kernel.
> +		 */
> +		if (p->mm && printk_ratelimit()) {
> +			printk_sched("process %d (%s) no longer affine to cpu%d\n",
> +					task_pid_nr(p), p->comm, cpu);
> +		}
>  	}
> 
>  	return dest_cpu;
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-20 13:14                           ` Srivatsa S. Bhat
@ 2012-03-20 13:41                             ` Peter Zijlstra
  2012-03-20 15:02                               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-20 13:41 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Liu, Chuansheng, linux-kernel, Yanmin Zhang, tglx,
	Rafael J. Wysocki, Linux PM mailing list

On Tue, 2012-03-20 at 18:44 +0530, Srivatsa S. Bhat wrote:
> 
> 
> I don't think this patch would change anything, atleast it wouldn't get
> rid of the warning that Liu reported. Because, he is running his stress
> tests on a machine which has only 2 CPUs. So effectively, we are hotplugging
> only CPU1 (since CPU0 can't be taken offline, on Intel boxes).
> 
> Also, CPU1 is removed from cpu_active_mask during CPU_DOWN_PREPARE time itself,
> and migrate_tasks() comes much later (during CPU_DYING). And in any case,
> dest_cpu will never be CPU1, because it is the CPU that is going down. So it
> *has* to be CPU0 anyway.
> 
> So, I don't think changes to select_fallback_rq() to make it more careful is
> going to make any difference in the particular scenario that Liu is testing.
> 
> That said, even I don't know what the root cause of the warning is.. :-(

Its a race in cpu-up, we set active before online, when we do a wakeup
select_task_rq() will see !cpu_online(), we then call
select_fallback_rq() to compute a new cpu, select_fallback_rq() computes
a new cpu against cpu_active (which is set) and can thus return cpu 1,
even though it is still offline.

So we queue the task on cpu 1 and send a reschedule ipi, at which point
we'll get the reported warning.

My change modifies select_fallback_rq() to require online && active.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-20 13:41                             ` Peter Zijlstra
@ 2012-03-20 15:02                               ` Srivatsa S. Bhat
  0 siblings, 0 replies; 29+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-20 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liu, Chuansheng, linux-kernel, Yanmin Zhang, tglx,
	Rafael J. Wysocki, Linux PM mailing list, Srivatsa S. Bhat

On 03/20/2012 07:11 PM, Peter Zijlstra wrote:

> On Tue, 2012-03-20 at 18:44 +0530, Srivatsa S. Bhat wrote:
>>
>>
>> I don't think this patch would change anything, atleast it wouldn't get
>> rid of the warning that Liu reported. Because, he is running his stress
>> tests on a machine which has only 2 CPUs. So effectively, we are hotplugging
>> only CPU1 (since CPU0 can't be taken offline, on Intel boxes).
>>
>> Also, CPU1 is removed from cpu_active_mask during CPU_DOWN_PREPARE time itself,
>> and migrate_tasks() comes much later (during CPU_DYING). And in any case,
>> dest_cpu will never be CPU1, because it is the CPU that is going down. So it
>> *has* to be CPU0 anyway.
>>
>> So, I don't think changes to select_fallback_rq() to make it more careful is
>> going to make any difference in the particular scenario that Liu is testing.
>>
>> That said, even I don't know what the root cause of the warning is.. :-(
> 
> Its a race in cpu-up, we set active before online, when we do a wakeup
> select_task_rq() will see !cpu_online(), we then call
> select_fallback_rq() to compute a new cpu, select_fallback_rq() computes
> a new cpu against cpu_active (which is set) and can thus return cpu 1,
> even though it is still offline.
> 
> So we queue the task on cpu 1 and send a reschedule ipi, at which point
> we'll get the reported warning.
> 
> My change modifies select_fallback_rq() to require online && active.
> 


Ok, that makes sense.. Thanks a lot for the explanation!

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-20 12:17                         ` Peter Zijlstra
  2012-03-20 13:14                           ` Srivatsa S. Bhat
@ 2012-03-21  5:59                           ` Liu, Chuansheng
  2012-03-21  9:12                             ` Peter Zijlstra
  2012-03-21 12:25                             ` Peter Zijlstra
  1 sibling, 2 replies; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-21  5:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx

> Does the below on top of current tip/master work?
Still has the same warning problem, and system hang there.
Could you consider the advice that setting the active just after set online?

Paste the both patches from you that I am testing:
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 11b9630..56eb3b4
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -285,19 +285,6 @@ notrace static void __cpuinit start_secondary(void *unused)
 	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
 	x86_platform.nmi_init();
 
-	/*
-	 * Wait until the cpu which brought this one up marked it
-	 * online before enabling interrupts. If we don't do that then
-	 * we can end up waking up the softirq thread before this cpu
-	 * reached the active state, which makes the scheduler unhappy
-	 * and schedule the softirq thread on the wrong cpu. This is
-	 * only observable with forced threaded interrupts, but in
-	 * theory it could also happen w/o them. It's just way harder
-	 * to achieve.
-	 */
-	while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
-		cpu_relax();
-
 	/* enable local interrupts */
 	local_irq_enable();
 

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..f3190db
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -22,7 +22,7 @@ extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
-extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -144,7 +144,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
 	cpumask_copy(mask, cpu_possible_mask);
 }
 
-static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
+static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
 {
 	do_set_cpus_allowed(p, cpu_possible_mask);
 	return cpumask_any(cpu_active_mask);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
old mode 100644
new mode 100755
index 9c9b754..76ba3b8
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2182,7 +2182,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 	mutex_unlock(&callback_mutex);
 }
 
-int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
+void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
 	const struct cpuset *cs;
 	int cpu;
@@ -2206,22 +2206,12 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 	 * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary
 	 * set any mask even if it is not right from task_cs() pov,
 	 * the pending set_cpus_allowed_ptr() will fix things.
-	 */
-
-	cpu = cpumask_any_and(&tsk->cpus_allowed, cpu_active_mask);
-	if (cpu >= nr_cpu_ids) {
-		/*
-		 * Either tsk->cpus_allowed is wrong (see above) or it
-		 * is actually empty. The latter case is only possible
-		 * if we are racing with remove_tasks_in_empty_cpuset().
-		 * Like above we can temporary set any mask and rely on
-		 * set_cpus_allowed_ptr() as synchronization point.
+	 *
+	 * select_fallback_rq() will fix things ups and set cpu_possible_mask
+	 * if required.
 		 */
-		do_set_cpus_allowed(tsk, cpu_possible_mask);
-		cpu = cpumask_any(cpu_active_mask);
-	}
 
-	return cpu;
+
 }
 
 void cpuset_init_current_mems_allowed(void)
diff --git a/kernel/sched.c b/kernel/sched.c
index d1c6969..bb3377d
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2374,29 +2374,65 @@ EXPORT_SYMBOL_GPL(kick_process);
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
-	int dest_cpu;
 	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
+	enum { cpuset, possible, fail } state = cpuset;
+	int dest_cpu;
 
 	/* Look for allowed, online CPU in same node. */
-	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
-		if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
+	for_each_cpu_mask(dest_cpu, *nodemask) {
+		if (!cpu_online(dest_cpu))
+			continue;
+		if (!cpu_active(dest_cpu))
+			continue;
+		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
 			return dest_cpu;
+	}
+
 
 	/* Any allowed, online CPU? */
 	dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
 	if (dest_cpu < nr_cpu_ids)
 		return dest_cpu;
+	for (;;) {
+		/* Any allowed, online CPU? */
+		for_each_cpu_mask(dest_cpu, *tsk_cpus_allowed(p)) {
+			if (!cpu_online(dest_cpu))
+				continue;
+			if (!cpu_active(dest_cpu))
+				continue;
+			goto out;
+		}
+		
+
+		switch (state) {
+		case cpuset:
+			/* No more Mr. Nice Guy. */
+			cpuset_cpus_allowed_fallback(p);
+			state = possible;
+			break;
+	
+		case possible:
+			do_set_cpus_allowed(p, cpu_possible_mask);
+			state = fail;
+			break;
+	
+		case fail:
+			BUG();
+			break;
+		}
+	}
+out:
+	if (state != cpuset) {
+			/*
+			 * Don't tell them about moving exiting tasks or
+			 * kernel threads (both mm NULL), since they never
+			 * leave kernel.
+			 */
+			if (p->mm && printk_ratelimit()) {
+				printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
+						task_pid_nr(p), p->comm, cpu);
+			}
 
-	/* No more Mr. Nice Guy. */
-	dest_cpu = cpuset_cpus_allowed_fallback(p);
-	/*
-	 * Don't tell them about moving exiting tasks or
-	 * kernel threads (both mm NULL), since they never
-	 * leave kernel.
-	 */
-	if (p->mm && printk_ratelimit()) {
-		printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
-				task_pid_nr(p), p->comm, cpu);
 	}
 
 	return dest_cpu;
@@ -6489,7 +6525,7 @@ static int __cpuinit sched_cpu_active(struct notifier_block *nfb,
 				      unsigned long action, void *hcpu)
 {
 	switch (action & ~CPU_TASKS_FROZEN) {
-	case CPU_ONLINE:
+	case CPU_STARTING:
 	case CPU_DOWN_FAILED:
 		set_cpu_active((long)hcpu, true);
 		return NOTIFY_OK;


> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Tuesday, March 20, 2012 8:17 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Tue, 2012-03-20 at 00:22 +0000, Liu, Chuansheng wrote:
> > Thanks to give some time to have a look.
> >
> Does the below on top of current tip/master work?
> >
> ---
>  include/linux/cpuset.h |    6 +---
>  kernel/cpuset.c        |   20 +++------------
>  kernel/sched/core.c    |   62
> +++++++++++++++++++++++++++++++++++------------
>  3 files changed, 52 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index
> e9eaec5..e0ffaf0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -22,7 +22,7 @@ extern int cpuset_init(void);  extern void
> cpuset_init_smp(void);  extern void cpuset_update_active_cpus(void);
> extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
> -extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
> +extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
>  extern nodemask_t cpuset_mems_allowed(struct task_struct *p);  #define
> cpuset_current_mems_allowed (current->mems_allowed)  void
> cpuset_init_current_mems_allowed(void);
> @@ -144,10 +144,8 @@ static inline void cpuset_cpus_allowed(struct
> task_struct *p,
>  	cpumask_copy(mask, cpu_possible_mask);  }
> 
> -static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> +static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
>  {
> -	do_set_cpus_allowed(p, cpu_possible_mask);
> -	return cpumask_any(cpu_active_mask);
>  }
> 
>  static inline nodemask_t cpuset_mems_allowed(struct task_struct *p) diff
> --git a/kernel/cpuset.c b/kernel/cpuset.c index a09ac2b..c9837b7 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2195,7 +2195,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk,
> struct cpumask *pmask)
>  	mutex_unlock(&callback_mutex);
>  }
> 
> -int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> +void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
>  	const struct cpuset *cs;
>  	int cpu;
> @@ -2219,22 +2219,10 @@ int cpuset_cpus_allowed_fallback(struct
> task_struct *tsk)
>  	 * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary
>  	 * set any mask even if it is not right from task_cs() pov,
>  	 * the pending set_cpus_allowed_ptr() will fix things.
> +	 *
> +	 * select_fallback_rq() will fix things ups and set cpu_possible_mask
> +	 * if required.
>  	 */
> -
> -	cpu = cpumask_any_and(&tsk->cpus_allowed, cpu_active_mask);
> -	if (cpu >= nr_cpu_ids) {
> -		/*
> -		 * Either tsk->cpus_allowed is wrong (see above) or it
> -		 * is actually empty. The latter case is only possible
> -		 * if we are racing with remove_tasks_in_empty_cpuset().
> -		 * Like above we can temporary set any mask and rely on
> -		 * set_cpus_allowed_ptr() as synchronization point.
> -		 */
> -		do_set_cpus_allowed(tsk, cpu_possible_mask);
> -		cpu = cpumask_any(cpu_active_mask);
> -	}
> -
> -	return cpu;
>  }
> 
>  void cpuset_init_current_mems_allowed(void)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e37c9af..04d3f7f
> 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1263,29 +1263,59 @@ EXPORT_SYMBOL_GPL(kick_process);
>   */
>  static int select_fallback_rq(int cpu, struct task_struct *p)  {
> -	int dest_cpu;
>  	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
> +	enum { cpuset, possible, fail } state = cpuset;
> +	int dest_cpu;
> 
>  	/* Look for allowed, online CPU in same node. */
> -	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
> +	for_each_cpu_mask(dest_cpu, *nodemask) {
> +		if (!cpu_online(dest_cpu))
> +			continue;
> +		if (!cpu_active(dest_cpu))
> +			continue;
>  		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
>  			return dest_cpu;
> +	}
> 
> -	/* Any allowed, online CPU? */
> -	dest_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpu_active_mask);
> -	if (dest_cpu < nr_cpu_ids)
> -		return dest_cpu;
> +	for (;;) {
> +		/* Any allowed, online CPU? */
> +		for_each_cpu_mask(dest_cpu, *tsk_cpus_allowed(p)) {
> +			if (!cpu_online(dest_cpu))
> +				continue;
> +			if (!cpu_active(dest_cpu))
> +				continue;
> +			goto out;
> +		}
> 
> -	/* No more Mr. Nice Guy. */
> -	dest_cpu = cpuset_cpus_allowed_fallback(p);
> -	/*
> -	 * Don't tell them about moving exiting tasks or
> -	 * kernel threads (both mm NULL), since they never
> -	 * leave kernel.
> -	 */
> -	if (p->mm && printk_ratelimit()) {
> -		printk_sched("process %d (%s) no longer affine to cpu%d\n",
> -				task_pid_nr(p), p->comm, cpu);
> +		switch (state) {
> +		case cpuset:
> +			/* No more Mr. Nice Guy. */
> +			cpuset_cpus_allowed_fallback(p);
> +			state = possible;
> +			break;
> +
> +		case possible:
> +			do_set_cpus_allowed(p, cpu_possible_mask);
> +			state = fail;
> +			break;
> +
> +		case fail:
> +			BUG();
> +			break;
> +		}
> +	}
> +
> +out:
> +	if (state != cpuset) {
> +		/*
> +		 * Don't tell them about moving exiting tasks or
> +		 * kernel threads (both mm NULL), since they never
> +		 * leave kernel.
> +		 */
> +		if (p->mm && printk_ratelimit()) {
> +			printk_sched("process %d (%s) no longer affine to cpu%d\n",
> +					task_pid_nr(p), p->comm, cpu);
> +		}
>  	}
> 
>  	return dest_cpu;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-21  5:59                           ` Liu, Chuansheng
@ 2012-03-21  9:12                             ` Peter Zijlstra
  2012-03-21 12:25                             ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-21  9:12 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx

On Wed, 2012-03-21 at 05:59 +0000, Liu, Chuansheng wrote:
> Could you consider the advice that setting the active just after set
> online?

Not until I understand why the current stuff explodes, also see
fd8a7de17 setting active after online can re-introduce that problem.

Which ever way I need to clean up some archs since there's all kind of
broken in there.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-21  5:59                           ` Liu, Chuansheng
  2012-03-21  9:12                             ` Peter Zijlstra
@ 2012-03-21 12:25                             ` Peter Zijlstra
  2012-03-22  0:59                               ` Liu, Chuansheng
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-21 12:25 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx


Can you send me your exact kernel patch (the sysfs knob you use to
trigger this) and test scripts?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-21 12:25                             ` Peter Zijlstra
@ 2012-03-22  0:59                               ` Liu, Chuansheng
  2012-03-23 10:25                                 ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-22  0:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx

Below is the complete test patch. Just run the simpl_loop_onoffline.sh and simpl_loop_smpcalling.sh concurrently
to reproduce the race.

From: liu chuansheng <chuansheng.liu@intel.com>
Subject: [PATCH] Test scripts to verify the race between CPU booting and smp_call

Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
---
 drivers/misc/Makefile            |    1 +
 drivers/misc/smp_call_test.c     |   61 ++++++++++++++++++++++++++++++++++++++
 scripts/simpl_loop_onoffline.sh  |    5 +++
 scripts/simpl_loop_smpcalling.sh |    5 +++
 4 files changed, 72 insertions(+), 0 deletions(-)
 create mode 100644 drivers/misc/smp_call_test.c
 create mode 100644 scripts/simpl_loop_onoffline.sh
 create mode 100644 scripts/simpl_loop_smpcalling.sh

diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 571ffd1..003f883 100755
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_MID_I2S_DEV)     += mid_i2s_dev/
 obj-$(CONFIG_EMMC_IPANIC)      += emmc_ipanic.o
 obj-$(CONFIG_SCU_LOGGING)      += intel_fabric_logging.o
 obj-$(CONFIG_UUID)     += uuid.o
+obj-y                          += smp_call_test.o
diff --git a/drivers/misc/smp_call_test.c b/drivers/misc/smp_call_test.c
new file mode 100644
index 0000000..95a14fc
--- /dev/null
+++ b/drivers/misc/smp_call_test.c
@@ -0,0 +1,61 @@
+/*
+ * drivers/misc/smp_call_test.c
+ *
+ * Author: chuansheng.liu@intel.com
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/device.h>
+#include <linux/types.h>
+#include <linux/delay.h>
+#include <linux/sched.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/proc_fs.h>
+
+static int smp_call_dbg_get(void *data, u64 *val)
+{
+       return 0;
+}
+
+static void smp_call_func(void *info)
+{
+       return;
+}
+
+static int smp_call_dbg_set(void *data, u64 val)
+{
+       smp_call_function(smp_call_func, NULL, 1);
+       printk(KERN_ERR "test smp_call_function\n");
+       return 0;
+}
+
+
+DEFINE_SIMPLE_ATTRIBUTE(smp_call_dbg_fops, smp_call_dbg_get, smp_call_dbg_set, "%llu\n");
+
+
+int __init smp_call_test_init(void)
+{
+       debugfs_create_file("smp_call_test", 0644, NULL, NULL, &smp_call_dbg_fops);
+       return 0;
+}
+
+module_init(smp_call_test_init);
+
diff --git a/scripts/simpl_loop_onoffline.sh b/scripts/simpl_loop_onoffline.sh
new file mode 100644
index 0000000..c2c640d
--- /dev/null
+++ b/scripts/simpl_loop_onoffline.sh
@@ -0,0 +1,5 @@
+while true
+do
+ echo 0 > /sys/devices/system/cpu/cpu1/online
+ echo 1 > /sys/devices/system/cpu/cpu1/online
+done
diff --git a/scripts/simpl_loop_smpcalling.sh b/scripts/simpl_loop_smpcalling.sh
new file mode 100644
index 0000000..37cef74
--- /dev/null
+++ b/scripts/simpl_loop_smpcalling.sh
@@ -0,0 +1,5 @@
+while true
+do
+echo 1 > /sys/kernel/debug/smp_call_test
+usleep 500000
+done
-- 
1.7.0.4

> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Wednesday, March 21, 2012 8:26 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> 
> Can you send me your exact kernel patch (the sysfs knob you use to trigger this)
> and test scripts?

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-22  0:59                               ` Liu, Chuansheng
@ 2012-03-23 10:25                                 ` Peter Zijlstra
  2012-03-23 11:32                                   ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-23 10:25 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx, Srivatsa S. Bhat

On Thu, 2012-03-22 at 00:59 +0000, Liu, Chuansheng wrote:
> Below is the complete test patch. Just run the simpl_loop_onoffline.sh
> and simpl_loop_smpcalling.sh concurrently
> to reproduce the race.
> 
What I did was:

for i in {2..23} ; do echo 0 > /sys/devices/system/cpu/cpu${i}/online ; done
while usleep 50000 ; do echo 1 > /debug/smp_call_test ; done &
while :; do echo $i > /sys/devices/system/cpu/cpu1/online ; let i^=1 ; done

and this has been running for well over 30 minutes.. is there anything
else one needs to do?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-23 10:25                                 ` Peter Zijlstra
@ 2012-03-23 11:32                                   ` Liu, Chuansheng
  2012-03-23 12:02                                     ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-23 11:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx, Srivatsa S. Bhat

In fact, I started two scripts running:
1/ One script:
echo 0 > /sys/devices/system/cpuX/online
echo 1 > /sys/devices/system/cpuX/online
Rerunning the above commands in loop

2/Another script:
echo 1 > /debug/smp_call_test
usleep 50000
Rerunning the above command in loop

This race issue can be easy to be reproduced in several minutes;
For simplify your test as mine(just two CPUs), you can set other non-booting CPUs as offline
at first and just leave one non-booting CPU.

> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Friday, March 23, 2012 6:26 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de; Srivatsa S.
> Bhat
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Thu, 2012-03-22 at 00:59 +0000, Liu, Chuansheng wrote:
> > Below is the complete test patch. Just run the simpl_loop_onoffline.sh
> > and simpl_loop_smpcalling.sh concurrently to reproduce the race.
> >
> What I did was:
> 
> for i in {2..23} ; do echo 0 > /sys/devices/system/cpu/cpu${i}/online ; done while
> usleep 50000 ; do echo 1 > /debug/smp_call_test ; done & while :; do echo $i >
> /sys/devices/system/cpu/cpu1/online ; let i^=1 ; done
> 
> and this has been running for well over 30 minutes.. is there anything else one
> needs to do?
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-23 11:32                                   ` Liu, Chuansheng
@ 2012-03-23 12:02                                     ` Peter Zijlstra
  2012-03-23 12:06                                       ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-23 12:02 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx, Srivatsa S. Bhat

On Fri, 2012-03-23 at 11:32 +0000, Liu, Chuansheng wrote:
> In fact, I started two scripts running:
> 1/ One script:
> echo 0 > /sys/devices/system/cpuX/online
> echo 1 > /sys/devices/system/cpuX/online
> Rerunning the above commands in loop
> 
> 2/Another script:
> echo 1 > /debug/smp_call_test
> usleep 50000
> Rerunning the above command in loop
> 
> This race issue can be easy to be reproduced in several minutes;
> For simplify your test as mine(just two CPUs), you can set other non-booting CPUs as offline
> at first and just leave one non-booting CPU.

So this is exactly what I did and it ran for 30+ minutes without fail. I
found I forgot to log the serial output so I just re-ran this to make
sure. 10+ minutes and not a single WARN in the console output.

If I pop my change to select_fallback_rq() I can indeed trigger this:

------------[ cut here ]------------
WARNING: at /usr/src/linux-2.6/arch/x86/kernel/smp.c:120
native_smp_send_reschedule+0x5b/0x60()
Hardware name: X8DTN
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 1542, comm: abrtd Not tainted 3.3.0-01725-gd6eb054-dirty #63
Call Trace:
 <IRQ>  [<ffffffff810775df>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8107763a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff8105f79b>] native_smp_send_reschedule+0x5b/0x60
 [<ffffffff810aa67a>] try_to_wake_up+0x1fa/0x2c0
 [<ffffffff810acaec>] ? sched_slice.isra.38+0x5c/0x90
 [<ffffffff810aa795>] wake_up_process+0x15/0x20
 [<ffffffff81085c6e>] process_timeout+0xe/0x10
 [<ffffffff81086cb3>] run_timer_softirq+0x143/0x460
 [<ffffffff81384a94>] ? timerqueue_add+0x74/0xc0
 [<ffffffff81085c60>] ? usleep_range+0x50/0x50
 [<ffffffff8107e81d>] __do_softirq+0xbd/0x290
 [<ffffffff810c5e64>] ? clockevents_program_event+0x74/0x100
 [<ffffffff810c72d4>] ? tick_program_event+0x24/0x30
 [<ffffffff8194ba4c>] call_softirq+0x1c/0x30
 [<ffffffff810433d5>] do_softirq+0x55/0x90
 [<ffffffff8107ed2e>] irq_exit+0x9e/0xe0
 [<ffffffff8194c07e>] smp_apic_timer_interrupt+0x6e/0x99
 [<ffffffff8194b107>] apic_timer_interrupt+0x67/0x70
 <EOI> 
---[ end trace d2b2cbf78c1ddd2e ]---


But let me re-run with the select_fallback_rq() change and let it run
for several hours while I go play outside.. 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-23 12:02                                     ` Peter Zijlstra
@ 2012-03-23 12:06                                       ` Liu, Chuansheng
  2012-03-23 12:13                                         ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-23 12:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx, Srivatsa S. Bhat

Sorry to not see your scripts in details.

If so, I has something wrong about merging your change of select_fallback_rq()?
I am using 3.0.8.

I indeed see the warning even with the change of select_fallback_rq().


> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Friday, March 23, 2012 8:02 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de; Srivatsa S.
> Bhat
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Fri, 2012-03-23 at 11:32 +0000, Liu, Chuansheng wrote:
> > In fact, I started two scripts running:
> > 1/ One script:
> > echo 0 > /sys/devices/system/cpuX/online echo 1 >
> > /sys/devices/system/cpuX/online Rerunning the above commands in loop
> >
> > 2/Another script:
> > echo 1 > /debug/smp_call_test
> > usleep 50000
> > Rerunning the above command in loop
> >
> > This race issue can be easy to be reproduced in several minutes; For
> > simplify your test as mine(just two CPUs), you can set other
> > non-booting CPUs as offline at first and just leave one non-booting CPU.
> 
> So this is exactly what I did and it ran for 30+ minutes without fail. I found I
> forgot to log the serial output so I just re-ran this to make sure. 10+ minutes
> and not a single WARN in the console output.
> 
> If I pop my change to select_fallback_rq() I can indeed trigger this:
> 
> ------------[ cut here ]------------
> WARNING: at /usr/src/linux-2.6/arch/x86/kernel/smp.c:120
> native_smp_send_reschedule+0x5b/0x60()
> Hardware name: X8DTN
> Modules linked in: [last unloaded: scsi_wait_scan]
> Pid: 1542, comm: abrtd Not tainted 3.3.0-01725-gd6eb054-dirty #63 Call Trace:
>  <IRQ>  [<ffffffff810775df>] warn_slowpath_common+0x7f/0xc0
> [<ffffffff8107763a>] warn_slowpath_null+0x1a/0x20  [<ffffffff8105f79b>]
> native_smp_send_reschedule+0x5b/0x60
>  [<ffffffff810aa67a>] try_to_wake_up+0x1fa/0x2c0  [<ffffffff810acaec>] ?
> sched_slice.isra.38+0x5c/0x90  [<ffffffff810aa795>]
> wake_up_process+0x15/0x20  [<ffffffff81085c6e>] process_timeout+0xe/0x10
> [<ffffffff81086cb3>] run_timer_softirq+0x143/0x460  [<ffffffff81384a94>] ?
> timerqueue_add+0x74/0xc0  [<ffffffff81085c60>] ? usleep_range+0x50/0x50
> [<ffffffff8107e81d>] __do_softirq+0xbd/0x290  [<ffffffff810c5e64>] ?
> clockevents_program_event+0x74/0x100
>  [<ffffffff810c72d4>] ? tick_program_event+0x24/0x30  [<ffffffff8194ba4c>]
> call_softirq+0x1c/0x30  [<ffffffff810433d5>] do_softirq+0x55/0x90
> [<ffffffff8107ed2e>] irq_exit+0x9e/0xe0  [<ffffffff8194c07e>]
> smp_apic_timer_interrupt+0x6e/0x99
>  [<ffffffff8194b107>] apic_timer_interrupt+0x67/0x70  <EOI> ---[ end trace
> d2b2cbf78c1ddd2e ]---
> 
> 
> But let me re-run with the select_fallback_rq() change and let it run
> for several hours while I go play outside..

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-23 12:06                                       ` Liu, Chuansheng
@ 2012-03-23 12:13                                         ` Peter Zijlstra
  2012-03-23 12:21                                           ` Liu, Chuansheng
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-03-23 12:13 UTC (permalink / raw)
  To: Liu, Chuansheng; +Cc: linux-kernel, Yanmin Zhang, tglx, Srivatsa S. Bhat

On Fri, 2012-03-23 at 12:06 +0000, Liu, Chuansheng wrote:
> If so, I has something wrong about merging your change of select_fallback_rq()?
> I am using 3.0.8.
> 
> I indeed see the warning even with the change of select_fallback_rq().

Please use linus's current tree or tip/master with the below patch. I've
no idea what 3.0.8 looks like and the important thing is to make sure
Linus' tree (which will become 3.4) works correctly.

After that we can prod at -stable muck.

I've included my patch to select_fallback_rq() again.

---
Subject: sched: Fix select_fallback_rq vs cpu_active/cpu_online
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Mar 20 15:57:01 CET 2012

Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.

Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online. 

If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.

select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.

This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.

The solution proposed by Chuansheng Lui of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.

[ By setting online first and active later we have a window of
  online,!active, fresh and bound kthreads have task_cpu() of 0 and
  since cpu0 isn't in tsk_cpus_allowed() we end up in
  select_fallback_rq() which excludes !active, resulting in a reset
  of ->cpus_allowed and the thread running all over the place. ]

The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.

Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Reported-by: Chuansheng Lui <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
---
 include/linux/cpuset.h |    6 +---
 kernel/cpuset.c        |   20 +++------------
 kernel/sched/core.c    |   64 +++++++++++++++++++++++++++++++++++--------------
 3 files changed, 53 insertions(+), 37 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -22,7 +22,7 @@ extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
-extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -144,10 +144,8 @@ static inline void cpuset_cpus_allowed(s
 	cpumask_copy(mask, cpu_possible_mask);
 }
 
-static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
+static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
 {
-	do_set_cpus_allowed(p, cpu_possible_mask);
-	return cpumask_any(cpu_active_mask);
 }
 
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2195,7 +2195,7 @@ void cpuset_cpus_allowed(struct task_str
 	mutex_unlock(&callback_mutex);
 }
 
-int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
+void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
 	const struct cpuset *cs;
 	int cpu;
@@ -2219,22 +2219,10 @@ int cpuset_cpus_allowed_fallback(struct
 	 * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary
 	 * set any mask even if it is not right from task_cs() pov,
 	 * the pending set_cpus_allowed_ptr() will fix things.
+	 *
+	 * select_fallback_rq() will fix things ups and set cpu_possible_mask
+	 * if required.
 	 */
-
-	cpu = cpumask_any_and(&tsk->cpus_allowed, cpu_active_mask);
-	if (cpu >= nr_cpu_ids) {
-		/*
-		 * Either tsk->cpus_allowed is wrong (see above) or it
-		 * is actually empty. The latter case is only possible
-		 * if we are racing with remove_tasks_in_empty_cpuset().
-		 * Like above we can temporary set any mask and rely on
-		 * set_cpus_allowed_ptr() as synchronization point.
-		 */
-		do_set_cpus_allowed(tsk, cpu_possible_mask);
-		cpu = cpumask_any(cpu_active_mask);
-	}
-
-	return cpu;
 }
 
 void cpuset_init_current_mems_allowed(void)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1263,29 +1263,59 @@ EXPORT_SYMBOL_GPL(kick_process);
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
-	int dest_cpu;
 	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
+	enum { cpuset, possible, fail } state = cpuset;
+	int dest_cpu;
 
 	/* Look for allowed, online CPU in same node. */
-	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
+	for_each_cpu_mask(dest_cpu, *nodemask) {
+		if (!cpu_online(dest_cpu))
+			continue;
+		if (!cpu_active(dest_cpu))
+			continue;
 		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
 			return dest_cpu;
+	}
+
+	for (;;) {
+		/* Any allowed, online CPU? */
+		for_each_cpu_mask(dest_cpu, *tsk_cpus_allowed(p)) {
+			if (!cpu_online(dest_cpu))
+				continue;
+			if (!cpu_active(dest_cpu))
+				continue;
+			goto out;
+		}
+
+		switch (state) {
+		case cpuset:
+			/* No more Mr. Nice Guy. */
+			cpuset_cpus_allowed_fallback(p);
+			state = possible;
+			break;
+
+		case possible:
+			do_set_cpus_allowed(p, cpu_possible_mask);
+			state = fail;
+			break;
+
+		case fail:
+			BUG();
+			break;
+		}
+	}
 
-	/* Any allowed, online CPU? */
-	dest_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpu_active_mask);
-	if (dest_cpu < nr_cpu_ids)
-		return dest_cpu;
-
-	/* No more Mr. Nice Guy. */
-	dest_cpu = cpuset_cpus_allowed_fallback(p);
-	/*
-	 * Don't tell them about moving exiting tasks or
-	 * kernel threads (both mm NULL), since they never
-	 * leave kernel.
-	 */
-	if (p->mm && printk_ratelimit()) {
-		printk_sched("process %d (%s) no longer affine to cpu%d\n",
-				task_pid_nr(p), p->comm, cpu);
+out:
+	if (state != cpuset) {
+		/*
+		 * Don't tell them about moving exiting tasks or
+		 * kernel threads (both mm NULL), since they never
+		 * leave kernel.
+		 */
+		if (p->mm && printk_ratelimit()) {
+			printk_sched("process %d (%s) no longer affine to cpu%d\n",
+					task_pid_nr(p), p->comm, cpu);
+		}
 	}
 
 	return dest_cpu;


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] Fix the race between smp_call_function and CPU booting
  2012-03-23 12:13                                         ` Peter Zijlstra
@ 2012-03-23 12:21                                           ` Liu, Chuansheng
  0 siblings, 0 replies; 29+ messages in thread
From: Liu, Chuansheng @ 2012-03-23 12:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Yanmin Zhang, tglx, Srivatsa S. Bhat

OK, waiting for your test result.
And if possible, please correct my name Chuansheng Lui == > Chuansheng Liu,
Thanks.

> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Friday, March 23, 2012 8:13 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Yanmin Zhang; tglx@linutronix.de; Srivatsa S.
> Bhat
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
> 
> On Fri, 2012-03-23 at 12:06 +0000, Liu, Chuansheng wrote:
> > If so, I has something wrong about merging your change of
> select_fallback_rq()?
> > I am using 3.0.8.
> >
> > I indeed see the warning even with the change of select_fallback_rq().
> 
> Please use linus's current tree or tip/master with the below patch. I've no idea
> what 3.0.8 looks like and the important thing is to make sure Linus' tree (which
> will become 3.4) works correctly.
> 
> After that we can prod at -stable muck.
> 
> I've included my patch to select_fallback_rq() again.
> 
> ---
> Subject: sched: Fix select_fallback_rq vs cpu_active/cpu_online
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue Mar 20 15:57:01 CET 2012
> 
> Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
> supposed to finally sort the cpu_active mess, instead uncovered more.
> 
> Since CPU_STARTING is ran before setting the cpu online, there's a
> (small) window where the cpu has active,!online.
> 
> If during this time there's a wakeup of a task that used to reside on that cpu
> select_task_rq() will use select_fallback_rq() to compute an alternative cpu to
> run on since we find !online.
> 
> select_fallback_rq() however will compute the new cpu against cpu_active, this
> means that it can return the same cpu it started out with, the !online one,
> since that cpu is in fact marked active.
> 
> This results in us trying to scheduling a task on an offline cpu and triggering a
> WARN in the IPI code.
> 
> The solution proposed by Chuansheng Lui of setting cpu_active in
> set_cpu_online() is buggy, firstly not all archs actually use set_cpu_online(),
> secondly, not all archs call set_cpu_online() with IRQs disabled, this means we
> would introduce either the same race or the race from fd8a7de17 ("x86:
> cpu-hotplug: Prevent softirq wakeup on wrong CPU") -- albeit much narrower.
> 
> [ By setting online first and active later we have a window of
>   online,!active, fresh and bound kthreads have task_cpu() of 0 and
>   since cpu0 isn't in tsk_cpus_allowed() we end up in
>   select_fallback_rq() which excludes !active, resulting in a reset
>   of ->cpus_allowed and the thread running all over the place. ]
> 
> The solution is to re-work select_fallback_rq() to require active _and_ online.
> This makes the active,!online case work as expected, OTOH archs running
> CPU_STARTING after setting online are now vulnerable to the issue from
> fd8a7de17 -- these are alpha and blackfin.
> 
> Cc: Mike Frysinger <vapier@gentoo.org>
> Cc: linux-alpha@vger.kernel.org
> Reported-by: Chuansheng Lui <chuansheng.liu@intel.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
> ---
>  include/linux/cpuset.h |    6 +---
>  kernel/cpuset.c        |   20 +++------------
>  kernel/sched/core.c    |   64
> +++++++++++++++++++++++++++++++++++--------------
>  3 files changed, 53 insertions(+), 37 deletions(-)
> 
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -22,7 +22,7 @@ extern int cpuset_init(void);  extern void
> cpuset_init_smp(void);  extern void cpuset_update_active_cpus(void);
> extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
> -extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
> +extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
>  extern nodemask_t cpuset_mems_allowed(struct task_struct *p);  #define
> cpuset_current_mems_allowed (current->mems_allowed)  void
> cpuset_init_current_mems_allowed(void);
> @@ -144,10 +144,8 @@ static inline void cpuset_cpus_allowed(s
>  	cpumask_copy(mask, cpu_possible_mask);  }
> 
> -static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> +static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
>  {
> -	do_set_cpus_allowed(p, cpu_possible_mask);
> -	return cpumask_any(cpu_active_mask);
>  }
> 
>  static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2195,7 +2195,7 @@ void cpuset_cpus_allowed(struct task_str
>  	mutex_unlock(&callback_mutex);
>  }
> 
> -int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> +void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
>  	const struct cpuset *cs;
>  	int cpu;
> @@ -2219,22 +2219,10 @@ int cpuset_cpus_allowed_fallback(struct
>  	 * changes in tsk_cs()->cpus_allowed. Otherwise we can temporary
>  	 * set any mask even if it is not right from task_cs() pov,
>  	 * the pending set_cpus_allowed_ptr() will fix things.
> +	 *
> +	 * select_fallback_rq() will fix things ups and set cpu_possible_mask
> +	 * if required.
>  	 */
> -
> -	cpu = cpumask_any_and(&tsk->cpus_allowed, cpu_active_mask);
> -	if (cpu >= nr_cpu_ids) {
> -		/*
> -		 * Either tsk->cpus_allowed is wrong (see above) or it
> -		 * is actually empty. The latter case is only possible
> -		 * if we are racing with remove_tasks_in_empty_cpuset().
> -		 * Like above we can temporary set any mask and rely on
> -		 * set_cpus_allowed_ptr() as synchronization point.
> -		 */
> -		do_set_cpus_allowed(tsk, cpu_possible_mask);
> -		cpu = cpumask_any(cpu_active_mask);
> -	}
> -
> -	return cpu;
>  }
> 
>  void cpuset_init_current_mems_allowed(void)
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1263,29 +1263,59 @@ EXPORT_SYMBOL_GPL(kick_process);
>   */
>  static int select_fallback_rq(int cpu, struct task_struct *p)  {
> -	int dest_cpu;
>  	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
> +	enum { cpuset, possible, fail } state = cpuset;
> +	int dest_cpu;
> 
>  	/* Look for allowed, online CPU in same node. */
> -	for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
> +	for_each_cpu_mask(dest_cpu, *nodemask) {
> +		if (!cpu_online(dest_cpu))
> +			continue;
> +		if (!cpu_active(dest_cpu))
> +			continue;
>  		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
>  			return dest_cpu;
> +	}
> +
> +	for (;;) {
> +		/* Any allowed, online CPU? */
> +		for_each_cpu_mask(dest_cpu, *tsk_cpus_allowed(p)) {
> +			if (!cpu_online(dest_cpu))
> +				continue;
> +			if (!cpu_active(dest_cpu))
> +				continue;
> +			goto out;
> +		}
> +
> +		switch (state) {
> +		case cpuset:
> +			/* No more Mr. Nice Guy. */
> +			cpuset_cpus_allowed_fallback(p);
> +			state = possible;
> +			break;
> +
> +		case possible:
> +			do_set_cpus_allowed(p, cpu_possible_mask);
> +			state = fail;
> +			break;
> +
> +		case fail:
> +			BUG();
> +			break;
> +		}
> +	}
> 
> -	/* Any allowed, online CPU? */
> -	dest_cpu = cpumask_any_and(tsk_cpus_allowed(p), cpu_active_mask);
> -	if (dest_cpu < nr_cpu_ids)
> -		return dest_cpu;
> -
> -	/* No more Mr. Nice Guy. */
> -	dest_cpu = cpuset_cpus_allowed_fallback(p);
> -	/*
> -	 * Don't tell them about moving exiting tasks or
> -	 * kernel threads (both mm NULL), since they never
> -	 * leave kernel.
> -	 */
> -	if (p->mm && printk_ratelimit()) {
> -		printk_sched("process %d (%s) no longer affine to cpu%d\n",
> -				task_pid_nr(p), p->comm, cpu);
> +out:
> +	if (state != cpuset) {
> +		/*
> +		 * Don't tell them about moving exiting tasks or
> +		 * kernel threads (both mm NULL), since they never
> +		 * leave kernel.
> +		 */
> +		if (p->mm && printk_ratelimit()) {
> +			printk_sched("process %d (%s) no longer affine to cpu%d\n",
> +					task_pid_nr(p), p->comm, cpu);
> +		}
>  	}
> 
>  	return dest_cpu;


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-03-23 12:21 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-12  9:27 [PATCH] Fix the race between smp_call_function and CPU booting Liu, Chuansheng
2012-03-12  9:32 ` Peter Zijlstra
2012-03-13  6:43   ` Liu, Chuansheng
2012-03-12  9:58 ` Peter Zijlstra
2012-03-13  6:46   ` Liu, Chuansheng
2012-03-13 15:57     ` Peter Zijlstra
2012-03-14  6:27       ` Liu, Chuansheng
2012-03-14  9:43         ` Peter Zijlstra
2012-03-15  0:11           ` Liu, Chuansheng
2012-03-15 10:46             ` Peter Zijlstra
2012-03-16  6:24               ` Liu, Chuansheng
2012-03-16  9:49                 ` Peter Zijlstra
2012-03-19  0:58                   ` Liu, Chuansheng
2012-03-19 10:03                     ` Peter Zijlstra
2012-03-20  0:22                       ` Liu, Chuansheng
2012-03-20 12:17                         ` Peter Zijlstra
2012-03-20 13:14                           ` Srivatsa S. Bhat
2012-03-20 13:41                             ` Peter Zijlstra
2012-03-20 15:02                               ` Srivatsa S. Bhat
2012-03-21  5:59                           ` Liu, Chuansheng
2012-03-21  9:12                             ` Peter Zijlstra
2012-03-21 12:25                             ` Peter Zijlstra
2012-03-22  0:59                               ` Liu, Chuansheng
2012-03-23 10:25                                 ` Peter Zijlstra
2012-03-23 11:32                                   ` Liu, Chuansheng
2012-03-23 12:02                                     ` Peter Zijlstra
2012-03-23 12:06                                       ` Liu, Chuansheng
2012-03-23 12:13                                         ` Peter Zijlstra
2012-03-23 12:21                                           ` Liu, Chuansheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.