linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel BUG at kernel/smpboot.c:134!
@ 2013-04-05 21:43 Dave Hansen
  2013-04-06  7:12 ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2013-04-05 21:43 UTC (permalink / raw)
  To: Srivatsa S. Bhat, linux-kernel, Thomas Gleixner

Hey Thomas,

I seem to be running in to smpboot_thread_fn()'s

	BUG_ON(td->cpu != smp_processor_id());

pretty regularly, both at boot and if I boot with maxcpus=x and then
online the CPUs from sysfs after boot.  It's a 160-logical-cpu system,
so it's quite a beast.  I _seem_ to be hitting it more often at higher
cpu counts, but it doesn't trigger on bringing up a particular CPU as
far as I can tell.

This is on a pull of mainline from today, e0a77f263.  Any ideas?

> [  790.223270] ------------[ cut here ]------------
> [  790.223966] kernel BUG at kernel/smpboot.c:134!
> [  790.224739] invalid opcode: 0000 [#1] SMP 
> [  790.225671] Modules linked in:
> [  790.226428] CPU 81 
> [  790.226909] Pid: 3909, comm: migration/135 Tainted: G        W    3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [  790.228775] RIP: 0010:[<ffffffff8110bee8>]  [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
> [  790.230205] RSP: 0018:ffff88bfef9c1e08  EFLAGS: 00010202
> [  790.231090] RAX: 0000000000000051 RBX: ffff88bfefb82000 RCX: 000000000000b888
> [  790.231653] RDX: ffff88bfef9c1fd8 RSI: ffff881fff000000 RDI: 0000000000000087
> [  790.232085] RBP: ffff88bfef9c1e38 R08: 0000000000000001 R09: 0000000000000000
> [  790.232850] R10: 0000000000000018 R11: 0000000000000000 R12: ffff88bfec9e22e0
> [  790.233561] R13: ffffffff81e587a0 R14: ffff88bfec9e22e0 R15: 0000000000000000
> [  790.234004] FS:  0000000000000000(0000) GS:ffff881fff000000(0000) knlGS:0000000000000000
> [  790.234918] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  790.235602] CR2: 00007fa89a333c62 CR3: 0000000001e0b000 CR4: 00000000000007e0
> [  790.236110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  790.236584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  790.237329] Process migration/135 (pid: 3909, threadinfo ffff88bfef9c0000, task ffff88bfec9e22e0)
> [  790.238321] Stack:
> [  790.238882]  ffff88bfef9c1e38 0000000000000000 ffff88ffef421cc0 ffff88bfef9c1ec0
> [  790.245415]  ffff88bfefb82000 ffffffff8110bc90 ffff88bfef9c1f48 ffffffff810ff1df
> [  790.250755]  0000000000000001 0000000000000087 ffff88bfefb82000 0000000000000000
> [  790.253365] Call Trace:
> [  790.254121]  [<ffffffff8110bc90>] ? __smpboot_create_thread+0x180/0x180
> [  790.255428]  [<ffffffff810ff1df>] kthread+0xef/0x100
> [  790.256071]  [<ffffffff819cb1a4>] ? wait_for_completion+0x124/0x180
> [  790.256697]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [  790.257325]  [<ffffffff819dba9c>] ret_from_fork+0x7c/0xb0
> [  790.258233]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [  790.258942] Code: ef 3d 01 01 48 89 df e8 87 b0 16 00 48 83 05 67 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 90 ef 3d 01 01 <0f> 0b 48 83 05 96 ef 3d 01 01 48 83 05 56 ef 3d 01 01 0f 0b 48 
> [  790.276178] RIP  [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
> [  790.276735]  RSP <ffff88bfef9c1e08>
> [  790.278348] ---[ end trace 84baa2bee1434240 ]---



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: kernel BUG at kernel/smpboot.c:134!
  2013-04-05 21:43 kernel BUG at kernel/smpboot.c:134! Dave Hansen
@ 2013-04-06  7:12 ` Srivatsa S. Bhat
  2013-04-06  8:31   ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-06  7:12 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, Thomas Gleixner, Dave Jones, dhillf

Hi Dave,

On 04/06/2013 03:13 AM, Dave Hansen wrote:
> Hey Thomas,
> 
> I seem to be running in to smpboot_thread_fn()'s
> 
> 	BUG_ON(td->cpu != smp_processor_id());
> 
> pretty regularly, both at boot and if I boot with maxcpus=x and then
> online the CPUs from sysfs after boot.  It's a 160-logical-cpu system,
> so it's quite a beast.  I _seem_ to be hitting it more often at higher
> cpu counts, but it doesn't trigger on bringing up a particular CPU as
> far as I can tell.
> 
> This is on a pull of mainline from today, e0a77f263.  Any ideas?
>

Dave Jones had reported a similar problem some time back and Hillf had
proposed a fix. I guess it slipped through the cracks and never went
upstream.

Here is the link: https://lkml.org/lkml/2013/1/19/1

Can you please try it and see if it improves anything?

Regards,
Srivatsa S. Bhat
 
>> [  790.223270] ------------[ cut here ]------------
>> [  790.223966] kernel BUG at kernel/smpboot.c:134!
>> [  790.224739] invalid opcode: 0000 [#1] SMP 
>> [  790.225671] Modules linked in:
>> [  790.226428] CPU 81 
>> [  790.226909] Pid: 3909, comm: migration/135 Tainted: G        W    3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB
>> [  790.228775] RIP: 0010:[<ffffffff8110bee8>]  [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
>> [  790.230205] RSP: 0018:ffff88bfef9c1e08  EFLAGS: 00010202
>> [  790.231090] RAX: 0000000000000051 RBX: ffff88bfefb82000 RCX: 000000000000b888
>> [  790.231653] RDX: ffff88bfef9c1fd8 RSI: ffff881fff000000 RDI: 0000000000000087
>> [  790.232085] RBP: ffff88bfef9c1e38 R08: 0000000000000001 R09: 0000000000000000
>> [  790.232850] R10: 0000000000000018 R11: 0000000000000000 R12: ffff88bfec9e22e0
>> [  790.233561] R13: ffffffff81e587a0 R14: ffff88bfec9e22e0 R15: 0000000000000000
>> [  790.234004] FS:  0000000000000000(0000) GS:ffff881fff000000(0000) knlGS:0000000000000000
>> [  790.234918] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  790.235602] CR2: 00007fa89a333c62 CR3: 0000000001e0b000 CR4: 00000000000007e0
>> [  790.236110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  790.236584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  790.237329] Process migration/135 (pid: 3909, threadinfo ffff88bfef9c0000, task ffff88bfec9e22e0)
>> [  790.238321] Stack:
>> [  790.238882]  ffff88bfef9c1e38 0000000000000000 ffff88ffef421cc0 ffff88bfef9c1ec0
>> [  790.245415]  ffff88bfefb82000 ffffffff8110bc90 ffff88bfef9c1f48 ffffffff810ff1df
>> [  790.250755]  0000000000000001 0000000000000087 ffff88bfefb82000 0000000000000000
>> [  790.253365] Call Trace:
>> [  790.254121]  [<ffffffff8110bc90>] ? __smpboot_create_thread+0x180/0x180
>> [  790.255428]  [<ffffffff810ff1df>] kthread+0xef/0x100
>> [  790.256071]  [<ffffffff819cb1a4>] ? wait_for_completion+0x124/0x180
>> [  790.256697]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
>> [  790.257325]  [<ffffffff819dba9c>] ret_from_fork+0x7c/0xb0
>> [  790.258233]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
>> [  790.258942] Code: ef 3d 01 01 48 89 df e8 87 b0 16 00 48 83 05 67 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 90 ef 3d 01 01 <0f> 0b 48 83 05 96 ef 3d 01 01 48 83 05 56 ef 3d 01 01 0f 0b 48 
>> [  790.276178] RIP  [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
>> [  790.276735]  RSP <ffff88bfef9c1e08>
>> [  790.278348] ---[ end trace 84baa2bee1434240 ]---
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: kernel BUG at kernel/smpboot.c:134!
  2013-04-06  7:12 ` Srivatsa S. Bhat
@ 2013-04-06  8:31   ` Thomas Gleixner
  2013-04-07  9:20     ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-06  8:31 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: Dave Hansen, LKML, Dave Jones, dhillf, Peter Zijlstra

On Sat, 6 Apr 2013, Srivatsa S. Bhat wrote:

> Hi Dave,
> 
> On 04/06/2013 03:13 AM, Dave Hansen wrote:
> > Hey Thomas,
> > 
> > I seem to be running in to smpboot_thread_fn()'s
> > 
> > 	BUG_ON(td->cpu != smp_processor_id());

That should be WARN_ON of course. Stupid me.
 
> > pretty regularly, both at boot and if I boot with maxcpus=x and then
> > online the CPUs from sysfs after boot.  It's a 160-logical-cpu system,
> > so it's quite a beast.  I _seem_ to be hitting it more often at higher
> > cpu counts, but it doesn't trigger on bringing up a particular CPU as
> > far as I can tell.
> > 
> > This is on a pull of mainline from today, e0a77f263.  Any ideas?
> >
> 
> Dave Jones had reported a similar problem some time back and Hillf had
> proposed a fix. I guess it slipped through the cracks and never went
> upstream.
> 
> Here is the link: https://lkml.org/lkml/2013/1/19/1

This is Hillfs proposed patch:

> --- a/kernel/kthread.c	Sat Jan 19 13:03:52 2013
> +++ b/kernel/kthread.c	Sat Jan 19 13:17:54 2013
> @@ -306,6 +306,7 @@ struct task_struct *kthread_create_on_cp
>   	  return p;
> 	  set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags);
> 	  to_kthread(p)->cpu = cpu;
> +	  __kthread_bind(p, cpu);
> 	  /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> 	  kthread_park(p);
> 	  return p;
 
That's bogus. Simply because when we create the thread then the thread
status is HP_THREAD_NONE and the path with the BUG_ON is only entered
with status == HP_THREAD_ACTIVE:

	if (ht->park && td->status == HP_THREAD_ACTIVE) {

So in Dave's case the thread was already created and has entered
active state.

> >> [  790.226909] Pid: 3909, comm: migration/135 Tainted: G        W    3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB

Hmm, it's the migration thread which trips over this. Oh joy!

Dave, does the issue reproduce with function tracing enabled? For a
first shot it's probably enough to filter on smpboot_* functions plus
sched_switch and sched_wakeup events.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: kernel BUG at kernel/smpboot.c:134!
  2013-04-06  8:31   ` Thomas Gleixner
@ 2013-04-07  9:20     ` Thomas Gleixner
  2013-04-07  9:50       ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-07  9:20 UTC (permalink / raw)
  To: Srivatsa S. Bhat; +Cc: Dave Hansen, LKML, Dave Jones, dhillf, Peter Zijlstra

On Sat, 6 Apr 2013, Thomas Gleixner wrote:
> This is Hillfs proposed patch:
> 
> > --- a/kernel/kthread.c	Sat Jan 19 13:03:52 2013
> > +++ b/kernel/kthread.c	Sat Jan 19 13:17:54 2013
> > @@ -306,6 +306,7 @@ struct task_struct *kthread_create_on_cp
> >   	  return p;
> > 	  set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags);
> > 	  to_kthread(p)->cpu = cpu;
> > +	  __kthread_bind(p, cpu);
> > 	  /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> > 	  kthread_park(p);
> > 	  return p;
>  
> That's bogus. Simply because when we create the thread then the thread

And it's even more bogus because the cpu to which we would bind in
kthread_create_on_cpu() is not yet online.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: kernel BUG at kernel/smpboot.c:134!
  2013-04-07  9:20     ` Thomas Gleixner
@ 2013-04-07  9:50       ` Borislav Petkov
  2013-04-08  9:24         ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2013-04-07  9:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Srivatsa S. Bhat, Dave Hansen, LKML, Dave Jones, dhillf, Peter Zijlstra

On Sun, Apr 07, 2013 at 11:20:10AM +0200, Thomas Gleixner wrote:
> And it's even more bogus because the cpu to which we would bind in
> kthread_create_on_cpu() is not yet online.

In case you guys are wondering about reproducibility, I saw the same
BUG_ON yesterday on a 32-bit atom smp qemu guest too (don't ask why?
:-))

And yes, this was without kvm (software emulation only in qemu).

[    0.395000] ------------[ cut here ]------------
[    0.395000] kernel BUG at kernel/smpboot.c:134!
[    0.395000] invalid opcode: 0000 [#1] PREEMPT SMP 
[    0.395000] Modules linked in:
[    0.395000] Pid: 13, comm: ksoftirqd/1 Not tainted 3.9.0-rc5+ #31 Bochs Bochs
[    0.395000] EIP: 0060:[<c106542c>] EFLAGS: 00000202 CPU: 0
[    0.395000] EIP is at smpboot_thread_fn+0x20c/0x260
[    0.395000] EAX: 00000000 EBX: f60ba430 ECX: 00000000 EDX: 00000001
[    0.395000] ESI: f60024f0 EDI: c1736480 EBP: f60c1f24 ESP: f60c1f04
[    0.395000]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[    0.395000] CR0: 8005003b CR2: ffe17000 CR3: 01943000 CR4: 000006e0
[    0.395000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    0.395000] DR6: 00000000 DR7: 00000000
[    0.395000] Process ksoftirqd/1 (pid: 13, ti=f60c0000 task=f60ba430 task.ti=f60c0000)
[    0.395000] Stack:
[    0.395000]  f60ba430 00000001 00000000 f60c1f24 c105c49d f6083e7c f60024f0 c1065220
[    0.395000]  f60c1fac c105c745 00000001 00000001 f60024f0 00000000 c1510303 dead4ead
[    0.395000]  ffffffff ffffffff c1b510ac c1c28450 00000000 c167e791 00000000 c151744f
[    0.395000] Call Trace:
[    0.395000]  [<c105c49d>] ? __kthread_parkme+0x2d/0x70
[    0.395000]  [<c1065220>] ? lg_lock_init+0x20/0x20
[    0.395000]  [<c105c745>] kthread+0x95/0xa0
[    0.395000]  [<c1510303>] ? dump_pagetable+0xa/0x101
[    0.395000]  [<c151744f>] ? wait_for_common+0xcf/0x110
[    0.395000]  [<c1060000>] ? __remove_hrtimer+0x90/0xa0
[    0.395000]  [<c151a477>] ret_from_kernel_thread+0x1b/0x28
[    0.395000]  [<c105c6b0>] ? __init_kthread_worker+0x60/0x60
[    0.395000] Code: 0f a3 02 19 d2 85 d2 0f 95 c2 0f b6 d2 ff d1 89 f0 e8 69 25 0b 00 8d 64 24 14 31 c0 5b 5e 5f 5d c3 e8 d9 2b 4b 00 e9 68 
fe ff ff <0f> 0b 66 90 e8 cb 2b 4b 00 e9 ac fe ff ff 8d b6 00 00 00 00 e8
[    0.395000] EIP: [<c106542c>] smpboot_thread_fn+0x20c/0x260 SS:ESP 0068:f60c1f04
[    0.403466] ---[ end trace f5696bc4e08aa88f ]---

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: kernel BUG at kernel/smpboot.c:134!
  2013-04-07  9:50       ` Borislav Petkov
@ 2013-04-08  9:24         ` Thomas Gleixner
  2013-04-08 11:55           ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-08  9:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Srivatsa S. Bhat, Dave Hansen, LKML, Dave Jones, dhillf, Peter Zijlstra

On Sun, 7 Apr 2013, Borislav Petkov wrote:

> On Sun, Apr 07, 2013 at 11:20:10AM +0200, Thomas Gleixner wrote:
> > And it's even more bogus because the cpu to which we would bind in
> > kthread_create_on_cpu() is not yet online.
> 
> In case you guys are wondering about reproducibility, I saw the same
> BUG_ON yesterday on a 32-bit atom smp qemu guest too (don't ask why?
> :-))

I don't have to ask. It's caused by massive brain failure on my side.

Does the patch below fix the issue ?

Thanks,

	tglx
---
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -131,8 +131,14 @@ static void __kthread_parkme(struct kthr
 		schedule();
 		__set_current_state(TASK_INTERRUPTIBLE);
 	}
-	clear_bit(KTHREAD_IS_PARKED, &self->flags);
 	__set_current_state(TASK_RUNNING);
+	clear_bit(KTHREAD_IS_PARKED, &self->flags);
+
+	/* Rebind ourself to the target cpu */
+	if (test_bit(KTHREAD_IS_PER_CPU, &self->flags)) {
+		set_cpus_allowed_ptr(currrent, cpumask_of(self->cpu));
+		current->flags |= PF_THREAD_BOUND;
+	}
 }
 
 void kthread_parkme(void)
@@ -344,11 +350,8 @@ void kthread_unpark(struct task_struct *
 		 * park before that happens we'd see the IS_PARKED bit
 		 * which might be about to be cleared.
 		 */
-		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
-			if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
-				__kthread_bind(k, kthread->cpu);
+		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
 			wake_up_process(k);
-		}
 	}
 	put_task_struct(k);
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: kernel BUG at kernel/smpboot.c:134!
  2013-04-08  9:24         ` Thomas Gleixner
@ 2013-04-08 11:55           ` Borislav Petkov
  2013-04-08 12:17             ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2013-04-08 11:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Srivatsa S. Bhat, Dave Hansen, LKML, Dave Jones, dhillf, Peter Zijlstra

On Mon, Apr 08, 2013 at 11:24:14AM +0200, Thomas Gleixner wrote:
> On Sun, 7 Apr 2013, Borislav Petkov wrote:
> 
> > On Sun, Apr 07, 2013 at 11:20:10AM +0200, Thomas Gleixner wrote:
> > > And it's even more bogus because the cpu to which we would bind in
> > > kthread_create_on_cpu() is not yet online.
> > 
> > In case you guys are wondering about reproducibility, I saw the same
> > BUG_ON yesterday on a 32-bit atom smp qemu guest too (don't ask why?
> > :-))
> 
> I don't have to ask.

Yeah, I actually meant "Don't ask why I'm testing 32-bit atom smp qemu
guests." :-)

> ---
> Index: linux-2.6/kernel/kthread.c
> ===================================================================
> --- linux-2.6.orig/kernel/kthread.c
> +++ linux-2.6/kernel/kthread.c
> @@ -131,8 +131,14 @@ static void __kthread_parkme(struct kthr
>  		schedule();
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  	}
> -	clear_bit(KTHREAD_IS_PARKED, &self->flags);
>  	__set_current_state(TASK_RUNNING);
> +	clear_bit(KTHREAD_IS_PARKED, &self->flags);
> +
> +	/* Rebind ourself to the target cpu */
> +	if (test_bit(KTHREAD_IS_PER_CPU, &self->flags)) {
> +		set_cpus_allowed_ptr(currrent, cpumask_of(self->cpu));

"currrent" is a typo, I fixed it up to the obvious "current" so that it
builds.

> +		current->flags |= PF_THREAD_BOUND;
> +	}
>  }
>  
>  void kthread_parkme(void)
> @@ -344,11 +350,8 @@ void kthread_unpark(struct task_struct *
>  		 * park before that happens we'd see the IS_PARKED bit
>  		 * which might be about to be cleared.
>  		 */
> -		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> -			if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> -				__kthread_bind(k, kthread->cpu);
> +		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
>  			wake_up_process(k);
> -		}
>  	}
>  	put_task_struct(k);

However, it needs more love, see below.

The thing is, I had to do a couple of runs to actually trigger it again
which means, it does not fire *always*. It seems that increasing the
core count for the qemu guest increases the hit rate. Anyway, this is
the whole qemu command:

$ qemu-system-i386 -gdb tcp::1234 -cpu n270,+movbe -hda /home/boris/kvm/debian/sid-i386.img -name "Debian i386:1235" -boot menu=off,order=c -m 2048 -localtime -net nic -net user,hostfwd=tcp::1235-:22 -usbdevice tablet -kernel /w/kernel/linux-2.6/arch/x86/boot/bzImage -append "root=/dev/sda1 debug ignore_loglevel console=ttyS0,115200 console=tty0" -serial file:/home/boris/kvm/test-i386.log -snapshot -smp 4

and it needs a patched qemu with MOVBE support.

Yep, 4 cores on a single-core atom N270. This doesn't really exist as a
hw platform but it seems to boot. With your patch applied, I get:

[    0.000000] Linux version 3.9.0-rc5+ (boris@pd) (gcc version 4.7.2 (Debian 4.7.2-5) ) #5 SMP PREEMPT Mon Apr 8 13:46:14 CEST 2013
[    0.000000] Atom PSE erratum detected, BIOS microcode update recommended
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007fffdfff] usable
[    0.000000] BIOS-e820: [mem 0x000000007fffe000-0x000000007fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] debug: ignoring loglevel setting.
[    0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
[    0.000000] SMBIOS 2.4 present.
[    0.000000] DMI: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] e820: last_pfn = 0x7fffe max_arch_pfn = 0x100000
[    0.000000] MTRR default type: write-back
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-BFFFF uncachable
[    0.000000]   C0000-FFFFF write-protect
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 080000000 mask 080000000 uncachable
[    0.000000]   1 disabled
[    0.000000]   2 disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[    0.000000] initial memory mapped: [mem 0x00000000-0x023fffff]
[    0.000000] Base memory trampoline at [c009b000] 9b000 size 16384
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000]  [mem 0x00000000-0x000fffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x37000000-0x373fffff]
[    0.000000]  [mem 0x37000000-0x373fffff] page 4k
[    0.000000] BRK [0x020b3000, 0x020b3fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x30000000-0x36ffffff]
[    0.000000]  [mem 0x30000000-0x36ffffff] page 4k
[    0.000000] BRK [0x020b4000, 0x020b4fff] PGTABLE
[    0.000000] BRK [0x020b5000, 0x020b5fff] PGTABLE
[    0.000000] BRK [0x020b6000, 0x020b6fff] PGTABLE
[    0.000000] BRK [0x020b7000, 0x020b7fff] PGTABLE
[    0.000000] init_memory_mapping: [mem 0x00100000-0x2fffffff]
[    0.000000]  [mem 0x00100000-0x2fffffff] page 4k
[    0.000000] init_memory_mapping: [mem 0x37400000-0x377fdfff]
[    0.000000]  [mem 0x37400000-0x377fdfff] page 4k
[    0.000000] ACPI: RSDP 000fd8d0 00014 (v00 BOCHS )
[    0.000000] ACPI: RSDT 7fffe380 00034 (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: FACP 7fffff80 00074 (v01 BOCHS  BXPCFACP 00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 7fffe3c0 011A9 (v01   BXPC   BXDSDT 00000001 INTL 20100528)
[    0.000000] ACPI: FACS 7fffff40 00040
[    0.000000] ACPI: SSDT 7ffff6e0 00858 (v01 BOCHS  BXPCSSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: APIC 7ffff5b0 00090 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
[    0.000000] ACPI: HPET 7ffff570 00038 (v01 BOCHS  BXPCHPET 00000001 BXPC 00000001)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] 1160MB HIGHMEM available.
[    0.000000] 887MB LOWMEM available.
[    0.000000]   mapped low ram: 0 - 377fe000
[    0.000000]   low ram: 0 - 377fe000
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   Normal   [mem 0x01000000-0x377fdfff]
[    0.000000]   HighMem  [mem 0x377fe000-0x7fffdfff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009efff]
[    0.000000]   node   0: [mem 0x00100000-0x7fffdfff]
[    0.000000] On node 0 totalpages: 524188
[    0.000000]   DMA zone: 32 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 3998 pages, LIFO batch:0
[    0.000000]   Normal zone: 1744 pages used for memmap
[    0.000000]   Normal zone: 223230 pages, LIFO batch:31
[    0.000000]   HighMem zone: 2320 pages used for memmap
[    0.000000]   HighMem zone: 296960 pages, LIFO batch:31
[    0.000000] Using APIC driver default
[    0.000000] ACPI: PM-Timer IO Port: 0xb008
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x00] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 0 already used, trying 4
[    0.000000] IOAPIC[0]: apic_id 4, version 17, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ5 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] ACPI: IRQ10 used by override.
[    0.000000] ACPI: IRQ11 used by override.
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] nr_irqs_gsi: 40
[    0.000000] PM: Registered nosave memory: 000000000009f000 - 00000000000a0000
[    0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000f0000
[    0.000000] PM: Registered nosave memory: 00000000000f0000 - 0000000000100000
[    0.000000] e820: [mem 0x80000000-0xfffbffff] available for PCI devices
[    0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:4 nr_node_ids:1
[    0.000000] PERCPU: Embedded 334 pages/cpu @f6178000 s1343936 r0 d24128 u1368064
[    0.000000] pcpu-alloc: s1343936 r0 d24128 u1368064 alloc=334*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 522412
[    0.000000] Kernel command line: root=/dev/sda1 debug ignore_loglevel console=ttyS0,115200 console=tty0
[    0.000000] PID hash table entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[    0.000000] __ex_table already sorted, skipping sort
[    0.000000] Initializing CPU#0
[    0.000000] Initializing HighMem for node 0 (000377fe:0007fffe)
[    0.000000] Memory: 2056060k/2097144k available (5202k kernel code, 40692k reserved, 2447k data, 1756k init, 1187840k highmem)
[    0.000000] virtual kernel memory layout:
[    0.000000]     fixmap  : 0xfff17000 - 0xfffff000   ( 928 kB)
[    0.000000]     pkmap   : 0xff800000 - 0xffc00000   (4096 kB)
[    0.000000]     vmalloc : 0xf7ffe000 - 0xff7fe000   ( 120 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xf77fe000   ( 887 MB)
[    0.000000]       .init : 0xc1779000 - 0xc1930000   (1756 kB)
[    0.000000]       .data : 0xc1514bfa - 0xc1778940   (2447 kB)
[    0.000000]       .text : 0xc1000000 - 0xc1514bfa   (5202 kB)
[    0.000000] Checking if this processor honours the WP bit even in supervisor mode...Ok.
[    0.000000] SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] Preemptible hierarchical RCU implementation.
[    0.000000] 	RCU dyntick-idle grace-period acceleration is enabled.
[    0.000000] 	RCU lockdep checking is enabled.
[    0.000000] 	Additional per-CPU info printed with stalls.
[    0.000000] 	RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=4.
[    0.000000] NR_IRQS:2304 nr_irqs:712 16
[    0.000000] CPU 0 irqstacks, hard=f5c08000 soft=f5c0a000
[    0.000000] Console: colour VGA+ 80x25
[    0.000000] console [tty0] enabled
[    0.000000] console [ttyS0] enabled
[    0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[    0.000000] ... MAX_LOCKDEP_SUBCLASSES:  8
[    0.000000] ... MAX_LOCK_DEPTH:          48
[    0.000000] ... MAX_LOCKDEP_KEYS:        8191
[    0.000000] ... CLASSHASH_SIZE:          4096
[    0.000000] ... MAX_LOCKDEP_ENTRIES:     16384
[    0.000000] ... MAX_LOCKDEP_CHAINS:      32768
[    0.000000] ... CHAINHASH_SIZE:          16384
[    0.000000]  memory used by lock dependency info: 3823 kB
[    0.000000]  per task-struct memory footprint: 1920 bytes
[    0.000000] ------------------------
[    0.000000] | Locking API testsuite:
[    0.000000] ----------------------------------------------------------------------------
[    0.000000]                                  | spin |wlock |rlock |mutex | wsem | rsem |
[    0.000000]   --------------------------------------------------------------------------
[    0.000000]                      A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]                  A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]              A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]              A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]          A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]          A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]          A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]                     double unlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]                   initialize held:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]                  bad unlock order:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[    0.000000]   --------------------------------------------------------------------------
[    0.000000]               recursive read-lock:             |  ok  |             |  ok  |
[    0.000000]            recursive read-lock #2:             |  ok  |             |  ok  |
[    0.000000]             mixed read-write-lock:             |  ok  |             |  ok  |
[    0.000000]             mixed write-read-lock:             |  ok  |             |  ok  |
[    0.000000]   --------------------------------------------------------------------------
[    0.000000]      hard-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[    0.000000]      soft-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[    0.000000]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
[    0.000000]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
[    0.000000]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
[    0.000000]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
[    0.000000]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
[    0.000000]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/132:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/132:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/213:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/213:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/231:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/231:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/312:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/312:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #1/321:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #1/321:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #2/123:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #2/123:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #2/132:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #2/132:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #2/213:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #2/213:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #2/231:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #2/231:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #2/312:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #2/312:  ok  |  ok  |  ok  |
[    0.000000]     hard-safe-A + unsafe-B #2/321:  ok  |  ok  |  ok  |
[    0.000000]     soft-safe-A + unsafe-B #2/321:  ok  |  ok  |  ok  |
[    0.000000]       hard-irq lock-inversion/123:  ok  |  ok  |  ok  |
[    0.000000]       soft-irq lock-inversion/123:  ok  |  ok  |  ok  |
[    0.000000]       hard-irq lock-inversion/132:  ok  |  ok  |  ok  |
[    0.000000]       soft-irq lock-inversion/132:  ok  |  ok  |  ok  |
[    0.000000]       hard-irq lock-inversion/213:  ok  |  ok  |  ok  |
[    0.000000]       soft-irq lock-inversion/213:  ok  |  ok  |  ok  |
[    0.000000]       hard-irq lock-inversion/231:  ok  |  ok  |  ok  |
[    0.000000]       soft-irq lock-inversion/231:  ok  |  ok  |  ok  |
[    0.000000]       hard-irq lock-inversion/312:  ok  |  ok  |  ok  |
[    0.000000]       soft-irq lock-inversion/312:  ok  |  ok  |  ok  |
[    0.000000]       hard-irq lock-inversion/321:  ok  |  ok  |  ok  |
[    0.000000]       soft-irq lock-inversion/321:  ok  |  ok  |  ok  |
[    0.000000]       hard-irq read-recursion/123:  ok  |
[    0.000000]       soft-irq read-recursion/123:  ok  |
[    0.000000]       hard-irq read-recursion/132:  ok  |
[    0.000000]       soft-irq read-recursion/132:  ok  |
[    0.000000]       hard-irq read-recursion/213:  ok  |
[    0.000000]       soft-irq read-recursion/213:  ok  |
[    0.000000]       hard-irq read-recursion/231:  ok  |
[    0.000000]       soft-irq read-recursion/231:  ok  |
[    0.000000]       hard-irq read-recursion/312:  ok  |
[    0.000000]       soft-irq read-recursion/312:  ok  |
[    0.000000]       hard-irq read-recursion/321:  ok  |
[    0.000000]       soft-irq read-recursion/321:  ok  |
[    0.000000] -------------------------------------------------------
[    0.000000] Good, all 218 testcases passed! |
[    0.000000] ---------------------------------
[    0.000000] hpet clockevent registered
[    0.000000] tsc: Fast TSC calibration failed
[    0.000000] tsc: Unable to calibrate against PIT
[    0.000000] tsc: using HPET reference calibration
[    0.000000] tsc: Detected 4013.467 MHz processor
[    0.016000] Calibrating delay loop (skipped), value calculated using timer frequency.. 8026.93 BogoMIPS (lpj=4013467)
[    0.019168] pid_max: default: 32768 minimum: 301
[    0.031115] Mount-cache hash table entries: 512
[    0.108223] Atom PSE erratum detected, BIOS microcode update recommended
[    0.116305] mce: CPU supports 10 MCE banks
[    0.127569] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.127569] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.127569] tlb_flushall_shift: 6
[    0.142361] Freeing SMP alternatives: 20k freed
[    0.152465] ACPI: Core revision 20130117
[    0.406397] ACPI: All ACPI Tables successfully acquired
[    0.419468] ftrace: allocating 19508 entries in 39 pages
[    0.810447] Enabling APIC mode:  Flat.  Using 1 I/O APICs
[    0.841000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.852146] smpboot: CPU0: Intel(R) Atom(TM) CPU N270   @ 1.60GHz (fam: 06, model: 1c, stepping: 02)
[    0.862000] APIC calibration not consistent with PM-Timer: 111ms instead of 100ms
[    0.862000] APIC delta adjusted to PM-Timer: 6249993 (6937930)
[    0.868380] Performance Events: unsupported p6 CPU model 28 no PMU driver, software events only.
[    0.932068] SMP alternatives: lockdep: fixing up alternatives
[    0.933327] CPU 1 irqstacks, hard=f5cf4000 soft=f5cf6000
[    0.934073] smpboot: Booting Node   0, Processors  #1[    0.016000] Initializing CPU#1
[    0.016000] Atom PSE erratum detected, BIOS microcode update recommended

[    1.156192] SMP alternatives: lockdep: fixing up alternatives
[    1.161141] CPU 2 irqstacks, hard=f5d08000 soft=f5d0a000
[    1.162733]  #2[    0.016000] Initializing CPU#2
[    0.016000] Atom PSE erratum detected, BIOS microcode update recommended
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.

[    1.522125] SMP alternatives: lockdep: fixing up alternatives
[    1.525834] CPU 3 irqstacks, hard=f5d28000 soft=f5d2a000
[    1.527101]  #3 OK
[    0.016000] Initializing CPU#3
[    0.016000] Atom PSE erratum detected, BIOS microcode update recommended
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[    1.635184] Brought up 4 CPUs
[    1.636126] smpboot: Total of 4 processors activated (28540.37 BogoMIPS)
[    1.639142] ------------[ cut here ]------------
[    1.640000] WARNING: at kernel/sched/core.c:963 set_task_cpu+0x12c/0x280()
[    1.640000] Hardware name: Bochs
[    1.640000] Modules linked in:
[    1.640000] Pid: 8, comm: migration/0 Not tainted 3.9.0-rc5+ #5
[    1.640000] Call Trace:
[    1.640000]  [<c1035633>] warn_slowpath_common+0x73/0xb0
[    1.640000]  [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[    1.640000]  [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[    1.640000]  [<c1035693>] warn_slowpath_null+0x23/0x30
[    1.640000]  [<c106a6cc>] set_task_cpu+0x12c/0x280
[    1.640000]  [<c1068e2b>] ? update_rq_clock+0x3b/0x70
[    1.640000]  [<c1068ee1>] ? dequeue_task+0x81/0xc0
[    1.640000]  [<c106a922>] __migrate_task+0x102/0x140
[    1.640000]  [<c106a960>] ? __migrate_task+0x140/0x140
[    1.640000]  [<c106a982>] migration_cpu_stop+0x22/0x30
[    1.640000]  [<c10a0a88>] cpu_stopper_thread+0x68/0x130
[    1.640000]  [<c106ea11>] ? sched_clock_cpu+0xf1/0x180
[    1.640000]  [<c106a20d>] ? sub_preempt_count+0x4d/0xb0
[    1.640000]  [<c106eaee>] ? local_clock+0x4e/0x60
[    1.640000]  [<c108cf4a>] ? trace_hardirqs_off_caller+0x1a/0x110
[    1.640000]  [<c108d22d>] ? put_lock_stats.isra.17+0xd/0x30
[    1.640000]  [<c1512b25>] ? _raw_spin_unlock_irqrestore+0x55/0x70
[    1.640000]  [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
[    1.640000]  [<c1063f06>] smpboot_thread_fn+0x176/0x260
[    1.640000]  [<c1063d90>] ? lg_lock_init+0x20/0x20
[    1.640000]  [<c105b305>] kthread+0x95/0xa0
[    1.640000]  [<c1510303>] ? schedule_hrtimeout_range_clock+0x163/0x170
[    1.640000]  [<c151077f>] ? wait_for_common+0xcf/0x110
[    1.640000]  [<c1060000>] ? hrtimer_run_pending+0x110/0x110
[    1.640000]  [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
[    1.640000]  [<c105b270>] ? __kthread_parkme+0xa0/0xa0
[    1.640000] 
[    1.640000] =================================
[    1.640000] [ INFO: inconsistent lock state ]
[    1.640000] 3.9.0-rc5+ #5 Not tainted
[    1.640000] ---------------------------------
[    1.640000] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[    1.640000] migration/0/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
[    1.640000]  (&p->pi_lock){?.-...}, at: [<c106a86e>] __migrate_task+0x4e/0x140
[    1.640000] {IN-HARDIRQ-W} state was registered at:
[    1.640000]   [<c109123d>] __lock_acquire+0x77d/0x1910
[    1.640000]   [<c10929c3>] lock_acquire+0x83/0x1b0
[    1.640000]   [<c15121f5>] _raw_spin_lock_irqsave+0x55/0x90
[    1.640000]   [<c106c635>] try_to_wake_up+0x25/0x270
[    1.640000]   [<c106c8bf>] wake_up_process+0x1f/0x40
[    1.640000]   [<c105e7ed>] hrtimer_wakeup+0x1d/0x30
[    1.640000]   [<c105ebfd>] __run_hrtimer+0x7d/0x3c0
[    1.640000]   [<c1060077>] hrtimer_run_queues+0x77/0x170
[    1.640000]   [<c104787d>] run_local_timers+0xd/0x20
[    1.640000]   [<c10478c3>] update_process_times+0x33/0x70
[    1.640000]   [<c108911a>] tick_periodic+0x2a/0x80
[    1.640000]   [<c10891ee>] tick_handle_periodic+0x1e/0x80
[    1.640000]   [<c1513e95>] smp_apic_timer_interrupt+0x55/0x87
[    1.640000]   [<c1513352>] apic_timer_interrupt+0x32/0x38
[    1.640000]   [<c103e6f6>] irq_exit+0x86/0xa0
[    1.640000]   [<c1513e9a>] smp_apic_timer_interrupt+0x5a/0x87
[    1.640000]   [<c1513352>] apic_timer_interrupt+0x32/0x38
[    1.640000]   [<c105f686>] __hrtimer_start_range_ns+0x1b6/0x560
[    1.640000]   [<c105fa87>] hrtimer_start_range_ns+0x27/0x30
[    1.640000]   [<c1510251>] schedule_hrtimeout_range_clock+0xb1/0x170
[    1.640000]   [<c1510328>] schedule_hrtimeout_range+0x18/0x20
[    1.640000]   [<c1510341>] schedule_hrtimeout+0x11/0x20
[    1.640000]   [<c106ab40>] wait_task_inactive+0x1b0/0x1d0
[    1.640000]   [<c105b620>] kthread_bind+0x20/0x70
[    1.640000]   [<c10530d3>] create_worker+0x1d3/0x200
[    1.640000]   [<c17892be>] init_workqueues+0x235/0x401
[    1.640000]   [<c1000202>] do_one_initcall+0x112/0x160
[    1.640000]   [<c17799e7>] kernel_init_freeable+0x58/0x184
[    1.640000]   [<c14fd0e1>] kernel_init+0x11/0xe0
[    1.640000]   [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
[    1.640000] irq event stamp: 410
[    1.640000] hardirqs last  enabled at (409): [<c1512b67>] _raw_spin_unlock_irq+0x27/0x50
[    1.640000] hardirqs last disabled at (410): [<c106a971>] migration_cpu_stop+0x11/0x30
[    1.640000] softirqs last  enabled at (394): [<c103e3a6>] __do_softirq+0x136/0x340
[    1.640000] softirqs last disabled at (389): [<c103e6f6>] irq_exit+0x86/0xa0
[    1.640000] 
[    1.640000] other info that might help us debug this:
[    1.640000]  Possible unsafe locking scenario:
[    1.640000] 
[    1.640000]        CPU0
[    1.640000]        ----
[    1.640000]   lock(&p->pi_lock);
[    1.640000]   <Interrupt>
[    1.640000]     lock(&p->pi_lock);
[    1.640000] 
[    1.640000]  *** DEADLOCK ***
[    1.640000] 
[    1.640000] 4 locks held by migration/0/8:
[    1.640000]  #0:  (&p->pi_lock){?.-...}, at: [<c106a86e>] __migrate_task+0x4e/0x140
[    1.640000]  #1:  (&rq->lock){-.-...}, at: [<c106a8d7>] __migrate_task+0xb7/0x140
[    1.640000]  #2:  (&rq->lock/1){......}, at: [<c106a8e3>] __migrate_task+0xc3/0x140
[    1.640000]  #3:  (nonblocking_pool.lock){......}, at: [<c12fd41b>] extract_buf+0x3b/0x140
[    1.640000] 
[    1.640000] stack backtrace:
[    1.640000] Pid: 8, comm: migration/0 Not tainted 3.9.0-rc5+ #5
[    1.640000] Call Trace:
[    1.640000]  [<c150b5c4>] print_usage_bug.part.34+0x239/0x242
[    1.640000]  [<c108ee53>] mark_lock+0x4e3/0x5f0
[    1.640000]  [<c106eaee>] ? local_clock+0x4e/0x60
[    1.640000]  [<c108ded0>] ? print_shortest_lock_dependencies+0x1c0/0x1c0
[    1.640000]  [<c108efd5>] mark_held_locks+0x75/0xe0
[    1.640000]  [<c1001dd5>] ? math_state_restore+0x85/0x160
[    1.640000]  [<c1002b00>] ? do_spurious_interrupt_bug+0x40/0x40
[    1.640000]  [<c108f0de>] trace_hardirqs_on_caller+0x9e/0x1b0
[    1.640000]  [<c108f1fc>] trace_hardirqs_on+0xc/0x10
[    1.640000]  [<c1001dd5>] math_state_restore+0x85/0x160
[    1.640000]  [<c1002b00>] ? do_spurious_interrupt_bug+0x40/0x40
[    1.640000]  [<c1002b15>] do_device_not_available+0x15/0x30
[    1.640000]  [<c1513593>] error_code+0x5f/0x64
[    1.640000]  [<c1002b00>] ? do_spurious_interrupt_bug+0x40/0x40
[    1.640000]  [<c14fbee3>] ? sha_transform+0x33/0x1090
[    1.640000]  [<c12fd41b>] ? extract_buf+0x3b/0x140
[    1.640000]  [<c12fd443>] extract_buf+0x63/0x140
[    1.640000]  [<c1512b01>] ? _raw_spin_unlock_irqrestore+0x31/0x70
[    1.640000]  [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
[    1.640000]  [<c1065316>] ? __wake_up+0x46/0x60
[    1.640000]  [<c12fd2d4>] ? account+0xa4/0x1b0
[    1.640000]  [<c1003968>] ? dump_trace+0x78/0xf0
[    1.640000]  [<c12fe64b>] extract_entropy+0x7b/0x1b0
[    1.640000]  [<c12fe919>] ? get_random_bytes+0x29/0x30
[    1.640000]  [<c1004c00>] ? show_trace+0x20/0x30
[    1.640000]  [<c12fe919>] get_random_bytes+0x29/0x30
[    1.640000]  [<c103530f>] init_oops_id+0x3f/0x50
[    1.640000]  [<c1035591>] print_oops_end_marker+0x11/0x40
[    1.640000]  [<c1035638>] warn_slowpath_common+0x78/0xb0
[    1.640000]  [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[    1.640000]  [<c106a6cc>] ? set_task_cpu+0x12c/0x280
[    1.640000]  [<c1035693>] warn_slowpath_null+0x23/0x30
[    1.640000]  [<c106a6cc>] set_task_cpu+0x12c/0x280
[    1.640000]  [<c1068e2b>] ? update_rq_clock+0x3b/0x70
[    1.640000]  [<c1068ee1>] ? dequeue_task+0x81/0xc0
[    1.640000]  [<c106a922>] __migrate_task+0x102/0x140
[    1.640000]  [<c106a960>] ? __migrate_task+0x140/0x140
[    1.640000]  [<c106a982>] migration_cpu_stop+0x22/0x30
[    1.640000]  [<c10a0a88>] cpu_stopper_thread+0x68/0x130
[    1.640000]  [<c106ea11>] ? sched_clock_cpu+0xf1/0x180
[    1.640000]  [<c106a20d>] ? sub_preempt_count+0x4d/0xb0
[    1.640000]  [<c106eaee>] ? local_clock+0x4e/0x60
[    1.640000]  [<c108cf4a>] ? trace_hardirqs_off_caller+0x1a/0x110
[    1.640000]  [<c108d22d>] ? put_lock_stats.isra.17+0xd/0x30
[    1.640000]  [<c1512b25>] ? _raw_spin_unlock_irqrestore+0x55/0x70
[    1.640000]  [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
[    1.640000]  [<c1063f06>] smpboot_thread_fn+0x176/0x260
[    1.640000]  [<c1063d90>] ? lg_lock_init+0x20/0x20
[    1.640000]  [<c105b305>] kthread+0x95/0xa0
[    1.640000]  [<c1510303>] ? schedule_hrtimeout_range_clock+0x163/0x170
[    1.640000]  [<c151077f>] ? wait_for_common+0xcf/0x110
[    1.640000]  [<c1060000>] ? hrtimer_run_pending+0x110/0x110
[    1.640000]  [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
[    1.640000]  [<c105b270>] ? __kthread_parkme+0xa0/0xa0

HTH and thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: kernel BUG at kernel/smpboot.c:134!
  2013-04-08 11:55           ` Borislav Petkov
@ 2013-04-08 12:17             ` Thomas Gleixner
  2013-04-09 14:38               ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-08 12:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Srivatsa S. Bhat, Dave Hansen, LKML, Dave Jones, dhillf, Peter Zijlstra

On Mon, 8 Apr 2013, Borislav Petkov wrote:
> On Mon, Apr 08, 2013 at 11:24:14AM +0200, Thomas Gleixner wrote:
> > +	/* Rebind ourself to the target cpu */
> > +	if (test_bit(KTHREAD_IS_PER_CPU, &self->flags)) {
> > +		set_cpus_allowed_ptr(currrent, cpumask_of(self->cpu));
> 
> "currrent" is a typo, I fixed it up to the obvious "current" so that it
> builds.

Bah.
 
> [    1.640000] WARNING: at kernel/sched/core.c:963 set_task_cpu+0x12c/0x280()
> [    1.640000] Hardware name: Bochs
> [    1.640000] Modules linked in:
> [    1.640000] Pid: 8, comm: migration/0 Not tainted 3.9.0-rc5+ #5

Hmm, crap migration thread trying to migrate itself via stomp
machine. Not really a brilliant idea.

/me goes back to drawing board.

> [    1.640000] Call Trace:
> [    1.640000]  [<c1035633>] warn_slowpath_common+0x73/0xb0
> [    1.640000]  [<c106a6cc>] ? set_task_cpu+0x12c/0x280
> [    1.640000]  [<c106a6cc>] ? set_task_cpu+0x12c/0x280
> [    1.640000]  [<c1035693>] warn_slowpath_null+0x23/0x30
> [    1.640000]  [<c106a6cc>] set_task_cpu+0x12c/0x280
> [    1.640000]  [<c1068e2b>] ? update_rq_clock+0x3b/0x70
> [    1.640000]  [<c1068ee1>] ? dequeue_task+0x81/0xc0
> [    1.640000]  [<c106a922>] __migrate_task+0x102/0x140
> [    1.640000]  [<c106a960>] ? __migrate_task+0x140/0x140
> [    1.640000]  [<c106a982>] migration_cpu_stop+0x22/0x30
> [    1.640000]  [<c10a0a88>] cpu_stopper_thread+0x68/0x130
> [    1.640000]  [<c106ea11>] ? sched_clock_cpu+0xf1/0x180
> [    1.640000]  [<c106a20d>] ? sub_preempt_count+0x4d/0xb0
> [    1.640000]  [<c106eaee>] ? local_clock+0x4e/0x60
> [    1.640000]  [<c108cf4a>] ? trace_hardirqs_off_caller+0x1a/0x110
> [    1.640000]  [<c108d22d>] ? put_lock_stats.isra.17+0xd/0x30
> [    1.640000]  [<c1512b25>] ? _raw_spin_unlock_irqrestore+0x55/0x70
> [    1.640000]  [<c1512b0b>] ? _raw_spin_unlock_irqrestore+0x3b/0x70
> [    1.640000]  [<c1063f06>] smpboot_thread_fn+0x176/0x260
> [    1.640000]  [<c1063d90>] ? lg_lock_init+0x20/0x20
> [    1.640000]  [<c105b305>] kthread+0x95/0xa0
> [    1.640000]  [<c1510303>] ? schedule_hrtimeout_range_clock+0x163/0x170
> [    1.640000]  [<c151077f>] ? wait_for_common+0xcf/0x110
> [    1.640000]  [<c1060000>] ? hrtimer_run_pending+0x110/0x110
> [    1.640000]  [<c15137b7>] ret_from_kernel_thread+0x1b/0x28
> [    1.640000]  [<c105b270>] ? __kthread_parkme+0xa0/0xa0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-08 12:17             ` Thomas Gleixner
@ 2013-04-09 14:38               ` Thomas Gleixner
  2013-04-09 15:55                 ` Dave Hansen
                                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-09 14:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Srivatsa S. Bhat, Dave Hansen, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0	       	 	CPU1  	     	    CPU2
unpark(T)				    wake_up_process(T)
  clear(SHOULD_PARK)	T runs
			leave parkme() due to !SHOULD_PARK  
  bind_to(CPU2)		BUG_ON(wrong CPU)						    

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The only rock solid solution to this problem is to prevent wakeups in
park state which are not from unpark(). That way we can guarantee that
the association of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitely for the unpark wakeup.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Hansen <dave@sr71.net>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 fs/proc/array.c       |    1 +
 include/linux/sched.h |    5 +++--
 kernel/kthread.c      |   38 +++++++++++++++++++++-----------------
 3 files changed, 25 insertions(+), 19 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_arr
 	"x (dead)",		/*  64 */
 	"K (wakekill)",		/* 128 */
 	"W (waking)",		/* 256 */
+	"P (parked)",		/* 512 */
 };
 
 static inline const char *get_task_state(struct task_struct *tsk)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
 #define TASK_DEAD		64
 #define TASK_WAKEKILL		128
 #define TASK_WAKING		256
-#define TASK_STATE_MAX		512
+#define TASK_PARKED		512
+#define TASK_STATE_MAX		1024
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t
 
 static void __kthread_parkme(struct kthread *self)
 {
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_PARKED);
 	while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
 		if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
 			complete(&self->parked);
 		schedule();
-		__set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_PARKED);
 	}
 	clear_bit(KTHREAD_IS_PARKED, &self->flags);
 	__set_current_state(TASK_RUNNING);
@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
 	return NULL;
 }
 
+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+	clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+	/*
+	 * We clear the IS_PARKED bit here as we don't wait
+	 * until the task has left the park code. So if we'd
+	 * park before that happens we'd see the IS_PARKED bit
+	 * which might be about to be cleared.
+	 */
+	if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+			__kthread_bind(k, kthread->cpu);
+		wake_up_state(k, TASK_PARKED);
+	}
+}
+
 /**
  * kthread_unpark - unpark a thread created by kthread_create().
  * @k:		thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
 {
 	struct kthread *kthread = task_get_live_kthread(k);
 
-	if (kthread) {
-		clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
-		/*
-		 * We clear the IS_PARKED bit here as we don't wait
-		 * until the task has left the park code. So if we'd
-		 * park before that happens we'd see the IS_PARKED bit
-		 * which might be about to be cleared.
-		 */
-		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
-			if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
-				__kthread_bind(k, kthread->cpu);
-			wake_up_process(k);
-		}
-	}
+	if (kthread)
+		__kthread_unpark(k, kthread);
 	put_task_struct(k);
 }
 
@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
 	trace_sched_kthread_stop(k);
 	if (kthread) {
 		set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
-		clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+		__kthread_unpark(k, kthread);
 		wake_up_process(k);
 		wait_for_completion(&kthread->exited);
 	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 14:38               ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
@ 2013-04-09 15:55                 ` Dave Hansen
  2013-04-09 18:43                   ` Thomas Gleixner
  2013-04-10 14:03                   ` [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn() Srivatsa S. Bhat
  2013-04-11 19:16                 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Srivatsa S. Bhat
                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-09 15:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

Hey Thomas,

I don't think the patch helped my case.  Looks like the same BUG_ON().

I accidentally booted with possible_cpus=10 instead of 160.  I wasn't
able to trigger this in that case, even repeatedly on/offlining them.
But, once I booted with possible_cpus=160, it triggered in a jiffy.

Two oopses below (bottom one has cpu numbers):

> [  467.106219] ------------[ cut here ]------------
> [  467.106400] kernel BUG at kernel/smpboot.c:134!
> [  467.106556] invalid opcode: 0000 [#1] SMP 
> [  467.106831] Modules linked in:
> [  467.107039] CPU 0 
> [  467.107109] Pid: 3095, comm: migration/115 Tainted: G        W    3.9.0-rc6-00020-g84ee980-dirty #132 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [  467.107507] RIP: 0010:[<ffffffff8110bed8>]  [<ffffffff8110bed8>] smpboot_thread_fn+0x258/0x280
> [  467.107820] RSP: 0018:ffff887ff0561e08  EFLAGS: 00010202
> [  467.107980] RAX: 0000000000000000 RBX: ffff887ff04ef010 RCX: 000000000000b888
> [  467.108142] RDX: ffff887ff0561fd8 RSI: ffff881ffda00000 RDI: 0000000000000073
> [  467.108303] RBP: ffff887ff0561e38 R08: 0000000000000001 R09: 0000000000000000
> [  467.108465] R10: 0000000000000018 R11: 0000000000000000 R12: ffff887ff053c5c0
> [  467.108629] R13: ffffffff81e587a0 R14: ffff887ff053c5c0 R15: 0000000000000000
> [  467.108791] FS:  0000000000000000(0000) GS:ffff881ffda00000(0000) knlGS:0000000000000000
> [  467.109037] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  467.109194] CR2: 000000000117c278 CR3: 0000000001e0b000 CR4: 00000000000007f0
> [  467.109357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  467.109519] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  467.109684] Process migration/115 (pid: 3095, threadinfo ffff887ff0560000, task ffff887ff053c5c0)
> [  467.109930] Stack:
> [  467.110075]  ffff887ff0561e38 0000000000000000 ffff881fe60adcc0 ffff887ff0561ec0
> [  467.110580]  ffff887ff04ef010 ffffffff8110bc80 ffff887ff0561f48 ffffffff810ff1df
> [  467.111075]  0000000000000001 ffff881f00000073 ffff887ff04ef010 ffff887f00000001
> [  467.111568] Call Trace:
> [  467.111726]  [<ffffffff8110bc80>] ? __smpboot_create_thread+0x180/0x180
> [  467.111893]  [<ffffffff810ff1df>] kthread+0xef/0x100
> [  467.112057]  [<ffffffff8110e340>] ? complete+0x30/0x80
> [  467.112216]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [  467.112386]  [<ffffffff819db99c>] ret_from_fork+0x7c/0xb0
> [  467.112548]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [  467.112708] Code: ef 3d 01 01 48 89 df e8 c7 af 16 00 48 83 05 97 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 c0 ef 3d 01 01 <0f> 0b 48 83 05 c6 ef 3d 01 01 48 83 05 86 ef 3d 01 01 0f 0b 48 
> [  467.117014] RIP  [<ffffffff8110bed8>] smpboot_thread_fn+0x258/0x280
> [  467.117233]  RSP <ffff887ff0561e08>
> [  467.117414] ---[ end trace d851dfb0bce51ca2 ]---

Here's the same oops, but with the line numbers munged because I added
some printks:

> [  161.551788] smpboot_thread_fn():
> [  161.551807] td->cpu: 132
> [  161.551808] smp_processor_id(): 121
> [  161.551811] comm: migration/%u
> [  161.551840] ------------[ cut here ]------------
> [  161.551939] kernel BUG at kernel/smpboot.c:149!
> [  161.552030] invalid opcode: 0000 [#1] SMP 
> [  161.552255] Modules linked in:
> [  161.552397] CPU 121 
> [  161.552474] Pid: 2957, comm: migration/132 Tainted: G        W    3.9.0-rc6-00020-g84ee980-dirty #136 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [  161.552655] RIP: 0010:[<ffffffff8110bf29>]  [<ffffffff8110bf29>] smpboot_thread_fn+0x409/0x560
> [  161.552852] RSP: 0018:ffff88bff0403de8  EFLAGS: 00010202
> [  161.552935] RAX: 0000000000000079 RBX: ffff88bff02ac070 RCX: 0000000000000006
> [  161.553025] RDX: 0000000000000007 RSI: 0000000000000007 RDI: ffff889ffec0d190
> [  161.553115] RBP: ffff88bff0403e38 R08: 0000000000000001 R09: 0000000000000001
> [  161.553204] R10: 0000000000000000 R11: 0000000000000b09 R12: ffff88bff04745c0
> [  161.553319] R13: ffffffff81e587a0 R14: ffffffff8110bb20 R15: ffff88bff04745c0
> [  161.553411] FS:  0000000000000000(0000) GS:ffff889ffec00000(0000) knlGS:0000000000000000
> [  161.553534] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  161.553619] CR2: 00007f0c4155c6d0 CR3: 0000000001e0b000 CR4: 00000000000007e0
> [  161.553709] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  161.553799] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  161.553889] Process migration/132 (pid: 2957, threadinfo ffff88bff0402000, task ffff88bff04745c0)
> [  161.554156] Stack:
> [  161.554312]  ffffffff8110bb20 ffff88bff04745c0 ffff88bff0403e08 0000000000000000
> [  161.554839]  ffff88bff0403e38 ffff881fef323cc0 ffff88bff0403ec0 ffff88bff02ac070
> [  161.555370]  ffffffff8110bb20 0000000000000000 ffff88bff0403f48 ffffffff810ff08f
> [  161.555891] Call Trace:
> [  161.556055]  [<ffffffff8110bb20>] ? __smpboot_create_thread+0x180/0x180
> [  161.556230]  [<ffffffff8110bb20>] ? __smpboot_create_thread+0x180/0x180
> [  161.556409]  [<ffffffff810ff08f>] kthread+0xef/0x100
> [  161.556590]  [<ffffffff819d5154>] ? wait_for_completion+0x124/0x180
> [  161.556761]  [<ffffffff810fefa0>] ? __init_kthread_worker+0x80/0x80
> [  161.556982]  [<ffffffff819e59dc>] ret_from_fork+0x7c/0xb0
> [  161.557148]  [<ffffffff810fefa0>] ? __init_kthread_worker+0x80/0x80
> [  161.557316] Code: 05 e4 f1 3d 01 01 e8 2b cf 8b 00 48 83 05 df f1 3d 01 01 65 8b 04 25 64 b0 00 00 39 03 0f 84 0c fd ff ff 48 83 05 cf f1 3d 01 01 <0f> 0b 48 83 05 cd f1 3d 01 01 0f 1f 44 00 00 b9 8b 00 00 00 48 
> [  161.561934] RIP  [<ffffffff8110bf29>] smpboot_thread_fn+0x409/0x560
> [  161.562171]  RSP <ffff88bff0403de8>
> [  161.562352] ---[ end trace 6a3b5261afedf7da ]---


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 15:55                 ` Dave Hansen
@ 2013-04-09 18:43                   ` Thomas Gleixner
  2013-04-09 19:30                     ` Thomas Gleixner
  2013-04-10 14:03                   ` [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn() Srivatsa S. Bhat
  1 sibling, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-09 18:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

Dave,

On Tue, 9 Apr 2013, Dave Hansen wrote:

> Hey Thomas,
> 
> I don't think the patch helped my case.  Looks like the same BUG_ON().
> 
> I accidentally booted with possible_cpus=10 instead of 160.  I wasn't
> able to trigger this in that case, even repeatedly on/offlining them.
> But, once I booted with possible_cpus=160, it triggered in a jiffy.

Darn. We should have merged Paul McKenneys NR_CPUS=0 patch last year
and we would have avoided the whole shebang.
 
> > [  161.551788] smpboot_thread_fn():
> > [  161.551807] td->cpu: 132
> > [  161.551808] smp_processor_id(): 121
> > [  161.551811] comm: migration/%u
> > [  161.551840] ------------[ cut here ]------------
> > [  161.551939] kernel BUG at kernel/smpboot.c:149!

Any chance, that you get a trace for that?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 18:43                   ` Thomas Gleixner
@ 2013-04-09 19:30                     ` Thomas Gleixner
  2013-04-09 20:38                       ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-09 19:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> Dave,
> 
> On Tue, 9 Apr 2013, Dave Hansen wrote:
> 
> > Hey Thomas,
> > 
> > I don't think the patch helped my case.  Looks like the same BUG_ON().
> > 
> > I accidentally booted with possible_cpus=10 instead of 160.  I wasn't
> > able to trigger this in that case, even repeatedly on/offlining them.
> > But, once I booted with possible_cpus=160, it triggered in a jiffy.
> 
> Darn. We should have merged Paul McKenneys NR_CPUS=0 patch last year
> and we would have avoided the whole shebang.
>  
> > > [  161.551788] smpboot_thread_fn():
> > > [  161.551807] td->cpu: 132
> > > [  161.551808] smp_processor_id(): 121
> > > [  161.551811] comm: migration/%u
> > > [  161.551840] ------------[ cut here ]------------
> > > [  161.551939] kernel BUG at kernel/smpboot.c:149!
> 
> Any chance, that you get a trace for that?

Thought more about it and found, that the stupid binding only works
when the task is really descheduled. So there is a small window left,
which could lead to this. Revised patch below.

Anyway a trace for that issue would be appreciated nevertheless.

Thanks,

	tglx
---
Subject: kthread: Prevent unpark race which puts threads on the wrong cpu
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 09 Apr 2013 09:33:34 +0200

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0	       	 	CPU1  	     	    CPU2
unpark(T)				    wake_up_process(T)
  clear(SHOULD_PARK)	T runs
			leave parkme() due to !SHOULD_PARK  
  bind_to(CPU2)		BUG_ON(wrong CPU)						    

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The only rock solid solution to this problem is to prevent wakeups in
park mode which are not from unpark(). That way we can guarantee that
the association of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitely for the unpark wakeup.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Hansen <dave@sr71.net>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 fs/proc/array.c       |    1 
 include/linux/sched.h |    5 ++--
 kernel/kthread.c      |   52 ++++++++++++++++++++++++++------------------------
 3 files changed, 32 insertions(+), 26 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_arr
 	"x (dead)",		/*  64 */
 	"K (wakekill)",		/* 128 */
 	"W (waking)",		/* 256 */
+	"P (parked)",		/* 512 */
 };
 
 static inline const char *get_task_state(struct task_struct *tsk)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
 #define TASK_DEAD		64
 #define TASK_WAKEKILL		128
 #define TASK_WAKING		256
-#define TASK_STATE_MAX		512
+#define TASK_PARKED		512
+#define TASK_STATE_MAX		1024
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t
 
 static void __kthread_parkme(struct kthread *self)
 {
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_PARKED);
 	while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
 		if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
 			complete(&self->parked);
 		schedule();
-		__set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_PARKED);
 	}
 	clear_bit(KTHREAD_IS_PARKED, &self->flags);
 	__set_current_state(TASK_RUNNING);
@@ -256,8 +256,13 @@ struct task_struct *kthread_create_on_no
 }
 EXPORT_SYMBOL(kthread_create_on_node);
 
-static void __kthread_bind(struct task_struct *p, unsigned int cpu)
+static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
 {
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, state)) {
+		WARN_ON(1);
+		return;
+	}
 	/* It's safe because the task is inactive. */
 	do_set_cpus_allowed(p, cpumask_of(cpu));
 	p->flags |= PF_THREAD_BOUND;
@@ -274,12 +279,7 @@ static void __kthread_bind(struct task_s
  */
 void kthread_bind(struct task_struct *p, unsigned int cpu)
 {
-	/* Must have done schedule() in kthread() before we set_task_cpu */
-	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
-		WARN_ON(1);
-		return;
-	}
-	__kthread_bind(p, cpu);
+	__kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(kthread_bind);
 
@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
 	return NULL;
 }
 
+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+	clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+	/*
+	 * We clear the IS_PARKED bit here as we don't wait
+	 * until the task has left the park code. So if we'd
+	 * park before that happens we'd see the IS_PARKED bit
+	 * which might be about to be cleared.
+	 */
+	if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+			__kthread_bind(k, kthread->cpu, TASK_PARKED);
+		wake_up_state(k, TASK_PARKED);
+	}
+}
+
 /**
  * kthread_unpark - unpark a thread created by kthread_create().
  * @k:		thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
 {
 	struct kthread *kthread = task_get_live_kthread(k);
 
-	if (kthread) {
-		clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
-		/*
-		 * We clear the IS_PARKED bit here as we don't wait
-		 * until the task has left the park code. So if we'd
-		 * park before that happens we'd see the IS_PARKED bit
-		 * which might be about to be cleared.
-		 */
-		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
-			if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
-				__kthread_bind(k, kthread->cpu);
-			wake_up_process(k);
-		}
-	}
+	if (kthread)
+		__kthread_unpark(k, kthread);
 	put_task_struct(k);
 }
 
@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
 	trace_sched_kthread_stop(k);
 	if (kthread) {
 		set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
-		clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+		__kthread_unpark(k, kthread);
 		wake_up_process(k);
 		wait_for_completion(&kthread->exited);
 	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 19:30                     ` Thomas Gleixner
@ 2013-04-09 20:38                       ` Dave Hansen
  2013-04-09 20:54                         ` Dave Hansen
  2013-04-10  8:29                         ` Thomas Gleixner
  0 siblings, 2 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-09 20:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> Thought more about it and found, that the stupid binding only works
> when the task is really descheduled. So there is a small window left,
> which could lead to this. Revised patch below.
> 
> Anyway a trace for that issue would be appreciated nevertheless.

Here you go:

	http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz

It oopsed in exit.c:

https://picasaweb.google.com/lh/photo/7v_Xua9I29Rar3bBdNlLu9MTjNZETYmyPJy0liipFm0?feat=directlink



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 20:38                       ` Dave Hansen
@ 2013-04-09 20:54                         ` Dave Hansen
  2013-04-10  8:29                         ` Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-09 20:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Borislav Petkov, Srivatsa S. Bhat, LKML,
	Dave Jones, dhillf, Peter Zijlstra, Ingo Molnar

On 04/09/2013 01:38 PM, Dave Hansen wrote:
> It oopsed in exit.c:
> 
> https://picasaweb.google.com/lh/photo/7v_Xua9I29Rar3bBdNlLu9MTjNZETYmyPJy0liipFm0?feat=directlink

This was just secondary fallout after the first BUG_ON().  This exit.c
thing isn't a new issue.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 20:38                       ` Dave Hansen
  2013-04-09 20:54                         ` Dave Hansen
@ 2013-04-10  8:29                         ` Thomas Gleixner
  2013-04-10 10:51                           ` Thomas Gleixner
  1 sibling, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-10  8:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Tue, 9 Apr 2013, Dave Hansen wrote:

> On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> > On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> > Thought more about it and found, that the stupid binding only works
> > when the task is really descheduled. So there is a small window left,
> > which could lead to this. Revised patch below.
> > 
> > Anyway a trace for that issue would be appreciated nevertheless.
> 
> Here you go:
> 
> 	http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz

Hmm. Unfortunately migration/146 is not in the trace.
 
Can you please apply the patch below? That avoids the oops, but might
hang an online operation. Though the machine should stay up and you
should be able to retrieve the trace.

Thanks,

	tglx
---
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -131,7 +131,10 @@ static int smpboot_thread_fn(void *data)
 			continue;
 		}
 
-		BUG_ON(td->cpu != smp_processor_id());
+		if (td->cpu != smp_processor_id()) {
+			tracing_off();
+			schedule();
+		}
 
 		/* Check for state change setup */
 		switch (td->status) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-10  8:29                         ` Thomas Gleixner
@ 2013-04-10 10:51                           ` Thomas Gleixner
  2013-04-10 19:41                             ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-10 10:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Wed, 10 Apr 2013, Thomas Gleixner wrote:
> On Tue, 9 Apr 2013, Dave Hansen wrote:
> 
> > On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> > > On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> > > Thought more about it and found, that the stupid binding only works
> > > when the task is really descheduled. So there is a small window left,
> > > which could lead to this. Revised patch below.
> > > 
> > > Anyway a trace for that issue would be appreciated nevertheless.
> > 
> > Here you go:
> > 
> > 	http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz
> 
> Hmm. Unfortunately migration/146 is not in the trace.
>  
> Can you please apply the patch below? That avoids the oops, but might
> hang an online operation. Though the machine should stay up and you
> should be able to retrieve the trace.
> 
> Thanks,
> 
> 	tglx
> ---
> Index: linux-2.6/kernel/smpboot.c
> ===================================================================
> --- linux-2.6.orig/kernel/smpboot.c
> +++ linux-2.6/kernel/smpboot.c
> @@ -131,7 +131,10 @@ static int smpboot_thread_fn(void *data)
>  			continue;
>  		}
>  
> -		BUG_ON(td->cpu != smp_processor_id());
> +		if (td->cpu != smp_processor_id()) {
> +			tracing_off();
> +			schedule();

Bah, that wants a continue. Revised patch below.

> +		}
>  
>  		/* Check for state change setup */
>  		switch (td->status) {
 
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -131,7 +131,11 @@ static int smpboot_thread_fn(void *data)
 			continue;
 		}
 
-		BUG_ON(td->cpu != smp_processor_id());
+		if (td->cpu != smp_processor_id()) {
+			tracing_off();
+			schedule();
+			continue;
+		}
 
 		/* Check for state change setup */
 		switch (td->status) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()
  2013-04-09 15:55                 ` Dave Hansen
  2013-04-09 18:43                   ` Thomas Gleixner
@ 2013-04-10 14:03                   ` Srivatsa S. Bhat
  2013-04-11  8:10                     ` Thomas Gleixner
  1 sibling, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-10 14:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

Hi Dave,

On 04/09/2013 09:25 PM, Dave Hansen wrote:
> Hey Thomas,
> 
> I don't think the patch helped my case.  Looks like the same BUG_ON().
> 
> I accidentally booted with possible_cpus=10 instead of 160.  I wasn't
> able to trigger this in that case, even repeatedly on/offlining them.
> But, once I booted with possible_cpus=160, it triggered in a jiffy.
> 
> Two oopses below (bottom one has cpu numbers):
>

So here is a little bit of a hunch + guess work ;) Could you kindly test
this patch on top of mainline and let me know if it helps?

Regards,
Srivatsa S. Bhat

-------------------------------------------------------------------------->

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()

Dave Hansen and Borislav Petkov reported the following crash, which
corresponds the following BUG_ON in smpboot_thread_fn():

BUG_ON(td->cpu != smp_processor_id());

[  790.223270] ------------[ cut here ]------------
[  790.223966] kernel BUG at kernel/smpboot.c:134!
[  790.224739] invalid opcode: 0000 [#1] SMP
[  790.225671] Modules linked in:
[  790.226428] CPU 81
[  790.226909] Pid: 3909, comm: migration/135 Tainted: G        W    3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB
[  790.228775] RIP: 0010:[<ffffffff8110bee8>]  [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
[  790.230205] RSP: 0018:ffff88bfef9c1e08  EFLAGS: 00010202
[  790.231090] RAX: 0000000000000051 RBX: ffff88bfefb82000 RCX: 000000000000b888
[  790.231653] RDX: ffff88bfef9c1fd8 RSI: ffff881fff000000 RDI: 0000000000000087
[  790.232085] RBP: ffff88bfef9c1e38 R08: 0000000000000001 R09: 0000000000000000
[  790.232850] R10: 0000000000000018 R11: 0000000000000000 R12: ffff88bfec9e22e0
[  790.233561] R13: ffffffff81e587a0 R14: ffff88bfec9e22e0 R15: 0000000000000000
[  790.234004] FS:  0000000000000000(0000) GS:ffff881fff000000(0000) knlGS:0000000000000000
[  790.234918] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  790.235602] CR2: 00007fa89a333c62 CR3: 0000000001e0b000 CR4: 00000000000007e0
[  790.236110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  790.236584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  790.237329] Process migration/135 (pid: 3909, threadinfo ffff88bfef9c0000, task ffff88bfec9e22e0)
[  790.238321] Stack:
[  790.238882]  ffff88bfef9c1e38 0000000000000000 ffff88ffef421cc0 ffff88bfef9c1ec0
[  790.245415]  ffff88bfefb82000 ffffffff8110bc90 ffff88bfef9c1f48 ffffffff810ff1df
[  790.250755]  0000000000000001 0000000000000087 ffff88bfefb82000 0000000000000000
[  790.253365] Call Trace:
[  790.254121]  [<ffffffff8110bc90>] ? __smpboot_create_thread+0x180/0x180
[  790.255428]  [<ffffffff810ff1df>] kthread+0xef/0x100
[  790.256071]  [<ffffffff819cb1a4>] ? wait_for_completion+0x124/0x180
[  790.256697]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
[  790.257325]  [<ffffffff819dba9c>] ret_from_fork+0x7c/0xb0
[  790.258233]  [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
[  790.258942] Code: ef 3d 01 01 48 89 df e8 87 b0 16 00 48 83 05 67 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 90 ef 3d 01 01 <0f> 0b 48 83 05 96 ef 3d 01 01 48 83 05 56 ef 3d 01 01 0f 0b 48
[  790.276178] RIP  [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
[  790.276735]  RSP <ffff88bfef9c1e08>
[  790.278348] ---[ end trace 84baa2bee1434240 ]---


Interestingly, in every single stack trace, the crashing task is the migration
thread. Now, migration thread belongs to the highest priority stop_task sched
class, and this particular sched class is very unique in the way it implements
its internal sched class functions, and I suspect this has a lot of bearing
on how functions like kthread_bind(), wake_up_process() etc react with it
(by looking at how it implements its functions such as select_task_rq(),
enqueue_task(), dequeue_task() etc).

Previous to commit 14e568e (stop_machine: Use smpboot threads), the migration
threads were set to stop_task sched class *after* binding them to the
appropriate cpus. But that commit reversed that sequence inadvertently.
So revert back to the old sequence to fix this issue.

But note that __kthread_bind() can wake up the task if the task is an RT
task. So it can be called only when the CPU (to which we want to bind the task)
is already online. So add a new function called kthread_unpark_prepare() and
call it in smpboot_unpark_thread(), to bind the kthread to the appropriate CPU.
And in stop-machine, move the sched class initialization to ->pre_unpark().
This way, it can be ensured that for stop-task class tasks (such as migration
thread), the sched class is set *after* binding the task to the CPU (and
also, the kthreads belonging to other sched classes will continue to function
properly). Also, teach sched_set_stop_task() to ignore repeated invocations
with the same task as argument.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Hansen <dave@sr71.net>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/kthread.h |    1 +
 kernel/kthread.c        |   24 ++++++++++++++++++++----
 kernel/sched/core.c     |    3 +++
 kernel/smpboot.c        |    2 ++
 kernel/stop_machine.c   |    7 +------
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 8d81664..7552eb0 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -44,6 +44,7 @@ bool kthread_should_park(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
 void *kthread_data(struct task_struct *k);
 int kthread_park(struct task_struct *k);
+void kthread_unpark_prepare(struct task_struct *k);
 void kthread_unpark(struct task_struct *k);
 void kthread_parkme(void);
 
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..e3325cc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -325,6 +325,25 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
 }
 
 /**
+ * kthread_unpark_prepare - prepare to unpark a thread created by
+ * kthread_create().
+ * @k:		thread created by kthread_create().
+ *
+ * If the thread is marked per-cpu, it is bound to the cpu, in
+ * preparation for waking it up in the near future.
+ */
+void kthread_unpark_prepare(struct task_struct *k)
+{
+	struct kthread *kthread = task_get_live_kthread(k);
+
+	if (kthread) {
+		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+			__kthread_bind(k, kthread->cpu);
+	}
+	put_task_struct(k);
+}
+
+/**
  * kthread_unpark - unpark a thread created by kthread_create().
  * @k:		thread created by kthread_create().
  *
@@ -344,11 +363,8 @@ void kthread_unpark(struct task_struct *k)
 		 * park before that happens we'd see the IS_PARKED bit
 		 * which might be about to be cleared.
 		 */
-		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
-			if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
-				__kthread_bind(k, kthread->cpu);
+		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
 			wake_up_process(k);
-		}
 	}
 	put_task_struct(k);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..7370fad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -814,6 +814,9 @@ void sched_set_stop_task(int cpu, struct task_struct *stop)
 	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
 	struct task_struct *old_stop = cpu_rq(cpu)->stop;
 
+	if (stop == old_stop)
+		return;
+
 	if (stop) {
 		/*
 		 * Make it appear like a SCHED_FIFO task, its something
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 8eaed9a..bd1a9175 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -209,6 +209,8 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
 {
 	struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
 
+	kthread_unpark_prepare(tsk);
+
 	if (ht->pre_unpark)
 		ht->pre_unpark(cpu);
 	kthread_unpark(tsk);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..db32f3a 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -300,11 +300,6 @@ repeat:
 
 extern void sched_set_stop_task(int cpu, struct task_struct *stop);
 
-static void cpu_stop_create(unsigned int cpu)
-{
-	sched_set_stop_task(cpu, per_cpu(cpu_stopper_task, cpu));
-}
-
 static void cpu_stop_park(unsigned int cpu)
 {
 	struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
@@ -323,6 +318,7 @@ static void cpu_stop_unpark(unsigned int cpu)
 {
 	struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
 
+	sched_set_stop_task(cpu, per_cpu(cpu_stopper_task, cpu));
 	spin_lock_irq(&stopper->lock);
 	stopper->enabled = true;
 	spin_unlock_irq(&stopper->lock);
@@ -333,7 +329,6 @@ static struct smp_hotplug_thread cpu_stop_threads = {
 	.thread_should_run	= cpu_stop_should_run,
 	.thread_fn		= cpu_stopper_thread,
 	.thread_comm		= "migration/%u",
-	.create			= cpu_stop_create,
 	.setup			= cpu_stop_unpark,
 	.park			= cpu_stop_park,
 	.pre_unpark		= cpu_stop_unpark,



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-10 10:51                           ` Thomas Gleixner
@ 2013-04-10 19:41                             ` Dave Hansen
  2013-04-11 10:19                               ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2013-04-10 19:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

I think I got a full trace this time:

	http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz

The last timestamp is pretty close to the timestamp on the console:

[ 2071.033434] smpboot_thread_fn():
[ 2071.033455] smpboot_thread_fn() cpu: 22 159
[ 2071.033470] td->cpu: 22
[ 2071.033475] smp_processor_id(): 21
[ 2071.033486] comm: migration/%u



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()
  2013-04-10 14:03                   ` [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn() Srivatsa S. Bhat
@ 2013-04-11  8:10                     ` Thomas Gleixner
  2013-04-11 10:19                       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11  8:10 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Wed, 10 Apr 2013, Srivatsa S. Bhat wrote:

> Interestingly, in every single stack trace, the crashing task is the migration
> thread. Now, migration thread belongs to the highest priority stop_task sched
> class, and this particular sched class is very unique in the way it implements
> its internal sched class functions, and I suspect this has a lot of bearing
> on how functions like kthread_bind(), wake_up_process() etc react with it
> (by looking at how it implements its functions such as select_task_rq(),
> enqueue_task(), dequeue_task() etc).

I don't think that's relevant. The migration thread can only be woken
via try_to_wakeup and my previous patch which implements a separate
task state makes sure that it cannot be woken accidentaly by anything
else than unpark.
 
> But note that __kthread_bind() can wake up the task if the task is an RT
> task. So it can be called only when the CPU (to which we want to bind the task)

kthread_bind() does NOT wakeup anything. It merily sets the cpus
allowed ptr without further ado.


Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()
  2013-04-11  8:10                     ` Thomas Gleixner
@ 2013-04-11 10:19                       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 10:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On 04/11/2013 01:40 PM, Thomas Gleixner wrote:
> On Wed, 10 Apr 2013, Srivatsa S. Bhat wrote:
> 
>> Interestingly, in every single stack trace, the crashing task is the migration
>> thread. Now, migration thread belongs to the highest priority stop_task sched
>> class, and this particular sched class is very unique in the way it implements
>> its internal sched class functions, and I suspect this has a lot of bearing
>> on how functions like kthread_bind(), wake_up_process() etc react with it
>> (by looking at how it implements its functions such as select_task_rq(),
>> enqueue_task(), dequeue_task() etc).
> 
> I don't think that's relevant. The migration thread can only be woken
> via try_to_wakeup and my previous patch which implements a separate
> task state makes sure that it cannot be woken accidentaly by anything
> else than unpark.
> 

Hmm, but it got to be simpler than that, no? Given that it used to work fine
before...

>> But note that __kthread_bind() can wake up the task if the task is an RT
>> task. So it can be called only when the CPU (to which we want to bind the task)
> 
> kthread_bind() does NOT wakeup anything. It merily sets the cpus
> allowed ptr without further ado.
> 

Sorry, I was mistaken and was carried away by a bug in the code I was testing.
I had intended to move kthread_bind() to the body of kthread_create_on_cpu()
and place it after the call to kthread_park(), as shown below:

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..b485fc0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 	to_kthread(p)->cpu = cpu;
 	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
 	kthread_park(p);
+
+	wait_task_inactive(p, TASK_INTERRUPTIBLE);
+	__kthread_bind(p, cpu);
 	return p;
 }
 
But by mistake, I had written the code as:

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..b485fc0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 	to_kthread(p)->cpu = cpu;
 	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
 	kthread_park(p);
+
+	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+		__kthread_bind(p, cpu);
 	return p;
 }
 
So, no wonder it never actually bound the task to the CPU. So when I gave this a run,
I saw watchdog threads hitting the same BUG_ON(), and since watchdog threads are
of RT priority, and RT is the only class that implements ->set_cpus_allowed(), I
thought that those threads got woken up due to the bind. But I was mistaken of
course, because I had checked for the wrong return value of wait_task_inactive().

Regards,
Srivatsa S. Bhat


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-10 19:41                             ` Dave Hansen
@ 2013-04-11 10:19                               ` Thomas Gleixner
  2013-04-11 10:48                                 ` Srivatsa S. Bhat
  2013-04-11 18:07                                 ` Dave Hansen
  0 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 10:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

Dave,

On Wed, 10 Apr 2013, Dave Hansen wrote:

> I think I got a full trace this time:
> 
> 	http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
> 
> The last timestamp is pretty close to the timestamp on the console:
> 
> [ 2071.033434] smpboot_thread_fn():
> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> [ 2071.033470] td->cpu: 22
> [ 2071.033475] smp_processor_id(): 21
> [ 2071.033486] comm: migration/%u

Yes, that's helpful. Though it makes my mind boggle:

    migration/22-4335  [021] d.h.  2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
    migration/22-4335  [021] d...  2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
    migration/21-4323  [021] d...  2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
    migration/22-4335  [021] ....  2071.033422: smpboot_thread_fn <-kthread^M

So the migration thread schedules out with state TASK_PARKED and gets
scheduled back in right away without a wakeup. Srivatsa was about
right, that this might be related to the sched_set_stop_task() call,
but the changelog led completely down the wrong road.

So the issue is:

CPU14				CPU21
create_thread(for CPU22)
 park_thread()	
   wait_for_completion()	park_me()
				 complete()
   sched_set_stop_task()
				 schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU21 and that confuses the hell out of the stop_task
class on that cpu. So as we have to synchronize the state for the
bind call (the issue observed by Borislav) we need to do the same
before issuing sched_set_stop_task(). Delta patch below.

Thanks,

	tglx
---
Index: linux-2.6/include/trace/events/sched.h
===================================================================
--- linux-2.6.orig/include/trace/events/sched.h
+++ linux-2.6/include/trace/events/sched.h
@@ -147,7 +147,7 @@ TRACE_EVENT(sched_switch,
 		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
 				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
 				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
-				{ 128, "W" }) : "R",
+				{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
 		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
 		__entry->next_comm, __entry->next_pid, __entry->next_prio)
 );
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
 	}
 	get_task_struct(tsk);
 	*per_cpu_ptr(ht->store, cpu) = tsk;
-	if (ht->create)
-		ht->create(cpu);
+	if (ht->create) {
+		/*
+		 * Make sure that the task has actually scheduled out
+		 * into park position, before calling the create
+		 * callback. At least the migration thread callback
+		 * requires that the task is off the runqueue.
+		 */
+		if (!wait_task_inactive(tsk, TASK_PARKED))
+			WARN_ON(1);
+		else
+			ht->create(cpu);
+	}
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 10:19                               ` Thomas Gleixner
@ 2013-04-11 10:48                                 ` Srivatsa S. Bhat
  2013-04-11 11:43                                   ` Srivatsa S. Bhat
  2013-04-11 13:46                                   ` Thomas Gleixner
  2013-04-11 18:07                                 ` Dave Hansen
  1 sibling, 2 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 10:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> Dave,
> 
> On Wed, 10 Apr 2013, Dave Hansen wrote:
> 
>> I think I got a full trace this time:
>>
>> 	http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>>
>> The last timestamp is pretty close to the timestamp on the console:
>>
>> [ 2071.033434] smpboot_thread_fn():
>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
>> [ 2071.033470] td->cpu: 22
>> [ 2071.033475] smp_processor_id(): 21
>> [ 2071.033486] comm: migration/%u
> 
> Yes, that's helpful. Though it makes my mind boggle:
> 
>     migration/22-4335  [021] d.h.  2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
>     migration/22-4335  [021] d...  2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
>     migration/21-4323  [021] d...  2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
>     migration/22-4335  [021] ....  2071.033422: smpboot_thread_fn <-kthread^M
> 
> So the migration thread schedules out with state TASK_PARKED and gets
> scheduled back in right away without a wakeup. Srivatsa was about
> right, that this might be related to the sched_set_stop_task() call,
> but the changelog led completely down the wrong road.
> 
> So the issue is:
> 
> CPU14				CPU21
> create_thread(for CPU22)
>  park_thread()	
>    wait_for_completion()	park_me()
> 				 complete()
>    sched_set_stop_task()
> 				 schedule(TASK_PARKED)
> 
> The sched_set_stop_task() call is issued while the task is on the
> runqueue of CPU21 and that confuses the hell out of the stop_task
> class on that cpu. So as we have to synchronize the state for the
> bind call (the issue observed by Borislav) we need to do the same
> before issuing sched_set_stop_task(). Delta patch below.
> 

In that case, why not just apply this 2 line patch on mainline?
The patch I sent yesterday was more elaborate because I wrongly assumed
that kthread_bind() can cause a wakeup. But now, I feel the patch shown
below should work just fine too. Yeah, it binds the task during creation
as well as during unpark, but that should be ok (see below).

Somehow, I believe we can handle this issue without introducing that
whole TASK_PARKED thing..


diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..74af4a8 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 	to_kthread(p)->cpu = cpu;
 	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
 	kthread_park(p);
+
+	wait_task_inactive(p, TASK_INTERRUPTIBLE);
+	__kthread_bind(p, cpu);
 	return p;
 }
 

The reason why we can't get rid of the bind in the unpark code is because,
the threads are parked during CPU offline *after* calling CPU_DOWN_PREPARE.
And during CPU_DOWN_PREPARE, the scheduler removes the CPU from the cpu_active_mask.
So on any subsequent wakeup of these threads before they are parked, the scheduler
will force migrate them to some other CPU (but alas it wont print this event
because of the p->mm != NULL check in select_fallback_rq()). So during unpark
during the next online operation we need to bind it again. But that's fine, IMHO.

Regards,
Srivatsa S. Bhat


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 10:48                                 ` Srivatsa S. Bhat
@ 2013-04-11 11:43                                   ` Srivatsa S. Bhat
  2013-04-11 11:59                                     ` Srivatsa S. Bhat
                                                       ` (2 more replies)
  2013-04-11 13:46                                   ` Thomas Gleixner
  1 sibling, 3 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 11:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
>> Dave,
>>
>> On Wed, 10 Apr 2013, Dave Hansen wrote:
>>
>>> I think I got a full trace this time:
>>>
>>> 	http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>>>
>>> The last timestamp is pretty close to the timestamp on the console:
>>>
>>> [ 2071.033434] smpboot_thread_fn():
>>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
>>> [ 2071.033470] td->cpu: 22
>>> [ 2071.033475] smp_processor_id(): 21
>>> [ 2071.033486] comm: migration/%u
>>
>> Yes, that's helpful. Though it makes my mind boggle:
>>
>>     migration/22-4335  [021] d.h.  2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
>>     migration/22-4335  [021] d...  2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
>>     migration/21-4323  [021] d...  2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
>>     migration/22-4335  [021] ....  2071.033422: smpboot_thread_fn <-kthread^M
>>
>> So the migration thread schedules out with state TASK_PARKED and gets
>> scheduled back in right away without a wakeup. Srivatsa was about
>> right, that this might be related to the sched_set_stop_task() call,
>> but the changelog led completely down the wrong road.
>>
>> So the issue is:
>>
>> CPU14				CPU21
>> create_thread(for CPU22)
>>  park_thread()	
>>    wait_for_completion()	park_me()
>> 				 complete()
>>    sched_set_stop_task()
>> 				 schedule(TASK_PARKED)
>>
>> The sched_set_stop_task() call is issued while the task is on the
>> runqueue of CPU21 and that confuses the hell out of the stop_task
>> class on that cpu. So as we have to synchronize the state for the
>> bind call (the issue observed by Borislav) we need to do the same
>> before issuing sched_set_stop_task(). Delta patch below.
>>
> 
> In that case, why not just apply this 2 line patch on mainline?
> The patch I sent yesterday was more elaborate because I wrongly assumed
> that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> below should work just fine too. Yeah, it binds the task during creation
> as well as during unpark, but that should be ok (see below).
> 
> Somehow, I believe we can handle this issue without introducing that
> whole TASK_PARKED thing..
> 
> 
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..74af4a8 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
>  	to_kthread(p)->cpu = cpu;
>  	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
>  	kthread_park(p);
> +
> +	wait_task_inactive(p, TASK_INTERRUPTIBLE);
> +	__kthread_bind(p, cpu);
>  	return p;
>  }
>  

Oh wait a minute, its simpler than even that I believe! We don't need that
extra kthread_bind() also.

So, here is what is happening, from what I understand:
sched_set_stop_task() will make the newly created migration thread as the
stop task, during ht->create(). But looking into __sched_setscheduler(),
we observe that if p->on_rq is true (where p is the migration thread),
the task is enqueued on the rq by calling ->enqueue_task() of stop_task
sched class, which simply increments the rq->nr_running.

So, on the next call to schedule(), the pick_next_task() in core.c sees
that rq->nr_running is not equal to rq->cfs.h_nr_running and hence invokes
the stop_task sched class' ->pick_next_task(). And that will return the
stop task, which is nothing but this migration thread that we just created.
That's why the migration thread starts running prematurely, even before
it is bound, and even before its corresponding CPU is brought online.

So, if we just wait till p->on_rq is reset to 0 by __schedule(), before
calling ht->create(), the task doesn't get queued to the runqueue with
stop task priority. So that's it - we simply wait till it gets off the
rq and everything should be fine. And during the subsequent online, during
the unpark phase, the task gets bound to the appropriate cpu and only
then woken up, as desired.

So Dave, could you kindly test the below patch on mainline?


diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9558355 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 	to_kthread(p)->cpu = cpu;
 	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
 	kthread_park(p);
+
+	/*
+	 * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
+	 * migration thread (which belongs to the stop_task sched class)
+	 * doesn't run until the cpu is actually onlined and the thread is
+	 * unparked.
+	 */
+	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+		WARN_ON(1);
 	return p;
 }
 
Regards,
Srivatsa S. Bhat


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 11:43                                   ` Srivatsa S. Bhat
@ 2013-04-11 11:59                                     ` Srivatsa S. Bhat
  2013-04-11 12:51                                     ` Thomas Gleixner
  2013-04-11 12:54                                     ` Thomas Gleixner
  2 siblings, 0 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 11:59 UTC (permalink / raw)
  To: Dave Hansen, Borislav Petkov
  Cc: Thomas Gleixner, LKML, Dave Jones, dhillf, Peter Zijlstra, Ingo Molnar

On 04/11/2013 05:13 PM, Srivatsa S. Bhat wrote:
[...] 
> So Dave, could you kindly test the below patch on mainline?
> 

BTW, you don't need to try out any of the previous patches that I sent,
just this one is good enough. Thanks!

Regards,
Srivatsa S. Bhat

> 
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..9558355 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
>  	to_kthread(p)->cpu = cpu;
>  	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
>  	kthread_park(p);
> +
> +	/*
> +	 * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> +	 * migration thread (which belongs to the stop_task sched class)
> +	 * doesn't run until the cpu is actually onlined and the thread is
> +	 * unparked.
> +	 */
> +	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> +		WARN_ON(1);
>  	return p;
>  }
>  


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 11:43                                   ` Srivatsa S. Bhat
  2013-04-11 11:59                                     ` Srivatsa S. Bhat
@ 2013-04-11 12:51                                     ` Thomas Gleixner
  2013-04-11 12:54                                     ` Thomas Gleixner
  2 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 12:51 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> > On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> >> Dave,
> >>
> >> On Wed, 10 Apr 2013, Dave Hansen wrote:
> >>
> >>> I think I got a full trace this time:
> >>>
> >>> 	http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
> >>>
> >>> The last timestamp is pretty close to the timestamp on the console:
> >>>
> >>> [ 2071.033434] smpboot_thread_fn():
> >>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> >>> [ 2071.033470] td->cpu: 22
> >>> [ 2071.033475] smp_processor_id(): 21
> >>> [ 2071.033486] comm: migration/%u
> >>
> >> Yes, that's helpful. Though it makes my mind boggle:
> >>
> >>     migration/22-4335  [021] d.h.  2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> >>     migration/22-4335  [021] d...  2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> >>     migration/21-4323  [021] d...  2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> >>     migration/22-4335  [021] ....  2071.033422: smpboot_thread_fn <-kthread^M
> >>
> >> So the migration thread schedules out with state TASK_PARKED and gets
> >> scheduled back in right away without a wakeup. Srivatsa was about
> >> right, that this might be related to the sched_set_stop_task() call,
> >> but the changelog led completely down the wrong road.
> >>
> >> So the issue is:
> >>
> >> CPU14				CPU21
> >> create_thread(for CPU22)
> >>  park_thread()	
> >>    wait_for_completion()	park_me()
> >> 				 complete()
> >>    sched_set_stop_task()
> >> 				 schedule(TASK_PARKED)
> >>
> >> The sched_set_stop_task() call is issued while the task is on the
> >> runqueue of CPU21 and that confuses the hell out of the stop_task
> >> class on that cpu. So as we have to synchronize the state for the
> >> bind call (the issue observed by Borislav) we need to do the same
> >> before issuing sched_set_stop_task(). Delta patch below.
> >>
> > 
> > In that case, why not just apply this 2 line patch on mainline?
> > The patch I sent yesterday was more elaborate because I wrongly assumed
> > that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> > below should work just fine too. Yeah, it binds the task during creation
> > as well as during unpark, but that should be ok (see below).
> > 
> > Somehow, I believe we can handle this issue without introducing that
> > whole TASK_PARKED thing..
> > 
> > 
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 691dc2e..74af4a8 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> >  	to_kthread(p)->cpu = cpu;
> >  	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> >  	kthread_park(p);
> > +
> > +	wait_task_inactive(p, TASK_INTERRUPTIBLE);
> > +	__kthread_bind(p, cpu);
> >  	return p;
> >  }
> >  
> 
> Oh wait a minute, its simpler than even that I believe! We don't need that
> extra kthread_bind() also.
> 
> So, here is what is happening, from what I understand:
> sched_set_stop_task() will make the newly created migration thread as the
> stop task, during ht->create(). But looking into __sched_setscheduler(),
> we observe that if p->on_rq is true (where p is the migration thread),
> the task is enqueued on the rq by calling ->enqueue_task() of stop_task
> sched class, which simply increments the rq->nr_running.
> 
> So, on the next call to schedule(), the pick_next_task() in core.c sees
> that rq->nr_running is not equal to rq->cfs.h_nr_running and hence invokes
> the stop_task sched class' ->pick_next_task(). And that will return the
> stop task, which is nothing but this migration thread that we just created.
> That's why the migration thread starts running prematurely, even before
> it is bound, and even before its corresponding CPU is brought online.
> 
> So, if we just wait till p->on_rq is reset to 0 by __schedule(), before
> calling ht->create(), the task doesn't get queued to the runqueue with
> stop task priority. So that's it - we simply wait till it gets off the
> rq and everything should be fine. And during the subsequent online, during
> the unpark phase, the task gets bound to the appropriate cpu and only
> then woken up, as desired.

Right, but I want to avoid that wait for the common case. In most of
the cases the newly created thread reaches PARKED and scheduled out
before we do bind or any other stuff.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 11:43                                   ` Srivatsa S. Bhat
  2013-04-11 11:59                                     ` Srivatsa S. Bhat
  2013-04-11 12:51                                     ` Thomas Gleixner
@ 2013-04-11 12:54                                     ` Thomas Gleixner
  2 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 12:54 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:

> On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> > On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> >> Dave,
> >>
> >> On Wed, 10 Apr 2013, Dave Hansen wrote:
> >>
> >>> I think I got a full trace this time:
> >>>
> >>> 	http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
> >>>
> >>> The last timestamp is pretty close to the timestamp on the console:
> >>>
> >>> [ 2071.033434] smpboot_thread_fn():
> >>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> >>> [ 2071.033470] td->cpu: 22
> >>> [ 2071.033475] smp_processor_id(): 21
> >>> [ 2071.033486] comm: migration/%u
> >>
> >> Yes, that's helpful. Though it makes my mind boggle:
> >>
> >>     migration/22-4335  [021] d.h.  2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> >>     migration/22-4335  [021] d...  2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> >>     migration/21-4323  [021] d...  2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> >>     migration/22-4335  [021] ....  2071.033422: smpboot_thread_fn <-kthread^M
> >>
> >> So the migration thread schedules out with state TASK_PARKED and gets
> >> scheduled back in right away without a wakeup. Srivatsa was about
> >> right, that this might be related to the sched_set_stop_task() call,
> >> but the changelog led completely down the wrong road.
> >>
> >> So the issue is:
> >>
> >> CPU14				CPU21
> >> create_thread(for CPU22)
> >>  park_thread()	
> >>    wait_for_completion()	park_me()
> >> 				 complete()
> >>    sched_set_stop_task()
> >> 				 schedule(TASK_PARKED)
> >>
> >> The sched_set_stop_task() call is issued while the task is on the
> >> runqueue of CPU21 and that confuses the hell out of the stop_task
> >> class on that cpu. So as we have to synchronize the state for the
> >> bind call (the issue observed by Borislav) we need to do the same
> >> before issuing sched_set_stop_task(). Delta patch below.
> >>
> > 
> > In that case, why not just apply this 2 line patch on mainline?
> > The patch I sent yesterday was more elaborate because I wrongly assumed
> > that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> > below should work just fine too. Yeah, it binds the task during creation
> > as well as during unpark, but that should be ok (see below).
> > 
> > Somehow, I believe we can handle this issue without introducing that
> > whole TASK_PARKED thing..

We need it as Borislav showed.

Thread is created and parked. Now after online we unpark the thread,
but something else wakes it before we reach the unpark code and boom
it's on the wrong cpu. We want that PARKED thing for
robustness. Anything else is just voodoo programming.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 10:48                                 ` Srivatsa S. Bhat
  2013-04-11 11:43                                   ` Srivatsa S. Bhat
@ 2013-04-11 13:46                                   ` Thomas Gleixner
  1 sibling, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 13:46 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:
> The reason why we can't get rid of the bind in the unpark code is because,
> the threads are parked during CPU offline *after* calling CPU_DOWN_PREPARE.
> And during CPU_DOWN_PREPARE, the scheduler removes the CPU from the cpu_active_mask.
> So on any subsequent wakeup of these threads before they are parked, the scheduler
> will force migrate them to some other CPU (but alas it wont print this event
> because of the p->mm != NULL check in select_fallback_rq()). So during unpark
> during the next online operation we need to bind it again. But that's fine, IMHO.

No, it's not fine. If we do not have TASK_PARKED a wakeup which
happens after onlining the task and races with the unpark code will
cause the task to end up on the wrong CPU. Borislav showed traces
which prove that with ksoftirqd. So no, TASK_PARKED is going to stay.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 10:19                               ` Thomas Gleixner
  2013-04-11 10:48                                 ` Srivatsa S. Bhat
@ 2013-04-11 18:07                                 ` Dave Hansen
  2013-04-11 19:48                                   ` Thomas Gleixner
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2013-04-11 18:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On 04/11/2013 03:19 AM, Thomas Gleixner wrote:
> --- linux-2.6.orig/kernel/smpboot.c
> +++ linux-2.6/kernel/smpboot.c
> @@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
>  	}
>  	get_task_struct(tsk);
>  	*per_cpu_ptr(ht->store, cpu) = tsk;
> -	if (ht->create)
> -		ht->create(cpu);
> +	if (ht->create) {
> +		/*
> +		 * Make sure that the task has actually scheduled out
> +		 * into park position, before calling the create
> +		 * callback. At least the migration thread callback
> +		 * requires that the task is off the runqueue.
> +		 */
> +		if (!wait_task_inactive(tsk, TASK_PARKED))
> +			WARN_ON(1);
> +		else
> +			ht->create(cpu);
> +	}
>  	return 0;
>  }

This one appears to be doing the trick.  I'll run the cpus in an
online/offline loop for a bit and make sure it's stable.  It's passed
several round so far, which is way more than it's done up to this point,
though.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 14:38               ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
  2013-04-09 15:55                 ` Dave Hansen
@ 2013-04-11 19:16                 ` Srivatsa S. Bhat
  2013-04-11 20:47                   ` Thomas Gleixner
  2013-04-12 10:41                 ` Peter Zijlstra
  2013-04-12 12:32                 ` [tip:core/urgent] " tip-bot for Thomas Gleixner
  3 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 19:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
> The smpboot threads rely on the park/unpark mechanism which binds per
> cpu threads on a particular core. Though the functionality is racy:
> 
> CPU0	       	 	CPU1  	     	    CPU2
> unpark(T)				    wake_up_process(T)
>   clear(SHOULD_PARK)	T runs
> 			leave parkme() due to !SHOULD_PARK  
>   bind_to(CPU2)		BUG_ON(wrong CPU)						    
>

OK, I must admit that I had missed noticing that the task was ksoftirqd
and not the migration thread in Boris' trace. And yes, this unexpected
wakeup is a problem for ksoftirqd.
 
> We cannot let the tasks move themself to the target CPU as one of
> those tasks is actually the migration thread itself, which requires
> that it starts running on the target cpu right away.
> 

Of course, we can't use set_cpus_allowed_ptr(), but we can still achieve
the desired bind effect without any race, see below.

> The only rock solid solution to this problem is to prevent wakeups in
> park state which are not from unpark(). That way we can guarantee that
> the association of the task to the target cpu is working correctly.
> 
> Add a new task state (TASK_PARKED) which prevents other wakeups and
> use this state explicitely for the unpark wakeup.
>

Again, I think this is unnecessary. We are good as long as no one other
than the unpark code can kick the kthreads out of the loop in the park
code. Now that I understand the race you explained above, why not just
fix that race itself by reversing the ordering of clear(SHOULD_PARK)
and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
kthread, it will just remain confined to the park code, as intended.

A patch like below should do it IMHO. I guess I'm being a little too
persistent, sorry!


diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9512fc5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 	to_kthread(p)->cpu = cpu;
 	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
 	kthread_park(p);
+
+	/*
+	 * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
+	 * migration thread (which belongs to the stop_task sched class)
+	 * doesn't run until the cpu is actually onlined and the thread is
+	 * unparked.
+	 */
+	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+		WARN_ON(1);
 	return p;
 }
 
@@ -324,6 +333,17 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
 	return NULL;
 }
 
+static void __kthread_park(struct task_struct *k, struct kthread *kthread)
+{
+	if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+		set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+		if (k != current) {
+			wake_up_process(k);
+			wait_for_completion(&kthread->parked);
+		}
+	}
+}
+
 /**
  * kthread_unpark - unpark a thread created by kthread_create().
  * @k:		thread created by kthread_create().
@@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
 	struct kthread *kthread = task_get_live_kthread(k);
 
 	if (kthread) {
+		/*
+		 * Per-cpu kthreads such as ksoftirqd can get woken up by
+		 * other events. So after binding the thread, ensure that
+		 * it goes off the CPU atleast once, by parking it again.
+		 * This way, we can ensure that it will run on the correct
+		 * CPU on subsequent wakeup.
+		 */
+		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
+			__kthread_bind(k, kthread->cpu);
+			clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
+			__kthread_park(k, kthread);
+		}
+
 		clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+
 		/*
 		 * We clear the IS_PARKED bit here as we don't wait
 		 * until the task has left the park code. So if we'd
 		 * park before that happens we'd see the IS_PARKED bit
 		 * which might be about to be cleared.
 		 */
-		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
-			if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
-				__kthread_bind(k, kthread->cpu);
+		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
 			wake_up_process(k);
-		}
 	}
 	put_task_struct(k);
 }
@@ -371,13 +402,7 @@ int kthread_park(struct task_struct *k)
 	int ret = -ENOSYS;
 
 	if (kthread) {
-		if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
-			set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
-			if (k != current) {
-				wake_up_process(k);
-				wait_for_completion(&kthread->parked);
-			}
-		}
+		__kthread_park(k, kthread);
 		ret = 0;
 	}
 	put_task_struct(k);


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 18:07                                 ` Dave Hansen
@ 2013-04-11 19:48                                   ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 19:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On Thu, 11 Apr 2013, Dave Hansen wrote:
> On 04/11/2013 03:19 AM, Thomas Gleixner wrote:
> > --- linux-2.6.orig/kernel/smpboot.c
> > +++ linux-2.6/kernel/smpboot.c
> > @@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
> >  	}
> >  	get_task_struct(tsk);
> >  	*per_cpu_ptr(ht->store, cpu) = tsk;
> > -	if (ht->create)
> > -		ht->create(cpu);
> > +	if (ht->create) {
> > +		/*
> > +		 * Make sure that the task has actually scheduled out
> > +		 * into park position, before calling the create
> > +		 * callback. At least the migration thread callback
> > +		 * requires that the task is off the runqueue.
> > +		 */
> > +		if (!wait_task_inactive(tsk, TASK_PARKED))
> > +			WARN_ON(1);
> > +		else
> > +			ht->create(cpu);
> > +	}
> >  	return 0;
> >  }
> 
> This one appears to be doing the trick.  I'll run the cpus in an
> online/offline loop for a bit and make sure it's stable.  It's passed
> several round so far, which is way more than it's done up to this point,
> though.

I think the most critical part is bootup when the threads are
created. The online/offline one is an issue as well, which is covered
by the patches, but way harder to trigger.

Thanks,

	tglx

 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 19:16                 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Srivatsa S. Bhat
@ 2013-04-11 20:47                   ` Thomas Gleixner
  2013-04-11 21:19                     ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 20:47 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

Srivatsa,

On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
> > Add a new task state (TASK_PARKED) which prevents other wakeups and
> > use this state explicitely for the unpark wakeup.
> >
> 
> Again, I think this is unnecessary. We are good as long as no one other
> than the unpark code can kick the kthreads out of the loop in the park
> code. Now that I understand the race you explained above, why not just
> fix that race itself by reversing the ordering of clear(SHOULD_PARK)
> and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
> kthread, it will just remain confined to the park code, as intended.

In theory.
 
> A patch like below should do it IMHO. I guess I'm being a little too
> persistent, sorry!

No it's not about being persistent, you're JUST too much into voodoo
programming instead of going for the straight forward and robust
solutions.

Darn, I hate it as much as everyone else to introduce a new task
state, but that state allows us to make guarantees and gives us
semantical clarity. A parked thread is parked and can only be woken up
by the unpark code. That's clear semantics and not a magic construct
which will work in most cases and for the remaining ones (See below)
it will give us problems which are way harder to decode than the ones
we tried to fix with that magic.

> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..9512fc5 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
>  	to_kthread(p)->cpu = cpu;
>  	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
>  	kthread_park(p);
> +
> +	/*
> +	 * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> +	 * migration thread (which belongs to the stop_task sched class)
> +	 * doesn't run until the cpu is actually onlined and the thread is
> +	 * unparked.
> +	 */
> +	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> +		WARN_ON(1);

Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
references outside the creation code. And then we _HOPE_ that nothing
wakes it up _BEFORE_ we do something else.

Aside of that, you are still insisting to enforce that for every per
cpu thread even if the only one which needs that at this point are
thos which have a create() callback (i.e. the migration thread). And
next week you figure out that this is a performance impact on bringing
up large machines....

>  /**
>   * kthread_unpark - unpark a thread created by kthread_create().
>   * @k:		thread created by kthread_create().
> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
>  	struct kthread *kthread = task_get_live_kthread(k);
>  
>  	if (kthread) {
> +		/*
> +		 * Per-cpu kthreads such as ksoftirqd can get woken up by
> +		 * other events. So after binding the thread, ensure that
> +		 * it goes off the CPU atleast once, by parking it again.
> +		 * This way, we can ensure that it will run on the correct
> +		 * CPU on subsequent wakeup.
> +		 */
> +		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
> +			__kthread_bind(k, kthread->cpu);
> +			clear_bit(KTHREAD_IS_PARKED, &kthread->flags);

And how is that f*cking different from the previous code?

CPU0	   		CPU1		       CPU2
				       	       wakeup(T) -> run on CPU1 (last cpu)

			switch_to(T)

__kthread_bind(T, CPU2)

clear(KTHREAD_IS_PARKED)

			leave loop due to !KTHREAD_IS_PARKED

			BUG(wrong cpu)  <--- VOODOO FAILURE

kthread_park(T) <-- VOODOO TOO LATE

You can turn around the order of clearing/setting the flags as much as
you want, I'm going to punch an hole in it.

TASK_PARKED is the very obvious and robust solution which fixes _ALL_
of the corner cases, at least as far as I can imagine them. And
robustness rules at least in my world.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 20:47                   ` Thomas Gleixner
@ 2013-04-11 21:19                     ` Srivatsa S. Bhat
  2013-04-12 10:59                       ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 21:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
> Srivatsa,
> 
> On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
>> On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
>>> Add a new task state (TASK_PARKED) which prevents other wakeups and
>>> use this state explicitely for the unpark wakeup.
>>>
>>
>> Again, I think this is unnecessary. We are good as long as no one other
>> than the unpark code can kick the kthreads out of the loop in the park
>> code. Now that I understand the race you explained above, why not just
>> fix that race itself by reversing the ordering of clear(SHOULD_PARK)
>> and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
>> kthread, it will just remain confined to the park code, as intended.
> 
> In theory.
> 
>> A patch like below should do it IMHO. I guess I'm being a little too
>> persistent, sorry!
> 
> No it's not about being persistent, you're JUST too much into voodoo
> programming instead of going for the straight forward and robust
> solutions.
> 
> Darn, I hate it as much as everyone else to introduce a new task
> state, but that state allows us to make guarantees and gives us
> semantical clarity. A parked thread is parked and can only be woken up
> by the unpark code. That's clear semantics and not a magic construct
> which will work in most cases and for the remaining ones (See below)
> it will give us problems which are way harder to decode than the ones
> we tried to fix with that magic.
> 
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 691dc2e..9512fc5 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
>>  	to_kthread(p)->cpu = cpu;
>>  	/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
>>  	kthread_park(p);
>> +
>> +	/*
>> +	 * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
>> +	 * migration thread (which belongs to the stop_task sched class)
>> +	 * doesn't run until the cpu is actually onlined and the thread is
>> +	 * unparked.
>> +	 */
>> +	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
>> +		WARN_ON(1);
> 
> Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
> references outside the creation code.

I doubt that. We have not even onlined the CPU, how would any else even
_know_ that we created this kthread??

*per_cpu_ptr(ht->store, cpu) = tsk; is executed _after_ returning from
this function.

The problem with ksoftirqd is very clear - we unpark threads _after_ we
online the CPU. So, in between the 2 steps, somebody on that CPU can call
__do_softirq(), leading to the race you described in your cover-letter.
That's why I tried to fix that race.

> And then we _HOPE_ that nothing
> wakes it up _BEFORE_ we do something else.
> 

Nothing can wake it up, because no one is aware of the newly created
kthread.

> Aside of that, you are still insisting to enforce that for every per
> cpu thread even if the only one which needs that at this point are
> thos which have a create() callback (i.e. the migration thread). And
> next week you figure out that this is a performance impact on bringing
> up large machines....
> 

Making this wait call specific to those kthreads with the ->create callback
won't be that much of a big deal, IMHO. But see below, I'm not going to
insist on going with my suggestions.

>>  /**
>>   * kthread_unpark - unpark a thread created by kthread_create().
>>   * @k:		thread created by kthread_create().
>> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
>>  	struct kthread *kthread = task_get_live_kthread(k);
>>  
>>  	if (kthread) {
>> +		/*
>> +		 * Per-cpu kthreads such as ksoftirqd can get woken up by
>> +		 * other events. So after binding the thread, ensure that
>> +		 * it goes off the CPU atleast once, by parking it again.
>> +		 * This way, we can ensure that it will run on the correct
>> +		 * CPU on subsequent wakeup.
>> +		 */
>> +		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
>> +			__kthread_bind(k, kthread->cpu);
>> +			clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
> 
> And how is that f*cking different from the previous code?
> 
> CPU0	   		CPU1		       CPU2
> 				       	       wakeup(T) -> run on CPU1 (last cpu)
> 
> 			switch_to(T)
> 
> __kthread_bind(T, CPU2)
> 
> clear(KTHREAD_IS_PARKED)
> 
> 			leave loop due to !KTHREAD_IS_PARKED

			How?? The task will leave the loop only when we clear
			SHOULD_PARK, not when we clear IS_PARKED. So it won't
			leave the loop here. It will cause the kthread to
			perform a fresh complete() for the waiting kthread_park()
			on CPU0.
> 
> 			BUG(wrong cpu)  <--- VOODOO FAILURE
> 
> kthread_park(T) <-- VOODOO TOO LATE
> 

No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
ensure that the task gets *descheduled* atleast once after we did the
kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
migrate a running kthread (because the kthread could be the migration
thread). So instead, we use kthread_bind() and depend on sleep->wakeup
to put the task on the right CPU.

> You can turn around the order of clearing/setting the flags as much as
> you want, I'm going to punch an hole in it.
> 
> TASK_PARKED is the very obvious and robust solution which fixes _ALL_
> of the corner cases, at least as far as I can imagine them. And
> robustness rules at least in my world.
> 

Yes, I agree that it is robust and has clear semantics. No doubt about
that. So I won't insist on going with my suggestions.

Regards,
Srivatsa S. Bhat



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 14:38               ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
  2013-04-09 15:55                 ` Dave Hansen
  2013-04-11 19:16                 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Srivatsa S. Bhat
@ 2013-04-12 10:41                 ` Peter Zijlstra
  2013-04-12 12:32                 ` [tip:core/urgent] " tip-bot for Thomas Gleixner
  3 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2013-04-12 10:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Srivatsa S. Bhat, Dave Hansen, LKML, Dave Jones,
	dhillf, Ingo Molnar

On Tue, 2013-04-09 at 16:38 +0200, Thomas Gleixner wrote:
> The smpboot threads rely on the park/unpark mechanism which binds per
> cpu threads on a particular core. Though the functionality is racy:
> 
> CPU0                    CPU1                CPU2
> unpark(T)                                   wake_up_process(T)
>   clear(SHOULD_PARK)    T runs
>                         leave parkme() due to !SHOULD_PARK  
>   bind_to(CPU2)         BUG_ON(wrong CPU)                                                   
> 
> We cannot let the tasks move themself to the target CPU as one of
> those tasks is actually the migration thread itself, which requires
> that it starts running on the target cpu right away.
> 
> The only rock solid solution to this problem is to prevent wakeups in
> park state which are not from unpark(). That way we can guarantee that
> the association of the task to the target cpu is working correctly.
> 
> Add a new task state (TASK_PARKED) which prevents other wakeups and
> use this state explicitely for the unpark wakeup.

explicitly

Also, since the task state is visible to userspace and all the parked
tasks are still in the PID space, its a good hint in ps and friends
that these tasks aren't really there for the moment.

> 
> Reported-by: Dave Jones <davej@redhat.com>
> Reported-by: Dave Hansen <dave@sr71.net>
> Reported-by: Borislav Petkov <bp@alien8.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Assuming you're going to merge in the missing trace hunk you posted
further down the thread...

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ---
>  fs/proc/array.c       |    1 +
>  include/linux/sched.h |    5 +++--
>  kernel/kthread.c      |   38 +++++++++++++++++++++-----------------
>  3 files changed, 25 insertions(+), 19 deletions(-)
> 
> Index: linux-2.6/fs/proc/array.c
> ===================================================================
> --- linux-2.6.orig/fs/proc/array.c
> +++ linux-2.6/fs/proc/array.c
> @@ -143,6 +143,7 @@ static const char * const task_state_arr
>         "x (dead)",             /*  64 */
>         "K (wakekill)",         /* 128 */
>         "W (waking)",           /* 256 */
> +       "P (parked)",           /* 512 */
>  };
>  
>  static inline const char *get_task_state(struct task_struct *tsk)
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
>  #define TASK_DEAD              64
>  #define TASK_WAKEKILL          128
>  #define TASK_WAKING            256
> -#define TASK_STATE_MAX         512
> +#define TASK_PARKED            512
> +#define TASK_STATE_MAX         1024
>  
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
>  
>  extern char ___assert_task_state[1 - 2*!!(
>                 sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
> Index: linux-2.6/kernel/kthread.c
> ===================================================================
> --- linux-2.6.orig/kernel/kthread.c
> +++ linux-2.6/kernel/kthread.c
> @@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t
>  
>  static void __kthread_parkme(struct kthread *self)
>  {
> -       __set_current_state(TASK_INTERRUPTIBLE);
> +       __set_current_state(TASK_PARKED);
>         while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
>                 if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
>                         complete(&self->parked);
>                 schedule();
> -               __set_current_state(TASK_INTERRUPTIBLE);
> +               __set_current_state(TASK_PARKED);
>         }
>         clear_bit(KTHREAD_IS_PARKED, &self->flags);
>         __set_current_state(TASK_RUNNING);
> @@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
>         return NULL;
>  }
>  
> +static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
> +{
> +       clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> +       /*
> +        * We clear the IS_PARKED bit here as we don't wait
> +        * until the task has left the park code. So if we'd
> +        * park before that happens we'd see the IS_PARKED bit
> +        * which might be about to be cleared.
> +        */
> +       if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> +               if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> +                       __kthread_bind(k, kthread->cpu);
> +               wake_up_state(k, TASK_PARKED);
> +       }
> +}
> +
>  /**
>   * kthread_unpark - unpark a thread created by kthread_create().
>   * @k:         thread created by kthread_create().
> @@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
>  {
>         struct kthread *kthread = task_get_live_kthread(k);
>  
> -       if (kthread) {
> -               clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> -               /*
> -                * We clear the IS_PARKED bit here as we don't wait
> -                * until the task has left the park code. So if we'd
> -                * park before that happens we'd see the IS_PARKED bit
> -                * which might be about to be cleared.
> -                */
> -               if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> -                       if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> -                               __kthread_bind(k, kthread->cpu);
> -                       wake_up_process(k);
> -               }
> -       }
> +       if (kthread)
> +               __kthread_unpark(k, kthread);
>         put_task_struct(k);
>  }
>  
> @@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
>         trace_sched_kthread_stop(k);
>         if (kthread) {
>                 set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
> -               clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> +               __kthread_unpark(k, kthread);
>                 wake_up_process(k);
>                 wait_for_completion(&kthread->exited);
>         }



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-11 21:19                     ` Srivatsa S. Bhat
@ 2013-04-12 10:59                       ` Thomas Gleixner
  2013-04-12 11:26                         ` Srivatsa S. Bhat
  2013-04-15 19:49                         ` Dave Hansen
  0 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-12 10:59 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

Srivatsa,

On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
> >> +
> >> +	/*
> >> +	 * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> >> +	 * migration thread (which belongs to the stop_task sched class)
> >> +	 * doesn't run until the cpu is actually onlined and the thread is
> >> +	 * unparked.
> >> +	 */
> >> +	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> >> +		WARN_ON(1);
> > 
> > Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
> > references outside the creation code.
> 
> I doubt that. We have not even onlined the CPU, how would any else even
> _know_ that we created this kthread??

The problem is not only at the thread creation time. We have the same
issue at offline/online and there we have a reference to that very
thread.
 
> >>  /**
> >>   * kthread_unpark - unpark a thread created by kthread_create().
> >>   * @k:		thread created by kthread_create().
> >> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
> >>  	struct kthread *kthread = task_get_live_kthread(k);
> >>  
> >>  	if (kthread) {
> >> +		/*
> >> +		 * Per-cpu kthreads such as ksoftirqd can get woken up by
> >> +		 * other events. So after binding the thread, ensure that
> >> +		 * it goes off the CPU atleast once, by parking it again.
> >> +		 * This way, we can ensure that it will run on the correct
> >> +		 * CPU on subsequent wakeup.
> >> +		 */
> >> +		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
> >> +			__kthread_bind(k, kthread->cpu);
> >> +			clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
> > 
> > And how is that f*cking different from the previous code?
> > 
> > CPU0	   		CPU1		       CPU2
> > 				       	       wakeup(T) -> run on CPU1 (last cpu)
> > 
> > 			switch_to(T)
> > 
> > __kthread_bind(T, CPU2)
> > 
> > clear(KTHREAD_IS_PARKED)
> > 
> > 			leave loop due to !KTHREAD_IS_PARKED
> 
> 			How?? The task will leave the loop only when we clear
> 			SHOULD_PARK, not when we clear IS_PARKED. So it won't
> 			leave the loop here. It will cause the kthread to
> 			perform a fresh complete() for the waiting kthread_park()
> 			on CPU0.

You are right on that, but you tricked me into misreading your
patch. Why? Simply because it is too complex for no reason.

> No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
> ensure that the task gets *descheduled* atleast once after we did the
> kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
> migrate a running kthread (because the kthread could be the migration
> thread). So instead, we use kthread_bind() and depend on sleep->wakeup
> to put the task on the right CPU.

Yeah, it's a nice workaround, though I really prefer a guaranteed well
defined state over this wakeup/sleep/wakeup trickery, which also adds
the additional cost of a wakeup/sleep cycle to the online operation.

> > TASK_PARKED is the very obvious and robust solution which fixes _ALL_
> > of the corner cases, at least as far as I can imagine them. And
> > robustness rules at least in my world.
> > 
> 
> Yes, I agree that it is robust and has clear semantics. No doubt about
> that. So I won't insist on going with my suggestions.

I'm glad, that we can agree on the robust solution :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-12 10:59                       ` Thomas Gleixner
@ 2013-04-12 11:26                         ` Srivatsa S. Bhat
  2013-04-15 19:49                         ` Dave Hansen
  1 sibling, 0 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-12 11:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

Hi Thomas,

On 04/12/2013 04:29 PM, Thomas Gleixner wrote:
> Srivatsa,
> 
> On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
>> On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
>>>> +
>>>> +	/*
>>>> +	 * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
>>>> +	 * migration thread (which belongs to the stop_task sched class)
>>>> +	 * doesn't run until the cpu is actually onlined and the thread is
>>>> +	 * unparked.
>>>> +	 */
>>>> +	if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
>>>> +		WARN_ON(1);
>>>
>>> Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
>>> references outside the creation code.
>>
>> I doubt that. We have not even onlined the CPU, how would any else even
>> _know_ that we created this kthread??
> 
> The problem is not only at the thread creation time. We have the same
> issue at offline/online and there we have a reference to that very
> thread.
> 

Right. So our solutions differ in how that is handled, like this:
Yours: ensures that nobody can wakeup the parked thread, except the unpark
       code.
Mine:  ensures that nobody can make the parked thread leave its park loop
       (even if it is woken up), except the unpark code.

Apart from this, everything else is mostly same - for eg., both the patches
depend on that wait_task_inactive() call, in order to make the migration
thread behave.

Either way, the purpose is served, so I'm fine with your solution.

(One of the reasons why I was confident of coming up with a working solution
without adding a new state was because I've worked on the freezer code before,
and IIRC, we have more or less similar problems there and we manage to deal
with it without having a dedicated TASK_FROZEN state. Anyway, nevermind... )

>>>>  /**
>>>>   * kthread_unpark - unpark a thread created by kthread_create().
>>>>   * @k:		thread created by kthread_create().
>>>> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
>>>>  	struct kthread *kthread = task_get_live_kthread(k);
>>>>  
>>>>  	if (kthread) {
>>>> +		/*
>>>> +		 * Per-cpu kthreads such as ksoftirqd can get woken up by
>>>> +		 * other events. So after binding the thread, ensure that
>>>> +		 * it goes off the CPU atleast once, by parking it again.
>>>> +		 * This way, we can ensure that it will run on the correct
>>>> +		 * CPU on subsequent wakeup.
>>>> +		 */
>>>> +		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
>>>> +			__kthread_bind(k, kthread->cpu);
>>>> +			clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
>>>
>>> And how is that f*cking different from the previous code?
>>>
>>> CPU0	   		CPU1		       CPU2
>>> 				       	       wakeup(T) -> run on CPU1 (last cpu)
>>>
>>> 			switch_to(T)
>>>
>>> __kthread_bind(T, CPU2)
>>>
>>> clear(KTHREAD_IS_PARKED)
>>>
>>> 			leave loop due to !KTHREAD_IS_PARKED
>>
>> 			How?? The task will leave the loop only when we clear
>> 			SHOULD_PARK, not when we clear IS_PARKED. So it won't
>> 			leave the loop here. It will cause the kthread to
>> 			perform a fresh complete() for the waiting kthread_park()
>> 			on CPU0.
> 
> You are right on that, but you tricked me into misreading your
> patch. Why? Simply because it is too complex for no reason.
> 

;-)

>> No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
>> ensure that the task gets *descheduled* atleast once after we did the
>> kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
>> migrate a running kthread (because the kthread could be the migration
>> thread). So instead, we use kthread_bind() and depend on sleep->wakeup
>> to put the task on the right CPU.
> 
> Yeah, it's a nice workaround, though I really prefer a guaranteed well
> defined state over this wakeup/sleep/wakeup trickery, which also adds
> the additional cost of a wakeup/sleep cycle to the online operation.
>

Sure, no objections from me!
 
>>> TASK_PARKED is the very obvious and robust solution which fixes _ALL_
>>> of the corner cases, at least as far as I can imagine them. And
>>> robustness rules at least in my world.
>>>
>>
>> Yes, I agree that it is robust and has clear semantics. No doubt about
>> that. So I won't insist on going with my suggestions.
> 
> I'm glad, that we can agree on the robust solution :)
>

I'm glad too :-) Thanks a lot!

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:core/urgent] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-09 14:38               ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
                                   ` (2 preceding siblings ...)
  2013-04-12 10:41                 ` Peter Zijlstra
@ 2013-04-12 12:32                 ` tip-bot for Thomas Gleixner
  3 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Thomas Gleixner @ 2013-04-12 12:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, srivatsa.bhat, davej, bp, dave, tglx

Commit-ID:  f2530dc71cf0822f90bb63ea4600caaef33a66bb
Gitweb:     http://git.kernel.org/tip/f2530dc71cf0822f90bb63ea4600caaef33a66bb
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 9 Apr 2013 09:33:34 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 12 Apr 2013 14:18:43 +0200

kthread: Prevent unpark race which puts threads on the wrong cpu

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0	       	 	CPU1  	     	    CPU2
unpark(T)				    wake_up_process(T)
  clear(SHOULD_PARK)	T runs
			leave parkme() due to !SHOULD_PARK  
  bind_to(CPU2)		BUG_ON(wrong CPU)						    

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0	      	     	 CPU1
Bring up CPU2
create_thread(T)
park(T)
 wait_for_completion()
			 parkme()
			 complete()
sched_set_stop_task()
			 schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 fs/proc/array.c              |  1 +
 include/linux/sched.h        |  5 +++--
 include/trace/events/sched.h |  2 +-
 kernel/kthread.c             | 52 ++++++++++++++++++++++++--------------------
 kernel/smpboot.c             | 14 ++++++++++--
 5 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index f7ed9ee..cbd0f1b 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_array[] = {
 	"x (dead)",		/*  64 */
 	"K (wakekill)",		/* 128 */
 	"W (waking)",		/* 256 */
+	"P (parked)",		/* 512 */
 };
 
 static inline const char *get_task_state(struct task_struct *tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..e692a02 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #define TASK_DEAD		64
 #define TASK_WAKEKILL		128
 #define TASK_WAKING		256
-#define TASK_STATE_MAX		512
+#define TASK_PARKED		512
+#define TASK_STATE_MAX		1024
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 5a8671e..e5586ca 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -147,7 +147,7 @@ TRACE_EVENT(sched_switch,
 		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
 				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
 				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
-				{ 128, "W" }) : "R",
+				{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
 		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
 		__entry->next_comm, __entry->next_pid, __entry->next_prio)
 );
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9eb7fed 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *task)
 
 static void __kthread_parkme(struct kthread *self)
 {
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_PARKED);
 	while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
 		if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
 			complete(&self->parked);
 		schedule();
-		__set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_PARKED);
 	}
 	clear_bit(KTHREAD_IS_PARKED, &self->flags);
 	__set_current_state(TASK_RUNNING);
@@ -256,8 +256,13 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 }
 EXPORT_SYMBOL(kthread_create_on_node);
 
-static void __kthread_bind(struct task_struct *p, unsigned int cpu)
+static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
 {
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, state)) {
+		WARN_ON(1);
+		return;
+	}
 	/* It's safe because the task is inactive. */
 	do_set_cpus_allowed(p, cpumask_of(cpu));
 	p->flags |= PF_THREAD_BOUND;
@@ -274,12 +279,7 @@ static void __kthread_bind(struct task_struct *p, unsigned int cpu)
  */
 void kthread_bind(struct task_struct *p, unsigned int cpu)
 {
-	/* Must have done schedule() in kthread() before we set_task_cpu */
-	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
-		WARN_ON(1);
-		return;
-	}
-	__kthread_bind(p, cpu);
+	__kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(kthread_bind);
 
@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
 	return NULL;
 }
 
+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+	clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+	/*
+	 * We clear the IS_PARKED bit here as we don't wait
+	 * until the task has left the park code. So if we'd
+	 * park before that happens we'd see the IS_PARKED bit
+	 * which might be about to be cleared.
+	 */
+	if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+		if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+			__kthread_bind(k, kthread->cpu, TASK_PARKED);
+		wake_up_state(k, TASK_PARKED);
+	}
+}
+
 /**
  * kthread_unpark - unpark a thread created by kthread_create().
  * @k:		thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *k)
 {
 	struct kthread *kthread = task_get_live_kthread(k);
 
-	if (kthread) {
-		clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
-		/*
-		 * We clear the IS_PARKED bit here as we don't wait
-		 * until the task has left the park code. So if we'd
-		 * park before that happens we'd see the IS_PARKED bit
-		 * which might be about to be cleared.
-		 */
-		if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
-			if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
-				__kthread_bind(k, kthread->cpu);
-			wake_up_process(k);
-		}
-	}
+	if (kthread)
+		__kthread_unpark(k, kthread);
 	put_task_struct(k);
 }
 
@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
 	trace_sched_kthread_stop(k);
 	if (kthread) {
 		set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
-		clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+		__kthread_unpark(k, kthread);
 		wake_up_process(k);
 		wait_for_completion(&kthread->exited);
 	}
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 8eaed9a..02fc5c9 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
 	}
 	get_task_struct(tsk);
 	*per_cpu_ptr(ht->store, cpu) = tsk;
-	if (ht->create)
-		ht->create(cpu);
+	if (ht->create) {
+		/*
+		 * Make sure that the task has actually scheduled out
+		 * into park position, before calling the create
+		 * callback. At least the migration thread callback
+		 * requires that the task is off the runqueue.
+		 */
+		if (!wait_task_inactive(tsk, TASK_PARKED))
+			WARN_ON(1);
+		else
+			ht->create(cpu);
+	}
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
  2013-04-12 10:59                       ` Thomas Gleixner
  2013-04-12 11:26                         ` Srivatsa S. Bhat
@ 2013-04-15 19:49                         ` Dave Hansen
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-15 19:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Srivatsa S. Bhat, Borislav Petkov, LKML, Dave Jones, dhillf,
	Peter Zijlstra, Ingo Molnar

I just tested out Linus's current tree (bb33db7).  It is quite happy on
the large system which was exciting this issue.  Thanks, Thomas!


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2013-04-15 19:49 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-05 21:43 kernel BUG at kernel/smpboot.c:134! Dave Hansen
2013-04-06  7:12 ` Srivatsa S. Bhat
2013-04-06  8:31   ` Thomas Gleixner
2013-04-07  9:20     ` Thomas Gleixner
2013-04-07  9:50       ` Borislav Petkov
2013-04-08  9:24         ` Thomas Gleixner
2013-04-08 11:55           ` Borislav Petkov
2013-04-08 12:17             ` Thomas Gleixner
2013-04-09 14:38               ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
2013-04-09 15:55                 ` Dave Hansen
2013-04-09 18:43                   ` Thomas Gleixner
2013-04-09 19:30                     ` Thomas Gleixner
2013-04-09 20:38                       ` Dave Hansen
2013-04-09 20:54                         ` Dave Hansen
2013-04-10  8:29                         ` Thomas Gleixner
2013-04-10 10:51                           ` Thomas Gleixner
2013-04-10 19:41                             ` Dave Hansen
2013-04-11 10:19                               ` Thomas Gleixner
2013-04-11 10:48                                 ` Srivatsa S. Bhat
2013-04-11 11:43                                   ` Srivatsa S. Bhat
2013-04-11 11:59                                     ` Srivatsa S. Bhat
2013-04-11 12:51                                     ` Thomas Gleixner
2013-04-11 12:54                                     ` Thomas Gleixner
2013-04-11 13:46                                   ` Thomas Gleixner
2013-04-11 18:07                                 ` Dave Hansen
2013-04-11 19:48                                   ` Thomas Gleixner
2013-04-10 14:03                   ` [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn() Srivatsa S. Bhat
2013-04-11  8:10                     ` Thomas Gleixner
2013-04-11 10:19                       ` Srivatsa S. Bhat
2013-04-11 19:16                 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Srivatsa S. Bhat
2013-04-11 20:47                   ` Thomas Gleixner
2013-04-11 21:19                     ` Srivatsa S. Bhat
2013-04-12 10:59                       ` Thomas Gleixner
2013-04-12 11:26                         ` Srivatsa S. Bhat
2013-04-15 19:49                         ` Dave Hansen
2013-04-12 10:41                 ` Peter Zijlstra
2013-04-12 12:32                 ` [tip:core/urgent] " tip-bot for Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).