* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 14:38 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
@ 2013-04-09 15:55 ` Dave Hansen
2013-04-09 18:43 ` Thomas Gleixner
2013-04-10 14:03 ` [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn() Srivatsa S. Bhat
2013-04-11 19:16 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Srivatsa S. Bhat
` (2 subsequent siblings)
3 siblings, 2 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-09 15:55 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
Hey Thomas,
I don't think the patch helped my case. Looks like the same BUG_ON().
I accidentally booted with possible_cpus=10 instead of 160. I wasn't
able to trigger this in that case, even repeatedly on/offlining them.
But, once I booted with possible_cpus=160, it triggered in a jiffy.
Two oopses below (bottom one has cpu numbers):
> [ 467.106219] ------------[ cut here ]------------
> [ 467.106400] kernel BUG at kernel/smpboot.c:134!
> [ 467.106556] invalid opcode: 0000 [#1] SMP
> [ 467.106831] Modules linked in:
> [ 467.107039] CPU 0
> [ 467.107109] Pid: 3095, comm: migration/115 Tainted: G W 3.9.0-rc6-00020-g84ee980-dirty #132 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [ 467.107507] RIP: 0010:[<ffffffff8110bed8>] [<ffffffff8110bed8>] smpboot_thread_fn+0x258/0x280
> [ 467.107820] RSP: 0018:ffff887ff0561e08 EFLAGS: 00010202
> [ 467.107980] RAX: 0000000000000000 RBX: ffff887ff04ef010 RCX: 000000000000b888
> [ 467.108142] RDX: ffff887ff0561fd8 RSI: ffff881ffda00000 RDI: 0000000000000073
> [ 467.108303] RBP: ffff887ff0561e38 R08: 0000000000000001 R09: 0000000000000000
> [ 467.108465] R10: 0000000000000018 R11: 0000000000000000 R12: ffff887ff053c5c0
> [ 467.108629] R13: ffffffff81e587a0 R14: ffff887ff053c5c0 R15: 0000000000000000
> [ 467.108791] FS: 0000000000000000(0000) GS:ffff881ffda00000(0000) knlGS:0000000000000000
> [ 467.109037] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 467.109194] CR2: 000000000117c278 CR3: 0000000001e0b000 CR4: 00000000000007f0
> [ 467.109357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 467.109519] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 467.109684] Process migration/115 (pid: 3095, threadinfo ffff887ff0560000, task ffff887ff053c5c0)
> [ 467.109930] Stack:
> [ 467.110075] ffff887ff0561e38 0000000000000000 ffff881fe60adcc0 ffff887ff0561ec0
> [ 467.110580] ffff887ff04ef010 ffffffff8110bc80 ffff887ff0561f48 ffffffff810ff1df
> [ 467.111075] 0000000000000001 ffff881f00000073 ffff887ff04ef010 ffff887f00000001
> [ 467.111568] Call Trace:
> [ 467.111726] [<ffffffff8110bc80>] ? __smpboot_create_thread+0x180/0x180
> [ 467.111893] [<ffffffff810ff1df>] kthread+0xef/0x100
> [ 467.112057] [<ffffffff8110e340>] ? complete+0x30/0x80
> [ 467.112216] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [ 467.112386] [<ffffffff819db99c>] ret_from_fork+0x7c/0xb0
> [ 467.112548] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
> [ 467.112708] Code: ef 3d 01 01 48 89 df e8 c7 af 16 00 48 83 05 97 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 c0 ef 3d 01 01 <0f> 0b 48 83 05 c6 ef 3d 01 01 48 83 05 86 ef 3d 01 01 0f 0b 48
> [ 467.117014] RIP [<ffffffff8110bed8>] smpboot_thread_fn+0x258/0x280
> [ 467.117233] RSP <ffff887ff0561e08>
> [ 467.117414] ---[ end trace d851dfb0bce51ca2 ]---
Here's the same oops, but with the line numbers munged because I added
some printks:
> [ 161.551788] smpboot_thread_fn():
> [ 161.551807] td->cpu: 132
> [ 161.551808] smp_processor_id(): 121
> [ 161.551811] comm: migration/%u
> [ 161.551840] ------------[ cut here ]------------
> [ 161.551939] kernel BUG at kernel/smpboot.c:149!
> [ 161.552030] invalid opcode: 0000 [#1] SMP
> [ 161.552255] Modules linked in:
> [ 161.552397] CPU 121
> [ 161.552474] Pid: 2957, comm: migration/132 Tainted: G W 3.9.0-rc6-00020-g84ee980-dirty #136 FUJITSU-SV PRIMEQUEST 1800E2/SB
> [ 161.552655] RIP: 0010:[<ffffffff8110bf29>] [<ffffffff8110bf29>] smpboot_thread_fn+0x409/0x560
> [ 161.552852] RSP: 0018:ffff88bff0403de8 EFLAGS: 00010202
> [ 161.552935] RAX: 0000000000000079 RBX: ffff88bff02ac070 RCX: 0000000000000006
> [ 161.553025] RDX: 0000000000000007 RSI: 0000000000000007 RDI: ffff889ffec0d190
> [ 161.553115] RBP: ffff88bff0403e38 R08: 0000000000000001 R09: 0000000000000001
> [ 161.553204] R10: 0000000000000000 R11: 0000000000000b09 R12: ffff88bff04745c0
> [ 161.553319] R13: ffffffff81e587a0 R14: ffffffff8110bb20 R15: ffff88bff04745c0
> [ 161.553411] FS: 0000000000000000(0000) GS:ffff889ffec00000(0000) knlGS:0000000000000000
> [ 161.553534] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 161.553619] CR2: 00007f0c4155c6d0 CR3: 0000000001e0b000 CR4: 00000000000007e0
> [ 161.553709] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 161.553799] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 161.553889] Process migration/132 (pid: 2957, threadinfo ffff88bff0402000, task ffff88bff04745c0)
> [ 161.554156] Stack:
> [ 161.554312] ffffffff8110bb20 ffff88bff04745c0 ffff88bff0403e08 0000000000000000
> [ 161.554839] ffff88bff0403e38 ffff881fef323cc0 ffff88bff0403ec0 ffff88bff02ac070
> [ 161.555370] ffffffff8110bb20 0000000000000000 ffff88bff0403f48 ffffffff810ff08f
> [ 161.555891] Call Trace:
> [ 161.556055] [<ffffffff8110bb20>] ? __smpboot_create_thread+0x180/0x180
> [ 161.556230] [<ffffffff8110bb20>] ? __smpboot_create_thread+0x180/0x180
> [ 161.556409] [<ffffffff810ff08f>] kthread+0xef/0x100
> [ 161.556590] [<ffffffff819d5154>] ? wait_for_completion+0x124/0x180
> [ 161.556761] [<ffffffff810fefa0>] ? __init_kthread_worker+0x80/0x80
> [ 161.556982] [<ffffffff819e59dc>] ret_from_fork+0x7c/0xb0
> [ 161.557148] [<ffffffff810fefa0>] ? __init_kthread_worker+0x80/0x80
> [ 161.557316] Code: 05 e4 f1 3d 01 01 e8 2b cf 8b 00 48 83 05 df f1 3d 01 01 65 8b 04 25 64 b0 00 00 39 03 0f 84 0c fd ff ff 48 83 05 cf f1 3d 01 01 <0f> 0b 48 83 05 cd f1 3d 01 01 0f 1f 44 00 00 b9 8b 00 00 00 48
> [ 161.561934] RIP [<ffffffff8110bf29>] smpboot_thread_fn+0x409/0x560
> [ 161.562171] RSP <ffff88bff0403de8>
> [ 161.562352] ---[ end trace 6a3b5261afedf7da ]---
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 15:55 ` Dave Hansen
@ 2013-04-09 18:43 ` Thomas Gleixner
2013-04-09 19:30 ` Thomas Gleixner
2013-04-10 14:03 ` [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn() Srivatsa S. Bhat
1 sibling, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-09 18:43 UTC (permalink / raw)
To: Dave Hansen
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
Dave,
On Tue, 9 Apr 2013, Dave Hansen wrote:
> Hey Thomas,
>
> I don't think the patch helped my case. Looks like the same BUG_ON().
>
> I accidentally booted with possible_cpus=10 instead of 160. I wasn't
> able to trigger this in that case, even repeatedly on/offlining them.
> But, once I booted with possible_cpus=160, it triggered in a jiffy.
Darn. We should have merged Paul McKenneys NR_CPUS=0 patch last year
and we would have avoided the whole shebang.
> > [ 161.551788] smpboot_thread_fn():
> > [ 161.551807] td->cpu: 132
> > [ 161.551808] smp_processor_id(): 121
> > [ 161.551811] comm: migration/%u
> > [ 161.551840] ------------[ cut here ]------------
> > [ 161.551939] kernel BUG at kernel/smpboot.c:149!
Any chance, that you get a trace for that?
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 18:43 ` Thomas Gleixner
@ 2013-04-09 19:30 ` Thomas Gleixner
2013-04-09 20:38 ` Dave Hansen
0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-09 19:30 UTC (permalink / raw)
To: Dave Hansen
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> Dave,
>
> On Tue, 9 Apr 2013, Dave Hansen wrote:
>
> > Hey Thomas,
> >
> > I don't think the patch helped my case. Looks like the same BUG_ON().
> >
> > I accidentally booted with possible_cpus=10 instead of 160. I wasn't
> > able to trigger this in that case, even repeatedly on/offlining them.
> > But, once I booted with possible_cpus=160, it triggered in a jiffy.
>
> Darn. We should have merged Paul McKenneys NR_CPUS=0 patch last year
> and we would have avoided the whole shebang.
>
> > > [ 161.551788] smpboot_thread_fn():
> > > [ 161.551807] td->cpu: 132
> > > [ 161.551808] smp_processor_id(): 121
> > > [ 161.551811] comm: migration/%u
> > > [ 161.551840] ------------[ cut here ]------------
> > > [ 161.551939] kernel BUG at kernel/smpboot.c:149!
>
> Any chance, that you get a trace for that?
Thought more about it and found, that the stupid binding only works
when the task is really descheduled. So there is a small window left,
which could lead to this. Revised patch below.
Anyway a trace for that issue would be appreciated nevertheless.
Thanks,
tglx
---
Subject: kthread: Prevent unpark race which puts threads on the wrong cpu
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 09 Apr 2013 09:33:34 +0200
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:
CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)
We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.
The only rock solid solution to this problem is to prevent wakeups in
park mode which are not from unpark(). That way we can guarantee that
the association of the task to the target cpu is working correctly.
Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitely for the unpark wakeup.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Hansen <dave@sr71.net>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
fs/proc/array.c | 1
include/linux/sched.h | 5 ++--
kernel/kthread.c | 52 ++++++++++++++++++++++++++------------------------
3 files changed, 32 insertions(+), 26 deletions(-)
Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_arr
"x (dead)", /* 64 */
"K (wakekill)", /* 128 */
"W (waking)", /* 256 */
+ "P (parked)", /* 512 */
};
static inline const char *get_task_state(struct task_struct *tsk)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
-#define TASK_STATE_MAX 512
+#define TASK_PARKED 512
+#define TASK_STATE_MAX 1024
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t
static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
@@ -256,8 +256,13 @@ struct task_struct *kthread_create_on_no
}
EXPORT_SYMBOL(kthread_create_on_node);
-static void __kthread_bind(struct task_struct *p, unsigned int cpu)
+static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
{
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, state)) {
+ WARN_ON(1);
+ return;
+ }
/* It's safe because the task is inactive. */
do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
@@ -274,12 +279,7 @@ static void __kthread_bind(struct task_s
*/
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- __kthread_bind(p, cpu);
+ __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(kthread_bind);
@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
return NULL;
}
+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+ clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ /*
+ * We clear the IS_PARKED bit here as we don't wait
+ * until the task has left the park code. So if we'd
+ * park before that happens we'd see the IS_PARKED bit
+ * which might be about to be cleared.
+ */
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu, TASK_PARKED);
+ wake_up_state(k, TASK_PARKED);
+ }
+}
+
/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
{
struct kthread *kthread = task_get_live_kthread(k);
- if (kthread) {
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- /*
- * We clear the IS_PARKED bit here as we don't wait
- * until the task has left the park code. So if we'd
- * park before that happens we'd see the IS_PARKED bit
- * which might be about to be cleared.
- */
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
- wake_up_process(k);
- }
- }
+ if (kthread)
+ __kthread_unpark(k, kthread);
put_task_struct(k);
}
@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
trace_sched_kthread_stop(k);
if (kthread) {
set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ __kthread_unpark(k, kthread);
wake_up_process(k);
wait_for_completion(&kthread->exited);
}
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 19:30 ` Thomas Gleixner
@ 2013-04-09 20:38 ` Dave Hansen
2013-04-09 20:54 ` Dave Hansen
2013-04-10 8:29 ` Thomas Gleixner
0 siblings, 2 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-09 20:38 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> Thought more about it and found, that the stupid binding only works
> when the task is really descheduled. So there is a small window left,
> which could lead to this. Revised patch below.
>
> Anyway a trace for that issue would be appreciated nevertheless.
Here you go:
http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz
It oopsed in exit.c:
https://picasaweb.google.com/lh/photo/7v_Xua9I29Rar3bBdNlLu9MTjNZETYmyPJy0liipFm0?feat=directlink
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 20:38 ` Dave Hansen
@ 2013-04-09 20:54 ` Dave Hansen
2013-04-10 8:29 ` Thomas Gleixner
1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-09 20:54 UTC (permalink / raw)
To: Dave Hansen
Cc: Thomas Gleixner, Borislav Petkov, Srivatsa S. Bhat, LKML,
Dave Jones, dhillf, Peter Zijlstra, Ingo Molnar
On 04/09/2013 01:38 PM, Dave Hansen wrote:
> It oopsed in exit.c:
>
> https://picasaweb.google.com/lh/photo/7v_Xua9I29Rar3bBdNlLu9MTjNZETYmyPJy0liipFm0?feat=directlink
This was just secondary fallout after the first BUG_ON(). This exit.c
thing isn't a new issue.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 20:38 ` Dave Hansen
2013-04-09 20:54 ` Dave Hansen
@ 2013-04-10 8:29 ` Thomas Gleixner
2013-04-10 10:51 ` Thomas Gleixner
1 sibling, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-10 8:29 UTC (permalink / raw)
To: Dave Hansen
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Tue, 9 Apr 2013, Dave Hansen wrote:
> On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> > On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> > Thought more about it and found, that the stupid binding only works
> > when the task is really descheduled. So there is a small window left,
> > which could lead to this. Revised patch below.
> >
> > Anyway a trace for that issue would be appreciated nevertheless.
>
> Here you go:
>
> http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz
Hmm. Unfortunately migration/146 is not in the trace.
Can you please apply the patch below? That avoids the oops, but might
hang an online operation. Though the machine should stay up and you
should be able to retrieve the trace.
Thanks,
tglx
---
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -131,7 +131,10 @@ static int smpboot_thread_fn(void *data)
continue;
}
- BUG_ON(td->cpu != smp_processor_id());
+ if (td->cpu != smp_processor_id()) {
+ tracing_off();
+ schedule();
+ }
/* Check for state change setup */
switch (td->status) {
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-10 8:29 ` Thomas Gleixner
@ 2013-04-10 10:51 ` Thomas Gleixner
2013-04-10 19:41 ` Dave Hansen
0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-10 10:51 UTC (permalink / raw)
To: Dave Hansen
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Wed, 10 Apr 2013, Thomas Gleixner wrote:
> On Tue, 9 Apr 2013, Dave Hansen wrote:
>
> > On 04/09/2013 12:30 PM, Thomas Gleixner wrote:
> > > On Tue, 9 Apr 2013, Thomas Gleixner wrote:
> > > Thought more about it and found, that the stupid binding only works
> > > when the task is really descheduled. So there is a small window left,
> > > which could lead to this. Revised patch below.
> > >
> > > Anyway a trace for that issue would be appreciated nevertheless.
> >
> > Here you go:
> >
> > http://sr71.net/~dave/linux/bigbox.1365539189.txt.gz
>
> Hmm. Unfortunately migration/146 is not in the trace.
>
> Can you please apply the patch below? That avoids the oops, but might
> hang an online operation. Though the machine should stay up and you
> should be able to retrieve the trace.
>
> Thanks,
>
> tglx
> ---
> Index: linux-2.6/kernel/smpboot.c
> ===================================================================
> --- linux-2.6.orig/kernel/smpboot.c
> +++ linux-2.6/kernel/smpboot.c
> @@ -131,7 +131,10 @@ static int smpboot_thread_fn(void *data)
> continue;
> }
>
> - BUG_ON(td->cpu != smp_processor_id());
> + if (td->cpu != smp_processor_id()) {
> + tracing_off();
> + schedule();
Bah, that wants a continue. Revised patch below.
> + }
>
> /* Check for state change setup */
> switch (td->status) {
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -131,7 +131,11 @@ static int smpboot_thread_fn(void *data)
continue;
}
- BUG_ON(td->cpu != smp_processor_id());
+ if (td->cpu != smp_processor_id()) {
+ tracing_off();
+ schedule();
+ continue;
+ }
/* Check for state change setup */
switch (td->status) {
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-10 10:51 ` Thomas Gleixner
@ 2013-04-10 19:41 ` Dave Hansen
2013-04-11 10:19 ` Thomas Gleixner
0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2013-04-10 19:41 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
I think I got a full trace this time:
http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
The last timestamp is pretty close to the timestamp on the console:
[ 2071.033434] smpboot_thread_fn():
[ 2071.033455] smpboot_thread_fn() cpu: 22 159
[ 2071.033470] td->cpu: 22
[ 2071.033475] smp_processor_id(): 21
[ 2071.033486] comm: migration/%u
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-10 19:41 ` Dave Hansen
@ 2013-04-11 10:19 ` Thomas Gleixner
2013-04-11 10:48 ` Srivatsa S. Bhat
2013-04-11 18:07 ` Dave Hansen
0 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 10:19 UTC (permalink / raw)
To: Dave Hansen
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
Dave,
On Wed, 10 Apr 2013, Dave Hansen wrote:
> I think I got a full trace this time:
>
> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>
> The last timestamp is pretty close to the timestamp on the console:
>
> [ 2071.033434] smpboot_thread_fn():
> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> [ 2071.033470] td->cpu: 22
> [ 2071.033475] smp_processor_id(): 21
> [ 2071.033486] comm: migration/%u
Yes, that's helpful. Though it makes my mind boggle:
migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
So the migration thread schedules out with state TASK_PARKED and gets
scheduled back in right away without a wakeup. Srivatsa was about
right, that this might be related to the sched_set_stop_task() call,
but the changelog led completely down the wrong road.
So the issue is:
CPU14 CPU21
create_thread(for CPU22)
park_thread()
wait_for_completion() park_me()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)
The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU21 and that confuses the hell out of the stop_task
class on that cpu. So as we have to synchronize the state for the
bind call (the issue observed by Borislav) we need to do the same
before issuing sched_set_stop_task(). Delta patch below.
Thanks,
tglx
---
Index: linux-2.6/include/trace/events/sched.h
===================================================================
--- linux-2.6.orig/include/trace/events/sched.h
+++ linux-2.6/include/trace/events/sched.h
@@ -147,7 +147,7 @@ TRACE_EVENT(sched_switch,
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
{ 16, "Z" }, { 32, "X" }, { 64, "x" },
- { 128, "W" }) : "R",
+ { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
__entry->prev_state & TASK_STATE_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);
Index: linux-2.6/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/kernel/smpboot.c
+++ linux-2.6/kernel/smpboot.c
@@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
}
get_task_struct(tsk);
*per_cpu_ptr(ht->store, cpu) = tsk;
- if (ht->create)
- ht->create(cpu);
+ if (ht->create) {
+ /*
+ * Make sure that the task has actually scheduled out
+ * into park position, before calling the create
+ * callback. At least the migration thread callback
+ * requires that the task is off the runqueue.
+ */
+ if (!wait_task_inactive(tsk, TASK_PARKED))
+ WARN_ON(1);
+ else
+ ht->create(cpu);
+ }
return 0;
}
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 10:19 ` Thomas Gleixner
@ 2013-04-11 10:48 ` Srivatsa S. Bhat
2013-04-11 11:43 ` Srivatsa S. Bhat
2013-04-11 13:46 ` Thomas Gleixner
2013-04-11 18:07 ` Dave Hansen
1 sibling, 2 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 10:48 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> Dave,
>
> On Wed, 10 Apr 2013, Dave Hansen wrote:
>
>> I think I got a full trace this time:
>>
>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>>
>> The last timestamp is pretty close to the timestamp on the console:
>>
>> [ 2071.033434] smpboot_thread_fn():
>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
>> [ 2071.033470] td->cpu: 22
>> [ 2071.033475] smp_processor_id(): 21
>> [ 2071.033486] comm: migration/%u
>
> Yes, that's helpful. Though it makes my mind boggle:
>
> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
>
> So the migration thread schedules out with state TASK_PARKED and gets
> scheduled back in right away without a wakeup. Srivatsa was about
> right, that this might be related to the sched_set_stop_task() call,
> but the changelog led completely down the wrong road.
>
> So the issue is:
>
> CPU14 CPU21
> create_thread(for CPU22)
> park_thread()
> wait_for_completion() park_me()
> complete()
> sched_set_stop_task()
> schedule(TASK_PARKED)
>
> The sched_set_stop_task() call is issued while the task is on the
> runqueue of CPU21 and that confuses the hell out of the stop_task
> class on that cpu. So as we have to synchronize the state for the
> bind call (the issue observed by Borislav) we need to do the same
> before issuing sched_set_stop_task(). Delta patch below.
>
In that case, why not just apply this 2 line patch on mainline?
The patch I sent yesterday was more elaborate because I wrongly assumed
that kthread_bind() can cause a wakeup. But now, I feel the patch shown
below should work just fine too. Yeah, it binds the task during creation
as well as during unpark, but that should be ok (see below).
Somehow, I believe we can handle this issue without introducing that
whole TASK_PARKED thing..
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..74af4a8 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ wait_task_inactive(p, TASK_INTERRUPTIBLE);
+ __kthread_bind(p, cpu);
return p;
}
The reason why we can't get rid of the bind in the unpark code is because,
the threads are parked during CPU offline *after* calling CPU_DOWN_PREPARE.
And during CPU_DOWN_PREPARE, the scheduler removes the CPU from the cpu_active_mask.
So on any subsequent wakeup of these threads before they are parked, the scheduler
will force migrate them to some other CPU (but alas it wont print this event
because of the p->mm != NULL check in select_fallback_rq()). So during unpark
during the next online operation we need to bind it again. But that's fine, IMHO.
Regards,
Srivatsa S. Bhat
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 10:48 ` Srivatsa S. Bhat
@ 2013-04-11 11:43 ` Srivatsa S. Bhat
2013-04-11 11:59 ` Srivatsa S. Bhat
` (2 more replies)
2013-04-11 13:46 ` Thomas Gleixner
1 sibling, 3 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 11:43 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
>> Dave,
>>
>> On Wed, 10 Apr 2013, Dave Hansen wrote:
>>
>>> I think I got a full trace this time:
>>>
>>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
>>>
>>> The last timestamp is pretty close to the timestamp on the console:
>>>
>>> [ 2071.033434] smpboot_thread_fn():
>>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
>>> [ 2071.033470] td->cpu: 22
>>> [ 2071.033475] smp_processor_id(): 21
>>> [ 2071.033486] comm: migration/%u
>>
>> Yes, that's helpful. Though it makes my mind boggle:
>>
>> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
>> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
>> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
>> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
>>
>> So the migration thread schedules out with state TASK_PARKED and gets
>> scheduled back in right away without a wakeup. Srivatsa was about
>> right, that this might be related to the sched_set_stop_task() call,
>> but the changelog led completely down the wrong road.
>>
>> So the issue is:
>>
>> CPU14 CPU21
>> create_thread(for CPU22)
>> park_thread()
>> wait_for_completion() park_me()
>> complete()
>> sched_set_stop_task()
>> schedule(TASK_PARKED)
>>
>> The sched_set_stop_task() call is issued while the task is on the
>> runqueue of CPU21 and that confuses the hell out of the stop_task
>> class on that cpu. So as we have to synchronize the state for the
>> bind call (the issue observed by Borislav) we need to do the same
>> before issuing sched_set_stop_task(). Delta patch below.
>>
>
> In that case, why not just apply this 2 line patch on mainline?
> The patch I sent yesterday was more elaborate because I wrongly assumed
> that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> below should work just fine too. Yeah, it binds the task during creation
> as well as during unpark, but that should be ok (see below).
>
> Somehow, I believe we can handle this issue without introducing that
> whole TASK_PARKED thing..
>
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..74af4a8 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> to_kthread(p)->cpu = cpu;
> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> kthread_park(p);
> +
> + wait_task_inactive(p, TASK_INTERRUPTIBLE);
> + __kthread_bind(p, cpu);
> return p;
> }
>
Oh wait a minute, its simpler than even that I believe! We don't need that
extra kthread_bind() also.
So, here is what is happening, from what I understand:
sched_set_stop_task() will make the newly created migration thread as the
stop task, during ht->create(). But looking into __sched_setscheduler(),
we observe that if p->on_rq is true (where p is the migration thread),
the task is enqueued on the rq by calling ->enqueue_task() of stop_task
sched class, which simply increments the rq->nr_running.
So, on the next call to schedule(), the pick_next_task() in core.c sees
that rq->nr_running is not equal to rq->cfs.h_nr_running and hence invokes
the stop_task sched class' ->pick_next_task(). And that will return the
stop task, which is nothing but this migration thread that we just created.
That's why the migration thread starts running prematurely, even before
it is bound, and even before its corresponding CPU is brought online.
So, if we just wait till p->on_rq is reset to 0 by __schedule(), before
calling ht->create(), the task doesn't get queued to the runqueue with
stop task priority. So that's it - we simply wait till it gets off the
rq and everything should be fine. And during the subsequent online, during
the unpark phase, the task gets bound to the appropriate cpu and only
then woken up, as desired.
So Dave, could you kindly test the below patch on mainline?
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9558355 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ /*
+ * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
+ * migration thread (which belongs to the stop_task sched class)
+ * doesn't run until the cpu is actually onlined and the thread is
+ * unparked.
+ */
+ if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+ WARN_ON(1);
return p;
}
Regards,
Srivatsa S. Bhat
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 11:43 ` Srivatsa S. Bhat
@ 2013-04-11 11:59 ` Srivatsa S. Bhat
2013-04-11 12:51 ` Thomas Gleixner
2013-04-11 12:54 ` Thomas Gleixner
2 siblings, 0 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 11:59 UTC (permalink / raw)
To: Dave Hansen, Borislav Petkov
Cc: Thomas Gleixner, LKML, Dave Jones, dhillf, Peter Zijlstra, Ingo Molnar
On 04/11/2013 05:13 PM, Srivatsa S. Bhat wrote:
[...]
> So Dave, could you kindly test the below patch on mainline?
>
BTW, you don't need to try out any of the previous patches that I sent,
just this one is good enough. Thanks!
Regards,
Srivatsa S. Bhat
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..9558355 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> to_kthread(p)->cpu = cpu;
> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> kthread_park(p);
> +
> + /*
> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> + * migration thread (which belongs to the stop_task sched class)
> + * doesn't run until the cpu is actually onlined and the thread is
> + * unparked.
> + */
> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> + WARN_ON(1);
> return p;
> }
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 11:43 ` Srivatsa S. Bhat
2013-04-11 11:59 ` Srivatsa S. Bhat
@ 2013-04-11 12:51 ` Thomas Gleixner
2013-04-11 12:54 ` Thomas Gleixner
2 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 12:51 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> > On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> >> Dave,
> >>
> >> On Wed, 10 Apr 2013, Dave Hansen wrote:
> >>
> >>> I think I got a full trace this time:
> >>>
> >>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
> >>>
> >>> The last timestamp is pretty close to the timestamp on the console:
> >>>
> >>> [ 2071.033434] smpboot_thread_fn():
> >>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> >>> [ 2071.033470] td->cpu: 22
> >>> [ 2071.033475] smp_processor_id(): 21
> >>> [ 2071.033486] comm: migration/%u
> >>
> >> Yes, that's helpful. Though it makes my mind boggle:
> >>
> >> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> >> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> >> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> >> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
> >>
> >> So the migration thread schedules out with state TASK_PARKED and gets
> >> scheduled back in right away without a wakeup. Srivatsa was about
> >> right, that this might be related to the sched_set_stop_task() call,
> >> but the changelog led completely down the wrong road.
> >>
> >> So the issue is:
> >>
> >> CPU14 CPU21
> >> create_thread(for CPU22)
> >> park_thread()
> >> wait_for_completion() park_me()
> >> complete()
> >> sched_set_stop_task()
> >> schedule(TASK_PARKED)
> >>
> >> The sched_set_stop_task() call is issued while the task is on the
> >> runqueue of CPU21 and that confuses the hell out of the stop_task
> >> class on that cpu. So as we have to synchronize the state for the
> >> bind call (the issue observed by Borislav) we need to do the same
> >> before issuing sched_set_stop_task(). Delta patch below.
> >>
> >
> > In that case, why not just apply this 2 line patch on mainline?
> > The patch I sent yesterday was more elaborate because I wrongly assumed
> > that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> > below should work just fine too. Yeah, it binds the task during creation
> > as well as during unpark, but that should be ok (see below).
> >
> > Somehow, I believe we can handle this issue without introducing that
> > whole TASK_PARKED thing..
> >
> >
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 691dc2e..74af4a8 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> > to_kthread(p)->cpu = cpu;
> > /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> > kthread_park(p);
> > +
> > + wait_task_inactive(p, TASK_INTERRUPTIBLE);
> > + __kthread_bind(p, cpu);
> > return p;
> > }
> >
>
> Oh wait a minute, its simpler than even that I believe! We don't need that
> extra kthread_bind() also.
>
> So, here is what is happening, from what I understand:
> sched_set_stop_task() will make the newly created migration thread as the
> stop task, during ht->create(). But looking into __sched_setscheduler(),
> we observe that if p->on_rq is true (where p is the migration thread),
> the task is enqueued on the rq by calling ->enqueue_task() of stop_task
> sched class, which simply increments the rq->nr_running.
>
> So, on the next call to schedule(), the pick_next_task() in core.c sees
> that rq->nr_running is not equal to rq->cfs.h_nr_running and hence invokes
> the stop_task sched class' ->pick_next_task(). And that will return the
> stop task, which is nothing but this migration thread that we just created.
> That's why the migration thread starts running prematurely, even before
> it is bound, and even before its corresponding CPU is brought online.
>
> So, if we just wait till p->on_rq is reset to 0 by __schedule(), before
> calling ht->create(), the task doesn't get queued to the runqueue with
> stop task priority. So that's it - we simply wait till it gets off the
> rq and everything should be fine. And during the subsequent online, during
> the unpark phase, the task gets bound to the appropriate cpu and only
> then woken up, as desired.
Right, but I want to avoid that wait for the common case. In most of
the cases the newly created thread reaches PARKED and scheduled out
before we do bind or any other stuff.
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 11:43 ` Srivatsa S. Bhat
2013-04-11 11:59 ` Srivatsa S. Bhat
2013-04-11 12:51 ` Thomas Gleixner
@ 2013-04-11 12:54 ` Thomas Gleixner
2 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 12:54 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/11/2013 04:18 PM, Srivatsa S. Bhat wrote:
> > On 04/11/2013 03:49 PM, Thomas Gleixner wrote:
> >> Dave,
> >>
> >> On Wed, 10 Apr 2013, Dave Hansen wrote:
> >>
> >>> I think I got a full trace this time:
> >>>
> >>> http://sr71.net/~dave/linux/bigbox-trace.1365621899.txt.gz
> >>>
> >>> The last timestamp is pretty close to the timestamp on the console:
> >>>
> >>> [ 2071.033434] smpboot_thread_fn():
> >>> [ 2071.033455] smpboot_thread_fn() cpu: 22 159
> >>> [ 2071.033470] td->cpu: 22
> >>> [ 2071.033475] smp_processor_id(): 21
> >>> [ 2071.033486] comm: migration/%u
> >>
> >> Yes, that's helpful. Though it makes my mind boggle:
> >>
> >> migration/22-4335 [021] d.h. 2071.020530: sched_wakeup: comm=migration/21 pid=4323 prio=0 success=1 target_cpu=021^M
> >> migration/22-4335 [021] d... 2071.020541: sched_switch: prev_comm=migration/22 prev_pid=4335 prev_prio=0 prev_state=0x200 ==> next_comm=migration/21 next_pid=4323 next_prio=0^M
> >> migration/21-4323 [021] d... 2071.026110: sched_switch: prev_comm=migration/21 prev_pid=4323 prev_prio=0 prev_state=S ==> next_comm=migration/22 next_pid=4335 next_prio=0^M
> >> migration/22-4335 [021] .... 2071.033422: smpboot_thread_fn <-kthread^M
> >>
> >> So the migration thread schedules out with state TASK_PARKED and gets
> >> scheduled back in right away without a wakeup. Srivatsa was about
> >> right, that this might be related to the sched_set_stop_task() call,
> >> but the changelog led completely down the wrong road.
> >>
> >> So the issue is:
> >>
> >> CPU14 CPU21
> >> create_thread(for CPU22)
> >> park_thread()
> >> wait_for_completion() park_me()
> >> complete()
> >> sched_set_stop_task()
> >> schedule(TASK_PARKED)
> >>
> >> The sched_set_stop_task() call is issued while the task is on the
> >> runqueue of CPU21 and that confuses the hell out of the stop_task
> >> class on that cpu. So as we have to synchronize the state for the
> >> bind call (the issue observed by Borislav) we need to do the same
> >> before issuing sched_set_stop_task(). Delta patch below.
> >>
> >
> > In that case, why not just apply this 2 line patch on mainline?
> > The patch I sent yesterday was more elaborate because I wrongly assumed
> > that kthread_bind() can cause a wakeup. But now, I feel the patch shown
> > below should work just fine too. Yeah, it binds the task during creation
> > as well as during unpark, but that should be ok (see below).
> >
> > Somehow, I believe we can handle this issue without introducing that
> > whole TASK_PARKED thing..
We need it as Borislav showed.
Thread is created and parked. Now after online we unpark the thread,
but something else wakes it before we reach the unpark code and boom
it's on the wrong cpu. We want that PARKED thing for
robustness. Anything else is just voodoo programming.
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 10:48 ` Srivatsa S. Bhat
2013-04-11 11:43 ` Srivatsa S. Bhat
@ 2013-04-11 13:46 ` Thomas Gleixner
1 sibling, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 13:46 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Thu, 11 Apr 2013, Srivatsa S. Bhat wrote:
> The reason why we can't get rid of the bind in the unpark code is because,
> the threads are parked during CPU offline *after* calling CPU_DOWN_PREPARE.
> And during CPU_DOWN_PREPARE, the scheduler removes the CPU from the cpu_active_mask.
> So on any subsequent wakeup of these threads before they are parked, the scheduler
> will force migrate them to some other CPU (but alas it wont print this event
> because of the p->mm != NULL check in select_fallback_rq()). So during unpark
> during the next online operation we need to bind it again. But that's fine, IMHO.
No, it's not fine. If we do not have TASK_PARKED a wakeup which
happens after onlining the task and races with the unpark code will
cause the task to end up on the wrong CPU. Borislav showed traces
which prove that with ksoftirqd. So no, TASK_PARKED is going to stay.
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 10:19 ` Thomas Gleixner
2013-04-11 10:48 ` Srivatsa S. Bhat
@ 2013-04-11 18:07 ` Dave Hansen
2013-04-11 19:48 ` Thomas Gleixner
1 sibling, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2013-04-11 18:07 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On 04/11/2013 03:19 AM, Thomas Gleixner wrote:
> --- linux-2.6.orig/kernel/smpboot.c
> +++ linux-2.6/kernel/smpboot.c
> @@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
> }
> get_task_struct(tsk);
> *per_cpu_ptr(ht->store, cpu) = tsk;
> - if (ht->create)
> - ht->create(cpu);
> + if (ht->create) {
> + /*
> + * Make sure that the task has actually scheduled out
> + * into park position, before calling the create
> + * callback. At least the migration thread callback
> + * requires that the task is off the runqueue.
> + */
> + if (!wait_task_inactive(tsk, TASK_PARKED))
> + WARN_ON(1);
> + else
> + ht->create(cpu);
> + }
> return 0;
> }
This one appears to be doing the trick. I'll run the cpus in an
online/offline loop for a bit and make sure it's stable. It's passed
several round so far, which is way more than it's done up to this point,
though.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 18:07 ` Dave Hansen
@ 2013-04-11 19:48 ` Thomas Gleixner
0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 19:48 UTC (permalink / raw)
To: Dave Hansen
Cc: Borislav Petkov, Srivatsa S. Bhat, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Thu, 11 Apr 2013, Dave Hansen wrote:
> On 04/11/2013 03:19 AM, Thomas Gleixner wrote:
> > --- linux-2.6.orig/kernel/smpboot.c
> > +++ linux-2.6/kernel/smpboot.c
> > @@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotpl
> > }
> > get_task_struct(tsk);
> > *per_cpu_ptr(ht->store, cpu) = tsk;
> > - if (ht->create)
> > - ht->create(cpu);
> > + if (ht->create) {
> > + /*
> > + * Make sure that the task has actually scheduled out
> > + * into park position, before calling the create
> > + * callback. At least the migration thread callback
> > + * requires that the task is off the runqueue.
> > + */
> > + if (!wait_task_inactive(tsk, TASK_PARKED))
> > + WARN_ON(1);
> > + else
> > + ht->create(cpu);
> > + }
> > return 0;
> > }
>
> This one appears to be doing the trick. I'll run the cpus in an
> online/offline loop for a bit and make sure it's stable. It's passed
> several round so far, which is way more than it's done up to this point,
> though.
I think the most critical part is bootup when the threads are
created. The online/offline one is an issue as well, which is covered
by the patches, but way harder to trigger.
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()
2013-04-09 15:55 ` Dave Hansen
2013-04-09 18:43 ` Thomas Gleixner
@ 2013-04-10 14:03 ` Srivatsa S. Bhat
2013-04-11 8:10 ` Thomas Gleixner
1 sibling, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-10 14:03 UTC (permalink / raw)
To: Dave Hansen
Cc: Thomas Gleixner, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
Hi Dave,
On 04/09/2013 09:25 PM, Dave Hansen wrote:
> Hey Thomas,
>
> I don't think the patch helped my case. Looks like the same BUG_ON().
>
> I accidentally booted with possible_cpus=10 instead of 160. I wasn't
> able to trigger this in that case, even repeatedly on/offlining them.
> But, once I booted with possible_cpus=160, it triggered in a jiffy.
>
> Two oopses below (bottom one has cpu numbers):
>
So here is a little bit of a hunch + guess work ;) Could you kindly test
this patch on top of mainline and let me know if it helps?
Regards,
Srivatsa S. Bhat
-------------------------------------------------------------------------->
From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()
Dave Hansen and Borislav Petkov reported the following crash, which
corresponds the following BUG_ON in smpboot_thread_fn():
BUG_ON(td->cpu != smp_processor_id());
[ 790.223270] ------------[ cut here ]------------
[ 790.223966] kernel BUG at kernel/smpboot.c:134!
[ 790.224739] invalid opcode: 0000 [#1] SMP
[ 790.225671] Modules linked in:
[ 790.226428] CPU 81
[ 790.226909] Pid: 3909, comm: migration/135 Tainted: G W 3.9.0-rc5-00184-gb6a9b7f-dirty #118 FUJITSU-SV PRIMEQUEST 1800E2/SB
[ 790.228775] RIP: 0010:[<ffffffff8110bee8>] [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
[ 790.230205] RSP: 0018:ffff88bfef9c1e08 EFLAGS: 00010202
[ 790.231090] RAX: 0000000000000051 RBX: ffff88bfefb82000 RCX: 000000000000b888
[ 790.231653] RDX: ffff88bfef9c1fd8 RSI: ffff881fff000000 RDI: 0000000000000087
[ 790.232085] RBP: ffff88bfef9c1e38 R08: 0000000000000001 R09: 0000000000000000
[ 790.232850] R10: 0000000000000018 R11: 0000000000000000 R12: ffff88bfec9e22e0
[ 790.233561] R13: ffffffff81e587a0 R14: ffff88bfec9e22e0 R15: 0000000000000000
[ 790.234004] FS: 0000000000000000(0000) GS:ffff881fff000000(0000) knlGS:0000000000000000
[ 790.234918] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 790.235602] CR2: 00007fa89a333c62 CR3: 0000000001e0b000 CR4: 00000000000007e0
[ 790.236110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 790.236584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 790.237329] Process migration/135 (pid: 3909, threadinfo ffff88bfef9c0000, task ffff88bfec9e22e0)
[ 790.238321] Stack:
[ 790.238882] ffff88bfef9c1e38 0000000000000000 ffff88ffef421cc0 ffff88bfef9c1ec0
[ 790.245415] ffff88bfefb82000 ffffffff8110bc90 ffff88bfef9c1f48 ffffffff810ff1df
[ 790.250755] 0000000000000001 0000000000000087 ffff88bfefb82000 0000000000000000
[ 790.253365] Call Trace:
[ 790.254121] [<ffffffff8110bc90>] ? __smpboot_create_thread+0x180/0x180
[ 790.255428] [<ffffffff810ff1df>] kthread+0xef/0x100
[ 790.256071] [<ffffffff819cb1a4>] ? wait_for_completion+0x124/0x180
[ 790.256697] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
[ 790.257325] [<ffffffff819dba9c>] ret_from_fork+0x7c/0xb0
[ 790.258233] [<ffffffff810ff0f0>] ? __init_kthread_worker+0x80/0x80
[ 790.258942] Code: ef 3d 01 01 48 89 df e8 87 b0 16 00 48 83 05 67 ef 3d 01 01 48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 83 05 90 ef 3d 01 01 <0f> 0b 48 83 05 96 ef 3d 01 01 48 83 05 56 ef 3d 01 01 0f 0b 48
[ 790.276178] RIP [<ffffffff8110bee8>] smpboot_thread_fn+0x258/0x280
[ 790.276735] RSP <ffff88bfef9c1e08>
[ 790.278348] ---[ end trace 84baa2bee1434240 ]---
Interestingly, in every single stack trace, the crashing task is the migration
thread. Now, migration thread belongs to the highest priority stop_task sched
class, and this particular sched class is very unique in the way it implements
its internal sched class functions, and I suspect this has a lot of bearing
on how functions like kthread_bind(), wake_up_process() etc react with it
(by looking at how it implements its functions such as select_task_rq(),
enqueue_task(), dequeue_task() etc).
Previous to commit 14e568e (stop_machine: Use smpboot threads), the migration
threads were set to stop_task sched class *after* binding them to the
appropriate cpus. But that commit reversed that sequence inadvertently.
So revert back to the old sequence to fix this issue.
But note that __kthread_bind() can wake up the task if the task is an RT
task. So it can be called only when the CPU (to which we want to bind the task)
is already online. So add a new function called kthread_unpark_prepare() and
call it in smpboot_unpark_thread(), to bind the kthread to the appropriate CPU.
And in stop-machine, move the sched class initialization to ->pre_unpark().
This way, it can be ensured that for stop-task class tasks (such as migration
thread), the sched class is set *after* binding the task to the CPU (and
also, the kthreads belonging to other sched classes will continue to function
properly). Also, teach sched_set_stop_task() to ignore repeated invocations
with the same task as argument.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Hansen <dave@sr71.net>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 24 ++++++++++++++++++++----
kernel/sched/core.c | 3 +++
kernel/smpboot.c | 2 ++
kernel/stop_machine.c | 7 +------
5 files changed, 27 insertions(+), 10 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 8d81664..7552eb0 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -44,6 +44,7 @@ bool kthread_should_park(void);
bool kthread_freezable_should_stop(bool *was_frozen);
void *kthread_data(struct task_struct *k);
int kthread_park(struct task_struct *k);
+void kthread_unpark_prepare(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..e3325cc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -325,6 +325,25 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
}
/**
+ * kthread_unpark_prepare - prepare to unpark a thread created by
+ * kthread_create().
+ * @k: thread created by kthread_create().
+ *
+ * If the thread is marked per-cpu, it is bound to the cpu, in
+ * preparation for waking it up in the near future.
+ */
+void kthread_unpark_prepare(struct task_struct *k)
+{
+ struct kthread *kthread = task_get_live_kthread(k);
+
+ if (kthread) {
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu);
+ }
+ put_task_struct(k);
+}
+
+/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
*
@@ -344,11 +363,8 @@ void kthread_unpark(struct task_struct *k)
* park before that happens we'd see the IS_PARKED bit
* which might be about to be cleared.
*/
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
wake_up_process(k);
- }
}
put_task_struct(k);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..7370fad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -814,6 +814,9 @@ void sched_set_stop_task(int cpu, struct task_struct *stop)
struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
struct task_struct *old_stop = cpu_rq(cpu)->stop;
+ if (stop == old_stop)
+ return;
+
if (stop) {
/*
* Make it appear like a SCHED_FIFO task, its something
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 8eaed9a..bd1a9175 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -209,6 +209,8 @@ static void smpboot_unpark_thread(struct smp_hotplug_thread *ht, unsigned int cp
{
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
+ kthread_unpark_prepare(tsk);
+
if (ht->pre_unpark)
ht->pre_unpark(cpu);
kthread_unpark(tsk);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..db32f3a 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -300,11 +300,6 @@ repeat:
extern void sched_set_stop_task(int cpu, struct task_struct *stop);
-static void cpu_stop_create(unsigned int cpu)
-{
- sched_set_stop_task(cpu, per_cpu(cpu_stopper_task, cpu));
-}
-
static void cpu_stop_park(unsigned int cpu)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
@@ -323,6 +318,7 @@ static void cpu_stop_unpark(unsigned int cpu)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
+ sched_set_stop_task(cpu, per_cpu(cpu_stopper_task, cpu));
spin_lock_irq(&stopper->lock);
stopper->enabled = true;
spin_unlock_irq(&stopper->lock);
@@ -333,7 +329,6 @@ static struct smp_hotplug_thread cpu_stop_threads = {
.thread_should_run = cpu_stop_should_run,
.thread_fn = cpu_stopper_thread,
.thread_comm = "migration/%u",
- .create = cpu_stop_create,
.setup = cpu_stop_unpark,
.park = cpu_stop_park,
.pre_unpark = cpu_stop_unpark,
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()
2013-04-10 14:03 ` [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn() Srivatsa S. Bhat
@ 2013-04-11 8:10 ` Thomas Gleixner
2013-04-11 10:19 ` Srivatsa S. Bhat
0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 8:10 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On Wed, 10 Apr 2013, Srivatsa S. Bhat wrote:
> Interestingly, in every single stack trace, the crashing task is the migration
> thread. Now, migration thread belongs to the highest priority stop_task sched
> class, and this particular sched class is very unique in the way it implements
> its internal sched class functions, and I suspect this has a lot of bearing
> on how functions like kthread_bind(), wake_up_process() etc react with it
> (by looking at how it implements its functions such as select_task_rq(),
> enqueue_task(), dequeue_task() etc).
I don't think that's relevant. The migration thread can only be woken
via try_to_wakeup and my previous patch which implements a separate
task state makes sure that it cannot be woken accidentaly by anything
else than unpark.
> But note that __kthread_bind() can wake up the task if the task is an RT
> task. So it can be called only when the CPU (to which we want to bind the task)
kthread_bind() does NOT wakeup anything. It merily sets the cpus
allowed ptr without further ado.
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] CPU hotplug, smpboot: Fix crash in smpboot_thread_fn()
2013-04-11 8:10 ` Thomas Gleixner
@ 2013-04-11 10:19 ` Srivatsa S. Bhat
0 siblings, 0 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 10:19 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Dave Hansen, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On 04/11/2013 01:40 PM, Thomas Gleixner wrote:
> On Wed, 10 Apr 2013, Srivatsa S. Bhat wrote:
>
>> Interestingly, in every single stack trace, the crashing task is the migration
>> thread. Now, migration thread belongs to the highest priority stop_task sched
>> class, and this particular sched class is very unique in the way it implements
>> its internal sched class functions, and I suspect this has a lot of bearing
>> on how functions like kthread_bind(), wake_up_process() etc react with it
>> (by looking at how it implements its functions such as select_task_rq(),
>> enqueue_task(), dequeue_task() etc).
>
> I don't think that's relevant. The migration thread can only be woken
> via try_to_wakeup and my previous patch which implements a separate
> task state makes sure that it cannot be woken accidentaly by anything
> else than unpark.
>
Hmm, but it got to be simpler than that, no? Given that it used to work fine
before...
>> But note that __kthread_bind() can wake up the task if the task is an RT
>> task. So it can be called only when the CPU (to which we want to bind the task)
>
> kthread_bind() does NOT wakeup anything. It merily sets the cpus
> allowed ptr without further ado.
>
Sorry, I was mistaken and was carried away by a bug in the code I was testing.
I had intended to move kthread_bind() to the body of kthread_create_on_cpu()
and place it after the call to kthread_park(), as shown below:
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..b485fc0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ wait_task_inactive(p, TASK_INTERRUPTIBLE);
+ __kthread_bind(p, cpu);
return p;
}
But by mistake, I had written the code as:
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..b485fc0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+ __kthread_bind(p, cpu);
return p;
}
So, no wonder it never actually bound the task to the CPU. So when I gave this a run,
I saw watchdog threads hitting the same BUG_ON(), and since watchdog threads are
of RT priority, and RT is the only class that implements ->set_cpus_allowed(), I
thought that those threads got woken up due to the bind. But I was mistaken of
course, because I had checked for the wrong return value of wait_task_inactive().
Regards,
Srivatsa S. Bhat
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 14:38 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
2013-04-09 15:55 ` Dave Hansen
@ 2013-04-11 19:16 ` Srivatsa S. Bhat
2013-04-11 20:47 ` Thomas Gleixner
2013-04-12 10:41 ` Peter Zijlstra
2013-04-12 12:32 ` [tip:core/urgent] " tip-bot for Thomas Gleixner
3 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 19:16 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
> The smpboot threads rely on the park/unpark mechanism which binds per
> cpu threads on a particular core. Though the functionality is racy:
>
> CPU0 CPU1 CPU2
> unpark(T) wake_up_process(T)
> clear(SHOULD_PARK) T runs
> leave parkme() due to !SHOULD_PARK
> bind_to(CPU2) BUG_ON(wrong CPU)
>
OK, I must admit that I had missed noticing that the task was ksoftirqd
and not the migration thread in Boris' trace. And yes, this unexpected
wakeup is a problem for ksoftirqd.
> We cannot let the tasks move themself to the target CPU as one of
> those tasks is actually the migration thread itself, which requires
> that it starts running on the target cpu right away.
>
Of course, we can't use set_cpus_allowed_ptr(), but we can still achieve
the desired bind effect without any race, see below.
> The only rock solid solution to this problem is to prevent wakeups in
> park state which are not from unpark(). That way we can guarantee that
> the association of the task to the target cpu is working correctly.
>
> Add a new task state (TASK_PARKED) which prevents other wakeups and
> use this state explicitely for the unpark wakeup.
>
Again, I think this is unnecessary. We are good as long as no one other
than the unpark code can kick the kthreads out of the loop in the park
code. Now that I understand the race you explained above, why not just
fix that race itself by reversing the ordering of clear(SHOULD_PARK)
and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
kthread, it will just remain confined to the park code, as intended.
A patch like below should do it IMHO. I guess I'm being a little too
persistent, sorry!
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9512fc5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
to_kthread(p)->cpu = cpu;
/* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
kthread_park(p);
+
+ /*
+ * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
+ * migration thread (which belongs to the stop_task sched class)
+ * doesn't run until the cpu is actually onlined and the thread is
+ * unparked.
+ */
+ if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
+ WARN_ON(1);
return p;
}
@@ -324,6 +333,17 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
return NULL;
}
+static void __kthread_park(struct task_struct *k, struct kthread *kthread)
+{
+ if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+ set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ if (k != current) {
+ wake_up_process(k);
+ wait_for_completion(&kthread->parked);
+ }
+ }
+}
+
/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
@@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
struct kthread *kthread = task_get_live_kthread(k);
if (kthread) {
+ /*
+ * Per-cpu kthreads such as ksoftirqd can get woken up by
+ * other events. So after binding the thread, ensure that
+ * it goes off the CPU atleast once, by parking it again.
+ * This way, we can ensure that it will run on the correct
+ * CPU on subsequent wakeup.
+ */
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
+ __kthread_bind(k, kthread->cpu);
+ clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
+ __kthread_park(k, kthread);
+ }
+
clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+
/*
* We clear the IS_PARKED bit here as we don't wait
* until the task has left the park code. So if we'd
* park before that happens we'd see the IS_PARKED bit
* which might be about to be cleared.
*/
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
wake_up_process(k);
- }
}
put_task_struct(k);
}
@@ -371,13 +402,7 @@ int kthread_park(struct task_struct *k)
int ret = -ENOSYS;
if (kthread) {
- if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- if (k != current) {
- wake_up_process(k);
- wait_for_completion(&kthread->parked);
- }
- }
+ __kthread_park(k, kthread);
ret = 0;
}
put_task_struct(k);
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 19:16 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Srivatsa S. Bhat
@ 2013-04-11 20:47 ` Thomas Gleixner
2013-04-11 21:19 ` Srivatsa S. Bhat
0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-11 20:47 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
Srivatsa,
On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
> > Add a new task state (TASK_PARKED) which prevents other wakeups and
> > use this state explicitely for the unpark wakeup.
> >
>
> Again, I think this is unnecessary. We are good as long as no one other
> than the unpark code can kick the kthreads out of the loop in the park
> code. Now that I understand the race you explained above, why not just
> fix that race itself by reversing the ordering of clear(SHOULD_PARK)
> and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
> kthread, it will just remain confined to the park code, as intended.
In theory.
> A patch like below should do it IMHO. I guess I'm being a little too
> persistent, sorry!
No it's not about being persistent, you're JUST too much into voodoo
programming instead of going for the straight forward and robust
solutions.
Darn, I hate it as much as everyone else to introduce a new task
state, but that state allows us to make guarantees and gives us
semantical clarity. A parked thread is parked and can only be woken up
by the unpark code. That's clear semantics and not a magic construct
which will work in most cases and for the remaining ones (See below)
it will give us problems which are way harder to decode than the ones
we tried to fix with that magic.
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 691dc2e..9512fc5 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
> to_kthread(p)->cpu = cpu;
> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
> kthread_park(p);
> +
> + /*
> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> + * migration thread (which belongs to the stop_task sched class)
> + * doesn't run until the cpu is actually onlined and the thread is
> + * unparked.
> + */
> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> + WARN_ON(1);
Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
references outside the creation code. And then we _HOPE_ that nothing
wakes it up _BEFORE_ we do something else.
Aside of that, you are still insisting to enforce that for every per
cpu thread even if the only one which needs that at this point are
thos which have a create() callback (i.e. the migration thread). And
next week you figure out that this is a performance impact on bringing
up large machines....
> /**
> * kthread_unpark - unpark a thread created by kthread_create().
> * @k: thread created by kthread_create().
> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
> struct kthread *kthread = task_get_live_kthread(k);
>
> if (kthread) {
> + /*
> + * Per-cpu kthreads such as ksoftirqd can get woken up by
> + * other events. So after binding the thread, ensure that
> + * it goes off the CPU atleast once, by parking it again.
> + * This way, we can ensure that it will run on the correct
> + * CPU on subsequent wakeup.
> + */
> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
> + __kthread_bind(k, kthread->cpu);
> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
And how is that f*cking different from the previous code?
CPU0 CPU1 CPU2
wakeup(T) -> run on CPU1 (last cpu)
switch_to(T)
__kthread_bind(T, CPU2)
clear(KTHREAD_IS_PARKED)
leave loop due to !KTHREAD_IS_PARKED
BUG(wrong cpu) <--- VOODOO FAILURE
kthread_park(T) <-- VOODOO TOO LATE
You can turn around the order of clearing/setting the flags as much as
you want, I'm going to punch an hole in it.
TASK_PARKED is the very obvious and robust solution which fixes _ALL_
of the corner cases, at least as far as I can imagine them. And
robustness rules at least in my world.
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 20:47 ` Thomas Gleixner
@ 2013-04-11 21:19 ` Srivatsa S. Bhat
2013-04-12 10:59 ` Thomas Gleixner
0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-11 21:19 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
> Srivatsa,
>
> On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
>> On 04/09/2013 08:08 PM, Thomas Gleixner wrote:
>>> Add a new task state (TASK_PARKED) which prevents other wakeups and
>>> use this state explicitely for the unpark wakeup.
>>>
>>
>> Again, I think this is unnecessary. We are good as long as no one other
>> than the unpark code can kick the kthreads out of the loop in the park
>> code. Now that I understand the race you explained above, why not just
>> fix that race itself by reversing the ordering of clear(SHOULD_PARK)
>> and bind_to(CPU2)? That way, even if someone else wakes up the per-cpu
>> kthread, it will just remain confined to the park code, as intended.
>
> In theory.
>
>> A patch like below should do it IMHO. I guess I'm being a little too
>> persistent, sorry!
>
> No it's not about being persistent, you're JUST too much into voodoo
> programming instead of going for the straight forward and robust
> solutions.
>
> Darn, I hate it as much as everyone else to introduce a new task
> state, but that state allows us to make guarantees and gives us
> semantical clarity. A parked thread is parked and can only be woken up
> by the unpark code. That's clear semantics and not a magic construct
> which will work in most cases and for the remaining ones (See below)
> it will give us problems which are way harder to decode than the ones
> we tried to fix with that magic.
>
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 691dc2e..9512fc5 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -308,6 +308,15 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
>> to_kthread(p)->cpu = cpu;
>> /* Park the thread to get it out of TASK_UNINTERRUPTIBLE state */
>> kthread_park(p);
>> +
>> + /*
>> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
>> + * migration thread (which belongs to the stop_task sched class)
>> + * doesn't run until the cpu is actually onlined and the thread is
>> + * unparked.
>> + */
>> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
>> + WARN_ON(1);
>
> Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
> references outside the creation code.
I doubt that. We have not even onlined the CPU, how would any else even
_know_ that we created this kthread??
*per_cpu_ptr(ht->store, cpu) = tsk; is executed _after_ returning from
this function.
The problem with ksoftirqd is very clear - we unpark threads _after_ we
online the CPU. So, in between the 2 steps, somebody on that CPU can call
__do_softirq(), leading to the race you described in your cover-letter.
That's why I tried to fix that race.
> And then we _HOPE_ that nothing
> wakes it up _BEFORE_ we do something else.
>
Nothing can wake it up, because no one is aware of the newly created
kthread.
> Aside of that, you are still insisting to enforce that for every per
> cpu thread even if the only one which needs that at this point are
> thos which have a create() callback (i.e. the migration thread). And
> next week you figure out that this is a performance impact on bringing
> up large machines....
>
Making this wait call specific to those kthreads with the ->create callback
won't be that much of a big deal, IMHO. But see below, I'm not going to
insist on going with my suggestions.
>> /**
>> * kthread_unpark - unpark a thread created by kthread_create().
>> * @k: thread created by kthread_create().
>> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
>> struct kthread *kthread = task_get_live_kthread(k);
>>
>> if (kthread) {
>> + /*
>> + * Per-cpu kthreads such as ksoftirqd can get woken up by
>> + * other events. So after binding the thread, ensure that
>> + * it goes off the CPU atleast once, by parking it again.
>> + * This way, we can ensure that it will run on the correct
>> + * CPU on subsequent wakeup.
>> + */
>> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
>> + __kthread_bind(k, kthread->cpu);
>> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
>
> And how is that f*cking different from the previous code?
>
> CPU0 CPU1 CPU2
> wakeup(T) -> run on CPU1 (last cpu)
>
> switch_to(T)
>
> __kthread_bind(T, CPU2)
>
> clear(KTHREAD_IS_PARKED)
>
> leave loop due to !KTHREAD_IS_PARKED
How?? The task will leave the loop only when we clear
SHOULD_PARK, not when we clear IS_PARKED. So it won't
leave the loop here. It will cause the kthread to
perform a fresh complete() for the waiting kthread_park()
on CPU0.
>
> BUG(wrong cpu) <--- VOODOO FAILURE
>
> kthread_park(T) <-- VOODOO TOO LATE
>
No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
ensure that the task gets *descheduled* atleast once after we did the
kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
migrate a running kthread (because the kthread could be the migration
thread). So instead, we use kthread_bind() and depend on sleep->wakeup
to put the task on the right CPU.
> You can turn around the order of clearing/setting the flags as much as
> you want, I'm going to punch an hole in it.
>
> TASK_PARKED is the very obvious and robust solution which fixes _ALL_
> of the corner cases, at least as far as I can imagine them. And
> robustness rules at least in my world.
>
Yes, I agree that it is robust and has clear semantics. No doubt about
that. So I won't insist on going with my suggestions.
Regards,
Srivatsa S. Bhat
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-11 21:19 ` Srivatsa S. Bhat
@ 2013-04-12 10:59 ` Thomas Gleixner
2013-04-12 11:26 ` Srivatsa S. Bhat
2013-04-15 19:49 ` Dave Hansen
0 siblings, 2 replies; 37+ messages in thread
From: Thomas Gleixner @ 2013-04-12 10:59 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
Srivatsa,
On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
> On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
> >> +
> >> + /*
> >> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
> >> + * migration thread (which belongs to the stop_task sched class)
> >> + * doesn't run until the cpu is actually onlined and the thread is
> >> + * unparked.
> >> + */
> >> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
> >> + WARN_ON(1);
> >
> > Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
> > references outside the creation code.
>
> I doubt that. We have not even onlined the CPU, how would any else even
> _know_ that we created this kthread??
The problem is not only at the thread creation time. We have the same
issue at offline/online and there we have a reference to that very
thread.
> >> /**
> >> * kthread_unpark - unpark a thread created by kthread_create().
> >> * @k: thread created by kthread_create().
> >> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
> >> struct kthread *kthread = task_get_live_kthread(k);
> >>
> >> if (kthread) {
> >> + /*
> >> + * Per-cpu kthreads such as ksoftirqd can get woken up by
> >> + * other events. So after binding the thread, ensure that
> >> + * it goes off the CPU atleast once, by parking it again.
> >> + * This way, we can ensure that it will run on the correct
> >> + * CPU on subsequent wakeup.
> >> + */
> >> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
> >> + __kthread_bind(k, kthread->cpu);
> >> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
> >
> > And how is that f*cking different from the previous code?
> >
> > CPU0 CPU1 CPU2
> > wakeup(T) -> run on CPU1 (last cpu)
> >
> > switch_to(T)
> >
> > __kthread_bind(T, CPU2)
> >
> > clear(KTHREAD_IS_PARKED)
> >
> > leave loop due to !KTHREAD_IS_PARKED
>
> How?? The task will leave the loop only when we clear
> SHOULD_PARK, not when we clear IS_PARKED. So it won't
> leave the loop here. It will cause the kthread to
> perform a fresh complete() for the waiting kthread_park()
> on CPU0.
You are right on that, but you tricked me into misreading your
patch. Why? Simply because it is too complex for no reason.
> No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
> ensure that the task gets *descheduled* atleast once after we did the
> kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
> migrate a running kthread (because the kthread could be the migration
> thread). So instead, we use kthread_bind() and depend on sleep->wakeup
> to put the task on the right CPU.
Yeah, it's a nice workaround, though I really prefer a guaranteed well
defined state over this wakeup/sleep/wakeup trickery, which also adds
the additional cost of a wakeup/sleep cycle to the online operation.
> > TASK_PARKED is the very obvious and robust solution which fixes _ALL_
> > of the corner cases, at least as far as I can imagine them. And
> > robustness rules at least in my world.
> >
>
> Yes, I agree that it is robust and has clear semantics. No doubt about
> that. So I won't insist on going with my suggestions.
I'm glad, that we can agree on the robust solution :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-12 10:59 ` Thomas Gleixner
@ 2013-04-12 11:26 ` Srivatsa S. Bhat
2013-04-15 19:49 ` Dave Hansen
1 sibling, 0 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2013-04-12 11:26 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Dave Hansen, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
Hi Thomas,
On 04/12/2013 04:29 PM, Thomas Gleixner wrote:
> Srivatsa,
>
> On Fri, 12 Apr 2013, Srivatsa S. Bhat wrote:
>> On 04/12/2013 02:17 AM, Thomas Gleixner wrote:
>>>> +
>>>> + /*
>>>> + * Wait for p->on_rq to be reset to 0, to ensure that the per-cpu
>>>> + * migration thread (which belongs to the stop_task sched class)
>>>> + * doesn't run until the cpu is actually onlined and the thread is
>>>> + * unparked.
>>>> + */
>>>> + if (!wait_task_inactive(p, TASK_INTERRUPTIBLE))
>>>> + WARN_ON(1);
>>>
>>> Yay, we rely on TASK_INTERRUPTIBLE state with a task which already has
>>> references outside the creation code.
>>
>> I doubt that. We have not even onlined the CPU, how would any else even
>> _know_ that we created this kthread??
>
> The problem is not only at the thread creation time. We have the same
> issue at offline/online and there we have a reference to that very
> thread.
>
Right. So our solutions differ in how that is handled, like this:
Yours: ensures that nobody can wakeup the parked thread, except the unpark
code.
Mine: ensures that nobody can make the parked thread leave its park loop
(even if it is woken up), except the unpark code.
Apart from this, everything else is mostly same - for eg., both the patches
depend on that wait_task_inactive() call, in order to make the migration
thread behave.
Either way, the purpose is served, so I'm fine with your solution.
(One of the reasons why I was confident of coming up with a working solution
without adding a new state was because I've worked on the freezer code before,
and IIRC, we have more or less similar problems there and we manage to deal
with it without having a dedicated TASK_FROZEN state. Anyway, nevermind... )
>>>> /**
>>>> * kthread_unpark - unpark a thread created by kthread_create().
>>>> * @k: thread created by kthread_create().
>>>> @@ -337,18 +357,29 @@ void kthread_unpark(struct task_struct *k)
>>>> struct kthread *kthread = task_get_live_kthread(k);
>>>>
>>>> if (kthread) {
>>>> + /*
>>>> + * Per-cpu kthreads such as ksoftirqd can get woken up by
>>>> + * other events. So after binding the thread, ensure that
>>>> + * it goes off the CPU atleast once, by parking it again.
>>>> + * This way, we can ensure that it will run on the correct
>>>> + * CPU on subsequent wakeup.
>>>> + */
>>>> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) {
>>>> + __kthread_bind(k, kthread->cpu);
>>>> + clear_bit(KTHREAD_IS_PARKED, &kthread->flags);
>>>
>>> And how is that f*cking different from the previous code?
>>>
>>> CPU0 CPU1 CPU2
>>> wakeup(T) -> run on CPU1 (last cpu)
>>>
>>> switch_to(T)
>>>
>>> __kthread_bind(T, CPU2)
>>>
>>> clear(KTHREAD_IS_PARKED)
>>>
>>> leave loop due to !KTHREAD_IS_PARKED
>>
>> How?? The task will leave the loop only when we clear
>> SHOULD_PARK, not when we clear IS_PARKED. So it won't
>> leave the loop here. It will cause the kthread to
>> perform a fresh complete() for the waiting kthread_park()
>> on CPU0.
>
> You are right on that, but you tricked me into misreading your
> patch. Why? Simply because it is too complex for no reason.
>
;-)
>> No, the purpose of clear(IS_PARKED) followed by __kthread_park() is to
>> ensure that the task gets *descheduled* atleast once after we did the
>> kthread_bind(). And that's because we can't use set_cpus_allowed_ptr() to
>> migrate a running kthread (because the kthread could be the migration
>> thread). So instead, we use kthread_bind() and depend on sleep->wakeup
>> to put the task on the right CPU.
>
> Yeah, it's a nice workaround, though I really prefer a guaranteed well
> defined state over this wakeup/sleep/wakeup trickery, which also adds
> the additional cost of a wakeup/sleep cycle to the online operation.
>
Sure, no objections from me!
>>> TASK_PARKED is the very obvious and robust solution which fixes _ALL_
>>> of the corner cases, at least as far as I can imagine them. And
>>> robustness rules at least in my world.
>>>
>>
>> Yes, I agree that it is robust and has clear semantics. No doubt about
>> that. So I won't insist on going with my suggestions.
>
> I'm glad, that we can agree on the robust solution :)
>
I'm glad too :-) Thanks a lot!
Regards,
Srivatsa S. Bhat
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-12 10:59 ` Thomas Gleixner
2013-04-12 11:26 ` Srivatsa S. Bhat
@ 2013-04-15 19:49 ` Dave Hansen
1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2013-04-15 19:49 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Srivatsa S. Bhat, Borislav Petkov, LKML, Dave Jones, dhillf,
Peter Zijlstra, Ingo Molnar
I just tested out Linus's current tree (bb33db7). It is quite happy on
the large system which was exciting this issue. Thanks, Thomas!
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 14:38 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
2013-04-09 15:55 ` Dave Hansen
2013-04-11 19:16 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Srivatsa S. Bhat
@ 2013-04-12 10:41 ` Peter Zijlstra
2013-04-12 12:32 ` [tip:core/urgent] " tip-bot for Thomas Gleixner
3 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2013-04-12 10:41 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Borislav Petkov, Srivatsa S. Bhat, Dave Hansen, LKML, Dave Jones,
dhillf, Ingo Molnar
On Tue, 2013-04-09 at 16:38 +0200, Thomas Gleixner wrote:
> The smpboot threads rely on the park/unpark mechanism which binds per
> cpu threads on a particular core. Though the functionality is racy:
>
> CPU0 CPU1 CPU2
> unpark(T) wake_up_process(T)
> clear(SHOULD_PARK) T runs
> leave parkme() due to !SHOULD_PARK
> bind_to(CPU2) BUG_ON(wrong CPU)
>
> We cannot let the tasks move themself to the target CPU as one of
> those tasks is actually the migration thread itself, which requires
> that it starts running on the target cpu right away.
>
> The only rock solid solution to this problem is to prevent wakeups in
> park state which are not from unpark(). That way we can guarantee that
> the association of the task to the target cpu is working correctly.
>
> Add a new task state (TASK_PARKED) which prevents other wakeups and
> use this state explicitely for the unpark wakeup.
explicitly
Also, since the task state is visible to userspace and all the parked
tasks are still in the PID space, its a good hint in ps and friends
that these tasks aren't really there for the moment.
>
> Reported-by: Dave Jones <davej@redhat.com>
> Reported-by: Dave Hansen <dave@sr71.net>
> Reported-by: Borislav Petkov <bp@alien8.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Assuming you're going to merge in the missing trace hunk you posted
further down the thread...
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> fs/proc/array.c | 1 +
> include/linux/sched.h | 5 +++--
> kernel/kthread.c | 38 +++++++++++++++++++++-----------------
> 3 files changed, 25 insertions(+), 19 deletions(-)
>
> Index: linux-2.6/fs/proc/array.c
> ===================================================================
> --- linux-2.6.orig/fs/proc/array.c
> +++ linux-2.6/fs/proc/array.c
> @@ -143,6 +143,7 @@ static const char * const task_state_arr
> "x (dead)", /* 64 */
> "K (wakekill)", /* 128 */
> "W (waking)", /* 256 */
> + "P (parked)", /* 512 */
> };
>
> static inline const char *get_task_state(struct task_struct *tsk)
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu
> #define TASK_DEAD 64
> #define TASK_WAKEKILL 128
> #define TASK_WAKING 256
> -#define TASK_STATE_MAX 512
> +#define TASK_PARKED 512
> +#define TASK_STATE_MAX 1024
>
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
>
> extern char ___assert_task_state[1 - 2*!!(
> sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
> Index: linux-2.6/kernel/kthread.c
> ===================================================================
> --- linux-2.6.orig/kernel/kthread.c
> +++ linux-2.6/kernel/kthread.c
> @@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *t
>
> static void __kthread_parkme(struct kthread *self)
> {
> - __set_current_state(TASK_INTERRUPTIBLE);
> + __set_current_state(TASK_PARKED);
> while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
> if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
> complete(&self->parked);
> schedule();
> - __set_current_state(TASK_INTERRUPTIBLE);
> + __set_current_state(TASK_PARKED);
> }
> clear_bit(KTHREAD_IS_PARKED, &self->flags);
> __set_current_state(TASK_RUNNING);
> @@ -324,6 +324,22 @@ static struct kthread *task_get_live_kth
> return NULL;
> }
>
> +static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
> +{
> + clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> + /*
> + * We clear the IS_PARKED bit here as we don't wait
> + * until the task has left the park code. So if we'd
> + * park before that happens we'd see the IS_PARKED bit
> + * which might be about to be cleared.
> + */
> + if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> + __kthread_bind(k, kthread->cpu);
> + wake_up_state(k, TASK_PARKED);
> + }
> +}
> +
> /**
> * kthread_unpark - unpark a thread created by kthread_create().
> * @k: thread created by kthread_create().
> @@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *
> {
> struct kthread *kthread = task_get_live_kthread(k);
>
> - if (kthread) {
> - clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> - /*
> - * We clear the IS_PARKED bit here as we don't wait
> - * until the task has left the park code. So if we'd
> - * park before that happens we'd see the IS_PARKED bit
> - * which might be about to be cleared.
> - */
> - if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> - if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> - __kthread_bind(k, kthread->cpu);
> - wake_up_process(k);
> - }
> - }
> + if (kthread)
> + __kthread_unpark(k, kthread);
> put_task_struct(k);
> }
>
> @@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
> trace_sched_kthread_stop(k);
> if (kthread) {
> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
> - clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> + __kthread_unpark(k, kthread);
> wake_up_process(k);
> wait_for_completion(&kthread->exited);
> }
^ permalink raw reply [flat|nested] 37+ messages in thread
* [tip:core/urgent] kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-09 14:38 ` [PATCH] kthread: Prevent unpark race which puts threads on the wrong cpu Thomas Gleixner
` (2 preceding siblings ...)
2013-04-12 10:41 ` Peter Zijlstra
@ 2013-04-12 12:32 ` tip-bot for Thomas Gleixner
3 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Thomas Gleixner @ 2013-04-12 12:32 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, srivatsa.bhat, davej, bp, dave, tglx
Commit-ID: f2530dc71cf0822f90bb63ea4600caaef33a66bb
Gitweb: http://git.kernel.org/tip/f2530dc71cf0822f90bb63ea4600caaef33a66bb
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 9 Apr 2013 09:33:34 +0200
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 12 Apr 2013 14:18:43 +0200
kthread: Prevent unpark race which puts threads on the wrong cpu
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:
CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)
We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.
The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.
Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.
Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.
The migration thread has another related issue.
CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)
The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().
Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
fs/proc/array.c | 1 +
include/linux/sched.h | 5 +++--
include/trace/events/sched.h | 2 +-
kernel/kthread.c | 52 ++++++++++++++++++++++++--------------------
kernel/smpboot.c | 14 ++++++++++--
5 files changed, 45 insertions(+), 29 deletions(-)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index f7ed9ee..cbd0f1b 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -143,6 +143,7 @@ static const char * const task_state_array[] = {
"x (dead)", /* 64 */
"K (wakekill)", /* 128 */
"W (waking)", /* 256 */
+ "P (parked)", /* 512 */
};
static inline const char *get_task_state(struct task_struct *tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..e692a02 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -163,9 +163,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
-#define TASK_STATE_MAX 512
+#define TASK_PARKED 512
+#define TASK_STATE_MAX 1024
-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 5a8671e..e5586ca 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -147,7 +147,7 @@ TRACE_EVENT(sched_switch,
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
{ 16, "Z" }, { 32, "X" }, { 64, "x" },
- { 128, "W" }) : "R",
+ { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
__entry->prev_state & TASK_STATE_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 691dc2e..9eb7fed 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -124,12 +124,12 @@ void *kthread_data(struct task_struct *task)
static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
@@ -256,8 +256,13 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
}
EXPORT_SYMBOL(kthread_create_on_node);
-static void __kthread_bind(struct task_struct *p, unsigned int cpu)
+static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
{
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, state)) {
+ WARN_ON(1);
+ return;
+ }
/* It's safe because the task is inactive. */
do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
@@ -274,12 +279,7 @@ static void __kthread_bind(struct task_struct *p, unsigned int cpu)
*/
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- __kthread_bind(p, cpu);
+ __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(kthread_bind);
@@ -324,6 +324,22 @@ static struct kthread *task_get_live_kthread(struct task_struct *k)
return NULL;
}
+static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
+{
+ clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ /*
+ * We clear the IS_PARKED bit here as we don't wait
+ * until the task has left the park code. So if we'd
+ * park before that happens we'd see the IS_PARKED bit
+ * which might be about to be cleared.
+ */
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu, TASK_PARKED);
+ wake_up_state(k, TASK_PARKED);
+ }
+}
+
/**
* kthread_unpark - unpark a thread created by kthread_create().
* @k: thread created by kthread_create().
@@ -336,20 +352,8 @@ void kthread_unpark(struct task_struct *k)
{
struct kthread *kthread = task_get_live_kthread(k);
- if (kthread) {
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- /*
- * We clear the IS_PARKED bit here as we don't wait
- * until the task has left the park code. So if we'd
- * park before that happens we'd see the IS_PARKED bit
- * which might be about to be cleared.
- */
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu);
- wake_up_process(k);
- }
- }
+ if (kthread)
+ __kthread_unpark(k, kthread);
put_task_struct(k);
}
@@ -407,7 +411,7 @@ int kthread_stop(struct task_struct *k)
trace_sched_kthread_stop(k);
if (kthread) {
set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ __kthread_unpark(k, kthread);
wake_up_process(k);
wait_for_completion(&kthread->exited);
}
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 8eaed9a..02fc5c9 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -185,8 +185,18 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
}
get_task_struct(tsk);
*per_cpu_ptr(ht->store, cpu) = tsk;
- if (ht->create)
- ht->create(cpu);
+ if (ht->create) {
+ /*
+ * Make sure that the task has actually scheduled out
+ * into park position, before calling the create
+ * callback. At least the migration thread callback
+ * requires that the task is off the runqueue.
+ */
+ if (!wait_task_inactive(tsk, TASK_PARKED))
+ WARN_ON(1);
+ else
+ ht->create(cpu);
+ }
return 0;
}
^ permalink raw reply related [flat|nested] 37+ messages in thread