All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
@ 2010-07-07  4:46 Mike Galbraith
  2010-07-07  8:03 ` Mike Galbraith
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-07  4:46 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Thomas Gleixner, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 4833 bytes --]

Greetings,

Stress testing, looking to trigger RCU stalls, I've managed to find a
way to repeatably create fireworks.  (got RCU stall, see attached)

1. download ltp-full-20100630.  Needs to be this version because of
testcase bustage in earlier versions, and must be built with gcc > 4.3,
else testcases will segfault due to a gcc bug. 

2. apply patchlet so you can run testcases/realtime/perf/latency/run.sh
at all.

--- pthread_cond_many.c.org	2010-07-05 09:05:59.000000000 +0200
+++ pthread_cond_many.c	2010-07-04 12:12:25.000000000 +0200
@@ -259,7 +259,7 @@ void usage(void)
 
 int parse_args(int c, char *v)
 {
-	int handled;
+	int handled = 1;
         switch (c) {
 		case 'h':
 			usage();

3. add --realtime for no particular reason.

--- run.sh.org	2010-07-06 15:54:58.000000000 +0200
+++ run.sh	2010-07-06 16:37:34.000000000 +0200
@@ -22,7 +22,7 @@ make
 # process to run realtime.  The remainder of the processes (if any)
 # will run non-realtime in any case.
 
-nthread=5000
+nthread=500
 iter=400
 nproc=5
 
@@ -39,7 +39,7 @@ i=0
 i=1
 while test $i -lt $nproc
 do
-        ./pthread_cond_many --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
+        ./pthread_cond_many --realtime --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
         i=`expr $i + 1`
 done
 wait

4. run it.

What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
this does not make it to consoles (poking sysrq-foo doesn't either).
Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
explosion, which does make it to consoles.

With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
fallout.  The start of bad juju is always pi_blocked_on != waiter.

[  141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
[  141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
[  141.609268] PGD 20e174067 PUD 20e253067 PMD 0 
[  141.609268] Oops: 0000 [#1] PREEMPT SMP 
[  141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[  141.609268] CPU 0 
[  141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G        W  2.6.33.6-rt23 #12 MS-7502/MS-7502
[  141.609268] RIP: 0010:[<ffffffff8106856d>]  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
[  141.609268] RSP: 0018:ffff88020e3cdd78  EFLAGS: 00010097
[  141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
[  141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
[  141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
[  141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
[  141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
[  141.609268] FS:  00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
[  141.609268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
[  141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
[  141.609268] Stack:
[  141.609268]  0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
[  141.609268] <0> 0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
[  141.609268] <0> ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
[  141.609268] Call Trace:
[  141.609268]  [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
[  141.609268]  [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
[  141.609268]  [<ffffffff81067d7f>] do_futex+0x83c/0x935
[  141.609268]  [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
[  141.609268]  [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
[  141.609268]  [<ffffffff81067fba>] sys_futex+0x142/0x154
[  141.609268]  [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
[  141.609268]  [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
[  141.609268]  [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
[  141.609268]  [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
[  141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00 <4c> 39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34 
[  141.609268] RIP  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
[  141.609268]  RSP <ffff88020e3cdd78>
[  141.609268] CR2: 0000000000000058
[  141.609268] ---[ end trace 58805b944e6f93ce ]---
[  141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2

(5. eyeball locks.. -> zzzzt -> report -> eyeball..)

	-Mike

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 15905 bytes --]

[-- Attachment #3: dmesg.gz --]
[-- Type: application/x-gzip, Size: 18656 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07  4:46 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() Mike Galbraith
@ 2010-07-07  8:03 ` Mike Galbraith
  2010-07-07 11:57   ` Thomas Gleixner
  2010-07-07 11:57 ` Thomas Gleixner
  2010-07-07 14:11 ` 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() gowrishankar
  2 siblings, 1 reply; 29+ messages in thread
From: Mike Galbraith @ 2010-07-07  8:03 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Thomas Gleixner, Peter Zijlstra


FWIW, 2.6.31.14-rt21 explodes too (darn).


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07  4:46 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() Mike Galbraith
  2010-07-07  8:03 ` Mike Galbraith
@ 2010-07-07 11:57 ` Thomas Gleixner
  2010-07-07 14:03   ` Darren Hart
                     ` (3 more replies)
  2010-07-07 14:11 ` 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() gowrishankar
  2 siblings, 4 replies; 29+ messages in thread
From: Thomas Gleixner @ 2010-07-07 11:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-rt-users, Peter Zijlstra, Darren Hart

Cc'ing Darren.

On Wed, 7 Jul 2010, Mike Galbraith wrote:

> Greetings,
> 
> Stress testing, looking to trigger RCU stalls, I've managed to find a
> way to repeatably create fireworks.  (got RCU stall, see attached)
> 
> 1. download ltp-full-20100630.  Needs to be this version because of
> testcase bustage in earlier versions, and must be built with gcc > 4.3,
> else testcases will segfault due to a gcc bug. 
> 
> 2. apply patchlet so you can run testcases/realtime/perf/latency/run.sh
> at all.
> 
> --- pthread_cond_many.c.org	2010-07-05 09:05:59.000000000 +0200
> +++ pthread_cond_many.c	2010-07-04 12:12:25.000000000 +0200
> @@ -259,7 +259,7 @@ void usage(void)
>  
>  int parse_args(int c, char *v)
>  {
> -	int handled;
> +	int handled = 1;
>          switch (c) {
>  		case 'h':
>  			usage();
> 
> 3. add --realtime for no particular reason.
> 
> --- run.sh.org	2010-07-06 15:54:58.000000000 +0200
> +++ run.sh	2010-07-06 16:37:34.000000000 +0200
> @@ -22,7 +22,7 @@ make
>  # process to run realtime.  The remainder of the processes (if any)
>  # will run non-realtime in any case.
>  
> -nthread=5000
> +nthread=500
>  iter=400
>  nproc=5
>  
> @@ -39,7 +39,7 @@ i=0
>  i=1
>  while test $i -lt $nproc
>  do
> -        ./pthread_cond_many --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
> +        ./pthread_cond_many --realtime --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
>          i=`expr $i + 1`
>  done
>  wait
> 
> 4. run it.
> 
> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
> this does not make it to consoles (poking sysrq-foo doesn't either).
> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
> explosion, which does make it to consoles.
> 
> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
> fallout.  The start of bad juju is always pi_blocked_on != waiter.
> 
> [  141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [  141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268] PGD 20e174067 PUD 20e253067 PMD 0 
> [  141.609268] Oops: 0000 [#1] PREEMPT SMP 
> [  141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> [  141.609268] CPU 0 
> [  141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G        W  2.6.33.6-rt23 #12 MS-7502/MS-7502
> [  141.609268] RIP: 0010:[<ffffffff8106856d>]  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268] RSP: 0018:ffff88020e3cdd78  EFLAGS: 00010097
> [  141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
> [  141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
> [  141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
> [  141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
> [  141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
> [  141.609268] FS:  00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
> [  141.609268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
> [  141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
> [  141.609268] Stack:
> [  141.609268]  0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
> [  141.609268] <0> 0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
> [  141.609268] <0> ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
> [  141.609268] Call Trace:
> [  141.609268]  [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
> [  141.609268]  [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
> [  141.609268]  [<ffffffff81067d7f>] do_futex+0x83c/0x935
> [  141.609268]  [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
> [  141.609268]  [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
> [  141.609268]  [<ffffffff81067fba>] sys_futex+0x142/0x154
> [  141.609268]  [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
> [  141.609268]  [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
> [  141.609268]  [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
> [  141.609268]  [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
> [  141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00 <4c> 39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34 
> [  141.609268] RIP  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268]  RSP <ffff88020e3cdd78>
> [  141.609268] CR2: 0000000000000058
> [  141.609268] ---[ end trace 58805b944e6f93ce ]---
> [  141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2
> 
> (5. eyeball locks.. -> zzzzt -> report -> eyeball..)
> 
> 	-Mike
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07  8:03 ` Mike Galbraith
@ 2010-07-07 11:57   ` Thomas Gleixner
  2010-07-07 12:50     ` Mike Galbraith
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2010-07-07 11:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-rt-users, Peter Zijlstra

On Wed, 7 Jul 2010, Mike Galbraith wrote:

> 
> FWIW, 2.6.31.14-rt21 explodes too (darn).

What about .29-rt ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 11:57   ` Thomas Gleixner
@ 2010-07-07 12:50     ` Mike Galbraith
  0 siblings, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-07 12:50 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-rt-users, Peter Zijlstra

(switches back to primary address.. dang evolution)

On Wed, 2010-07-07 at 13:57 +0200, Thomas Gleixner wrote:
> On Wed, 7 Jul 2010, Mike Galbraith wrote:
> 
> > 
> > FWIW, 2.6.31.14-rt21 explodes too (darn).
> 
> What about .29-rt ?

Goes boom too.

	-Mike


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 11:57 ` Thomas Gleixner
@ 2010-07-07 14:03   ` Darren Hart
  2010-07-07 14:17     ` Mike Galbraith
  2010-07-08 12:05     ` Mike Galbraith
  2010-07-09  2:11   ` Darren Hart
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 29+ messages in thread
From: Darren Hart @ 2010-07-07 14:03 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Mike Galbraith, linux-rt-users, Peter Zijlstra

On 07/07/2010 04:57 AM, Thomas Gleixner wrote:
> Cc'ing Darren.
>
> On Wed, 7 Jul 2010, Mike Galbraith wrote:
>

Hi Mike,

>> Greetings,
>>
>> Stress testing, looking to trigger RCU stalls, I've managed to find a
>> way to repeatably create fireworks.  (got RCU stall, see attached)
>>
>> 1. download ltp-full-20100630.  Needs to be this version because of
>> testcase bustage in earlier versions, and must be built with gcc>  4.3,
>> else testcases will segfault due to a gcc bug.

Interesting, I had not hit any gcc specific issues with this. Can you 
point me to the bug?

>>
>> 2. apply patchlet so you can run testcases/realtime/perf/latency/run.sh
>> at all.
>>
>> --- pthread_cond_many.c.org	2010-07-05 09:05:59.000000000 +0200
>> +++ pthread_cond_many.c	2010-07-04 12:12:25.000000000 +0200
>> @@ -259,7 +259,7 @@ void usage(void)
>>
>>   int parse_args(int c, char *v)
>>   {
>> -	int handled;
>> +	int handled = 1;
>>           switch (c) {
>>   		case 'h':
>>   			usage();
>>
>> 3. add --realtime for no particular reason.
>>
>> --- run.sh.org	2010-07-06 15:54:58.000000000 +0200
>> +++ run.sh	2010-07-06 16:37:34.000000000 +0200
>> @@ -22,7 +22,7 @@ make
>>   # process to run realtime.  The remainder of the processes (if any)
>>   # will run non-realtime in any case.
>>
>> -nthread=5000
>> +nthread=500

Was this just to lighten the load, or was it required to reproduce?

>>   iter=400
>>   nproc=5
>>
>> @@ -39,7 +39,7 @@ i=0
>>   i=1
>>   while test $i -lt $nproc
>>   do
>> -        ./pthread_cond_many --broadcast -i $iter -n $nthread>  $nthread.$iter.$nproc.$i.out&
>> +        ./pthread_cond_many --realtime --broadcast -i $iter -n $nthread>  $nthread.$iter.$nproc.$i.out&
>>           i=`expr $i + 1`
>>   done
>>   wait
>>

We'll do an audit and see if any pthread_cond_many patches have been 
dropped, or just fix the above issues if not.


>> 4. run it.
>>

Which architecture?

Glibc version?

I see kernel version is: 2.6.33.6-rt23, have you reproduced this on 
earlier kernel versions as well? Any 2.6.31 rt kernel would be a good 
data point.

Is this immediately reproducible for you?

I see a possibly fault occuring in the stack, if you run with 
mlockall(), does the problem go away? (assuming not, but an easy thing 
to test).

Nothing comes to mind re. cause quite yet, will have to dig into it.

--
Darren

>> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
>> this does not make it to consoles (poking sysrq-foo doesn't either).
>> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
>> explosion, which does make it to consoles.
>>
>> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
>> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
>> fallout.  The start of bad juju is always pi_blocked_on != waiter.
>>
>> [  141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
>> [  141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268] PGD 20e174067 PUD 20e253067 PMD 0
>> [  141.609268] Oops: 0000 [#1] PREEMPT SMP
>> [  141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
>> [  141.609268] CPU 0
>> [  141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G        W  2.6.33.6-rt23 #12 MS-7502/MS-7502
>> [  141.609268] RIP: 0010:[<ffffffff8106856d>]  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268] RSP: 0018:ffff88020e3cdd78  EFLAGS: 00010097
>> [  141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
>> [  141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
>> [  141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
>> [  141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
>> [  141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
>> [  141.609268] FS:  00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
>> [  141.609268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
>> [  141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
>> [  141.609268] Stack:
>> [  141.609268]  0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
>> [  141.609268]<0>  0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
>> [  141.609268]<0>  ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
>> [  141.609268] Call Trace:
>> [  141.609268]  [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
>> [  141.609268]  [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
>> [  141.609268]  [<ffffffff81067d7f>] do_futex+0x83c/0x935
>> [  141.609268]  [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
>> [  141.609268]  [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
>> [  141.609268]  [<ffffffff81067fba>] sys_futex+0x142/0x154
>> [  141.609268]  [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
>> [  141.609268]  [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
>> [  141.609268]  [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
>> [  141.609268]  [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
>> [  141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00<4c>  39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34
>> [  141.609268] RIP  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268]  RSP<ffff88020e3cdd78>
>> [  141.609268] CR2: 0000000000000058
>> [  141.609268] ---[ end trace 58805b944e6f93ce ]---
>> [  141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2
>>
>> (5. eyeball locks.. ->  zzzzt ->  report ->  eyeball..)
>>
>> 	-Mike
>>


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07  4:46 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() Mike Galbraith
  2010-07-07  8:03 ` Mike Galbraith
  2010-07-07 11:57 ` Thomas Gleixner
@ 2010-07-07 14:11 ` gowrishankar
  2010-07-07 14:31   ` Mike Galbraith
  2 siblings, 1 reply; 29+ messages in thread
From: gowrishankar @ 2010-07-07 14:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-rt-users, Thomas Gleixner, Peter Zijlstra, Darren Hart

On Wednesday 07 July 2010 10:16 AM, Mike Galbraith wrote:
> Greetings,
>
> Stress testing, looking to trigger RCU stalls, I've managed to find a
> way to repeatably create fireworks.  (got RCU stall, see attached)
>
> 1. download ltp-full-20100630.  Needs to be this version because of
> testcase bustage in earlier versions, and must be built with gcc>  4.3,
> else testcases will segfault due to a gcc bug.
>
>    
Hi Mike,
I have seen this segfault esp with GCC v4.3.4. I am about to post this 
patch
in ltp:

Signed-off-by: Gowrishankar <gowrishankar.m@in.ibm.com>
---
  testcases/realtime/include/librttest.h |    6 +++---
  1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/testcases/realtime/include/librttest.h 
b/testcases/realtime/include/librttest.h
index e526ab4..273de6f 100644
--- a/testcases/realtime/include/librttest.h
+++ b/testcases/realtime/include/librttest.h
@@ -118,9 +118,9 @@ static inline int atomic_add(int i, atomic_t *v)
         int __i;
         __i = i;
         asm volatile(
-                       "lock; xaddl %0, %1;"
-                       :"=r"(i)
-                       :"m"(v->counter), "0"(i));
+                       "lock; xaddl %1, %0;"
+                       :"=m"(v->counter)
+                       :"r"(i), "m" (v->counter));
         return i + __i;
  #elif defined(__powerpc__)
  #define ISYNC_ON_SMP   "\n\tisync\n"
--

Please let me know if this patch helps.

Thanks,
Gowri


> 2. apply patchlet so you can run testcases/realtime/perf/latency/run.sh
> at all.
>
> --- pthread_cond_many.c.org	2010-07-05 09:05:59.000000000 +0200
> +++ pthread_cond_many.c	2010-07-04 12:12:25.000000000 +0200
> @@ -259,7 +259,7 @@ void usage(void)
>
>   int parse_args(int c, char *v)
>   {
> -	int handled;
> +	int handled = 1;
>           switch (c) {
>   		case 'h':
>   			usage();
>
> 3. add --realtime for no particular reason.
>
> --- run.sh.org	2010-07-06 15:54:58.000000000 +0200
> +++ run.sh	2010-07-06 16:37:34.000000000 +0200
> @@ -22,7 +22,7 @@ make
>   # process to run realtime.  The remainder of the processes (if any)
>   # will run non-realtime in any case.
>
> -nthread=5000
> +nthread=500
>   iter=400
>   nproc=5
>
> @@ -39,7 +39,7 @@ i=0
>   i=1
>   while test $i -lt $nproc
>   do
> -        ./pthread_cond_many --broadcast -i $iter -n $nthread>  $nthread.$iter.$nproc.$i.out&
> +        ./pthread_cond_many --realtime --broadcast -i $iter -n $nthread>  $nthread.$iter.$nproc.$i.out&
>           i=`expr $i + 1`
>   done
>   wait
>
> 4. run it.
>
> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
> this does not make it to consoles (poking sysrq-foo doesn't either).
> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
> explosion, which does make it to consoles.
>
> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
> fallout.  The start of bad juju is always pi_blocked_on != waiter.
>
> [  141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [  141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268] PGD 20e174067 PUD 20e253067 PMD 0
> [  141.609268] Oops: 0000 [#1] PREEMPT SMP
> [  141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> [  141.609268] CPU 0
> [  141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G        W  2.6.33.6-rt23 #12 MS-7502/MS-7502
> [  141.609268] RIP: 0010:[<ffffffff8106856d>]  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268] RSP: 0018:ffff88020e3cdd78  EFLAGS: 00010097
> [  141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
> [  141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
> [  141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
> [  141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
> [  141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
> [  141.609268] FS:  00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
> [  141.609268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
> [  141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
> [  141.609268] Stack:
> [  141.609268]  0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
> [  141.609268]<0>  0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
> [  141.609268]<0>  ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
> [  141.609268] Call Trace:
> [  141.609268]  [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
> [  141.609268]  [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
> [  141.609268]  [<ffffffff81067d7f>] do_futex+0x83c/0x935
> [  141.609268]  [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
> [  141.609268]  [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
> [  141.609268]  [<ffffffff81067fba>] sys_futex+0x142/0x154
> [  141.609268]  [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
> [  141.609268]  [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
> [  141.609268]  [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
> [  141.609268]  [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
> [  141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00<4c>  39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34
> [  141.609268] RIP  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268]  RSP<ffff88020e3cdd78>
> [  141.609268] CR2: 0000000000000058
> [  141.609268] ---[ end trace 58805b944e6f93ce ]---
> [  141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2
>
> (5. eyeball locks.. ->  zzzzt ->  report ->  eyeball..)
>
> 	-Mike
>    


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 14:03   ` Darren Hart
@ 2010-07-07 14:17     ` Mike Galbraith
  2010-07-08 12:05     ` Mike Galbraith
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-07 14:17 UTC (permalink / raw)
  To: Darren Hart; +Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra

On Wed, 2010-07-07 at 07:03 -0700, Darren Hart wrote:
> On 07/07/2010 04:57 AM, Thomas Gleixner wrote:
> > Cc'ing Darren.
> >
> > On Wed, 7 Jul 2010, Mike Galbraith wrote:
> >
> 
> Hi Mike,
> 
> >> Greetings,
> >>
> >> Stress testing, looking to trigger RCU stalls, I've managed to find a
> >> way to repeatably create fireworks.  (got RCU stall, see attached)
> >>
> >> 1. download ltp-full-20100630.  Needs to be this version because of
> >> testcase bustage in earlier versions, and must be built with gcc>  4.3,
> >> else testcases will segfault due to a gcc bug.
> 
> Interesting, I had not hit any gcc specific issues with this. Can you 
> point me to the bug?

gcc-4.3 doesn't grok volatile inside a struct.  Workaround below.

Index: ltp-full-20090930/testcases/realtime/include/librttest.h
===================================================================
--- ltp-full-20090930.orig/testcases/realtime/include/librttest.h
+++ ltp-full-20090930/testcases/realtime/include/librttest.h
@@ -88,7 +88,7 @@ struct thread {
 	int flags;
 	int id;
 };
-typedef struct { volatile int counter; } atomic_t;
+typedef volatile struct { volatile int counter; } atomic_t;
 
 // flags for threads
 #define THREAD_START 1

> >> --- run.sh.org	2010-07-06 15:54:58.000000000 +0200
> >> +++ run.sh	2010-07-06 16:37:34.000000000 +0200
> >> @@ -22,7 +22,7 @@ make
> >>   # process to run realtime.  The remainder of the processes (if any)
> >>   # will run non-realtime in any case.
> >>
> >> -nthread=5000
> >> +nthread=500
> 
> Was this just to lighten the load, or was it required to reproduce?

Just to lighten it.

> Which architecture?

x64_64.

> Glibc version?

glibc-2.11.2-2.3

> I see kernel version is: 2.6.33.6-rt23, have you reproduced this on 
> earlier kernel versions as well? Any 2.6.31 rt kernel would be a good 
> data point.

Yes, both 2.6.31.14-rt21 and 2.6.29.6-rt24.

> Is this immediately reproducible for you?

Yes, nearly instant kaboom.

> I see a possibly fault occuring in the stack, if you run with 
> mlockall(), does the problem go away? (assuming not, but an easy thing 
> to test).

Nope, -m made no difference.

	-Mike


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 14:11 ` 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() gowrishankar
@ 2010-07-07 14:31   ` Mike Galbraith
  2010-07-07 15:05     ` Darren Hart
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Galbraith @ 2010-07-07 14:31 UTC (permalink / raw)
  To: gowrishankar; +Cc: linux-rt-users, Thomas Gleixner, Peter Zijlstra, Darren Hart

On Wed, 2010-07-07 at 19:41 +0530, gowrishankar wrote:
> On Wednesday 07 July 2010 10:16 AM, Mike Galbraith wrote:
> > Greetings,
> >
> > Stress testing, looking to trigger RCU stalls, I've managed to find a
> > way to repeatably create fireworks.  (got RCU stall, see attached)
> >
> > 1. download ltp-full-20100630.  Needs to be this version because of
> > testcase bustage in earlier versions, and must be built with gcc>  4.3,
> > else testcases will segfault due to a gcc bug.
> >
> >    
> Hi Mike,
> I have seen this segfault esp with GCC v4.3.4. I am about to post this 
> patch
> in ltp:
> 
> Signed-off-by: Gowrishankar <gowrishankar.m@in.ibm.com>
> ---
>   testcases/realtime/include/librttest.h |    6 +++---
>   1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/testcases/realtime/include/librttest.h 
> b/testcases/realtime/include/librttest.h
> index e526ab4..273de6f 100644
> --- a/testcases/realtime/include/librttest.h
> +++ b/testcases/realtime/include/librttest.h
> @@ -118,9 +118,9 @@ static inline int atomic_add(int i, atomic_t *v)
>          int __i;
>          __i = i;
>          asm volatile(
> -                       "lock; xaddl %0, %1;"
> -                       :"=r"(i)
> -                       :"m"(v->counter), "0"(i));
> +                       "lock; xaddl %1, %0;"
> +                       :"=m"(v->counter)
> +                       :"r"(i), "m" (v->counter));
>          return i + __i;
>   #elif defined(__powerpc__)
>   #define ISYNC_ON_SMP   "\n\tisync\n"
> --
> 
> Please let me know if this patch helps.

Yup, all better.

	-Mike


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 14:31   ` Mike Galbraith
@ 2010-07-07 15:05     ` Darren Hart
  2010-07-07 17:45       ` Mike Galbraith
  0 siblings, 1 reply; 29+ messages in thread
From: Darren Hart @ 2010-07-07 15:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: gowrishankar, linux-rt-users, Thomas Gleixner, Peter Zijlstra

On 07/07/2010 07:31 AM, Mike Galbraith wrote:
> On Wed, 2010-07-07 at 19:41 +0530, gowrishankar wrote:
>> On Wednesday 07 July 2010 10:16 AM, Mike Galbraith wrote:
>>> Greetings,
>>>
>>> Stress testing, looking to trigger RCU stalls, I've managed to find a
>>> way to repeatably create fireworks.  (got RCU stall, see attached)
>>>
>>> 1. download ltp-full-20100630.  Needs to be this version because of
>>> testcase bustage in earlier versions, and must be built with gcc>   4.3,
>>> else testcases will segfault due to a gcc bug.
>>>
>>>
>> Hi Mike,
>> I have seen this segfault esp with GCC v4.3.4. I am about to post this
>> patch
>> in ltp:
>>
>> Signed-off-by: Gowrishankar<gowrishankar.m@in.ibm.com>
>> ---
>>    testcases/realtime/include/librttest.h |    6 +++---
>>    1 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/testcases/realtime/include/librttest.h
>> b/testcases/realtime/include/librttest.h
>> index e526ab4..273de6f 100644
>> --- a/testcases/realtime/include/librttest.h
>> +++ b/testcases/realtime/include/librttest.h
>> @@ -118,9 +118,9 @@ static inline int atomic_add(int i, atomic_t *v)
>>           int __i;
>>           __i = i;
>>           asm volatile(
>> -                       "lock; xaddl %0, %1;"
>> -                       :"=r"(i)
>> -                       :"m"(v->counter), "0"(i));
>> +                       "lock; xaddl %1, %0;"
>> +                       :"=m"(v->counter)
>> +                       :"r"(i), "m" (v->counter));
>>           return i + __i;
>>    #elif defined(__powerpc__)
>>    #define ISYNC_ON_SMP   "\n\tisync\n"
>> --
>>
>> Please let me know if this patch helps.
>
> Yup, all better.

So with this, the "volatile struct" patch isn't necessary?

If so, perhaps we would be better off converting librttest.h to 
implement the atomic functions using the glibc built-ins, ie:

static inline int
atomic_add(atomic_t *addr, int i)
{
	return __sync_add_and_fetch(&addr->val, i);
}

It's more maintainable than inline asm and better tested than a roll 
your own implementation. Oddly, I believe the original was patterned 
after what is in the Linux kernel...

--
Darren



>
> 	-Mike
>


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 15:05     ` Darren Hart
@ 2010-07-07 17:45       ` Mike Galbraith
  0 siblings, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-07 17:45 UTC (permalink / raw)
  To: Darren Hart; +Cc: gowrishankar, linux-rt-users, Thomas Gleixner, Peter Zijlstra

On Wed, 2010-07-07 at 08:05 -0700, Darren Hart wrote:
> On 07/07/2010 07:31 AM, Mike Galbraith wrote:

> > Yup, all better.
> 
> So with this, the "volatile struct" patch isn't necessary?

Definitely no longer necessary.. on my box ;-)

	-Mike


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 14:03   ` Darren Hart
  2010-07-07 14:17     ` Mike Galbraith
@ 2010-07-08 12:05     ` Mike Galbraith
  2010-07-08 14:12       ` Darren Hart
  1 sibling, 1 reply; 29+ messages in thread
From: Mike Galbraith @ 2010-07-08 12:05 UTC (permalink / raw)
  To: Darren Hart; +Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra

On Wed, 2010-07-07 at 07:03 -0700, Darren Hart wrote:

> I see kernel version is: 2.6.33.6-rt23, have you reproduced this on 
> earlier kernel versions as well? Any 2.6.31 rt kernel would be a good 
> data point.

FUTEX_CMP/WAIT_REQUEUE_PI first appeared in 2.6.29-rt, as does the
explosion.

	-Mike




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-08 12:05     ` Mike Galbraith
@ 2010-07-08 14:12       ` Darren Hart
  0 siblings, 0 replies; 29+ messages in thread
From: Darren Hart @ 2010-07-08 14:12 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra

On 07/08/2010 05:05 AM, Mike Galbraith wrote:
> On Wed, 2010-07-07 at 07:03 -0700, Darren Hart wrote:
>
>> I see kernel version is: 2.6.33.6-rt23, have you reproduced this on
>> earlier kernel versions as well? Any 2.6.31 rt kernel would be a good
>> data point.
>
> FUTEX_CMP/WAIT_REQUEUE_PI first appeared in 2.6.29-rt, as does the
> explosion.
>

Right. I haven't had a chance to look into this yet. I will have some 
time today.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 11:57 ` Thomas Gleixner
  2010-07-07 14:03   ` Darren Hart
@ 2010-07-09  2:11   ` Darren Hart
  2010-07-09  4:32     ` Mike Galbraith
       [not found]     ` <4C36CD83.6070809@us.ibm.com>
  2010-07-09 20:05   ` Darren Hart
  2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
  3 siblings, 2 replies; 29+ messages in thread
From: Darren Hart @ 2010-07-09  2:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

On 07/07/2010 04:57 AM, Thomas Gleixner wrote:
> Cc'ing Darren.
> 
> On Wed, 7 Jul 2010, Mike Galbraith wrote:
> 
>> Greetings,
>>
>> Stress testing, looking to trigger RCU stalls, I've managed to find a
>> way to repeatably create fireworks.  (got RCU stall, see attached)

<snip>
embarrassing ltp realtime/perf/latency/pthread_cond_many breakage </snip>

>> 4. run it.
>>
>> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
>> this does not make it to consoles (poking sysrq-foo doesn't either).
>> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
>> explosion, which does make it to consoles.

So the WARN_ON sequence is obviously wrong, if it's critical it should
be a BUG(), if not we shouldn't dereference what we know to be null. The
following patch avoids the NULL pointer dereference in the WARN_ON. With
this patch the NULL WARN_ON makes it to the console, and test runs to
completion with no obvious negative side effects. I'm only posting for
reference at this point, as while this may be necessary, it isn't the
right "solution".

Some other data points from what time I could spend on this today.

FC12 kernel (2.6.31 based) has Requeue PI support, but does not exhibit
this behavior.

2.6.33.5-rt23 without CONFIG_PREEMPT_RT does NOT exhibit this behavior.

2.6.33.5-rt23 does exhibit this behavior.

The minimal tracing I attempted (a handful of trace_printk's and run
with the nop plugin) all prevented the crash from happening.

There appears to be no correlation to pi_blocked_on being NULL and the
next pointer being NULL (I saw a roughly equivalent mix of NULL and
valid pointers for next when pi_blocked_on was NULL).

Tonight/Tomorrow I'll review the rtmutex and futex code to try and fully
understand (again) the usage of pi_blocked_on and if we need to avoid
this scenario, or if we need to handle it "gracefully".

>From fa6a6bee6e467d12d3774612c838703acd265ea6 Mon Sep 17 00:00:00 2001
From: Darren Hart <dvhltc@us.ibm.com>
Date: Thu, 8 Jul 2010 19:44:35 -0400
Subject: [PATCH] rtmutex: avoid warnon bug

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
 kernel/rtmutex.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..a2fcaa5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -579,9 +579,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 
 	raw_spin_lock(&pendowner->pi_lock);
 
-	WARN_ON(!pendowner->pi_blocked_on);
 	WARN_ON(pendowner->pi_blocked_on != waiter);
-	WARN_ON(pendowner->pi_blocked_on->lock != lock);
+	if (!WARN_ON(!pendowner->pi_blocked_on))
+		WARN_ON(pendowner->pi_blocked_on->lock != lock);
 
 	pendowner->pi_blocked_on = NULL;
 
-- 
1.6.5.2


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-09  2:11   ` Darren Hart
@ 2010-07-09  4:32     ` Mike Galbraith
       [not found]     ` <4C36CD83.6070809@us.ibm.com>
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-09  4:32 UTC (permalink / raw)
  To: Darren Hart
  Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

On Thu, 2010-07-08 at 19:11 -0700, Darren Hart wrote:

> So the WARN_ON sequence is obviously wrong, if it's critical it should
> be a BUG(), if not we shouldn't dereference what we know to be null. The
> following patch avoids the NULL pointer dereference in the WARN_ON. With
> this patch the NULL WARN_ON makes it to the console, and test runs to
> completion with no obvious negative side effects. I'm only posting for
> reference at this point, as while this may be necessary, it isn't the
> right "solution".

I've been slogging through the locking under the assumption that
pi_blocked_on->task pointing to a stranger is very bad juju, but you're
right, the only obviously evil consequence I see is tripping over the
fallout in WARN_ON().

> Tonight/Tomorrow I'll review the rtmutex and futex code to try and fully
> understand (again) the usage of pi_blocked_on and if we need to avoid
> this scenario, or if we need to handle it "gracefully".

I hope you find it, I'm going blind crawling in endless circles :)

	-Mike


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
       [not found]     ` <4C36CD83.6070809@us.ibm.com>
@ 2010-07-09  8:13       ` Mike Galbraith
  2010-07-09 13:58       ` Mike Galbraith
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-09  8:13 UTC (permalink / raw)
  To: Darren Hart
  Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

On Fri, 2010-07-09 at 00:19 -0700, Darren Hart wrote:

> I'll keep on looking, but I wanted to share what I've learned in case anyone
> in another timezone is hoping to continue debugging should I fall asleep at
> the helm ;-)

Still making smoke here, no heat or light tho..

Q: why is it ok to call rt_mutex_proxy_unlock() without the wait_lock
held?  Every other path leading to rt_mutex_set_owner() seems to hold
the lock.  Locking it did diddly spit other than shut up my WARN_ON().

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..f4b76d9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -407,7 +407,9 @@ static void free_pi_state(struct futex_pi_state *pi_state)
 		list_del_init(&pi_state->list);
 		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
 
+		raw_spin_lock(&pi_state->pi_mutex.wait_lock);
 		rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner);
+		raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
 	}
 
 	if (current->pi_state_cache)



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
       [not found]     ` <4C36CD83.6070809@us.ibm.com>
  2010-07-09  8:13       ` Mike Galbraith
@ 2010-07-09 13:58       ` Mike Galbraith
  2010-07-09 14:51         ` Mike Galbraith
  2010-07-09 16:35         ` Darren Hart
  1 sibling, 2 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-09 13:58 UTC (permalink / raw)
  To: Darren Hart
  Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

On Fri, 2010-07-09 at 00:19 -0700, Darren Hart wrote:

> Walking through it:
> 
> First the dumps:
> ------------[ cut here ]------------
> WARNING: at kernel/rtmutex.c:583 wakeup_next_waiter+0x1ad/0x220()
> 
> 
> WARN_ON(pendowner->pi_blocked_on != waiter);
> The pi_blocked_on is not NULL, but it isn't the expected waiter either.
> This means that the top waiter selected at the beginning of
> wakeup_next_waiter() is now blocked on a lock with a different waiter
> structure, possibly on a different lock.

pendowner->pi_blocked_on changes while we're in wakeup_next_waiter().
The below fi^Wmade it not do that any more.  We hold the wait_lock for
this lock, but if the wakee blocks on another, what's protecting us?

bandaid-by: /me

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..dd91ede 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -525,6 +525,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 	pendowner = waiter->task;
 	waiter->task = NULL;
 
+	raw_spin_lock(&pendowner->pi_lock);
+
 	/*
 	 * Do the wakeup before the ownership change to give any spinning
 	 * waiter grantees a headstart over the other threads that will
@@ -577,8 +579,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 	else
 		next = NULL;
 
-	raw_spin_lock(&pendowner->pi_lock);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-09 13:58       ` Mike Galbraith
@ 2010-07-09 14:51         ` Mike Galbraith
  2010-07-09 16:35         ` Darren Hart
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-09 14:51 UTC (permalink / raw)
  To: Darren Hart
  Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

This would likely be better if holding over wakeup is ok.

pi-tests still work, as does everything else I've tried.  Doesn't _seem_
like a bad idea to keep pending owner on a leash until we're done
fiddling with him, but dunno, I've never needed to tinker here. 

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..051f386 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -525,6 +525,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 	pendowner = waiter->task;
 	waiter->task = NULL;
 
+	raw_spin_unlock(&current->pi_lock);
+	raw_spin_lock(&pendowner->pi_lock);
+
 	/*
 	 * Do the wakeup before the ownership change to give any spinning
 	 * waiter grantees a headstart over the other threads that will
@@ -562,8 +565,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 
 	rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
 
-	raw_spin_unlock(&current->pi_lock);
-
 	/*
 	 * Clear the pi_blocked_on variable and enqueue a possible
 	 * waiter into the pi_waiters list of the pending owner. This
@@ -577,8 +578,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 	else
 		next = NULL;
 
-	raw_spin_lock(&pendowner->pi_lock);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-09 13:58       ` Mike Galbraith
  2010-07-09 14:51         ` Mike Galbraith
@ 2010-07-09 16:35         ` Darren Hart
  2010-07-09 19:34           ` Mike Galbraith
  1 sibling, 1 reply; 29+ messages in thread
From: Darren Hart @ 2010-07-09 16:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

On 07/09/2010 06:58 AM, Mike Galbraith wrote:
> On Fri, 2010-07-09 at 00:19 -0700, Darren Hart wrote:
>
>> Walking through it:
>>
>> First the dumps:
>> ------------[ cut here ]------------
>> WARNING: at kernel/rtmutex.c:583 wakeup_next_waiter+0x1ad/0x220()
>>
>>
>> WARN_ON(pendowner->pi_blocked_on != waiter);
>> The pi_blocked_on is not NULL, but it isn't the expected waiter either.
>> This means that the top waiter selected at the beginning of
>> wakeup_next_waiter() is now blocked on a lock with a different waiter
>> structure, possibly on a different lock.
>
> pendowner->pi_blocked_on changes while we're in wakeup_next_waiter().
> The below fi^Wmade it not do that any more.  We hold the wait_lock for
> this lock, but if the wakee blocks on another, what's protecting us?

If pendowner is blocked on "lock" to begin with (he should be as his 
waiter struct in in the rtmutex waiters list) then he can't block on 
someone else until he either acquires this one or removes himself as a 
waiter (due to a timeout for instance) - both of these operations 
require holding lock->wait_lock, which is held by the caller of 
wakeup_next_waiter().

Seems more likely that the below forces a missing memory barrier... not 
sure yet though. Good data point.

--
Darren

>
> bandaid-by: /me
>
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 23dd443..dd91ede 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -525,6 +525,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>   	pendowner = waiter->task;
>   	waiter->task = NULL;
>
> +	raw_spin_lock(&pendowner->pi_lock);
> +
>   	/*
>   	 * Do the wakeup before the ownership change to give any spinning
>   	 * waiter grantees a headstart over the other threads that will
> @@ -577,8 +579,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>   	else
>   		next = NULL;
>
> -	raw_spin_lock(&pendowner->pi_lock);
> -
>   	WARN_ON(!pendowner->pi_blocked_on);
>   	WARN_ON(pendowner->pi_blocked_on != waiter);
>   	WARN_ON(pendowner->pi_blocked_on->lock != lock);
>
>


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-09 16:35         ` Darren Hart
@ 2010-07-09 19:34           ` Mike Galbraith
  0 siblings, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-09 19:34 UTC (permalink / raw)
  To: Darren Hart
  Cc: Thomas Gleixner, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

On Fri, 2010-07-09 at 09:35 -0700, Darren Hart wrote:

> If pendowner is blocked on "lock" to begin with (he should be as his 
> waiter struct in in the rtmutex waiters list) then he can't block on 
> someone else until he either acquires this one or removes himself as a 
> waiter (due to a timeout for instance) - both of these operations 
> require holding lock->wait_lock, which is held by the caller of 
> wakeup_next_waiter().

Hm, right, first thing wakee does is re-grab the wait_lock.

	-Mike


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter()
  2010-07-07 11:57 ` Thomas Gleixner
  2010-07-07 14:03   ` Darren Hart
  2010-07-09  2:11   ` Darren Hart
@ 2010-07-09 20:05   ` Darren Hart
  2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
  3 siblings, 0 replies; 29+ messages in thread
From: Darren Hart @ 2010-07-09 20:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, linux-rt-users, Peter Zijlstra, Steven Rostedt,
	gowrishankar

On 07/07/2010 04:57 AM, Thomas Gleixner wrote:
> Cc'ing Darren.
> 
> On Wed, 7 Jul 2010, Mike Galbraith wrote:
> 
>> Greetings,
>>
>> Stress testing, looking to trigger RCU stalls, I've managed to find a
>> way to repeatably create fireworks.  (got RCU stall, see attached)
>> 4. run it.
>>
>> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
>> this does not make it to consoles (poking sysrq-foo doesn't either).
>> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
>> explosion, which does make it to consoles.
>>
>> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
>> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
>> fallout.  The start of bad juju is always pi_blocked_on != waiter.
>>
>> [  141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
>> [  141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268] PGD 20e174067 PUD 20e253067 PMD 0
>> [  141.609268] Oops: 0000 [#1] PREEMPT SMP
>> [  141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
>> [  141.609268] CPU 0
>> [  141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G        W  2.6.33.6-rt23 #12 MS-7502/MS-7502
>> [  141.609268] RIP: 0010:[<ffffffff8106856d>]  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268] RSP: 0018:ffff88020e3cdd78  EFLAGS: 00010097
>> [  141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
>> [  141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
>> [  141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
>> [  141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
>> [  141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
>> [  141.609268] FS:  00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
>> [  141.609268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
>> [  141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
>> [  141.609268] Stack:
>> [  141.609268]  0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
>> [  141.609268]<0>  0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
>> [  141.609268]<0>  ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
>> [  141.609268] Call Trace:
>> [  141.609268]  [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
>> [  141.609268]  [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
>> [  141.609268]  [<ffffffff81067d7f>] do_futex+0x83c/0x935
>> [  141.609268]  [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
>> [  141.609268]  [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
>> [  141.609268]  [<ffffffff81067fba>] sys_futex+0x142/0x154
>> [  141.609268]  [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
>> [  141.609268]  [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
>> [  141.609268]  [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
>> [  141.609268]  [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
>> [  141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00<4c>  39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34
>> [  141.609268] RIP  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
>> [  141.609268]  RSP<ffff88020e3cdd78>
>> [  141.609268] CR2: 0000000000000058
>> [  141.609268] ---[ end trace 58805b944e6f93ce ]---
>> [  141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2
>>
>> (5. eyeball locks.. ->  zzzzt ->  report ->  eyeball..)
>>
>> 	-Mike
>>


OK, here is the final analysis. After a trace-cmd fix and some rtmutex
"paradigm" insight from Rostedt, the root of the problem came out:

Jul  9 14:26:39 elm9m94 kernel: ------------[ cut here ]------------
Jul  9 14:26:39 elm9m94 kernel: WARNING: at kernel/rtmutex.c:590 wakeup_next_waiter+0x244/0x370()
Jul  9 14:26:39 elm9m94 kernel: Hardware name: IBM eServer BladeCenter HS21 -[7995AC1]-
Jul  9 14:26:39 elm9m94 kernel: Modules linked in: scsi_wait_scan
Jul  9 14:26:39 elm9m94 kernel: Pid: 3870, comm: pthread_cond_ma Not tainted 2.6.33.5-rt23dvh01 #13
Jul  9 14:26:39 elm9m94 kernel: Call Trace:
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8104dc3b>] warn_slowpath_common+0x7b/0xc0
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8104dc94>] warn_slowpath_null+0x14/0x20
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff81083f74>] wakeup_next_waiter+0x244/0x370
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff81921875>] rt_mutex_slowunlock+0x35/0x80
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff819218e9>] rt_mutex_unlock+0x29/0x40
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8108258c>] do_futex+0xa6c/0xae0


# addr2line -e vmlinux ffffffff8108258c
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:833
In futex_unlock_pi() unlocking the pi futex backing the pthread_mutex:

        rt_mutex_unlock(&pi_state->pi_mutex);


and a trace of the pending owner with the changing waiter (pi_blocked_on) from
sched_show_task():

Jul  9 14:26:39 elm9m94 kernel: pthread_cond_ M ffffffff81922537     0  4176   3745 0x00000080
Jul  9 14:26:39 elm9m94 kernel: ffffffff821fc9a8 0000000000000000 ffff88042ded9c08 ffffffff819216c0
Jul  9 14:26:39 elm9m94 kernel: 0000000000000000 ffff88042dbbabc0 ffffffff821fc9c0 ffff88042decee40
Jul  9 14:26:39 elm9m94 kernel: ffff88042decee40 ffff88042decee40 ffff880400000000 ffff88042ded9b70
Jul  9 14:26:39 elm9m94 kernel: Call Trace:
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff819216c0>] ? rt_spin_lock_slowlock+0x190/0x310
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff8192216c>] ? rt_spin_lock+0x7c/0x90
Jul  9 14:26:39 elm9m94 kernel: [<ffffffff810810c8>] ? futex_wait_requeue_pi+0x178/0x360


# addr2line -e vmlinux ffffffff810810c8
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:153

That's inside match_futex(), back up a line:

# addr2line -e vmlinux ffffffff810810c0
/test/dvhart/source/linux-2.6-tip.git/kernel/futex.c:2258

This is futex_wait_requeue_pi():
        /* Queue the futex_q, drop the hb lock, wait for wakeup. */
        futex_wait_queue_me(hb, &q, to);
        
        spin_lock(&hb->lock);
        ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
        spin_unlock(&hb->lock);


Snippets from the trace:

 pthread_cond_ma-4176  [003]   395.357310: bprint:               task_blocks_on_rt_mutex : 4176 blocks on waiter:
 pthread_cond_ma-4176  [003]   395.357310: bprint:               task_blocks_on_rt_mutex :   waiter: 0xffff88042ded9b68
 pthread_cond_ma-4176  [003]   395.357311: bprint:               task_blocks_on_rt_mutex :   waiter->lock: 0xffffffff821fc9a8
 pthread_cond_ma-4176  [003]   395.357311: bprint:               task_blocks_on_rt_mutex :   was: 0xffff88042ded9cc8
                                                                                             ^ should always be NULL
 pthread_cond_ma-4176  [003]   395.357311: bprint:               task_blocks_on_rt_mutex :   oldwaiter->lock: 0xffff8804104d0450


 pthread_cond_ma-3870  [000]   395.357732: bprint:               wakeup_next_waiter : BUG: waiter changed
 pthread_cond_ma-3870  [000]   395.357733: bprint:               wakeup_next_waiter : pendowner pid: 4176
 pthread_cond_ma-3870  [000]   395.357734: bprint:               wakeup_next_waiter : pendowner comm:
 pthread_cond_ma-3870  [000]   395.357734: bprint:               wakeup_next_waiter : waiter->lock: 0xffff8804104d0450
 pthread_cond_ma-3870  [000]   395.357734: bprint:               wakeup_next_waiter : pi_blocked_on->lock: 0xffffffff821fc9a8
 
The one it changed to is the one we see in the task_blocks_on_rt_mutex
above. It has woken from the futex_wait_queue_me() after having been
requeued to the pi futex (we know this because it wouldn't have a
pi_blocked_on if it was still on the wait queue futex. Also the
futex_unlock_pi() call by 3870 is trying to unlock the previous lock
4167 was locked on.
 
The core of the problem is that the proxy_lock blocks a task on a lock
the task knows nothing about. So when it wakes up inside of
futex_wait_requeue_pi, it immediately tries to block on hb->lock to
check why it woke up. This has the potential to block the task on two
locks (thus overwriting the pi_blocked_on). Any attempt preventing this
involves a lock, and ultimiately the hb->lock. The only solution I see
is to make the hb->locks raw locks (thanks to Steven Rostedt for
original idea and batting this around with me in IRC).
 
Given the contention the hb->locks can be under, I'm concerned about the
latency impacts this can have, still, I don't see another way to allow
for the proxy locking that the rtmutex semantics require for requeue_pi
to work. I'll work up a patch and do some testing to see if I can get a
feel for the impact.

Ugh.

I'm still hoping to find some clever way to avoid this, going to map out
this state machine again and see if there is a way around it...

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
  2010-07-07 11:57 ` Thomas Gleixner
                     ` (2 preceding siblings ...)
  2010-07-09 20:05   ` Darren Hart
@ 2010-07-13  8:03   ` Darren Hart
  2010-07-13  9:25     ` Thomas Gleixner
  2010-07-13  9:58     ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Thomas Gleixner
  3 siblings, 2 replies; 29+ messages in thread
From: Darren Hart @ 2010-07-13  8:03 UTC (permalink / raw)
  To: lkml, 
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users

Thanks to Thomas, Steven, and Mike for hashing this over me. After an
IRC discussion with Thomas, I put the following together. It resolves
the issue for me, Mike please test and let us know if it fixes it for
you. A couple of points of discussion before we commit this:

The use of the new state flag, PI_WAKEUP_INPROGRESS, is pretty ugly.
Would a new task_pi_blocked_on_valid() method be preferred (in
rtmutex.c)? 

The new WARN_ON() in task_blocks_on_rt_mutex() is complex. It didn't
exist before and we've now closed this gap, should we just drop it?

I've added a couple BUG_ON()s in futex_wait_requeue_pi() dealing with
the race with requeue and q.lock_ptr. I'd like to leave this for the
time being if nobody strongly objects.

Thanks,

Darren


>From 93fd3bb97800ebf5e5c1a6a85937bab93256dd42 Mon Sep 17 00:00:00 2001
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 17:50:23 -0400
Subject: [PATCH 1/2] futex: protect against pi_blocked_on corruption during requeue PI

The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
a scenario where a task can wakeup, not knowing it has been enqueued on an
rtmutex. Blocking on an hb->lock() can overwrite a valid value in
current->pi_blocked_on, leading to an inconsistent state.

Prevent overwriting pi_blocked_on by serializing on the waiter's pi_lock (a
raw_spinlock) and using the new PI_WAKEUP_INPROGRESS state flag to indicate a
waiter that has been woken by a timeout or signal. This prevents the rtmutex
code from adding the waiter to the rtmutex wait list, returning EAGAIN to
futex_requeue(), which will in turn ignore the waiter during a requeue. Care
is taken to allow current to block on locks even if PI_WAKEUP_INPROGRESS is
set.

During normal wakeup, this results in one less hb->lock protected section. In
the pre-requeue-timeout-or-signal wakeup, this removes the "greedy locking"
behavior, no attempt will be made to acquire the lock.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/futex.c          |   50 +++++++++++++++++++++++++++++++++-------------
 kernel/rtmutex.c        |   45 ++++++++++++++++++++++++++++++++++-------
 kernel/rtmutex_common.h |    1 +
 kernel/sched.c          |    5 +++-
 4 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..c92978d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1336,6 +1336,9 @@ retry_private:
 				requeue_pi_wake_futex(this, &key2, hb2);
 				drop_count++;
 				continue;
+			} else if (ret == -EAGAIN) {
+				/* Waiter woken by timeout or signal. */
+				continue;
 			} else if (ret) {
 				/* -EDEADLK */
 				this->pi_state = NULL;
@@ -2211,9 +2214,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 				 int clockrt, u32 __user *uaddr2)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_hash_bucket *hb, *hb2;
 	struct rt_mutex_waiter rt_waiter;
 	struct rt_mutex *pi_mutex = NULL;
-	struct futex_hash_bucket *hb;
 	union futex_key key2;
 	struct futex_q q;
 	int res, ret;
@@ -2255,18 +2258,33 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
 	futex_wait_queue_me(hb, &q, to);
 
-	spin_lock(&hb->lock);
-	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
-	spin_unlock(&hb->lock);
-	if (ret)
-		goto out_put_keys;
-
 	/*
-	 * In order for us to be here, we know our q.key == key2, and since
-	 * we took the hb->lock above, we also know that futex_requeue() has
-	 * completed and we no longer have to concern ourselves with a wakeup
-	 * race with the atomic proxy lock acquition by the requeue code.
+	 * Avoid races with requeue and trying to block on two mutexes
+	 * (hb->lock and uaddr2's rtmutex) by serializing access to
+	 * pi_blocked_on with pi_lock and setting PI_BLOCKED_ON_PENDING.
+	 */
+	raw_spin_lock(&current->pi_lock);
+	if (current->pi_blocked_on) {
+		raw_spin_unlock(&current->pi_lock);
+	} else {
+		current->pi_blocked_on = (struct rt_mutex_waiter *)PI_WAKEUP_INPROGRESS;
+		raw_spin_unlock(&current->pi_lock);
+
+		spin_lock(&hb->lock);
+		ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
+		spin_unlock(&hb->lock);
+		if (ret)
+			goto out_put_keys;
+	}
+
+	/* 
+	 * In order to be here, we have either been requeued, are in the process
+	 * of being requeued, or requeue successfully acquired uaddr2 on our
+	 * behalf.  If pi_blocked_on was non-null above, we may be racing with a
+	 * requeue.  Do not rely on q->lock_ptr to be hb2->lock until after
+	 * blocking on hb->lock or hb2->lock.
 	 */
+	hb2 = hash_futex(&key2);
 
 	/* Check if the requeue code acquired the second futex for us. */
 	if (!q.rt_waiter) {
@@ -2275,10 +2293,12 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		 * did a lock-steal - fix up the PI-state in that case.
 		 */
 		if (q.pi_state && (q.pi_state->owner != current)) {
-			spin_lock(q.lock_ptr);
+			spin_lock(&hb2->lock);
+			BUG_ON(&hb2->lock != q.lock_ptr);
+
 			ret = fixup_pi_state_owner(uaddr2, &q, current,
 						   fshared);
-			spin_unlock(q.lock_ptr);
+			spin_unlock(&hb2->lock);
 		}
 	} else {
 		/*
@@ -2291,7 +2311,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 
-		spin_lock(q.lock_ptr);
+		spin_lock(&hb2->lock);
+		BUG_ON(&hb2->lock != q.lock_ptr);
+
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..0399108 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * reached or the state of the chain has changed while we
 	 * dropped the locks.
 	 */
-	if (!waiter || !waiter->task)
+	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
 		goto out_unlock_pi;
 
 	/*
@@ -448,6 +448,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	int chain_walk = 0, res;
 
 	raw_spin_lock(&task->pi_lock);
+
+	/*
+	 * In the case of futex requeue PI, this will be a proxy lock. The task
+	 * will wake unaware that it is enqueueed on this lock. Avoid blocking
+	 * on two locks and corrupting pi_blocked_on via the
+	 * PI_WAKEUP_INPROGRESS flag. futex_wait_requeue_pi() sets this when it
+	 * wakes up before requeue (due to a signal or timeout). Do not enqueue
+	 * the task if PI_WAKEUP_INPROGRESS is set.
+	 */
+	if (task != current &&
+	    (long)task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
+		raw_spin_unlock(&task->pi_lock);
+		return -EAGAIN;
+	}
+
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
@@ -459,6 +474,15 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		top_waiter = rt_mutex_top_waiter(lock);
 	plist_add(&waiter->list_entry, &lock->wait_list);
 
+	/* 
+	 * Tasks can only block on one lock at a time. In the case of futex
+	 * requeue PI, if task == current it may have set PI_WAKEUP_INPROGRESS
+	 * to prevent requeue, but it will still need to acquire locks on its
+	 * way out of futex_wait_requeue_pi().
+ 	 */
+	WARN_ON(task->pi_blocked_on != NULL &&
+	        (task != current || (long)task->pi_blocked_on != PI_WAKEUP_INPROGRESS));
+
 	task->pi_blocked_on = waiter;
 
 	raw_spin_unlock(&task->pi_lock);
@@ -469,7 +493,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
 		__rt_mutex_adjust_prio(owner);
-		if (owner->pi_blocked_on)
+		if (owner->pi_blocked_on &&
+		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
 			chain_walk = 1;
 		raw_spin_unlock(&owner->pi_lock);
 	}
@@ -579,9 +604,11 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 
 	raw_spin_lock(&pendowner->pi_lock);
 
-	WARN_ON(!pendowner->pi_blocked_on);
-	WARN_ON(pendowner->pi_blocked_on != waiter);
-	WARN_ON(pendowner->pi_blocked_on->lock != lock);
+ 	if (!WARN_ON(!pendowner->pi_blocked_on) &&
+ 	    !WARN_ON((long)pendowner->pi_blocked_on == PI_WAKEUP_INPROGRESS)) {
+  		WARN_ON(pendowner->pi_blocked_on != waiter);
+  		WARN_ON(pendowner->pi_blocked_on->lock != lock);
+  	}
 
 	pendowner->pi_blocked_on = NULL;
 
@@ -624,7 +651,8 @@ static void remove_waiter(struct rt_mutex *lock,
 		}
 		__rt_mutex_adjust_prio(owner);
 
-		if (owner->pi_blocked_on)
+		if (owner->pi_blocked_on &&
+		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
 			chain_walk = 1;
 
 		raw_spin_unlock(&owner->pi_lock);
@@ -658,7 +686,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS ||
+	    waiter->list_entry.prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1527,7 +1556,7 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
 				      flags);
 
-	if (ret && !waiter->task) {
+	if (ret == -EDEADLK && !waiter->task) {
 		/*
 		 * Reset the return value. We might have
 		 * returned with -EDEADLK and the owner
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 4df690c..94a856f 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -115,6 +115,7 @@ static inline unsigned long rt_mutex_owner_pending(struct rt_mutex *lock)
 /*
  * PI-futex support (proxy locking functions, etc.):
  */
+#define PI_WAKEUP_INPROGRESS 1
 extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock);
 extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 				       struct task_struct *proxy_owner);
diff --git a/kernel/sched.c b/kernel/sched.c
index aa5dced..9d4337e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -83,6 +83,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
+#include "rtmutex_common.h"
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
 	 */
 	if (unlikely(p == rq->idle)) {
 		WARN_ON(p != rq->curr);
-		WARN_ON(p->pi_blocked_on);
+		WARN_ON(p->pi_blocked_on &&
+		        (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);
 		goto out_unlock;
 	}
 
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
  2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
@ 2010-07-13  9:25     ` Thomas Gleixner
  2010-07-13 10:28       ` Thomas Gleixner
  2010-07-13  9:58     ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2010-07-13  9:25 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Darren Hart wrote:

> Thanks to Thomas, Steven, and Mike for hashing this over me. After an
> IRC discussion with Thomas, I put the following together. It resolves
> the issue for me, Mike please test and let us know if it fixes it for
> you. A couple of points of discussion before we commit this:
> 
> The use of the new state flag, PI_WAKEUP_INPROGRESS, is pretty ugly.
> Would a new task_pi_blocked_on_valid() method be preferred (in
> rtmutex.c)? 
> 
> The new WARN_ON() in task_blocks_on_rt_mutex() is complex. It didn't
> exist before and we've now closed this gap, should we just drop it?

We can simplify it to:

	WARN_ON(task->pi_blocked_on &&
		task->pi_blocked_on != PI_WAKEUP_INPROGRESS);

We check for !=current and PI_WAKEUP_INPROGRESS just above.   
 
> I've added a couple BUG_ON()s in futex_wait_requeue_pi() dealing with
> the race with requeue and q.lock_ptr. I'd like to leave this for the
> time being if nobody strongly objects.
> -
>  	/*
> -	 * In order for us to be here, we know our q.key == key2, and since
> -	 * we took the hb->lock above, we also know that futex_requeue() has
> -	 * completed and we no longer have to concern ourselves with a wakeup
> -	 * race with the atomic proxy lock acquition by the requeue code.
> +	 * Avoid races with requeue and trying to block on two mutexes
> +	 * (hb->lock and uaddr2's rtmutex) by serializing access to
> +	 * pi_blocked_on with pi_lock and setting PI_BLOCKED_ON_PENDING.
> +	 */
> +	raw_spin_lock(&current->pi_lock);

  Needs to be raw_spin_lock_irq()

> +	if (current->pi_blocked_on) {
> +		raw_spin_unlock(&current->pi_lock);
> +	} else {
> +		current->pi_blocked_on = (struct rt_mutex_waiter *)PI_WAKEUP_INPROGRESS;

  #define PI_WAKEUP_INPROGRESS ((struct rt_mutex_waiter *) 1)

  perhaps ? That gets rid of all type casts

> +		raw_spin_unlock(&current->pi_lock);
> +
> +		spin_lock(&hb->lock);

  We need to cleanup current->pi_blocked_on here. If we succeed in the
  hb->lock fast path then we might leak the PI_WAKEUP_INPROGRESS to user space
  and the next requeue will fail.

> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 23dd443..0399108 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
>  	 * reached or the state of the chain has changed while we
>  	 * dropped the locks.
>  	 */
> -	if (!waiter || !waiter->task)
> +	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
>  		goto out_unlock_pi;

 Why do we need that check ? Either the requeue succeeded then
 task->pi_blocked_on is set to the real waiter or the wakeup won and
 we are in no lock chain.

 If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
 rt_mutex_adjust_prio_chain() then it's a bug nothing else.

> @@ -469,7 +493,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
>  		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
>  
>  		__rt_mutex_adjust_prio(owner);
> -		if (owner->pi_blocked_on)
> +		if (owner->pi_blocked_on &&
> +		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)

  Again, that can never happen

>  			chain_walk = 1;
>  		raw_spin_unlock(&owner->pi_lock);
>  	}
> @@ -579,9 +604,11 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>  
>  	raw_spin_lock(&pendowner->pi_lock);
>  
> -	WARN_ON(!pendowner->pi_blocked_on);
> -	WARN_ON(pendowner->pi_blocked_on != waiter);
> -	WARN_ON(pendowner->pi_blocked_on->lock != lock);
> + 	if (!WARN_ON(!pendowner->pi_blocked_on) &&
> + 	    !WARN_ON((long)pendowner->pi_blocked_on == PI_WAKEUP_INPROGRESS)) {

  Ditto

> +  		WARN_ON(pendowner->pi_blocked_on != waiter);
> +  		WARN_ON(pendowner->pi_blocked_on->lock != lock);
> +  	}
>  
>  	pendowner->pi_blocked_on = NULL;
>  
> @@ -624,7 +651,8 @@ static void remove_waiter(struct rt_mutex *lock,
>  		}
>  		__rt_mutex_adjust_prio(owner);
>  
> -		if (owner->pi_blocked_on)
> +		if (owner->pi_blocked_on &&
> +		    (long)owner->pi_blocked_on != PI_WAKEUP_INPROGRESS)
>  			chain_walk = 1;

  Same here.			
  
>  		raw_spin_unlock(&owner->pi_lock);
> @@ -658,7 +686,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>  
>  	waiter = task->pi_blocked_on;
> -	if (!waiter || waiter->list_entry.prio == task->prio) {
> +	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS ||
> +	    waiter->list_entry.prio == task->prio) {

  And here

>  /*
>   * Convert user-nice values [ -20 ... 0 ... 19 ]
>   * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
> @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
>  	 */
>  	if (unlikely(p == rq->idle)) {
>  		WARN_ON(p != rq->curr);
> -		WARN_ON(p->pi_blocked_on);
> +		WARN_ON(p->pi_blocked_on &&
> +		        (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);

  Yuck. Paranoia ? If we ever requeue idle, then .....

I'm going to cleanup the stuff and send out a new patch for Mike
to test.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
  2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
  2010-07-13  9:25     ` Thomas Gleixner
@ 2010-07-13  9:58     ` Thomas Gleixner
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2010-07-13  9:58 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Darren Hart wrote:
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..c92978d 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1336,6 +1336,9 @@ retry_private:
>  				requeue_pi_wake_futex(this, &key2, hb2);
>  				drop_count++;
>  				continue;
> +			} else if (ret == -EAGAIN) {
> +				/* Waiter woken by timeout or signal. */

  This leaks the pi_state.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI
  2010-07-13  9:25     ` Thomas Gleixner
@ 2010-07-13 10:28       ` Thomas Gleixner
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2010-07-13 10:28 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Darren Hart wrote:
>
> > --- a/kernel/rtmutex.c
> > +++ b/kernel/rtmutex.c
> > @@ -227,7 +227,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
> >  	 * reached or the state of the chain has changed while we
> >  	 * dropped the locks.
> >  	 */
> > -	if (!waiter || !waiter->task)
> > +	if (!waiter || (long)waiter == PI_WAKEUP_INPROGRESS || !waiter->task)
> >  		goto out_unlock_pi;
> 
>  Why do we need that check ? Either the requeue succeeded then
>  task->pi_blocked_on is set to the real waiter or the wakeup won and
>  we are in no lock chain.
> 
>  If we ever find a waiter with PI_WAKEUP_INPROGRESS set in
>  rt_mutex_adjust_prio_chain() then it's a bug nothing else.

Grrr, I'm wrong. If we take hb->lock in the fast path then something
else might try to boost us and trip over this :(

This code causes braindamage. I really wonder whether we need to
remove it according to the "United Nations Convention against Torture
and Other Cruel, Inhuman or Degrading Treatment or Punishment".

> > @@ -6377,7 +6379,8 @@ void task_setprio(struct task_struct *p, int prio)
> >  	 */
> >  	if (unlikely(p == rq->idle)) {
> >  		WARN_ON(p != rq->curr);
> > -		WARN_ON(p->pi_blocked_on);
> > +		WARN_ON(p->pi_blocked_on &&
> > +		        (long)p->pi_blocked_on != PI_WAKEUP_INPROGRESS);
> 
>   Yuck. Paranoia ? If we ever requeue idle, then .....

  At least one which is bogus :)
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 10:28       ` Thomas Gleixner
@ 2010-07-13 11:52         ` Thomas Gleixner
  2010-07-13 15:57           ` Mike Galbraith
                             ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Thomas Gleixner @ 2010-07-13 11:52 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> 
> This code causes braindamage. I really wonder whether we need to
> remove it according to the "United Nations Convention against Torture
> and Other Cruel, Inhuman or Degrading Treatment or Punishment".
> 

Ok, finally managed to twist my brain around it. Mike, can you give it
a test ride ?

Thanks,

	tglx

-------->
Subject: futex: Protect against pi_blocked_on corruption during requeue PI
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 17:50:23 -0400

The requeue_pi mechanism introduced proxy locking of the rtmutex. This
creates a scenario where a task can wakeup, not knowing it has been
enqueued on an rtmutex. Blocking on an hb->lock() can overwrite a
valid value in current->pi_blocked_on, leading to an inconsistent
state.

Prevent overwriting pi_blocked_on by serializing on the waiter's
pi_lock and using the new PI_WAKEUP_INPROGRESS state flag to indicate
a waiter that has been woken by a timeout or signal. This prevents the
rtmutex code from adding the waiter to the rtmutex wait list,
returning EAGAIN to futex_requeue(), which will in turn ignore the
waiter during a requeue. Care is taken to allow current to block on
locks even if PI_WAKEUP_INPROGRESS is set.

During normal wakeup, this results in one less hb->lock protected
section. In the pre-requeue-timeout-or-signal wakeup, this removes the
"greedy locking" behavior, no attempt will be made to acquire the
lock.

[ tglx: take pi_lock with lock_irq(), removed paranoid warning,
  	plugged pi_state and pi_blocked_on leak, adjusted some
  	comments ]

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <4C3C1DCF.9090509@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c          |   50 +++++++++++++++++++++++++++++++++-------------
 kernel/rtmutex.c        |   45 ++++++++++++++++++++++++++++++++++-------
 kernel/rtmutex_common.h |    1 +
 kernel/sched.c          |    5 +++-
 4 files changed, 78 insertions(+), 23 deletions(-)

Index: linux-2.6-tip/kernel/futex.c
===================================================================
--- linux-2.6-tip.orig/kernel/futex.c
+++ linux-2.6-tip/kernel/futex.c
@@ -1336,6 +1336,16 @@ retry_private:
 				requeue_pi_wake_futex(this, &key2, hb2);
 				drop_count++;
 				continue;
+			} else if (ret == -EAGAIN) {
+				/*
+				 * Waiter was woken by timeout or
+				 * signal and has set pi_blocked_on to
+				 * PI_WAKEUP_INPROGRESS before we
+				 * tried to enqueue it on the rtmutex.
+				 */
+				this->pi_state = NULL;
+				free_pi_state(pi_state);
+				continue;
 			} else if (ret) {
 				/* -EDEADLK */
 				this->pi_state = NULL;
@@ -2211,9 +2221,9 @@ static int futex_wait_requeue_pi(u32 __u
 				 int clockrt, u32 __user *uaddr2)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_hash_bucket *hb, *hb2;
 	struct rt_mutex_waiter rt_waiter;
 	struct rt_mutex *pi_mutex = NULL;
-	struct futex_hash_bucket *hb;
 	union futex_key key2;
 	struct futex_q q;
 	int res, ret;
@@ -2255,18 +2265,51 @@ static int futex_wait_requeue_pi(u32 __u
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
 	futex_wait_queue_me(hb, &q, to);
 
-	spin_lock(&hb->lock);
-	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
-	spin_unlock(&hb->lock);
-	if (ret)
-		goto out_put_keys;
+	/*
+	 * Avoid races with requeue and trying to block on two mutexes
+	 * (hb->lock and uaddr2's rtmutex) by serializing access to
+	 * pi_blocked_on with pi_lock.
+	 */
+	raw_spin_lock_irq(&current->pi_lock);
+	if (current->pi_blocked_on) {
+		/* Requeue happened already */
+		raw_spin_unlock_irq(&current->pi_lock);
+	} else {
+		/*
+		 * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS
+		 * prevents a concurrent requeue from enqueuein us on
+		 * the uaddr2 rtmutex. After that we can safely
+		 * acquire (and possibly block on) hb->lock.
+		 */
+		current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
+		raw_spin_unlock_irq(&current->pi_lock);
+
+		spin_lock(&hb->lock);
+
+		/*
+		 * Clean up pi_blocked_on. We might leak it otherwise
+		 * when we succeeded with the hb->lock in the fast
+		 * path.
+		 */
+		raw_spin_lock_irq(&current->pi_lock);
+		current->pi_blocked_on = NULL;
+		raw_spin_unlock_irq(&current->pi_lock);
+
+		ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
+		spin_unlock(&hb->lock);
+		if (ret)
+			goto out_put_keys;
+	}
 
 	/*
-	 * In order for us to be here, we know our q.key == key2, and since
-	 * we took the hb->lock above, we also know that futex_requeue() has
-	 * completed and we no longer have to concern ourselves with a wakeup
-	 * race with the atomic proxy lock acquition by the requeue code.
+	 * In order to be here, we have either been requeued, are in
+	 * the process of being requeued, or requeue successfully
+	 * acquired uaddr2 on our behalf.  If pi_blocked_on was
+	 * non-null above, we may be racing with a requeue.  Do not
+	 * rely on q->lock_ptr to be hb2->lock until after blocking on
+	 * hb->lock or hb2->lock.
 	 */
+	hb2 = hash_futex(&key2);
 
 	/* Check if the requeue code acquired the second futex for us. */
 	if (!q.rt_waiter) {
@@ -2275,10 +2318,12 @@ static int futex_wait_requeue_pi(u32 __u
 		 * did a lock-steal - fix up the PI-state in that case.
 		 */
 		if (q.pi_state && (q.pi_state->owner != current)) {
-			spin_lock(q.lock_ptr);
+			spin_lock(&hb2->lock);
+			BUG_ON(&hb2->lock != q.lock_ptr);
+
 			ret = fixup_pi_state_owner(uaddr2, &q, current,
 						   fshared);
-			spin_unlock(q.lock_ptr);
+			spin_unlock(&hb2->lock);
 		}
 	} else {
 		/*
@@ -2291,7 +2336,9 @@ static int futex_wait_requeue_pi(u32 __u
 		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 
-		spin_lock(q.lock_ptr);
+		spin_lock(&hb2->lock);
+		BUG_ON(&hb2->lock != q.lock_ptr);
+
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
Index: linux-2.6-tip/kernel/rtmutex.c
===================================================================
--- linux-2.6-tip.orig/kernel/rtmutex.c
+++ linux-2.6-tip/kernel/rtmutex.c
@@ -82,6 +82,11 @@ static void fixup_rt_mutex_waiters(struc
 		clear_rt_mutex_waiters(lock);
 }
 
+static inline int rt_mutex_real_waiter(struct rt_mutex_waiter *waiter)
+{
+	return waiter && waiter != PI_WAKEUP_INPROGRESS;
+}
+
 /*
  * We can speed up the acquire/release, if the architecture
  * supports cmpxchg and if there's no debugging state to be set up
@@ -227,7 +232,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * reached or the state of the chain has changed while we
 	 * dropped the locks.
 	 */
-	if (!waiter || !waiter->task)
+	if (!rt_mutex_real_waiter(waiter) || !waiter->task)
 		goto out_unlock_pi;
 
 	/*
@@ -448,6 +453,23 @@ static int task_blocks_on_rt_mutex(struc
 	int chain_walk = 0, res;
 
 	raw_spin_lock(&task->pi_lock);
+
+	/*
+	 * In the case of futex requeue PI, this will be a proxy
+	 * lock. The task will wake unaware that it is enqueueed on
+	 * this lock. Avoid blocking on two locks and corrupting
+	 * pi_blocked_on via the PI_WAKEUP_INPROGRESS
+	 * flag. futex_wait_requeue_pi() sets this when it wakes up
+	 * before requeue (due to a signal or timeout). Do not enqueue
+	 * the task if PI_WAKEUP_INPROGRESS is set.
+	 */
+	if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
+		raw_spin_unlock(&task->pi_lock);
+		return -EAGAIN;
+	}
+
+	BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on));
+
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
@@ -469,7 +491,7 @@ static int task_blocks_on_rt_mutex(struc
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
 		__rt_mutex_adjust_prio(owner);
-		if (owner->pi_blocked_on)
+		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
 		raw_spin_unlock(&owner->pi_lock);
 	}
@@ -624,7 +646,7 @@ static void remove_waiter(struct rt_mute
 		}
 		__rt_mutex_adjust_prio(owner);
 
-		if (owner->pi_blocked_on)
+		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
 
 		raw_spin_unlock(&owner->pi_lock);
@@ -658,7 +680,8 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!rt_mutex_real_waiter(waiter) ||
+	    waiter->list_entry.prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1527,7 +1550,7 @@ int rt_mutex_start_proxy_lock(struct rt_
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
 				      flags);
 
-	if (ret && !waiter->task) {
+	if (ret == -EDEADLK && !waiter->task) {
 		/*
 		 * Reset the return value. We might have
 		 * returned with -EDEADLK and the owner
Index: linux-2.6-tip/kernel/rtmutex_common.h
===================================================================
--- linux-2.6-tip.orig/kernel/rtmutex_common.h
+++ linux-2.6-tip/kernel/rtmutex_common.h
@@ -115,6 +115,9 @@ static inline unsigned long rt_mutex_own
 /*
  * PI-futex support (proxy locking functions, etc.):
  */
+
+#define PI_WAKEUP_INPROGRESS	((struct rt_mutex_waiter *) 1)
+
 extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock);
 extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 				       struct task_struct *proxy_owner);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
@ 2010-07-13 15:57           ` Mike Galbraith
  2010-07-13 18:59           ` Darren Hart
  2010-07-18  8:32           ` Mike Galbraith
  2 siblings, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-13 15:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, lkml,,
	Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, linux-rt-users

On Tue, 2010-07-13 at 13:52 +0200, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:
> > 
> > This code causes braindamage. I really wonder whether we need to
> > remove it according to the "United Nations Convention against Torture
> > and Other Cruel, Inhuman or Degrading Treatment or Punishment".
> > 
> 
> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

I'm away from box until Friday evening.  I'll test, but it'll likely be
ancient history by then.

	-Mike



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
  2010-07-13 15:57           ` Mike Galbraith
@ 2010-07-13 18:59           ` Darren Hart
  2010-07-18  8:32           ` Mike Galbraith
  2 siblings, 0 replies; 29+ messages in thread
From: Darren Hart @ 2010-07-13 18:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: lkml, ,
	Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, Mike Galbraith, linux-rt-users

On 07/13/2010 04:52 AM, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:
>>
>> This code causes braindamage. I really wonder whether we need to
>> remove it according to the "United Nations Convention against Torture
>> and Other Cruel, Inhuman or Degrading Treatment or Punishment".
>>
>
> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

Since Mike is out I built this version and ran it a few times. I saw a 
100% reproduction rate previously. I haven't seen any errors in a 
handful of runs (~10) with this patch.

I do still see the hrtimer latency message on the first run, so this is 
likely unrelated to the issue we're addressing:

Jul 13 14:47:59 elm9m94 kernel: hrtimer: interrupt took 123924 ns

As for Thomas's changes, only a couple nitpics below:


> @@ -2255,18 +2265,51 @@ static int futex_wait_requeue_pi(u32 __u
>   	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
>   	futex_wait_queue_me(hb,&q, to);
>
> -	spin_lock(&hb->lock);
> -	ret = handle_early_requeue_pi_wakeup(hb,&q,&key2, to);
> -	spin_unlock(&hb->lock);
> -	if (ret)
> -		goto out_put_keys;
> +	/*
> +	 * Avoid races with requeue and trying to block on two mutexes
> +	 * (hb->lock and uaddr2's rtmutex) by serializing access to
> +	 * pi_blocked_on with pi_lock.
> +	 */
> +	raw_spin_lock_irq(&current->pi_lock);
> +	if (current->pi_blocked_on) {
> +		/* Requeue happened already */

This comment isn't quite accurate. The requeue may be in progress, which 
means the q.lock_ptr is not trustworthy as noted below. Consider:

		/*
		 * We have been requeued, or are in the process
		 * of being requeued.
		 */

> +		raw_spin_unlock_irq(&current->pi_lock);
> +	} else {
> +		/*
> +		 * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS
> +		 * prevents a concurrent requeue from enqueuein us on

s/enqueuein/enqueueing/  not that my dictionary thinks either one of 
them are words ;-)

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2
  2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
  2010-07-13 15:57           ` Mike Galbraith
  2010-07-13 18:59           ` Darren Hart
@ 2010-07-18  8:32           ` Mike Galbraith
  2 siblings, 0 replies; 29+ messages in thread
From: Mike Galbraith @ 2010-07-18  8:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, lkml,,
	Peter Zijlstra, Ingo Molnar, Eric Dumazet, John Kacur,
	Steven Rostedt, linux-rt-users

On Tue, 2010-07-13 at 13:52 +0200, Thomas Gleixner wrote:
> On Tue, 13 Jul 2010, Thomas Gleixner wrote:

> Ok, finally managed to twist my brain around it. Mike, can you give it
> a test ride ?

It's a bit late, but I took it out for a spin, no problem noted.

	-Mike


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-07-18  8:32 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-07  4:46 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() Mike Galbraith
2010-07-07  8:03 ` Mike Galbraith
2010-07-07 11:57   ` Thomas Gleixner
2010-07-07 12:50     ` Mike Galbraith
2010-07-07 11:57 ` Thomas Gleixner
2010-07-07 14:03   ` Darren Hart
2010-07-07 14:17     ` Mike Galbraith
2010-07-08 12:05     ` Mike Galbraith
2010-07-08 14:12       ` Darren Hart
2010-07-09  2:11   ` Darren Hart
2010-07-09  4:32     ` Mike Galbraith
     [not found]     ` <4C36CD83.6070809@us.ibm.com>
2010-07-09  8:13       ` Mike Galbraith
2010-07-09 13:58       ` Mike Galbraith
2010-07-09 14:51         ` Mike Galbraith
2010-07-09 16:35         ` Darren Hart
2010-07-09 19:34           ` Mike Galbraith
2010-07-09 20:05   ` Darren Hart
2010-07-13  8:03   ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Darren Hart
2010-07-13  9:25     ` Thomas Gleixner
2010-07-13 10:28       ` Thomas Gleixner
2010-07-13 11:52         ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI -V2 Thomas Gleixner
2010-07-13 15:57           ` Mike Galbraith
2010-07-13 18:59           ` Darren Hart
2010-07-18  8:32           ` Mike Galbraith
2010-07-13  9:58     ` [PATCH][RT] futex: protect against pi_blocked_on corruption during requeue PI Thomas Gleixner
2010-07-07 14:11 ` 2.6.33.[56]-rt23: howto create repeatable explosion in wakeup_next_waiter() gowrishankar
2010-07-07 14:31   ` Mike Galbraith
2010-07-07 15:05     ` Darren Hart
2010-07-07 17:45       ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.