linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
@ 2018-11-27  8:11 Heiko Carstens
  2018-11-28 14:32 ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2018-11-27  8:11 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Ingo Molnar
  Cc: Martin Schwidefsky, linux-kernel, linux-s390

Hello,

with the glibc self-tests I was able to trigger the "this should not
happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
like it is hardly reproducible.

This one happened with commit d146194f31c9 for compiling the kernel.
Config can be re-created with "make ARCH=s390 performance_defconfig".

[  649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
[  649.596946] Kernel panic - not syncing: panic_on_warn set ...
[  649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1
[  649.596953] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
[  649.596956] Call Trace:
[  649.596963] ([<0000000000113050>] show_stack+0x58/0x70)
[  649.596970]  [<0000000000a62a92>] dump_stack+0x7a/0xa8
[  649.596975]  [<0000000000144012>] panic+0x11a/0x258
[  649.596978]  [<0000000000143e70>] __warn+0xf8/0x118
[  649.596981]  [<0000000000a61c20>] report_bug+0xd8/0x150
[  649.596985]  [<00000000001014ac>] do_report_trap+0xc4/0xe0
[  649.596988]  [<0000000000101680>] illegal_op+0x138/0x150
[  649.596994]  [<0000000000a82bec>] pgm_check_handler+0x1cc/0x220
[  649.596998]  [<00000000001e89b2>] do_futex+0xa9a/0xc50
[  649.597002] ([<00000000001e8b16>] do_futex+0xbfe/0xc50)
[  649.597006]  [<00000000001e8c4c>] sys_futex+0xe4/0x170
[  649.597010]  [<0000000000a82800>] system_call+0xdc/0x2c8

Thanks,
Heiko


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2018-11-27  8:11 WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Heiko Carstens
@ 2018-11-28 14:32 ` Thomas Gleixner
  2018-11-29 11:23   ` Heiko Carstens
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2018-11-28 14:32 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390

Heiko,

On Tue, 27 Nov 2018, Heiko Carstens wrote:

> with the glibc self-tests I was able to trigger the "this should not
> happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
> like it is hardly reproducible.

Any idea which self-test triggered that?

> This one happened with commit d146194f31c9 for compiling the kernel.
> Config can be re-created with "make ARCH=s390 performance_defconfig".

Which is not really helpful for people who do not own a s390. And no, I
don't want one unless IBM pays the power bill as well :)

> [  649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
> [  649.596946] Kernel panic - not syncing: panic_on_warn set ...
> [  649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1

That's ld64.so.1. Weird, but what do I know about glibc self tests.

I still fail to see how that can happen, but I usually page out the futex
horrors immediately. I'll keep staring at the code...

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2018-11-28 14:32 ` Thomas Gleixner
@ 2018-11-29 11:23   ` Heiko Carstens
  2019-01-21 12:21     ` Heiko Carstens
  2019-01-28 13:44     ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Heiko Carstens @ 2018-11-29 11:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

On Wed, Nov 28, 2018 at 03:32:45PM +0100, Thomas Gleixner wrote:
> Heiko,
> 
> On Tue, 27 Nov 2018, Heiko Carstens wrote:
> 
> > with the glibc self-tests I was able to trigger the "this should not
> > happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
> > like it is hardly reproducible.
> 
> Any idea which self-test triggered that?
> 
> > This one happened with commit d146194f31c9 for compiling the kernel.
> > Config can be re-created with "make ARCH=s390 performance_defconfig".
> 
> Which is not really helpful for people who do not own a s390. And no, I
> don't want one unless IBM pays the power bill as well :)
> 
> > [  649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
> > [  649.596946] Kernel panic - not syncing: panic_on_warn set ...
> > [  649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1
> 
> That's ld64.so.1. Weird, but what do I know about glibc self tests.
> 
> I still fail to see how that can happen, but I usually page out the futex
> horrors immediately. I'll keep staring at the code...

I looked into the system dumps, and if I didn't screw up, then the
command line for both occurrences was

/root/glibc-build/nptl/tst-robustpi8

And indeed, if I run only this test case in an endless loop and do
some parallel work (like kernel compile) it currently seems to be
possible to reproduce the warning:

while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done

within the build directory of glibc (2.28).

See
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robustpi8.c;h=cbea3d6d77abb00be05ec7b466d8339c26dd2efb;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b

which includes this one:

https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robust8.c;h=9c636250d4cb0bcd6d802910e8f9ea31568bb73f;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2018-11-29 11:23   ` Heiko Carstens
@ 2019-01-21 12:21     ` Heiko Carstens
  2019-01-21 13:12       ` Thomas Gleixner
  2019-01-28 13:44     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2019-01-21 12:21 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	linux-kernel, linux-s390, Stefan Liebler

Hi Thomas,

[full quote below]

Did you have any time to look into this yet? :)

The warning is still reproducible.

On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> On Wed, Nov 28, 2018 at 03:32:45PM +0100, Thomas Gleixner wrote:
> > Heiko,
> > 
> > On Tue, 27 Nov 2018, Heiko Carstens wrote:
> > 
> > > with the glibc self-tests I was able to trigger the "this should not
> > > happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
> > > like it is hardly reproducible.
> > 
> > Any idea which self-test triggered that?
> > 
> > > This one happened with commit d146194f31c9 for compiling the kernel.
> > > Config can be re-created with "make ARCH=s390 performance_defconfig".
> > 
> > Which is not really helpful for people who do not own a s390. And no, I
> > don't want one unless IBM pays the power bill as well :)
> > 
> > > [  649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
> > > [  649.596946] Kernel panic - not syncing: panic_on_warn set ...
> > > [  649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1
> > 
> > That's ld64.so.1. Weird, but what do I know about glibc self tests.
> > 
> > I still fail to see how that can happen, but I usually page out the futex
> > horrors immediately. I'll keep staring at the code...
> 
> I looked into the system dumps, and if I didn't screw up, then the
> command line for both occurrences was
> 
> /root/glibc-build/nptl/tst-robustpi8
> 
> And indeed, if I run only this test case in an endless loop and do
> some parallel work (like kernel compile) it currently seems to be
> possible to reproduce the warning:
> 
> while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> 
> within the build directory of glibc (2.28).
> 
> See
> https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robustpi8.c;h=cbea3d6d77abb00be05ec7b466d8339c26dd2efb;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
> 
> which includes this one:
> 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robust8.c;h=9c636250d4cb0bcd6d802910e8f9ea31568bb73f;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-21 12:21     ` Heiko Carstens
@ 2019-01-21 13:12       ` Thomas Gleixner
  2019-01-22 21:14         ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-21 13:12 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

On Mon, 21 Jan 2019, Heiko Carstens wrote:

> Hi Thomas,
> 
> [full quote below]
> 
> Did you have any time to look into this yet? :)
> 
> The warning is still reproducible.

Yeah, it's on my list of stuff which I need to take care of urgently. In
the next couple of days I hope...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-21 13:12       ` Thomas Gleixner
@ 2019-01-22 21:14         ` Thomas Gleixner
  2019-01-23  9:24           ` Heiko Carstens
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-22 21:14 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

On Mon, 21 Jan 2019, Thomas Gleixner wrote:
> On Mon, 21 Jan 2019, Heiko Carstens wrote:
> 
> > Hi Thomas,
> > 
> > [full quote below]
> > 
> > Did you have any time to look into this yet? :)
> > 
> > The warning is still reproducible.
> 
> Yeah, it's on my list of stuff which I need to take care of urgently. In
> the next couple of days I hope...

Hmm. Doesn't

     da791a667536 ("futex: Cure exit race")

address that issue?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-22 21:14         ` Thomas Gleixner
@ 2019-01-23  9:24           ` Heiko Carstens
  2019-01-23 12:33             ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2019-01-23  9:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

On Tue, Jan 22, 2019 at 10:14:00PM +0100, Thomas Gleixner wrote:
> On Mon, 21 Jan 2019, Thomas Gleixner wrote:
> > On Mon, 21 Jan 2019, Heiko Carstens wrote:
> > 
> > > Hi Thomas,
> > > 
> > > [full quote below]
> > > 
> > > Did you have any time to look into this yet? :)
> > > 
> > > The warning is still reproducible.
> > 
> > Yeah, it's on my list of stuff which I need to take care of urgently. In
> > the next couple of days I hope...
> 
> Hmm. Doesn't
> 
>      da791a667536 ("futex: Cure exit race")
> 
> address that issue?

It doesn't look like it does. One occurrence was the one below when
using commit 7939f8beecf1 (which is post 5.0-rc2) for building the
kernel:

 WARNING: CPU: 14 PID: 23505 at kernel/futex.c:1483 do_futex+0xa9a/0xc50
 Kernel panic - not syncing: panic_on_warn set ...
 CPU: 14 PID: 23505 Comm: ld.so.1 Not tainted 5.0.0-20190116.rc2.git0.7939f8beecf1.300.fc29.s390x+git #1
 Hardware name: IBM 3906 M04 704 (LPAR)
 Call Trace:
 ([<0000000000112e60>] show_stack+0x58/0x70)
  [<0000000000a671fa>] dump_stack+0x7a/0xa8
  [<0000000000143f52>] panic+0x11a/0x2d0
  [<0000000000143db0>] __warn+0xf8/0x118
  [<0000000000a662f8>] report_bug+0xd8/0x150
  [<00000000001014ac>] do_report_trap+0xc4/0xe0
  [<0000000000101680>] illegal_op+0x138/0x150
  [<0000000000a87270>] pgm_check_handler+0x1c8/0x220
  [<00000000001e9aea>] do_futex+0xa9a/0xc50
 ([<00000000001e9c4e>] do_futex+0xbfe/0xc50)
  [<00000000001ea13c>] compat_sys_futex+0xe4/0x170
  [<0000000000a86e84>] system_call+0xd8/0x2c8


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-23  9:24           ` Heiko Carstens
@ 2019-01-23 12:33             ` Thomas Gleixner
  2019-01-23 12:40               ` Heiko Carstens
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-23 12:33 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

Heiko,

On Wed, 23 Jan 2019, Heiko Carstens wrote:
> It doesn't look like it does. One occurrence was the one below when
> using commit 7939f8beecf1 (which is post 5.0-rc2) for building the
> kernel:
>
>  WARNING: CPU: 14 PID: 23505 at kernel/futex.c:1483 do_futex+0xa9a/0xc50
>  Kernel panic - not syncing: panic_on_warn set ...
>  CPU: 14 PID: 23505 Comm: ld.so.1 Not tainted 5.0.0-20190116.rc2.git0.7939f8beecf1.300.fc29.s390x+git #1
>  Hardware name: IBM 3906 M04 704 (LPAR)
>  Call Trace:
>  ([<0000000000112e60>] show_stack+0x58/0x70)
>   [<0000000000a671fa>] dump_stack+0x7a/0xa8
>   [<0000000000143f52>] panic+0x11a/0x2d0
>   [<0000000000143db0>] __warn+0xf8/0x118
>   [<0000000000a662f8>] report_bug+0xd8/0x150
>   [<00000000001014ac>] do_report_trap+0xc4/0xe0
>   [<0000000000101680>] illegal_op+0x138/0x150
>   [<0000000000a87270>] pgm_check_handler+0x1c8/0x220
>   [<00000000001e9aea>] do_futex+0xa9a/0xc50
>  ([<00000000001e9c4e>] do_futex+0xbfe/0xc50)
>   [<00000000001ea13c>] compat_sys_futex+0xe4/0x170
>   [<0000000000a86e84>] system_call+0xd8/0x2c8 

Ok. That's another one which I can't spot yet.

    Is that only on S390?
    What's the test which explodes? Kernel build?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-23 12:33             ` Thomas Gleixner
@ 2019-01-23 12:40               ` Heiko Carstens
  0 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2019-01-23 12:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

On Wed, Jan 23, 2019 at 01:33:15PM +0100, Thomas Gleixner wrote:
> Heiko,
> 
> On Wed, 23 Jan 2019, Heiko Carstens wrote:
> > It doesn't look like it does. One occurrence was the one below when
> > using commit 7939f8beecf1 (which is post 5.0-rc2) for building the
> > kernel:
> >
> >  WARNING: CPU: 14 PID: 23505 at kernel/futex.c:1483 do_futex+0xa9a/0xc50
> >  Kernel panic - not syncing: panic_on_warn set ...
> >  CPU: 14 PID: 23505 Comm: ld.so.1 Not tainted 5.0.0-20190116.rc2.git0.7939f8beecf1.300.fc29.s390x+git #1
> >  Hardware name: IBM 3906 M04 704 (LPAR)
> >  Call Trace:
> >  ([<0000000000112e60>] show_stack+0x58/0x70)
> >   [<0000000000a671fa>] dump_stack+0x7a/0xa8
> >   [<0000000000143f52>] panic+0x11a/0x2d0
> >   [<0000000000143db0>] __warn+0xf8/0x118
> >   [<0000000000a662f8>] report_bug+0xd8/0x150
> >   [<00000000001014ac>] do_report_trap+0xc4/0xe0
> >   [<0000000000101680>] illegal_op+0x138/0x150
> >   [<0000000000a87270>] pgm_check_handler+0x1c8/0x220
> >   [<00000000001e9aea>] do_futex+0xa9a/0xc50
> >  ([<00000000001e9c4e>] do_futex+0xbfe/0xc50)
> >   [<00000000001ea13c>] compat_sys_futex+0xe4/0x170
> >   [<0000000000a86e84>] system_call+0xd8/0x2c8 
> 
> Ok. That's another one which I can't spot yet.
> 
>     Is that only on S390?
>     What's the test which explodes? Kernel build?

No, it's the tst-robustpi8 glibc 2.28 selftest. Let me quote myself
from earlier in this thread:

"
... tst-robustpi8

And indeed, if I run only this test case in an endless loop and do
some parallel work (like kernel compile) it currently seems to be
possible to reproduce the warning:

while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done

within the build directory of glibc (2.28).

See
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robustpi8.c;h=cbea3d6d77abb00be05ec7b466d8339c26dd2efb;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b

which includes this one:

https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robust8.c;h=9c636250d4cb0bcd6d802910e8f9ea31568bb73f;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
"


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2018-11-29 11:23   ` Heiko Carstens
  2019-01-21 12:21     ` Heiko Carstens
@ 2019-01-28 13:44     ` Peter Zijlstra
  2019-01-28 13:58       ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2019-01-28 13:44 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:

> And indeed, if I run only this test case in an endless loop and do
> some parallel work (like kernel compile) it currently seems to be
> possible to reproduce the warning:
> 
> while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> 
> within the build directory of glibc (2.28).

Right; so that reproduces for me.

After staring at all that for a while; trying to remember how it all
worked (or supposed to work rather), I became suspiscous of commit:

  56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")

And indeed, when I revert that; the above reproducer no longer works (as
in, it no longer triggers in minutes and has -- so far -- held up for an
hour+ or so).

That patch in particular allows futex_unlock_pi() to 'start' early:


futex_lock_pi()				futex_unlock_pi()
  lock hb
  queue
  lock wait_lock
  unlock hb
					lock hb
					futex_top_waiter
					get_pi_state
					lock wait_lock
  rt_mutex_proxy_start // fail
  unlock wait_lock
					// acquired wait_lock
					wake_futex_pi()
					rt_mutex_next_owner() // whoops, no waiter
					WARN

  lock hb
  unqueue_me_pi



So reverting that patch should cure things, because then there is no hb
lock break between queue/unqueue and futex_unlock_pi() cannot observe
this half-arsed state.

Now obviously reverting that makes RT unhappy; let me see what the
options are.

(concurrently tglx generated a trace that corroborates)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-28 13:44     ` Peter Zijlstra
@ 2019-01-28 13:58       ` Peter Zijlstra
  2019-01-28 15:53         ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2019-01-28 13:58 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Ingo Molnar, Martin Schwidefsky, linux-kernel,
	linux-s390, Stefan Liebler

On Mon, Jan 28, 2019 at 02:44:10PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> 
> > And indeed, if I run only this test case in an endless loop and do
> > some parallel work (like kernel compile) it currently seems to be
> > possible to reproduce the warning:
> > 
> > while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> > 
> > within the build directory of glibc (2.28).
> 
> Right; so that reproduces for me.
> 
> After staring at all that for a while; trying to remember how it all
> worked (or supposed to work rather), I became suspiscous of commit:
> 
>   56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
> 
> And indeed, when I revert that; the above reproducer no longer works (as
> in, it no longer triggers in minutes and has -- so far -- held up for an
> hour+ or so).
> 
> That patch in particular allows futex_unlock_pi() to 'start' early:
> 
> 
> futex_lock_pi()			futex_unlock_pi()
>   lock hb
>   queue
>   lock wait_lock
>   unlock hb
> 					lock hb
> 					futex_top_waiter
> 					get_pi_state
> 					lock wait_lock
>   rt_mutex_proxy_start // fail
>   unlock wait_lock
> 					// acquired wait_lock
					unlock hb
> 					wake_futex_pi()
> 					rt_mutex_next_owner() // whoops, no waiter
> 					WARN

and simply removing that WARN, would allow futex_unlock_pi() to spin on
retry until the futex_lock_pi() CPU comes around to doing the lock hb
below:

>   lock hb
>   unqueue_me_pi

Which seems undesirable from a determinsm POV.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-28 13:58       ` Peter Zijlstra
@ 2019-01-28 15:53         ` Thomas Gleixner
  2019-01-29  8:49           ` Peter Zijlstra
  2019-01-29  9:01           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Heiko Carstens
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-28 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Mon, 28 Jan 2019, Peter Zijlstra wrote:
> On Mon, Jan 28, 2019 at 02:44:10PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> > 
> > > And indeed, if I run only this test case in an endless loop and do
> > > some parallel work (like kernel compile) it currently seems to be
> > > possible to reproduce the warning:
> > > 
> > > while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> > > 
> > > within the build directory of glibc (2.28).
> > 
> > Right; so that reproduces for me.
> > 
> > After staring at all that for a while; trying to remember how it all
> > worked (or supposed to work rather), I became suspiscous of commit:
> > 
> >   56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
> > 
> > And indeed, when I revert that; the above reproducer no longer works (as
> > in, it no longer triggers in minutes and has -- so far -- held up for an
> > hour+ or so).

Right after staring long enough at it, the commit simply forgot to give
__rt_mutex_start_proxy_lock() the same treatment as it gave to
rt_mutex_wait_proxy_lock().

Patch below cures that.

Thanks,

	tglx

8<----------------

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2845,7 +2845,7 @@ static int futex_lock_pi(u32 __user *uad
 		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
-		goto no_block;
+		goto cleanup;
 	}
 
 	rt_mutex_init_waiter(&rt_waiter);
@@ -2870,17 +2870,15 @@ static int futex_lock_pi(u32 __user *uad
 	if (ret) {
 		if (ret == 1)
 			ret = 0;
-
-		spin_lock(q.lock_ptr);
 		goto no_block;
 	}
 
-
 	if (unlikely(to))
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
+no_block:
 	spin_lock(q.lock_ptr);
 	/*
 	 * If we failed to acquire the lock (signal/timeout), we must
@@ -2894,7 +2892,7 @@ static int futex_lock_pi(u32 __user *uad
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
-no_block:
+cleanup:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
 	 * haven't already.
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1749,9 +1749,6 @@ int __rt_mutex_start_proxy_lock(struct r
 		ret = 0;
 	}
 
-	if (unlikely(ret))
-		remove_waiter(lock, waiter);
-
 	debug_rt_mutex_print_deadlock(waiter);
 
 	return ret;
@@ -1778,6 +1775,8 @@ int rt_mutex_start_proxy_lock(struct rt_
 
 	raw_spin_lock_irq(&lock->wait_lock);
 	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+	if (unlikely(ret))
+		remove_waiter(lock, waiter);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return ret;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-28 15:53         ` Thomas Gleixner
@ 2019-01-29  8:49           ` Peter Zijlstra
  2019-01-29 22:15             ` [PATCH] futex: Handle early deadlock return correctly Thomas Gleixner
  2019-01-29  9:01           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Heiko Carstens
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2019-01-29  8:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Heiko Carstens, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:

> Right after staring long enough at it, the commit simply forgot to give
> __rt_mutex_start_proxy_lock() the same treatment as it gave to
> rt_mutex_wait_proxy_lock().
> 
> Patch below cures that.

Yes, that is a very nice solution. Find below an updated patch that
includes a few comments. I found it harder than it should be to
reconstruct this code.

(also, I flipped the label names)

---
 kernel/futex.c           | 28 ++++++++++++++++++----------
 kernel/locking/rtmutex.c | 37 ++++++++++++++++++++++++++++++++-----
 2 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fdd312da0992..9d8411d3142d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2861,35 +2861,39 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	 * and BUG when futex_unlock_pi() interleaves with this.
 	 *
 	 * Therefore acquire wait_lock while holding hb->lock, but drop the
-	 * latter before calling rt_mutex_start_proxy_lock(). This still fully
-	 * serializes against futex_unlock_pi() as that does the exact same
-	 * lock handoff sequence.
+	 * latter before calling __rt_mutex_start_proxy_lock(). This
+	 * interleaves with futex_unlock_pi() -- which does a similar lock
+	 * handoff -- such that the latter can observe the futex_q::pi_state
+	 * before __rt_mutex_start_proxy_lock() is done.
 	 */
 	raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
 	spin_unlock(q.lock_ptr);
+	/*
+	 * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+	 * such that futex_unlock_pi() is guaranteed to observe the waiter when
+	 * it sees the futex_q::pi_state.
+	 */
 	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
 	raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
 
 	if (ret) {
 		if (ret == 1)
 			ret = 0;
-
-		spin_lock(q.lock_ptr);
-		goto no_block;
+		goto cleanup;
 	}
 
-
 	if (unlikely(to))
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
+cleanup:
 	spin_lock(q.lock_ptr);
 	/*
-	 * If we failed to acquire the lock (signal/timeout), we must
+	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
 	 * first acquire the hb->lock before removing the lock from the
-	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
-	 * wait lists consistent.
+	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
+	 * lists consistent.
 	 *
 	 * In particular; it is important that futex_unlock_pi() can not
 	 * observe this inconsistency.
@@ -3013,6 +3017,10 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		 * there is no point where we hold neither; and therefore
 		 * wake_futex_pi() must observe a state consistent with what we
 		 * observed.
+		 *
+		 * In particular; this forces __rt_mutex_start_proxy() to
+		 * complete such that we're guaranteed to observe the
+		 * rt_waiter. Also see the WARN in wake_futex_pi().
 		 */
 		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 		spin_unlock(&hb->lock);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 581edcc63c26..afaf37d0ac15 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 	rt_mutex_set_owner(lock, NULL);
 }
 
+/**
+ * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock:		the rt_mutex to take
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ * @task:		the task to prepare
+ *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: does _NOT_ remove the @waiter on failure; must either call
+ * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
+ *
+ * Returns:
+ *  0 - task blocked on lock
+ *  1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for PI-futex support.
+ */
 int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
 			      struct task_struct *task)
 {
 	int ret;
 
+	lockdep_asssert_held(&lock->wait_lock);
+
 	if (try_to_take_rt_mutex(lock, task, NULL))
 		return 1;
 
@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 		ret = 0;
 	}
 
-	if (unlikely(ret))
-		remove_waiter(lock, waiter);
-
 	debug_rt_mutex_print_deadlock(waiter);
 
 	return ret;
@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
  * @waiter:		the pre-initialized rt_mutex_waiter
  * @task:		the task to prepare
  *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
+ * on failure.
+ *
  * Returns:
  *  0 - task blocked on lock
  *  1 - acquired the lock for task, caller should wake it up
  * <0 - error
  *
- * Special API call for FUTEX_REQUEUE_PI support.
+ * Special API call for PI-futex support.
  */
 int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 
 	raw_spin_lock_irq(&lock->wait_lock);
 	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+	if (unlikely(ret))
+		remove_waiter(lock, waiter);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return ret;
@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
  * @lock:		the rt_mutex we were woken on
  * @waiter:		the pre-initialized rt_mutex_waiter
  *
- * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
+ * rt_mutex_wait_proxy_lock().
  *
  * Unless we acquired the lock; we're still enqueued on the wait-list and can
  * in fact still be granted ownership until we're removed. Therefore we can

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-28 15:53         ` Thomas Gleixner
  2019-01-29  8:49           ` Peter Zijlstra
@ 2019-01-29  9:01           ` Heiko Carstens
  2019-01-29  9:33             ` Peter Zijlstra
  2019-01-29  9:45             ` Thomas Gleixner
  1 sibling, 2 replies; 47+ messages in thread
From: Heiko Carstens @ 2019-01-29  9:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:
> On Mon, 28 Jan 2019, Peter Zijlstra wrote:
> > On Mon, Jan 28, 2019 at 02:44:10PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> > > 
> > > > And indeed, if I run only this test case in an endless loop and do
> > > > some parallel work (like kernel compile) it currently seems to be
> > > > possible to reproduce the warning:
> > > > 
> > > > while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> > > > 
> > > > within the build directory of glibc (2.28).
> > > 
> > > Right; so that reproduces for me.
> > > 
> > > After staring at all that for a while; trying to remember how it all
> > > worked (or supposed to work rather), I became suspiscous of commit:
> > > 
> > >   56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
> > > 
> > > And indeed, when I revert that; the above reproducer no longer works (as
> > > in, it no longer triggers in minutes and has -- so far -- held up for an
> > > hour+ or so).
> 
> Right after staring long enough at it, the commit simply forgot to give
> __rt_mutex_start_proxy_lock() the same treatment as it gave to
> rt_mutex_wait_proxy_lock().
> 
> Patch below cures that.

With your patch the kernel warning doesn't occur anymore. So if this
is supposed to be the fix feel free to add:

Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>


However now I see every now and then the following failure from the
same test case:

tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.

		/* ESRCH can happen only for non-robust PI mutexes where
		   the owner of the lock died.	*/
		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);

I just verified that this happened also without your patch, I just
didn't see it since I started my tests with panic_on_warn=1 and the
warning triggered always earlier.
So, this seems to be something different.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-29  9:01           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Heiko Carstens
@ 2019-01-29  9:33             ` Peter Zijlstra
  2019-01-29  9:45             ` Thomas Gleixner
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2019-01-29  9:33 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Tue, Jan 29, 2019 at 10:01:08AM +0100, Heiko Carstens wrote:

> However now I see every now and then the following failure from the
> same test case:
> 
> tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.
> 
> 		/* ESRCH can happen only for non-robust PI mutexes where
> 		   the owner of the lock died.	*/
> 		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
> 
> I just verified that this happened also without your patch, I just
> didn't see it since I started my tests with panic_on_warn=1 and the
> warning triggered always earlier.
> So, this seems to be something different.

*groan*... so while the other thing reproduced, I've not seen this
happen and I ran the test for hours.

I'll start it up again, see what happens..

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-29  9:01           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Heiko Carstens
  2019-01-29  9:33             ` Peter Zijlstra
@ 2019-01-29  9:45             ` Thomas Gleixner
  2019-01-29 10:24               ` Heiko Carstens
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-29  9:45 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Tue, 29 Jan 2019, Heiko Carstens wrote:
> On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:
> > Patch below cures that.
> 
> With your patch the kernel warning doesn't occur anymore. So if this
> is supposed to be the fix feel free to add:

Yes, it's supposed to be the fix.
> 
> However now I see every now and then the following failure from the
> same test case:
> 
> tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.
> 
> 		/* ESRCH can happen only for non-robust PI mutexes where
> 		   the owner of the lock died.	*/
> 		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
> 
> I just verified that this happened also without your patch, I just
> didn't see it since I started my tests with panic_on_warn=1 and the
> warning triggered always earlier.
> So, this seems to be something different.

Moo. I ran the test loop all night (simply because I forgot to stop it) and
of course this does not trigger here. Could you try to gather a bit more
information with lightweight tracing?

Something like the below should give us at least a clue. Stop the tst loop
when the assert triggers and then extract the trace.

Thanks,

	tglx

8<--------------

#!/bin/sh

# Substitute with your ld comm string if it starts differently 
C=ld-linux

echo 'comm ~ "$C*"' >/sys/kernel/debug/tracing/events/syscalls/sys_enter_futex/filter
echo 'comm ~ "$C*"' >/sys/kernel/debug/tracing/events/syscalls/sys_exit_futex/filter
echo 'comm ~ "$C*"' >/sys/kernel/debug/tracing/events/sched/sched_process_exit/filter
echo 'prev_comm ~ "$C*" || next_comm ~ "$C*"' >/sys/kernel/debug/tracing/events/sched/sched_switch/filter

echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_futex/enable
echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_exit_futex/enable
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_process_exit/enable
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-29  9:45             ` Thomas Gleixner
@ 2019-01-29 10:24               ` Heiko Carstens
  2019-01-29 10:35                 ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2019-01-29 10:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Tue, Jan 29, 2019 at 10:45:44AM +0100, Thomas Gleixner wrote:
> On Tue, 29 Jan 2019, Heiko Carstens wrote:
> > On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:
> > > Patch below cures that.
> > 
> > With your patch the kernel warning doesn't occur anymore. So if this
> > is supposed to be the fix feel free to add:
> 
> Yes, it's supposed to be the fix.
> > 
> > However now I see every now and then the following failure from the
> > same test case:
> > 
> > tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.
> > 
> > 		/* ESRCH can happen only for non-robust PI mutexes where
> > 		   the owner of the lock died.	*/
> > 		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
> > 
> > I just verified that this happened also without your patch, I just
> > didn't see it since I started my tests with panic_on_warn=1 and the
> > warning triggered always earlier.
> > So, this seems to be something different.
> 
> Moo. I ran the test loop all night (simply because I forgot to stop it) and
> of course this does not trigger here. Could you try to gather a bit more
> information with lightweight tracing?

Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
patch. And now I am trying to reproduce with linux-next 20190129 +
your patch and it doesn't trigger. Did I miss a patch which is only in
linux-next which could fix this?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-29 10:24               ` Heiko Carstens
@ 2019-01-29 10:35                 ` Peter Zijlstra
  2019-01-29 13:03                   ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2019-01-29 10:35 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Tue, Jan 29, 2019 at 11:24:09AM +0100, Heiko Carstens wrote:

> Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
> patch. And now I am trying to reproduce with linux-next 20190129 +
> your patch and it doesn't trigger. Did I miss a patch which is only in
> linux-next which could fix this?
> 

I'm forever confused on what patch is where; but -ESRCH makes me thing
maybe you lost this one:

---

commit da791a667536bf8322042e38ca85d55a78d3c273
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Dec 10 14:35:14 2018 +0100

    futex: Cure exit race
    
    Stefan reported, that the glibc tst-robustpi4 test case fails
    occasionally. That case creates the following race between
    sys_exit() and sys_futex_lock_pi():
    
     CPU0                           CPU1
    
     sys_exit()                     sys_futex()
      do_exit()                      futex_lock_pi()
       exit_signals(tsk)              No waiters:
        tsk->flags |= PF_EXITING;     *uaddr == 0x00000PID
      mm_release(tsk)                 Set waiter bit
       exit_robust_list(tsk) {        *uaddr = 0x80000PID;
          Set owner died              attach_to_pi_owner() {
        *uaddr = 0xC0000000;           tsk = get_task(PID);
       }                               if (!tsk->flags & PF_EXITING) {
      ...                                attach();
      tsk->flags |= PF_EXITPIDONE;     } else {
                                         if (!(tsk->flags & PF_EXITPIDONE))
                                           return -EAGAIN;
                                         return -ESRCH; <--- FAIL
                                       }
    
    ESRCH is returned all the way to user space, which triggers the glibc test
    case assert. Returning ESRCH unconditionally is wrong here because the user
    space value has been changed by the exiting task to 0xC0000000, i.e. the
    FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
    is a valid state and the kernel has to handle it, i.e. taking the futex.
    
    Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
    is set in the task which 'owns' the futex. If the value has changed, let
    the kernel retry the operation, which includes all regular sanity checks
    and correctly handles the FUTEX_OWNER_DIED case.
    
    If it hasn't changed, then return ESRCH as there is no way to distinguish
    this case from malfunctioning user space. This happens when the exiting
    task did not have a robust list, the robust list was corrupted or the user
    space value in the futex was simply bogus.
    
    Reported-by: Stefan Liebler <stli@linux.ibm.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: Darren Hart <dvhart@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Sasha Levin <sashal@kernel.org>
    Cc: stable@vger.kernel.org
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
    Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de

diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..5cc8083a4c89 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
 	return ret;
 }
 
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+			    struct task_struct *tsk)
+{
+	u32 uval2;
+
+	/*
+	 * If PF_EXITPIDONE is not yet set, then try again.
+	 */
+	if (tsk && !(tsk->flags & PF_EXITPIDONE))
+		return -EAGAIN;
+
+	/*
+	 * Reread the user space value to handle the following situation:
+	 *
+	 * CPU0				CPU1
+	 *
+	 * sys_exit()			sys_futex()
+	 *  do_exit()			 futex_lock_pi()
+	 *                                futex_lock_pi_atomic()
+	 *   exit_signals(tsk)		    No waiters:
+	 *    tsk->flags |= PF_EXITING;	    *uaddr == 0x00000PID
+	 *  mm_release(tsk)		    Set waiter bit
+	 *   exit_robust_list(tsk) {	    *uaddr = 0x80000PID;
+	 *      Set owner died		    attach_to_pi_owner() {
+	 *    *uaddr = 0xC0000000;	     tsk = get_task(PID);
+	 *   }				     if (!tsk->flags & PF_EXITING) {
+	 *  ...				       attach();
+	 *  tsk->flags |= PF_EXITPIDONE;     } else {
+	 *				       if (!(tsk->flags & PF_EXITPIDONE))
+	 *				         return -EAGAIN;
+	 *				       return -ESRCH; <--- FAIL
+	 *				     }
+	 *
+	 * Returning ESRCH unconditionally is wrong here because the
+	 * user space value has been changed by the exiting task.
+	 *
+	 * The same logic applies to the case where the exiting task is
+	 * already gone.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		return -EFAULT;
+
+	/* If the user space value has changed, try again. */
+	if (uval2 != uval)
+		return -EAGAIN;
+
+	/*
+	 * The exiting task did not have a robust list, the robust list was
+	 * corrupted or the user space value in *uaddr is simply bogus.
+	 * Give up and tell user space.
+	 */
+	return -ESRCH;
+}
+
 /*
  * Lookup the task for the TID provided from user space and attach to
  * it after doing proper sanity checks.
  */
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	/*
 	 * We are the first waiter - try to look up the real owner and attach
 	 * the new pi_state to it, but bail out when TID = 0 [1]
+	 *
+	 * The !pid check is paranoid. None of the call sites should end up
+	 * with pid == 0, but better safe than sorry. Let the caller retry
 	 */
 	if (!pid)
-		return -ESRCH;
+		return -EAGAIN;
 	p = find_get_task_by_vpid(pid);
 	if (!p)
-		return -ESRCH;
+		return handle_exit_race(uaddr, uval, NULL);
 
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+		int ret = handle_exit_race(uaddr, uval, p);
 
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
 	 * We are the first waiter - try to look up the owner based on
 	 * @uval and attach to it.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, uval, key, ps);
 }
 
 static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 	 * attach to the owner. If that fails, no harm done, we only
 	 * set the FUTEX_WAITERS bit in the user space variable.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, newval, key, ps);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-29 10:35                 ` Peter Zijlstra
@ 2019-01-29 13:03                   ` Thomas Gleixner
  2019-01-29 13:23                     ` Heiko Carstens
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-29 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Tue, 29 Jan 2019, Peter Zijlstra wrote:

> On Tue, Jan 29, 2019 at 11:24:09AM +0100, Heiko Carstens wrote:
> 
> > Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
> > patch. And now I am trying to reproduce with linux-next 20190129 +
> > your patch and it doesn't trigger. Did I miss a patch which is only in
> > linux-next which could fix this?
> > 
> 
> I'm forever confused on what patch is where; but -ESRCH makes me thing
> maybe you lost this one:
> 
> ---
> 
> commit da791a667536bf8322042e38ca85d55a78d3c273
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Mon Dec 10 14:35:14 2018 +0100

That's in Linus tree already ...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-29 13:03                   ` Thomas Gleixner
@ 2019-01-29 13:23                     ` Heiko Carstens
       [not found]                       ` <20190129151058.GG26906@osiris>
  0 siblings, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2019-01-29 13:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Tue, Jan 29, 2019 at 02:03:01PM +0100, Thomas Gleixner wrote:
> On Tue, 29 Jan 2019, Peter Zijlstra wrote:
> 
> > On Tue, Jan 29, 2019 at 11:24:09AM +0100, Heiko Carstens wrote:
> > 
> > > Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
> > > patch. And now I am trying to reproduce with linux-next 20190129 +
> > > your patch and it doesn't trigger. Did I miss a patch which is only in
> > > linux-next which could fix this?
> > > 
> > 
> > I'm forever confused on what patch is where; but -ESRCH makes me thing
> > maybe you lost this one:
> > 
> > ---
> > 
> > commit da791a667536bf8322042e38ca85d55a78d3c273
> > Author: Thomas Gleixner <tglx@linutronix.de>
> > Date:   Mon Dec 10 14:35:14 2018 +0100
> 
> That's in Linus tree already ...

Yes, for some reason it's hardly reproducible now. It did hit once
after an hour, but (of course) without tracing enabled... still
trying.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
       [not found]                       ` <20190129151058.GG26906@osiris>
@ 2019-01-29 17:16                         ` Sebastian Sewior
  2019-01-29 21:45                           ` Thomas Gleixner
       [not found]                           ` <20190130094913.GC5299@osiris>
  0 siblings, 2 replies; 47+ messages in thread
From: Sebastian Sewior @ 2019-01-29 17:16 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler

On 2019-01-29 16:10:58 [+0100], Heiko Carstens wrote:
> Finally... the trace output is quite large with 26 MB... Therefore an
> xz compressed attachment. Hope that's ok.
> 
> The kernel used was linux-next 20190129 + your patch.
|        ld64.so.1-10237 [006] .... 14232.031726: sys_futex(uaddr: 3ff88e80618, op: 7, val: 3ff00000007, utime: 3ff88e7f910, uaddr2: 3ff88e7f910, val3: 3ffc167e8d7)
FUTEX_UNLOCK_PI | SHARED

|        ld64.so.1-10237 [006] .... 14232.031726: sys_futex -> 0x0
…
|        ld64.so.1-10237 [006] .... 14232.051751: sched_process_exit: comm=ld64.so.1 pid=10237 prio=120
…
|        ld64.so.1-10148 [006] .... 14232.061826: sys_futex(uaddr: 3ff88e80618, op: 6, val: 1, utime: 0, uaddr2: 2, val3: 0)
FUTEX_LOCK_PI | SHARED

|        ld64.so.1-10148 [006] .... 14232.061826: sys_futex -> 0xfffffffffffffffd

So there got to be another task that acquired the lock in userland and
left since the last in kernel-user unlocked it. This might bring more
light to it:

diff --git a/kernel/futex.c b/kernel/futex.c
index 599da35c2768..aaa782a8a115 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1209,6 +1209,9 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval,
 	 * corrupted or the user space value in *uaddr is simply bogus.
 	 * Give up and tell user space.
 	 */
+	trace_printk("uval2 vs uval %08x vs %08x (%d)\n", uval2, uval,
+		     tsk ? tsk->pid : -1);
+	__WARN();
 	return -ESRCH;
 }
 
@@ -1233,8 +1236,10 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
 	if (!pid)
 		return -EAGAIN;
 	p = find_get_task_by_vpid(pid);
-	if (!p)
+	if (!p) {
+		trace_printk("Missing pid %d\n", pid);
 		return handle_exit_race(uaddr, uval, NULL);
+	}
 
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);

---

I am not sure, but isn't this the "known" issue where the kernel drops
ESRCH in a valid case and glibc upstream does not recognize it because
it is not a valid /POSIX-defined error code? (I *think* same is true for
-ENOMEM) If it is, the following C snippet is a small tc:

---->8------
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdio.h>

static char nothing[4096];

int main(void)
{
	int fd;
	ssize_t wn;
	void *lockm;
	pid_t child;
	pthread_mutex_t *the_lock;
	pthread_mutexattr_t mutexattr;
	int ret;

	fd = open("/dev/shm/futex-test-lock", O_RDWR | O_CREAT | O_TRUNC, 0644);
	if (fd < 0) {
		printf("Failed to create lock file: %m\n");
		return 1;
	}
	wn = write(fd, nothing, sizeof(nothing));
	if (wn != sizeof(nothing)) {
		printf("Failed to write to file: %m\n");
		goto out_unlink;
	}

	lockm = mmap(NULL, sizeof(nothing), PROT_READ | PROT_WRITE, MAP_SHARED,
		     fd, 0);
	if (lockm == MAP_FAILED) {
		printf("mmap() failed: %m\n");
		goto out_unlink;
	}
	close(fd);
	unlink("/dev/shm/futex-test-lock");
	the_lock = lockm;

	ret = pthread_mutexattr_init(&mutexattr);
	ret |= pthread_mutexattr_setpshared(&mutexattr, PTHREAD_PROCESS_SHARED);
	ret |= pthread_mutexattr_setprotocol(&mutexattr, PTHREAD_PRIO_INHERIT);

	if (ret) {
		printf("Something went wrong during init\n");
		return 1;
	}

	ret = pthread_mutex_init(the_lock, &mutexattr);
	if (ret) {
		printf("Failed to init the lock\n");
		return 1;
	}
	child = fork();
	if (child < 0) {
		printf("fork(): %m\n");
		return 1;
	}

	if (!child) {
		pthread_mutex_lock(the_lock);
		exit(2);
	}

	sleep(2);
	ret = pthread_mutex_lock(the_lock);

	printf("-> %x\n", ret);
	return 0;

out_unlink:
	unlink("/dev/shm/futex-test-lock");
	return 1;
}

---------------8<-----------------------

strace gives this:
|openat(AT_FDCWD, "/dev/shm/futex-test-lock", O_RDWR|O_CREAT|O_TRUNC, 0644) = 3
|write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
|mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f5e23e37000
|close(3)                                = 0
|unlink("/dev/shm/futex-test-lock")      = 0
…
|clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5e23c1da10) = 25777
|strace: Process 25777 attached
|[pid 25776] nanosleep({tv_sec=2, tv_nsec=0},  <unfinished ...>
|[pid 25777] set_robust_list(0x7f5e23c1da20, 24) = 0
|[pid 25777] exit_group(2)               = ?
|[pid 25777] +++ exited with 2 +++
|<... nanosleep resumed> {tv_sec=1, tv_nsec=999821679}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
|--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25777, si_uid=1000, si_status=2, si_utime=0, si_stime=0} ---
|restart_syscall(<... resuming interrupted nanosleep ...>) = 0
|futex(0x7f5e23e37000, FUTEX_LOCK_PI, NULL) = -1 ESRCH (No such process)
|pause(^Cstrace: Process 25776 detached

and if I remember correctly, if asserts are not enabled we end up with a
pause loop instead.

Sebastian

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-29 17:16                         ` Sebastian Sewior
@ 2019-01-29 21:45                           ` Thomas Gleixner
       [not found]                           ` <20190130094913.GC5299@osiris>
  1 sibling, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-29 21:45 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Heiko Carstens, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler

[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]

On Tue, 29 Jan 2019, Sebastian Sewior wrote:

> On 2019-01-29 16:10:58 [+0100], Heiko Carstens wrote:
> > Finally... the trace output is quite large with 26 MB... Therefore an
> > xz compressed attachment. Hope that's ok.
> > 
> > The kernel used was linux-next 20190129 + your patch.
> |        ld64.so.1-10237 [006] .... 14232.031726: sys_futex(uaddr: 3ff88e80618, op: 7, val: 3ff00000007, utime: 3ff88e7f910, uaddr2: 3ff88e7f910, val3: 3ffc167e8d7)
> FUTEX_UNLOCK_PI | SHARED
> 
> |        ld64.so.1-10237 [006] .... 14232.031726: sys_futex -> 0x0
> …
> |        ld64.so.1-10237 [006] .... 14232.051751: sched_process_exit: comm=ld64.so.1 pid=10237 prio=120
> …
> |        ld64.so.1-10148 [006] .... 14232.061826: sys_futex(uaddr: 3ff88e80618, op: 6, val: 1, utime: 0, uaddr2: 2, val3: 0)
> FUTEX_LOCK_PI | SHARED
> 
> |        ld64.so.1-10148 [006] .... 14232.061826: sys_futex -> 0xfffffffffffffffd
> 
> So there got to be another task that acquired the lock in userland and
> left since the last in kernel-user unlocked it. This might bring more

Well, that would mean that this very task did not have a valid robust list,
which is very unlikely according to the test case.

We might actually stick a trace point into the robust list code as well.

> light to it:
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 599da35c2768..aaa782a8a115 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1209,6 +1209,9 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval,
>  	 * corrupted or the user space value in *uaddr is simply bogus.
>  	 * Give up and tell user space.
>  	 */
> +	trace_printk("uval2 vs uval %08x vs %08x (%d)\n", uval2, uval,
> +		     tsk ? tsk->pid : -1);
> +	__WARN();
>  	return -ESRCH;
>  }
>  
> @@ -1233,8 +1236,10 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
>  	if (!pid)
>  		return -EAGAIN;
>  	p = find_get_task_by_vpid(pid);
> -	if (!p)
> +	if (!p) {
> +		trace_printk("Missing pid %d\n", pid);
>  		return handle_exit_race(uaddr, uval, NULL);
> +	}
>  
>  	if (unlikely(p->flags & PF_KTHREAD)) {
>  		put_task_struct(p);

Yep, that should give us some more clue.

> I am not sure, but isn't this the "known" issue where the kernel drops
> ESRCH in a valid case and glibc upstream does not recognize it because
> it is not a valid /POSIX-defined error code? (I *think* same is true for
> -ENOMEM) If it is, the following C snippet is a small tc:

That testcase is not using robust futexes, but yes it's demonstrating the
glibc does not handle all documented error codes. But I don't think it has
anything to do with the problem at hand. Famous last words....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] futex: Handle early deadlock return correctly
  2019-01-29  8:49           ` Peter Zijlstra
@ 2019-01-29 22:15             ` Thomas Gleixner
  2019-01-30 12:01               ` Thomas Gleixner
  2019-02-08 12:05               ` [tip:locking/urgent] " tip-bot for Thomas Gleixner
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-29 22:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.

futex_lock_pi()                         futex_unlock_pi()
  lock(hb->lock)
  queue(hb_waiter)				lock(hb->lock)
  lock(rtmutex->wait_lock)
  unlock(hb->lock)
                                        // acquired hb->lock
                                        hb_waiter = futex_top_waiter()
                                        lock(rtmutex->wait_lock)
  __rt_mutex_proxy_start()
     ---> fail
          remove(rtmutex_waiter);
     ---> returns -EDEADLOCK
  unlock(rtmutex->wait_lock)
                                        // acquired wait_lock
                                        wake_futex_pi()
                                        rt_mutex_next_owner()
					  --> returns NULL
                                          --> WARN

  lock(hb->lock)
  unqueue(hb_waiter)

The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.

Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.

Add proper comments to the code so all these details are fully documented.

Thanks to Peter for helping with the analysis and writing the really
valuable code comments.

Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: stable@vger.kernel.org
---
 kernel/futex.c           |   28 ++++++++++++++++++----------
 kernel/locking/rtmutex.c |   37 ++++++++++++++++++++++++++++++++-----
 2 files changed, 50 insertions(+), 15 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2861,35 +2861,39 @@ static int futex_lock_pi(u32 __user *uad
 	 * and BUG when futex_unlock_pi() interleaves with this.
 	 *
 	 * Therefore acquire wait_lock while holding hb->lock, but drop the
-	 * latter before calling rt_mutex_start_proxy_lock(). This still fully
-	 * serializes against futex_unlock_pi() as that does the exact same
-	 * lock handoff sequence.
+	 * latter before calling __rt_mutex_start_proxy_lock(). This
+	 * interleaves with futex_unlock_pi() -- which does a similar lock
+	 * handoff -- such that the latter can observe the futex_q::pi_state
+	 * before __rt_mutex_start_proxy_lock() is done.
 	 */
 	raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
 	spin_unlock(q.lock_ptr);
+	/*
+	 * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+	 * such that futex_unlock_pi() is guaranteed to observe the waiter when
+	 * it sees the futex_q::pi_state.
+	 */
 	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
 	raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
 
 	if (ret) {
 		if (ret == 1)
 			ret = 0;
-
-		spin_lock(q.lock_ptr);
-		goto no_block;
+		goto cleanup;
 	}
 
-
 	if (unlikely(to))
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
+cleanup:
 	spin_lock(q.lock_ptr);
 	/*
-	 * If we failed to acquire the lock (signal/timeout), we must
+	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
 	 * first acquire the hb->lock before removing the lock from the
-	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
-	 * wait lists consistent.
+	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
+	 * lists consistent.
 	 *
 	 * In particular; it is important that futex_unlock_pi() can not
 	 * observe this inconsistency.
@@ -3013,6 +3017,10 @@ static int futex_unlock_pi(u32 __user *u
 		 * there is no point where we hold neither; and therefore
 		 * wake_futex_pi() must observe a state consistent with what we
 		 * observed.
+		 *
+		 * In particular; this forces __rt_mutex_start_proxy() to
+		 * complete such that we're guaranteed to observe the
+		 * rt_waiter. Also see the WARN in wake_futex_pi().
 		 */
 		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 		spin_unlock(&hb->lock);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mut
 	rt_mutex_set_owner(lock, NULL);
 }
 
+/**
+ * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock:		the rt_mutex to take
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ * @task:		the task to prepare
+ *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: does _NOT_ remove the @waiter on failure; must either call
+ * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
+ *
+ * Returns:
+ *  0 - task blocked on lock
+ *  1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for PI-futex support.
+ */
 int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
 			      struct task_struct *task)
 {
 	int ret;
 
+	lockdep_asssert_held(&lock->wait_lock);
+
 	if (try_to_take_rt_mutex(lock, task, NULL))
 		return 1;
 
@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct r
 		ret = 0;
 	}
 
-	if (unlikely(ret))
-		remove_waiter(lock, waiter);
-
 	debug_rt_mutex_print_deadlock(waiter);
 
 	return ret;
@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct r
  * @waiter:		the pre-initialized rt_mutex_waiter
  * @task:		the task to prepare
  *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
+ * on failure.
+ *
  * Returns:
  *  0 - task blocked on lock
  *  1 - acquired the lock for task, caller should wake it up
  * <0 - error
  *
- * Special API call for FUTEX_REQUEUE_PI support.
+ * Special API call for PI-futex support.
  */
 int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_
 
 	raw_spin_lock_irq(&lock->wait_lock);
 	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+	if (unlikely(ret))
+		remove_waiter(lock, waiter);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return ret;
@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_m
  * @lock:		the rt_mutex we were woken on
  * @waiter:		the pre-initialized rt_mutex_waiter
  *
- * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
+ * rt_mutex_wait_proxy_lock().
  *
  * Unless we acquired the lock; we're still enqueued on the wait-list and can
  * in fact still be granted ownership until we're removed. Therefore we can

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] futex: Handle early deadlock return correctly
  2019-01-29 22:15             ` [PATCH] futex: Handle early deadlock return correctly Thomas Gleixner
@ 2019-01-30 12:01               ` Thomas Gleixner
  2019-02-08 12:05               ` [tip:locking/urgent] " tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 12:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Ingo Molnar, Martin Schwidefsky, LKML,
	linux-s390, Stefan Liebler, Sebastian Sewior

On Tue, 29 Jan 2019, Thomas Gleixner wrote:
>  int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
>  			      struct rt_mutex_waiter *waiter,
>  			      struct task_struct *task)
>  {
>  	int ret;
>  
> +	lockdep_asssert_held(&lock->wait_lock);

I'm a moron. I fixed that on the test box and wrote the changelog on the
laptop. Syncing the two together was too much for my futex wreckaged brain
...


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
       [not found]                           ` <20190130094913.GC5299@osiris>
@ 2019-01-30 12:15                             ` Thomas Gleixner
       [not found]                               ` <20190130125955.GD5299@osiris>
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 12:15 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Sebastian Sewior, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Wed, 30 Jan 2019, Heiko Carstens wrote:
> On Tue, Jan 29, 2019 at 06:16:53PM +0100, Sebastian Sewior wrote:
> >  	if (unlikely(p->flags & PF_KTHREAD)) {
> >  		put_task_struct(p);
> 
> Last lines of the trace with your additional patch (full log attached):
> 
>            <...>-50539 [003] ....  2376.398223: sys_futex -> 0x0
>            <...>-50539 [003] ....  2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)
>            <...>-50539 [003] ....  2376.398225: attach_to_pi_owner: Missing pid 50734
>            <...>-50539 [003] ....  2376.398226: handle_exit_race: uval2 vs uval 8000c62e vs 8000c62e (-1)

So the user space value is: 8000c62e. FUTEX_WAITER bit is set and the owner
of the futex is PID 50734, which exited long time ago:

           <...>-50734 [000] ....  2376.394936: sched_process_exit: comm=ld64.so.1 pid=50734 prio=120

But at least from the kernel view 50734 has released it last:

           <...>-50734 [000] ....  2376.394930: sys_futex(uaddr: 3ffb7700208, op: 7, val: 3ff00000007, utime: 3ffb3ef8910, uaddr2: 3ffb3ef8910, val3: 3ffc0afe987)
           <...>-50539 [003] ....  2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)

Now, if it would have acquired it in userspace again before exiting, then
the robust list exit code should have set the OWNER_DIED bit as well, but
that's not set....

debug patch for the robust list exit handling below.

Thanks,

	tglx

8<-----------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3440,10 +3440,13 @@ static int handle_futex_death(u32 __user
 {
 	u32 uval, uninitialized_var(nval), mval;
 
+	trace_printk("uaddr: %lx pi: %d\n", (unsigned long) uaddr, pi);
 retry:
 	if (get_user(uval, uaddr))
 		return -1;
 
+	trace_printk("uaddr: %lx uval: %08x\n", (unsigned long) uaddr, uval);
+
 	if ((uval & FUTEX_TID_MASK) == task_pid_vnr(curr)) {
 		/*
 		 * Ok, this dying thread is truly holding a futex
@@ -3480,6 +3483,7 @@ static int handle_futex_death(u32 __user
 		if (!pi && (uval & FUTEX_WAITERS))
 			futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
 	}
+	trace_printk("uaddr: %lx success\n", (unsigned long) uaddr);
 	return 0;
 }
 

 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
       [not found]                               ` <20190130125955.GD5299@osiris>
@ 2019-01-30 13:24                                 ` Sebastian Sewior
  2019-01-30 13:29                                   ` Thomas Gleixner
  2019-01-30 13:25                                 ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Sebastian Sewior @ 2019-01-30 13:24 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler

On 2019-01-30 13:59:55 [+0100], Heiko Carstens wrote:
> Last lines of trace below (full log attached):

            <...>-56956 [005] ....   658.931364: handle_futex_death: uaddr: 3ff9e880c58 pi: 1
…
            <...>-56956 [005] ....   658.931369: handle_futex_death: uaddr: 3ff9e8808c0 success
            <...>-56956 [005] ....   658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120

not including 3ff9e880140

            <...>-56956 [005] ....   658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120

            <...>-56496 [001] ....   658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
            <...>-56496 [001] ....   658.932405: attach_to_pi_owner: Missing pid 56956
            <...>-56496 [001] ....   658.932406: handle_exit_race: uval2 vs uval 8000de7c vs 8000de7c (-1)
            <...>-56496 [001] ....   658.932501: sys_futex -> 0xfffffffffffffffd
…
            <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 pi: 1
            <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 uval: 8000de7c
            <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 success

and here we have it.

Sebastian

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
       [not found]                               ` <20190130125955.GD5299@osiris>
  2019-01-30 13:24                                 ` Sebastian Sewior
@ 2019-01-30 13:25                                 ` Thomas Gleixner
  1 sibling, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 13:25 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Sebastian Sewior, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Wed, 30 Jan 2019, Heiko Carstens wrote:
> On Wed, Jan 30, 2019 at 01:15:18PM +0100, Thomas Gleixner wrote:
> > On Wed, 30 Jan 2019, Heiko Carstens wrote:
> > > On Tue, Jan 29, 2019 at 06:16:53PM +0100, Sebastian Sewior wrote:
> > > >  	if (unlikely(p->flags & PF_KTHREAD)) {
> > > >  		put_task_struct(p);
> > > 
> > > Last lines of the trace with your additional patch (full log attached):
> > > 
> > >            <...>-50539 [003] ....  2376.398223: sys_futex -> 0x0
> > >            <...>-50539 [003] ....  2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)
> > >            <...>-50539 [003] ....  2376.398225: attach_to_pi_owner: Missing pid 50734
> > >            <...>-50539 [003] ....  2376.398226: handle_exit_race: uval2 vs uval 8000c62e vs 8000c62e (-1)
> > 
> > So the user space value is: 8000c62e. FUTEX_WAITER bit is set and the owner
> > of the futex is PID 50734, which exited long time ago:
> > 
> >            <...>-50734 [000] ....  2376.394936: sched_process_exit: comm=ld64.so.1 pid=50734 prio=120
> > 
> > But at least from the kernel view 50734 has released it last:
> > 
> >            <...>-50734 [000] ....  2376.394930: sys_futex(uaddr: 3ffb7700208, op: 7, val: 3ff00000007, utime: 3ffb3ef8910, uaddr2: 3ffb3ef8910, val3: 3ffc0afe987)
> >            <...>-50539 [003] ....  2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)
> > 
> > Now, if it would have acquired it in userspace again before exiting, then
> > the robust list exit code should have set the OWNER_DIED bit as well, but
> > that's not set....
> > 
> > debug patch for the robust list exit handling below.
> 
> Last lines of trace below (full log attached):

SNIP...

It's the same picture as last time and the only occurence of the futex in
question in the context of the dead task is:

           <...>-56956 [007] ....   658.804018: sys_futex(uaddr: 3ff9e880050, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)

The robust list exit of that task does not contain the user space address 3ff9e880050.

Confused and of course the problem does not reproduce on x86. Sigh.

I'll think about it some more.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-30 13:24                                 ` Sebastian Sewior
@ 2019-01-30 13:29                                   ` Thomas Gleixner
  2019-01-30 14:33                                     ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 13:29 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Heiko Carstens, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler

[-- Attachment #1: Type: text/plain, Size: 1435 bytes --]

On Wed, 30 Jan 2019, Sebastian Sewior wrote:

> On 2019-01-30 13:59:55 [+0100], Heiko Carstens wrote:
> > Last lines of trace below (full log attached):
> 
>             <...>-56956 [005] ....   658.931364: handle_futex_death: uaddr: 3ff9e880c58 pi: 1
> …
>             <...>-56956 [005] ....   658.931369: handle_futex_death: uaddr: 3ff9e8808c0 success
>             <...>-56956 [005] ....   658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
> 
> not including 3ff9e880140
> 
>             <...>-56956 [005] ....   658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
> 
>             <...>-56496 [001] ....   658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
>             <...>-56496 [001] ....   658.932405: attach_to_pi_owner: Missing pid 56956
>             <...>-56496 [001] ....   658.932406: handle_exit_race: uval2 vs uval 8000de7c vs 8000de7c (-1)
>             <...>-56496 [001] ....   658.932501: sys_futex -> 0xfffffffffffffffd
> …
>             <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 pi: 1
>             <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 uval: 8000de7c
>             <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 success
> 
> and here we have it.

Bah. I was looking at the wrong trace. /me goes to give a talk and stares
at that mess later.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-30 13:29                                   ` Thomas Gleixner
@ 2019-01-30 14:33                                     ` Thomas Gleixner
  2019-01-30 17:56                                       ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 14:33 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Heiko Carstens, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler

[-- Attachment #1: Type: text/plain, Size: 2732 bytes --]

On Wed, 30 Jan 2019, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Sebastian Sewior wrote:
> 
> > On 2019-01-30 13:59:55 [+0100], Heiko Carstens wrote:
> > > Last lines of trace below (full log attached):
> > 
> >             <...>-56956 [005] ....   658.931364: handle_futex_death: uaddr: 3ff9e880c58 pi: 1
> > …
> >             <...>-56956 [005] ....   658.931369: handle_futex_death: uaddr: 3ff9e8808c0 success
> >             <...>-56956 [005] ....   658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
> > 
> > not including 3ff9e880140
> > 
> >             <...>-56956 [005] ....   658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
> > 
> >             <...>-56496 [001] ....   658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
> >             <...>-56496 [001] ....   658.932405: attach_to_pi_owner: Missing pid 56956
> >             <...>-56496 [001] ....   658.932406: handle_exit_race: uval2 vs uval 8000de7c vs 8000de7c (-1)
> >             <...>-56496 [001] ....   658.932501: sys_futex -> 0xfffffffffffffffd
> > …
> >             <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 pi: 1
> >             <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 uval: 8000de7c
> >             <...>-56496 [001] ....   659.020750: handle_futex_death: uaddr: 3ff9e880140 success
> > 
> > and here we have it.

Well no. That futex is in the robust list of the dying task because it
tried to lock it and then died probably before removing itself from the
robust list.

handle_futex_death() of that task does not touch it because the userspace
TID value is de7c which is 56956.

The last entries with that uaddr are:

          <...>-56956 [005] ....   658.923608: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)

UNLOCK

           <...>-56945 [006] ....   658.923612: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 1003ff0, uaddr2: 3ff9e87f910, val3: 3ff0000de71)

LOCK

           <...>-56956 [005] ....   658.923612: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)

UNLOCK

           <...>-56945 [006] ....   658.923830: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9e87f910, uaddr2: 3ff9e87f910, val3: 3ffea67e3f7)

UNLOCK

           <...>-56496 [001] ....   658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)

LOCK which fails.

This does not make any sense. The last kernel visible operation of 56956 on
that uaddr is the UNLOCK above.

I need to think some more about what might happen.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-30 14:33                                     ` Thomas Gleixner
@ 2019-01-30 17:56                                       ` Thomas Gleixner
  2019-01-30 21:07                                         ` Sebastian Sewior
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 17:56 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Heiko Carstens, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler

On Wed, 30 Jan 2019, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Thomas Gleixner wrote:
> The last entries with that uaddr are:
> 
>           <...>-56956 [005] ....   658.923608: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)
> 
> UNLOCK
> 
>            <...>-56945 [006] ....   658.923612: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 1003ff0, uaddr2: 3ff9e87f910, val3: 3ff0000de71)
> 
> LOCK
> 
>            <...>-56956 [005] ....   658.923612: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)
> 
> UNLOCK
> 
>            <...>-56945 [006] ....   658.923830: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9e87f910, uaddr2: 3ff9e87f910, val3: 3ffea67e3f7)
> 
> UNLOCK
> 
>            <...>-56496 [001] ....   658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
> 
> LOCK which fails.
> 
> This does not make any sense. The last kernel visible operation of 56956 on
> that uaddr is the UNLOCK above.
> 
> I need to think some more about what might happen.

TBH, no clue. Below are some more traceprintks which hopefully shed some
light on that mystery. See kernel/futex.c line 30  ...

Thanks,

	tglx
8<--------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1502,6 +1502,8 @@ static int wake_futex_pi(u32 __user *uad
 	 * died bit, because we are the owner.
 	 */
 	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+	trace_printk("uaddr: %lx cur: %x new: %x\n",
+		     (unsigned long) uaddr, uval, newval);
 
 	if (unlikely(should_fail_futex(true)))
 		ret = -EFAULT;
@@ -2431,6 +2433,8 @@ static int fixup_pi_state_owner(u32 __us
 	for (;;) {
 		newval = (uval & FUTEX_OWNER_DIED) | newtid;
 
+		trace_printk("uaddr: %lx cur: %x new: %x\n",
+			     (unsigned long) uaddr, uval, newval);
 		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
 			goto handle_fault;
 		if (curval == uval)
@@ -2438,6 +2442,8 @@ static int fixup_pi_state_owner(u32 __us
 		uval = curval;
 	}
 
+	trace_printk("uaddr: %lx cur: %x new: %x\n",
+		     (unsigned long) uaddr, uval, newval);
 	/*
 	 * We fixed up user space. Now we need to fix the pi_state
 	 * itself.
@@ -3028,6 +3034,9 @@ static int futex_unlock_pi(u32 __user *u
 		/* drops pi_state->pi_mutex.wait_lock */
 		ret = wake_futex_pi(uaddr, uval, pi_state);
 
+		trace_printk("uaddr: %lx wake: %d\n",
+			     (unsigned long) uaddr, ret);
+
 		put_pi_state(pi_state);
 
 		/*
@@ -3056,6 +3065,8 @@ static int futex_unlock_pi(u32 __user *u
 		goto out_putkey;
 	}
 
+	trace_printk("uaddr: %lx cur: %x new: %x\n",
+		     (unsigned long) uaddr, uval, 0);
 	/*
 	 * We have no kernel internal state, i.e. no waiters in the
 	 * kernel. Waiters which are about to queue themselves are stuck

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered
  2019-01-30 17:56                                       ` Thomas Gleixner
@ 2019-01-30 21:07                                         ` Sebastian Sewior
  2019-01-30 23:13                                           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Sebastian Sewior @ 2019-01-30 21:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Heiko Carstens, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler

On 2019-01-30 18:56:54 [+0100], Thomas Gleixner wrote:
> TBH, no clue. Below are some more traceprintks which hopefully shed some
> light on that mystery. See kernel/futex.c line 30  ...

The robust list it somehow buggy. In the last trace we had the
handle_futex_death() of uaddr 3ff9e880140 as the last action. That means
it was an entry in 56496's ->list_op_pending entry. This makes sense
because it tried to acquire the lock, failed, got killed.

According to uaddr pid 56956 is the owner. So 56956 invoked one of
pthread_mutex_lock() / pthread_mutex_timedlock() /
pthread_mutex_trylock() and should have obtained the lock in userland.
Depending on where it got killed, that mutex should be either recorded
in ->list_op_pending or the robust_list (or both if it didn't clear
->list_op_pending yet). But it is not.
Similar for pthread_mutex_unlock().
We don't have a trace_point if we abort processing the list.
On the other hand, it didn't trigger on x86 for hours. Could the atomic
ops be the culprit?

> Thanks,
> 
> 	tglx

Sebastian

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-30 21:07                                         ` Sebastian Sewior
@ 2019-01-30 23:13                                           ` Thomas Gleixner
  2019-01-30 23:35                                             ` Paul E. McKenney
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 23:13 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Heiko Carstens, Peter Zijlstra, Ingo Molnar, Martin Schwidefsky,
	LKML, linux-s390, Stefan Liebler, Paul E. McKenney

On Wed, 30 Jan 2019, Sebastian Sewior wrote:

> On 2019-01-30 18:56:54 [+0100], Thomas Gleixner wrote:
> > TBH, no clue. Below are some more traceprintks which hopefully shed some
> > light on that mystery. See kernel/futex.c line 30  ...
> 
> The robust list it somehow buggy. In the last trace we had the
> handle_futex_death() of uaddr 3ff9e880140 as the last action. That means
> it was an entry in 56496's ->list_op_pending entry. This makes sense
> because it tried to acquire the lock, failed, got killed.

The robust list of the failing task seems to be correct. 

> According to uaddr pid 56956 is the owner. So 56956 invoked one of
> pthread_mutex_lock() / pthread_mutex_timedlock() /
> pthread_mutex_trylock() and should have obtained the lock in userland.
> Depending on where it got killed, that mutex should be either recorded in
> ->list_op_pending or the robust_list (or both if it didn't clear
> ->list_op_pending yet). But it is not.
> Similar for pthread_mutex_unlock().

> We don't have a trace_point if we abort processing the list.

The only reason why it would abort is due a page fault because that cannot
be handled in the exit code anymore.

> On the other hand, it didn't trigger on x86 for hours. Could the atomic

s/hours/days/ ....

> ops be the culprit?

The glibc code does:

       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
       		     (void *) (((uintptr_t) &mutex->__data.__list.__next)
                                   | 1));

       ....
       lock in user space

       or

       lock in kernel space

       ENQUEUE_MUTEX_PI (mutex);
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

       ENQUEUE_MUTEX_PI() resolves to a THREAD_GETMEM() which reads the
       list head from TLS, some list manipulation operations and the final
       THREAD_SETMEM() which stores the new list head

Now on x86 THREAD_GETMEM() and THREAD_SETMEM() are resolving to

    asm volatile ("movX .....")

on s390 they are

   descr->member

based operations.

Now the important part of the robust list is the store sequence, i.e. the
list head and final update to the TLS visible part need to come _before_
list_op_pending is cleared.

I might be missing something, but there is no compiler barrier in that code
which would prevent the compiler from reordering the stores. It can
rightfully do so because there is no compiler visible dependency of these
two operations.

On x8664 the asm volatile might prevent it by chance, but it does not have
a 'memory' specified which would guarantee a compiler barrier.

On s390 there is certainly nothing.

So assumed that clearing list_op_pending comes before the list head update,
then the robust exit code in the kernel will fail to see either of
them. FAIL.

I might be wrong as usual, but this would definitely explain the fail very
well.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-30 23:13                                           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede Thomas Gleixner
@ 2019-01-30 23:35                                             ` Paul E. McKenney
  2019-01-30 23:55                                               ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Paul E. McKenney @ 2019-01-30 23:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Sewior, Heiko Carstens, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Sebastian Sewior wrote:
> 
> > On 2019-01-30 18:56:54 [+0100], Thomas Gleixner wrote:
> > > TBH, no clue. Below are some more traceprintks which hopefully shed some
> > > light on that mystery. See kernel/futex.c line 30  ...
> > 
> > The robust list it somehow buggy. In the last trace we had the
> > handle_futex_death() of uaddr 3ff9e880140 as the last action. That means
> > it was an entry in 56496's ->list_op_pending entry. This makes sense
> > because it tried to acquire the lock, failed, got killed.
> 
> The robust list of the failing task seems to be correct. 
> 
> > According to uaddr pid 56956 is the owner. So 56956 invoked one of
> > pthread_mutex_lock() / pthread_mutex_timedlock() /
> > pthread_mutex_trylock() and should have obtained the lock in userland.
> > Depending on where it got killed, that mutex should be either recorded in
> > ->list_op_pending or the robust_list (or both if it didn't clear
> > ->list_op_pending yet). But it is not.
> > Similar for pthread_mutex_unlock().
> 
> > We don't have a trace_point if we abort processing the list.
> 
> The only reason why it would abort is due a page fault because that cannot
> be handled in the exit code anymore.
> 
> > On the other hand, it didn't trigger on x86 for hours. Could the atomic
> 
> s/hours/days/ ....
> 
> > ops be the culprit?
> 
> The glibc code does:
> 
>        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>        		     (void *) (((uintptr_t) &mutex->__data.__list.__next)
>                                    | 1));
> 
>        ....
>        lock in user space
> 
>        or
> 
>        lock in kernel space
> 
>        ENQUEUE_MUTEX_PI (mutex);
>        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> 
>        ENQUEUE_MUTEX_PI() resolves to a THREAD_GETMEM() which reads the
>        list head from TLS, some list manipulation operations and the final
>        THREAD_SETMEM() which stores the new list head
> 
> Now on x86 THREAD_GETMEM() and THREAD_SETMEM() are resolving to
> 
>     asm volatile ("movX .....")
> 
> on s390 they are
> 
>    descr->member
> 
> based operations.
> 
> Now the important part of the robust list is the store sequence, i.e. the
> list head and final update to the TLS visible part need to come _before_
> list_op_pending is cleared.
> 
> I might be missing something, but there is no compiler barrier in that code
> which would prevent the compiler from reordering the stores. It can
> rightfully do so because there is no compiler visible dependency of these
> two operations.
> 
> On x8664 the asm volatile might prevent it by chance, but it does not have
> a 'memory' specified which would guarantee a compiler barrier.
> 
> On s390 there is certainly nothing.
> 
> So assumed that clearing list_op_pending comes before the list head update,
> then the robust exit code in the kernel will fail to see either of
> them. FAIL.
> 
> I might be wrong as usual, but this would definitely explain the fail very
> well.

On recent versions of GCC, the fix would be to put this between the two
stores that need ordering:

	__atomic_thread_fence(__ATOMIC_RELEASE);

I must defer to Heiko on whether s390 GCC might tear the stores.  My
guess is "probably not".  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-30 23:35                                             ` Paul E. McKenney
@ 2019-01-30 23:55                                               ` Thomas Gleixner
  2019-01-31  0:27                                                 ` Thomas Gleixner
  2019-01-31  1:44                                                 ` Paul E. McKenney
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-30 23:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Sebastian Sewior, Heiko Carstens, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > I might be wrong as usual, but this would definitely explain the fail very
> > well.
> 
> On recent versions of GCC, the fix would be to put this between the two
> stores that need ordering:
> 
> 	__atomic_thread_fence(__ATOMIC_RELEASE);
> 
> I must defer to Heiko on whether s390 GCC might tear the stores.  My
> guess is "probably not".  ;-)

So I just checked the latest glibc code. It has:

            /* We must not enqueue the mutex before we have acquired it.
               Also see comments at ENQUEUE_MUTEX.  */
            __asm ("" ::: "memory");
            ENQUEUE_MUTEX_PI (mutex);
            /* We need to clear op_pending after we enqueue the mutex.  */
            __asm ("" ::: "memory");
            THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")

in the glibc repository, There since Dec 24 2016 ...

So the question is whether this is sufficient. That ordering only only
matters vs. the thread itself and not for others.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-30 23:55                                               ` Thomas Gleixner
@ 2019-01-31  0:27                                                 ` Thomas Gleixner
  2019-01-31  1:45                                                   ` Paul E. McKenney
  2019-01-31 16:52                                                   ` Heiko Carstens
  2019-01-31  1:44                                                 ` Paul E. McKenney
  1 sibling, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2019-01-31  0:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Sebastian Sewior, Heiko Carstens, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Thu, 31 Jan 2019, Thomas Gleixner wrote:

> On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > I might be wrong as usual, but this would definitely explain the fail very
> > > well.
> > 
> > On recent versions of GCC, the fix would be to put this between the two
> > stores that need ordering:
> > 
> > 	__atomic_thread_fence(__ATOMIC_RELEASE);
> > 
> > I must defer to Heiko on whether s390 GCC might tear the stores.  My
> > guess is "probably not".  ;-)
> 
> So I just checked the latest glibc code. It has:
> 
>             /* We must not enqueue the mutex before we have acquired it.
>                Also see comments at ENQUEUE_MUTEX.  */
>             __asm ("" ::: "memory");
>             ENQUEUE_MUTEX_PI (mutex);
>             /* We need to clear op_pending after we enqueue the mutex.  */
>             __asm ("" ::: "memory");
>             THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> 
> 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
> 
> in the glibc repository, There since Dec 24 2016 ...

And of course, I'm using the latest greatest glibc for testing that, so I'm
not at all surprised that it just does not reproduce on my tests.

I just hacked the ordering and restarted the test. If the theory holds,
then this should die sooner than later.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-30 23:55                                               ` Thomas Gleixner
  2019-01-31  0:27                                                 ` Thomas Gleixner
@ 2019-01-31  1:44                                                 ` Paul E. McKenney
  1 sibling, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2019-01-31  1:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Sewior, Heiko Carstens, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Thu, Jan 31, 2019 at 12:55:18AM +0100, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > I might be wrong as usual, but this would definitely explain the fail very
> > > well.
> > 
> > On recent versions of GCC, the fix would be to put this between the two
> > stores that need ordering:
> > 
> > 	__atomic_thread_fence(__ATOMIC_RELEASE);
> > 
> > I must defer to Heiko on whether s390 GCC might tear the stores.  My
> > guess is "probably not".  ;-)
> 
> So I just checked the latest glibc code. It has:
> 
>             /* We must not enqueue the mutex before we have acquired it.
>                Also see comments at ENQUEUE_MUTEX.  */
>             __asm ("" ::: "memory");
>             ENQUEUE_MUTEX_PI (mutex);
>             /* We need to clear op_pending after we enqueue the mutex.  */
>             __asm ("" ::: "memory");
>             THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> 
> 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
> 
> in the glibc repository, There since Dec 24 2016 ...
> 
> So the question is whether this is sufficient. That ordering only only
> matters vs. the thread itself and not for others.

Ah, in that case you can use __atomic_signal_fence(__ATOMIC_RELEASE)
instead of the __atomic_thread_fence(__ATOMIC_RELEASE).

The __atomic_thread_fence(__ATOMIC_RELEASE) provides ordering between
threads, but __atomic_signal_fence(__ATOMIC_RELEASE) only does so
within a thread.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-31  0:27                                                 ` Thomas Gleixner
@ 2019-01-31  1:45                                                   ` Paul E. McKenney
  2019-01-31 16:52                                                   ` Heiko Carstens
  1 sibling, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2019-01-31  1:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Sewior, Heiko Carstens, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Thu, Jan 31, 2019 at 01:27:25AM +0100, Thomas Gleixner wrote:
> On Thu, 31 Jan 2019, Thomas Gleixner wrote:
> 
> > On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > > I might be wrong as usual, but this would definitely explain the fail very
> > > > well.
> > > 
> > > On recent versions of GCC, the fix would be to put this between the two
> > > stores that need ordering:
> > > 
> > > 	__atomic_thread_fence(__ATOMIC_RELEASE);
> > > 
> > > I must defer to Heiko on whether s390 GCC might tear the stores.  My
> > > guess is "probably not".  ;-)
> > 
> > So I just checked the latest glibc code. It has:
> > 
> >             /* We must not enqueue the mutex before we have acquired it.
> >                Also see comments at ENQUEUE_MUTEX.  */
> >             __asm ("" ::: "memory");
> >             ENQUEUE_MUTEX_PI (mutex);
> >             /* We need to clear op_pending after we enqueue the mutex.  */
> >             __asm ("" ::: "memory");
> >             THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> > 
> > 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
> > 
> > in the glibc repository, There since Dec 24 2016 ...
> 
> And of course, I'm using the latest greatest glibc for testing that, so I'm
> not at all surprised that it just does not reproduce on my tests.

Sounds about right.  :-/

> I just hacked the ordering and restarted the test. If the theory holds,
> then this should die sooner than later.

;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-31  0:27                                                 ` Thomas Gleixner
  2019-01-31  1:45                                                   ` Paul E. McKenney
@ 2019-01-31 16:52                                                   ` Heiko Carstens
  2019-01-31 17:06                                                     ` Sebastian Sewior
  1 sibling, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2019-01-31 16:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, Sebastian Sewior, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Thu, Jan 31, 2019 at 01:27:25AM +0100, Thomas Gleixner wrote:
> On Thu, 31 Jan 2019, Thomas Gleixner wrote:
> 
> > On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > > I might be wrong as usual, but this would definitely explain the fail very
> > > > well.
> > > 
> > > On recent versions of GCC, the fix would be to put this between the two
> > > stores that need ordering:
> > > 
> > > 	__atomic_thread_fence(__ATOMIC_RELEASE);
> > > 
> > > I must defer to Heiko on whether s390 GCC might tear the stores.  My
> > > guess is "probably not".  ;-)
> > 
> > So I just checked the latest glibc code. It has:
> > 
> >             /* We must not enqueue the mutex before we have acquired it.
> >                Also see comments at ENQUEUE_MUTEX.  */
> >             __asm ("" ::: "memory");
> >             ENQUEUE_MUTEX_PI (mutex);
> >             /* We need to clear op_pending after we enqueue the mutex.  */
> >             __asm ("" ::: "memory");
> >             THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> > 
> > 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
> > 
> > in the glibc repository, There since Dec 24 2016 ...
> 
> And of course, I'm using the latest greatest glibc for testing that, so I'm
> not at all surprised that it just does not reproduce on my tests.

As discussed on IRC: I used plain vanilla glibc version 2.28 for my
tests. This version already contains the commit you mentioned above.

> I just hacked the ordering and restarted the test. If the theory holds,
> then this should die sooner than later.

...nevertheless Stefan and I looked through the lovely disassembly of
_pthread_mutex_lock_full() to verify if the compiler barriers are
actually doing what they are supposed to do. The generated code
however does look correct.
So, it must be something different.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-31 16:52                                                   ` Heiko Carstens
@ 2019-01-31 17:06                                                     ` Sebastian Sewior
  2019-01-31 20:42                                                       ` Heiko Carstens
  2019-02-01 16:12                                                       ` Heiko Carstens
  0 siblings, 2 replies; 47+ messages in thread
From: Sebastian Sewior @ 2019-01-31 17:06 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> ...nevertheless Stefan and I looked through the lovely disassembly of
> _pthread_mutex_lock_full() to verify if the compiler barriers are
> actually doing what they are supposed to do. The generated code
> however does look correct.
> So, it must be something different.

would it make sense to use one locking function instead all three (lock,
try-lock, timed) in the test case to figure out if this is related to
one of the locking function?

Sebastian

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-31 17:06                                                     ` Sebastian Sewior
@ 2019-01-31 20:42                                                       ` Heiko Carstens
  2019-02-01 16:12                                                       ` Heiko Carstens
  1 sibling, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2019-01-31 20:42 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Thomas Gleixner, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Thu, Jan 31, 2019 at 06:06:53PM +0100, Sebastian Sewior wrote:
> On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> > ...nevertheless Stefan and I looked through the lovely disassembly of
> > _pthread_mutex_lock_full() to verify if the compiler barriers are
> > actually doing what they are supposed to do. The generated code
> > however does look correct.
> > So, it must be something different.
> 
> would it make sense to use one locking function instead all three (lock,
> try-lock, timed) in the test case to figure out if this is related to
> one of the locking function?

Sure, I will give that a try tomorrow.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-01-31 17:06                                                     ` Sebastian Sewior
  2019-01-31 20:42                                                       ` Heiko Carstens
@ 2019-02-01 16:12                                                       ` Heiko Carstens
  2019-02-01 21:59                                                         ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2019-02-01 16:12 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Thomas Gleixner, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Thu, Jan 31, 2019 at 06:06:53PM +0100, Sebastian Sewior wrote:
> On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> > ...nevertheless Stefan and I looked through the lovely disassembly of
> > _pthread_mutex_lock_full() to verify if the compiler barriers are
> > actually doing what they are supposed to do. The generated code
> > however does look correct.
> > So, it must be something different.
> 
> would it make sense to use one locking function instead all three (lock,
> try-lock, timed) in the test case to figure out if this is related to
> one of the locking function?

I tried all three variants, but it seems to be close to impossible to
re-create then. I had a single fail when using only the trylock
variant, but I wouldn't say that means anything.

Only if all three variants run in parallel it seems to be quite
reliably reproducible, even though sometimes it still takes an hour.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-02-01 16:12                                                       ` Heiko Carstens
@ 2019-02-01 21:59                                                         ` Thomas Gleixner
       [not found]                                                           ` <20190202091043.GA3381@osiris>
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-01 21:59 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Sebastian Sewior, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Fri, 1 Feb 2019, Heiko Carstens wrote:
> On Thu, Jan 31, 2019 at 06:06:53PM +0100, Sebastian Sewior wrote:
> > On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> > > ...nevertheless Stefan and I looked through the lovely disassembly of
> > > _pthread_mutex_lock_full() to verify if the compiler barriers are
> > > actually doing what they are supposed to do. The generated code
> > > however does look correct.
> > > So, it must be something different.
> > 
> > would it make sense to use one locking function instead all three (lock,
> > try-lock, timed) in the test case to figure out if this is related to
> > one of the locking function?
> 
> I tried all three variants, but it seems to be close to impossible to
> re-create then. I had a single fail when using only the trylock
> variant, but I wouldn't say that means anything.
> 
> Only if all three variants run in parallel it seems to be quite
> reliably reproducible, even though sometimes it still takes an hour.

Were you able to capture a trace with the last set of additional trace
printks?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
       [not found]                                                           ` <20190202091043.GA3381@osiris>
@ 2019-02-02 10:14                                                             ` Thomas Gleixner
  2019-02-02 11:20                                                               ` Heiko Carstens
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-02 10:14 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Sebastian Sewior, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Sat, 2 Feb 2019, Heiko Carstens wrote:
> On Fri, Feb 01, 2019 at 10:59:08PM +0100, Thomas Gleixner wrote:
> > Were you able to capture a trace with the last set of additional trace
> > printks?
> 
> Of course I forgot to collect that, sorry! But just reproduced; see
> log below (last 1000 lines) and attachment for full log.

The failing futex is here:

<...>-48786 [002] ....   337.231645: sys_futex(uaddr: 3ff90c00460, op: 6, val: 1, utime: 0, uaddr2: 4, val3: 0)
<...>-48786 [002] ....   337.231646: attach_to_pi_owner: Missing pid 49011
<...>-48786 [002] ....   337.231646: handle_exit_race: uval2 vs uval 8000bf73 vs 8000bf73 (-1)
<...>-48786 [002] ....   337.231741: sys_futex -> 0xfffffffffffffffd

Lets look were it was handled in the kernel right before that:

<...>-49014 [006] ....   337.215675: sys_futex(uaddr: 3ff90c00460, op: 7, val: 3ff00000007, utime: 3ff8d3f8910, uaddr2: 3ff8d3f8910, val3: 3ffc64fe8f7)
<...>-49014 [006] ....   337.215675: do_futex: uaddr: 3ff90c00460 cur: 8000bf76 new: 0

49014 unlocks the futex in the kernel and due to lack of waiters it sets it
to unlocked ---> new: 0.

Between this and the failing sys_futex() invocation, the missing task exits:

<...>-49011 [000] ....   337.221543: handle_futex_death: uaddr: 3ff90c00a00 pi: 1
...
<...>-49011 [000] ....   337.221547: handle_futex_death: uaddr: 3ff90c00488 success
<...>-49011 [000] ....   337.221548: sched_process_exit: comm=ld64.so.1 pid=49011 prio=120

but it does not have futex 3ff90c00460 in its robust list.

So after the unlock @timestamp 337.215675 the kernel does not deal with
that futex at all until the failed lock attempt where it rightfully rejects
the attempt due to the alleged owner being gone.

So this looks more like user space doing something stupid...

As we talked about the missing barriers before, I just looked at
pthread_mutex_trylock() and that does still:

	if (robust)
          {
            ENQUEUE_MUTEX_PI (mutex);
            THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
          }

So it's missing the barriers which pthread_mutex_lock() has. Grasping for
straws obviously....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-02-02 10:14                                                             ` Thomas Gleixner
@ 2019-02-02 11:20                                                               ` Heiko Carstens
  2019-02-03 16:30                                                                 ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Heiko Carstens @ 2019-02-02 11:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Sewior, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Sat, Feb 02, 2019 at 11:14:27AM +0100, Thomas Gleixner wrote:
> On Sat, 2 Feb 2019, Heiko Carstens wrote:
> So after the unlock @timestamp 337.215675 the kernel does not deal with
> that futex at all until the failed lock attempt where it rightfully rejects
> the attempt due to the alleged owner being gone.
> 
> So this looks more like user space doing something stupid...
> 
> As we talked about the missing barriers before, I just looked at
> pthread_mutex_trylock() and that does still:
> 
> 	if (robust)
>           {
>             ENQUEUE_MUTEX_PI (mutex);
>             THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>           }
> 
> So it's missing the barriers which pthread_mutex_lock() has. Grasping for
> straws obviously....

Excellent! Taking a look into the disassembly of nptl/pthread_mutex_trylock.o
reveals this part:

140:   a5 1b 00 01             oill    %r1,1
144:   e5 48 a0 f0 00 00       mvghi   240(%r10),0   <--- THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
14a:   e3 10 a0 e0 00 24       stg     %r1,224(%r10) <--- last THREAD_SETMEM of ENQUEUE_MUTEX_PI

I added a barrier between those two and now the code looks like this:

140:   a5 1b 00 01             oill    %r1,1
144:   e3 10 a0 e0 00 24       stg     %r1,224(%r10)
14a:   e5 48 a0 f0 00 00       mvghi   240(%r10),0

Looks like this was a one instruction race...

I'll try to reproduce with the patch below (sprinkling compiler
barriers just like the other files have).

diff --git a/nptl/pthread_mutex_trylock.c b/nptl/pthread_mutex_trylock.c
index 7de61f4f68..3b093cb09c 100644
--- a/nptl/pthread_mutex_trylock.c
+++ b/nptl/pthread_mutex_trylock.c
@@ -116,8 +116,12 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	      mutex->__data.__count = 1;
 	      /* But it is inconsistent unless marked otherwise.  */
 	      mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;
-
+	      /* We must not enqueue the mutex before we have acquired it.
+		 Also see comments at ENQUEUE_MUTEX.  */
+	      __asm ("" ::: "memory");
 	      ENQUEUE_MUTEX (mutex);
+	      /* We need to clear op_pending after we enqueue the mutex.  */
+	      __asm ("" ::: "memory");
 	      THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
 
 	      /* Note that we deliberately exist here.  If we fall
@@ -177,7 +181,12 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	}
       while ((oldval & FUTEX_OWNER_DIED) != 0);
 
+      /* We must not enqueue the mutex before we have acquired it.
+         Also see comments at ENQUEUE_MUTEX.  */
+      __asm ("" ::: "memory");
       ENQUEUE_MUTEX (mutex);
+      /* We need to clear op_pending after we enqueue the mutex.  */
+      __asm ("" ::: "memory");
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
 
       mutex->__data.__owner = id;
@@ -279,7 +288,11 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	    /* But it is inconsistent unless marked otherwise.  */
 	    mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;
 
+	    /* We must not enqueue the mutex before we have acquired it.
+	       Also see comments at ENQUEUE_MUTEX.  */
+	    __asm ("" ::: "memory");
 	    ENQUEUE_MUTEX (mutex);
+	    __asm ("" ::: "memory");
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
 
 	    /* Note that we deliberately exit here.  If we fall
@@ -308,7 +321,12 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
 
 	if (robust)
 	  {
+	    /* We must not enqueue the mutex before we have acquired it.
+	       Also see comments at ENQUEUE_MUTEX.  */
+	    __asm ("" ::: "memory");
 	    ENQUEUE_MUTEX_PI (mutex);
+	    /* We need to clear op_pending after we enqueue the mutex.  */
+	    __asm ("" ::: "memory");
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
 	  }
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-02-02 11:20                                                               ` Heiko Carstens
@ 2019-02-03 16:30                                                                 ` Thomas Gleixner
  2019-02-04 11:40                                                                   ` Heiko Carstens
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2019-02-03 16:30 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Sebastian Sewior, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Sat, 2 Feb 2019, Heiko Carstens wrote:

> On Sat, Feb 02, 2019 at 11:14:27AM +0100, Thomas Gleixner wrote:
> > On Sat, 2 Feb 2019, Heiko Carstens wrote:
> > So after the unlock @timestamp 337.215675 the kernel does not deal with
> > that futex at all until the failed lock attempt where it rightfully rejects
> > the attempt due to the alleged owner being gone.
> > 
> > So this looks more like user space doing something stupid...
> > 
> > As we talked about the missing barriers before, I just looked at
> > pthread_mutex_trylock() and that does still:
> > 
> > 	if (robust)
> >           {
> >             ENQUEUE_MUTEX_PI (mutex);
> >             THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> >           }
> > 
> > So it's missing the barriers which pthread_mutex_lock() has. Grasping for
> > straws obviously....

Looks more like a solid tree than a straw now. :)

> Excellent! Taking a look into the disassembly of nptl/pthread_mutex_trylock.o
> reveals this part:
> 
> 140:   a5 1b 00 01             oill    %r1,1
> 144:   e5 48 a0 f0 00 00       mvghi   240(%r10),0   <--- THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> 14a:   e3 10 a0 e0 00 24       stg     %r1,224(%r10) <--- last THREAD_SETMEM of ENQUEUE_MUTEX_PI

Awesome.

> I added a barrier between those two and now the code looks like this:
> 
> 140:   a5 1b 00 01             oill    %r1,1
> 144:   e3 10 a0 e0 00 24       stg     %r1,224(%r10)
> 14a:   e5 48 a0 f0 00 00       mvghi   240(%r10),0
> 
> Looks like this was a one instruction race...

Fun. JFYI, I said that I reversed the stores in glibc and on my x86 test VM
it took more than _3_ days to trigger. But the good news is, that the trace
looks exactly like the ones you provided. So it looks we are on the right
track.

> I'll try to reproduce with the patch below (sprinkling compiler
> barriers just like the other files have).

Looks about right.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede
  2019-02-03 16:30                                                                 ` Thomas Gleixner
@ 2019-02-04 11:40                                                                   ` Heiko Carstens
  0 siblings, 0 replies; 47+ messages in thread
From: Heiko Carstens @ 2019-02-04 11:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Sewior, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Martin Schwidefsky, LKML, linux-s390, Stefan Liebler

On Sun, Feb 03, 2019 at 05:30:39PM +0100, Thomas Gleixner wrote:
> On Sat, 2 Feb 2019, Heiko Carstens wrote:
> > I added a barrier between those two and now the code looks like this:
> > 
> > 140:   a5 1b 00 01             oill    %r1,1
> > 144:   e3 10 a0 e0 00 24       stg     %r1,224(%r10)
> > 14a:   e5 48 a0 f0 00 00       mvghi   240(%r10),0
> > 
> > Looks like this was a one instruction race...
> 
> Fun. JFYI, I said that I reversed the stores in glibc and on my x86 test VM
> it took more than _3_ days to trigger. But the good news is, that the trace
> looks exactly like the ones you provided. So it looks we are on the right
> track.
> 
> > I'll try to reproduce with the patch below (sprinkling compiler
> > barriers just like the other files have).
> 
> Looks about right.

The test case now runs since two days without failures. So it looks like you
found the bug! Thank you for debugging this!

My glibc patch missed at lease one place where I should have added another
barrier, but the current version was good enough for the test case ;)

Stefan Liebler is kind enough to take care that this will be fixed in glibc.

Thanks!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [tip:locking/urgent] futex: Handle early deadlock return correctly
  2019-01-29 22:15             ` [PATCH] futex: Handle early deadlock return correctly Thomas Gleixner
  2019-01-30 12:01               ` Thomas Gleixner
@ 2019-02-08 12:05               ` tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 47+ messages in thread
From: tip-bot for Thomas Gleixner @ 2019-02-08 12:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, bigeasy, tglx, peterz, heiko.carstens, hpa,
	stli, schwidefsky

Commit-ID:  1a1fb985f2e2b85ec0d3dc2e519ee48389ec2434
Gitweb:     https://git.kernel.org/tip/1a1fb985f2e2b85ec0d3dc2e519ee48389ec2434
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 29 Jan 2019 23:15:12 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 8 Feb 2019 13:00:36 +0100

futex: Handle early deadlock return correctly

commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.

futex_lock_pi()                         futex_unlock_pi()
  lock(hb->lock)
  queue(hb_waiter)				lock(hb->lock)
  lock(rtmutex->wait_lock)
  unlock(hb->lock)
                                        // acquired hb->lock
                                        hb_waiter = futex_top_waiter()
                                        lock(rtmutex->wait_lock)
  __rt_mutex_proxy_start()
     ---> fail
          remove(rtmutex_waiter);
     ---> returns -EDEADLOCK
  unlock(rtmutex->wait_lock)
                                        // acquired wait_lock
                                        wake_futex_pi()
                                        rt_mutex_next_owner()
					  --> returns NULL
                                          --> WARN

  lock(hb->lock)
  unqueue(hb_waiter)

The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.

Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.

Add proper comments to the code so all these details are fully documented.

Thanks to Peter for helping with the analysis and writing the really
valuable code comments.

Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Stefan Liebler <stli@linux.ibm.com>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de

---
 kernel/futex.c           | 28 ++++++++++++++++++----------
 kernel/locking/rtmutex.c | 37 ++++++++++++++++++++++++++++++++-----
 2 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5ec2473a3497..a0514e01c3eb 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2861,35 +2861,39 @@ retry_private:
 	 * and BUG when futex_unlock_pi() interleaves with this.
 	 *
 	 * Therefore acquire wait_lock while holding hb->lock, but drop the
-	 * latter before calling rt_mutex_start_proxy_lock(). This still fully
-	 * serializes against futex_unlock_pi() as that does the exact same
-	 * lock handoff sequence.
+	 * latter before calling __rt_mutex_start_proxy_lock(). This
+	 * interleaves with futex_unlock_pi() -- which does a similar lock
+	 * handoff -- such that the latter can observe the futex_q::pi_state
+	 * before __rt_mutex_start_proxy_lock() is done.
 	 */
 	raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
 	spin_unlock(q.lock_ptr);
+	/*
+	 * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+	 * such that futex_unlock_pi() is guaranteed to observe the waiter when
+	 * it sees the futex_q::pi_state.
+	 */
 	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
 	raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
 
 	if (ret) {
 		if (ret == 1)
 			ret = 0;
-
-		spin_lock(q.lock_ptr);
-		goto no_block;
+		goto cleanup;
 	}
 
-
 	if (unlikely(to))
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
+cleanup:
 	spin_lock(q.lock_ptr);
 	/*
-	 * If we failed to acquire the lock (signal/timeout), we must
+	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
 	 * first acquire the hb->lock before removing the lock from the
-	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
-	 * wait lists consistent.
+	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
+	 * lists consistent.
 	 *
 	 * In particular; it is important that futex_unlock_pi() can not
 	 * observe this inconsistency.
@@ -3013,6 +3017,10 @@ retry:
 		 * there is no point where we hold neither; and therefore
 		 * wake_futex_pi() must observe a state consistent with what we
 		 * observed.
+		 *
+		 * In particular; this forces __rt_mutex_start_proxy() to
+		 * complete such that we're guaranteed to observe the
+		 * rt_waiter. Also see the WARN in wake_futex_pi().
 		 */
 		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 		spin_unlock(&hb->lock);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 581edcc63c26..978d63a8261c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 	rt_mutex_set_owner(lock, NULL);
 }
 
+/**
+ * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock:		the rt_mutex to take
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ * @task:		the task to prepare
+ *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: does _NOT_ remove the @waiter on failure; must either call
+ * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
+ *
+ * Returns:
+ *  0 - task blocked on lock
+ *  1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for PI-futex support.
+ */
 int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
 			      struct task_struct *task)
 {
 	int ret;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	if (try_to_take_rt_mutex(lock, task, NULL))
 		return 1;
 
@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 		ret = 0;
 	}
 
-	if (unlikely(ret))
-		remove_waiter(lock, waiter);
-
 	debug_rt_mutex_print_deadlock(waiter);
 
 	return ret;
@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
  * @waiter:		the pre-initialized rt_mutex_waiter
  * @task:		the task to prepare
  *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
+ * on failure.
+ *
  * Returns:
  *  0 - task blocked on lock
  *  1 - acquired the lock for task, caller should wake it up
  * <0 - error
  *
- * Special API call for FUTEX_REQUEUE_PI support.
+ * Special API call for PI-futex support.
  */
 int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 
 	raw_spin_lock_irq(&lock->wait_lock);
 	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+	if (unlikely(ret))
+		remove_waiter(lock, waiter);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return ret;
@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
  * @lock:		the rt_mutex we were woken on
  * @waiter:		the pre-initialized rt_mutex_waiter
  *
- * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
+ * rt_mutex_wait_proxy_lock().
  *
  * Unless we acquired the lock; we're still enqueued on the wait-list and can
  * in fact still be granted ownership until we're removed. Therefore we can

^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-02-08 12:05 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-27  8:11 WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Heiko Carstens
2018-11-28 14:32 ` Thomas Gleixner
2018-11-29 11:23   ` Heiko Carstens
2019-01-21 12:21     ` Heiko Carstens
2019-01-21 13:12       ` Thomas Gleixner
2019-01-22 21:14         ` Thomas Gleixner
2019-01-23  9:24           ` Heiko Carstens
2019-01-23 12:33             ` Thomas Gleixner
2019-01-23 12:40               ` Heiko Carstens
2019-01-28 13:44     ` Peter Zijlstra
2019-01-28 13:58       ` Peter Zijlstra
2019-01-28 15:53         ` Thomas Gleixner
2019-01-29  8:49           ` Peter Zijlstra
2019-01-29 22:15             ` [PATCH] futex: Handle early deadlock return correctly Thomas Gleixner
2019-01-30 12:01               ` Thomas Gleixner
2019-02-08 12:05               ` [tip:locking/urgent] " tip-bot for Thomas Gleixner
2019-01-29  9:01           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Heiko Carstens
2019-01-29  9:33             ` Peter Zijlstra
2019-01-29  9:45             ` Thomas Gleixner
2019-01-29 10:24               ` Heiko Carstens
2019-01-29 10:35                 ` Peter Zijlstra
2019-01-29 13:03                   ` Thomas Gleixner
2019-01-29 13:23                     ` Heiko Carstens
     [not found]                       ` <20190129151058.GG26906@osiris>
2019-01-29 17:16                         ` Sebastian Sewior
2019-01-29 21:45                           ` Thomas Gleixner
     [not found]                           ` <20190130094913.GC5299@osiris>
2019-01-30 12:15                             ` Thomas Gleixner
     [not found]                               ` <20190130125955.GD5299@osiris>
2019-01-30 13:24                                 ` Sebastian Sewior
2019-01-30 13:29                                   ` Thomas Gleixner
2019-01-30 14:33                                     ` Thomas Gleixner
2019-01-30 17:56                                       ` Thomas Gleixner
2019-01-30 21:07                                         ` Sebastian Sewior
2019-01-30 23:13                                           ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggerede Thomas Gleixner
2019-01-30 23:35                                             ` Paul E. McKenney
2019-01-30 23:55                                               ` Thomas Gleixner
2019-01-31  0:27                                                 ` Thomas Gleixner
2019-01-31  1:45                                                   ` Paul E. McKenney
2019-01-31 16:52                                                   ` Heiko Carstens
2019-01-31 17:06                                                     ` Sebastian Sewior
2019-01-31 20:42                                                       ` Heiko Carstens
2019-02-01 16:12                                                       ` Heiko Carstens
2019-02-01 21:59                                                         ` Thomas Gleixner
     [not found]                                                           ` <20190202091043.GA3381@osiris>
2019-02-02 10:14                                                             ` Thomas Gleixner
2019-02-02 11:20                                                               ` Heiko Carstens
2019-02-03 16:30                                                                 ` Thomas Gleixner
2019-02-04 11:40                                                                   ` Heiko Carstens
2019-01-31  1:44                                                 ` Paul E. McKenney
2019-01-30 13:25                                 ` WARN_ON_ONCE(!new_owner) within wake_futex_pi() triggered Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).