All of lore.kernel.org
 help / color / mirror / Atom feed
* process 'stuck' at exit.
@ 2013-12-10 15:47 Dave Jones
  2013-12-10 18:40 ` Linus Torvalds
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-10 15:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel

I woke up to find my fuzzer in a curious state.

 1121 pts/5    SN+    0:00  |       \_ ../trinity -q -l off -N 999999 -C 42
 1130 pts/5    SN+    0:01  |           \_ ../trinity -q -l off -N 999999 -C 42
 1131 pts/5    SN+    0:17  |           \_ ../trinity -q -l off -N 999999 -C 42
10818 ?        RNs  21115152:53  |               \_ ../trinity -q -l off -N 999999 -C 42

(ignore the first 3 pids, they're waiting on 10818, which is the interesting one)

It's completed its run of 999999 syscalls, and looking at tmux, it tried to exit.

/proc/10818/stack just shows

[<ffffffffffffffff>] 0xffffffffffffffff

Top reports a core is spinning in the kernel 100%, so I ran perf top -a
and saw..

  8.63%  [kernel]                [k] trace_hardirqs_off_caller
  7.68%  [kernel]                [k] __lock_acquire
  5.35%  [kernel]                [k] gup_huge_pmd
  5.31%  [kernel]                [k] put_compound_page
  4.93%  [kernel]                [k] gup_pud_range
  4.76%  [kernel]                [k] trace_hardirqs_on_caller
  4.58%  [kernel]                [k] get_user_pages_fast
  4.00%  [kernel]                [k] debug_smp_processor_id
  4.00%  [kernel]                [k] lock_is_held
  3.73%  [kernel]                [k] lock_acquired
  3.67%  [kernel]                [k] lock_release


sysrq-t shows..

trinity-child27 R  running task     5520 10818   1131 0x00080004
 0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108
 0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0
 ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50
Call Trace:
 [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe
 [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80
 [<ffffffff8109624f>] ? local_clock+0xf/0x50
 [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
 [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190
 [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
 [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
 [<ffffffff810a8a2f>] ? up_read+0x1f/0x40
 [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
 [<ffffffff8115f76c>] ? put_page+0x3c/0x50
 [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0
 [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0
 [<ffffffff810e019e>] ? do_futex+0xae/0xc80
 [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
 [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170
 [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190
 [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
 [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150
 [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0
 [<ffffffff81760be4>] ? tracesys+0xdd/0xe2


any ideas ?

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 15:47 process 'stuck' at exit Dave Jones
@ 2013-12-10 18:40 ` Linus Torvalds
  2013-12-10 19:18   ` Thomas Gleixner
  2013-12-10 20:57   ` Darren Hart
  0 siblings, 2 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 18:40 UTC (permalink / raw)
  To: Dave Jones, Thomas Gleixner, Darren Hart, Andrea Arcangeli
  Cc: Linux Kernel Mailing List

Hmm. Looks like the futex code is somehow stuck in a loop, calling
get_user_pages_fast().

The futex code itself is apparently so low-overhead that it doesn't
show up in your 'perf top' report (which is dominated by all the
expensive debug things that get_user_pages_fast() etc ends up doing),
but that's the only looping I can see. Perhaps the "goto again" case
for transparent huge pages in get_futex_key()? Or the
"retry[_private]" cases in futex_requeue()? Some error condition that
causes us to retry forever, rather than returning the error code?

Added a few more people to the cc.. Ideas?

              Linus


On Tue, Dec 10, 2013 at 7:47 AM, Dave Jones <davej@redhat.com> wrote:
> I woke up to find my fuzzer in a curious state.
>
>  1121 pts/5    SN+    0:00  |       \_ ../trinity -q -l off -N 999999 -C 42
>  1130 pts/5    SN+    0:01  |           \_ ../trinity -q -l off -N 999999 -C 42
>  1131 pts/5    SN+    0:17  |           \_ ../trinity -q -l off -N 999999 -C 42
> 10818 ?        RNs  21115152:53  |               \_ ../trinity -q -l off -N 999999 -C 42
>
> (ignore the first 3 pids, they're waiting on 10818, which is the interesting one)
>
> It's completed its run of 999999 syscalls, and looking at tmux, it tried to exit.
>
> /proc/10818/stack just shows
>
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Top reports a core is spinning in the kernel 100%, so I ran perf top -a
> and saw..
>
>   8.63%  [kernel]                [k] trace_hardirqs_off_caller
>   7.68%  [kernel]                [k] __lock_acquire
>   5.35%  [kernel]                [k] gup_huge_pmd
>   5.31%  [kernel]                [k] put_compound_page
>   4.93%  [kernel]                [k] gup_pud_range
>   4.76%  [kernel]                [k] trace_hardirqs_on_caller
>   4.58%  [kernel]                [k] get_user_pages_fast
>   4.00%  [kernel]                [k] debug_smp_processor_id
>   4.00%  [kernel]                [k] lock_is_held
>   3.73%  [kernel]                [k] lock_acquired
>   3.67%  [kernel]                [k] lock_release
>
>
> sysrq-t shows..
>
> trinity-child27 R  running task     5520 10818   1131 0x00080004
>  0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108
>  0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0
>  ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50
> Call Trace:
>  [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe
>  [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80
>  [<ffffffff8109624f>] ? local_clock+0xf/0x50
>  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
>  [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190
>  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
>  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810a8a2f>] ? up_read+0x1f/0x40
>  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
>  [<ffffffff8115f76c>] ? put_page+0x3c/0x50
>  [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0
>  [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0
>  [<ffffffff810e019e>] ? do_futex+0xae/0xc80
>  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
>  [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170
>  [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190
>  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150
>  [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0
>  [<ffffffff81760be4>] ? tracesys+0xdd/0xe2
>
>
> any ideas ?
>
>         Dave
>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 18:40 ` Linus Torvalds
@ 2013-12-10 19:18   ` Thomas Gleixner
  2013-12-10 19:55     ` Linus Torvalds
  2013-12-11  1:02     ` Mel Gorman
  2013-12-10 20:57   ` Darren Hart
  1 sibling, 2 replies; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 19:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> Hmm. Looks like the futex code is somehow stuck in a loop, calling
> get_user_pages_fast().
> 
> The futex code itself is apparently so low-overhead that it doesn't
> show up in your 'perf top' report (which is dominated by all the
> expensive debug things that get_user_pages_fast() etc ends up doing),
> but that's the only looping I can see. Perhaps the "goto again" case
> for transparent huge pages in get_futex_key()? Or the

Cc'ng more folks on that.

> "retry[_private]" cases in futex_requeue()? Some error condition that
> causes us to retry forever, rather than returning the error code?

So this is pretty unlikely. The retry requires:

   get_futex_value_locked() == EFAULT;
   
Now we drop the hash bucket locks and do:

   get_user();

And if that get_user() faults again, we bail out.

If it succeeds we try again. So between the get_user() and the next
get_futex_value_locked() the effect of get_user() must have been
undone. I guess this can happen, but a gazillion times in a row?

> 
> Added a few more people to the cc.. Ideas?

...

> > Top reports a core is spinning in the kernel 100%, so I ran perf top -a
> > and saw..
> >
> >   8.63%  [kernel]                [k] trace_hardirqs_off_caller
> >   7.68%  [kernel]                [k] __lock_acquire
> >   5.35%  [kernel]                [k] gup_huge_pmd
> >   5.31%  [kernel]                [k] put_compound_page
> >   4.93%  [kernel]                [k] gup_pud_range
> >   4.76%  [kernel]                [k] trace_hardirqs_on_caller
> >   4.58%  [kernel]                [k] get_user_pages_fast
> >   4.00%  [kernel]                [k] debug_smp_processor_id
> >   4.00%  [kernel]                [k] lock_is_held
> >   3.73%  [kernel]                [k] lock_acquired
> >   3.67%  [kernel]                [k] lock_release
> >
> >
> > sysrq-t shows..
> >
> > trinity-child27 R  running task     5520 10818   1131 0x00080004
> >  0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108
> >  0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0
> >  ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50
> > Call Trace:
> >  [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe
> >  [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >  [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80
> >  [<ffffffff8109624f>] ? local_clock+0xf/0x50
> >  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
> >  [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190
> >  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
> >  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> >  [<ffffffff810a8a2f>] ? up_read+0x1f/0x40
> >  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
> >  [<ffffffff8115f76c>] ? put_page+0x3c/0x50
> >  [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0
> >  [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0
> >  [<ffffffff810e019e>] ? do_futex+0xae/0xc80
> >  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
> >  [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170
> >  [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190
> >  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> >  [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150
> >  [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0
> >  [<ffffffff81760be4>] ? tracesys+0xdd/0xe2
> >
> >
> > any ideas ?
> >
> >         Dave
> >
> 

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 19:18   ` Thomas Gleixner
@ 2013-12-10 19:55     ` Linus Torvalds
  2013-12-10 20:27       ` Dave Jones
                         ` (2 more replies)
  2013-12-11  1:02     ` Mel Gorman
  1 sibling, 3 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 19:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> So this is pretty unlikely. The retry requires:
>
>    get_futex_value_locked() == EFAULT;
>
> Now we drop the hash bucket locks and do:
>
>    get_user();
>
> And if that get_user() faults again, we bail out.

I think you need to look closer.

We have at least also that "futex_proxy_trylock_atomic() returns
-EAGAIN" case. Which triggers at some exit condition. Another thread
in the same group, perhaps never completing the exit because it's
waiting for this one? I dunno, I didn't look any closer (but this does
make me think "Hey, we should add Oleg to the Cc too", since
PF_EXITING is involved).. So maybe there is some situation where that
EAGAIN will keep happening, forever..

Now, I'm *not* saying that that is it. It's quite possible/likely some
other loop, but I do have to say that it sure isn't _obvious_. And
that whole EAGAIN return case is quite deep and special, so ...

                 Linus

PS: Oleg - the whole thread is on lkml. Ping me if you need more context.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 19:55     ` Linus Torvalds
@ 2013-12-10 20:27       ` Dave Jones
  2013-12-10 20:34         ` Thomas Gleixner
  2013-12-10 20:33       ` Thomas Gleixner
  2013-12-10 20:35       ` Oleg Nesterov
  2 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-10 20:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, Dec 10, 2013 at 11:55:06AM -0800, Linus Torvalds wrote:
 > On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
 > >
 > > So this is pretty unlikely. The retry requires:
 > >
 > >    get_futex_value_locked() == EFAULT;
 > >
 > > Now we drop the hash bucket locks and do:
 > >
 > >    get_user();
 > >
 > > And if that get_user() faults again, we bail out.
 > 
 > I think you need to look closer.
 > 
 > We have at least also that "futex_proxy_trylock_atomic() returns
 > -EAGAIN" case. Which triggers at some exit condition. Another thread
 > in the same group, perhaps never completing the exit because it's
 > waiting for this one? I dunno, I didn't look any closer (but this does
 > make me think "Hey, we should add Oleg to the Cc too", since
 > PF_EXITING is involved).. So maybe there is some situation where that
 > EAGAIN will keep happening, forever..
 > 
 > Now, I'm *not* saying that that is it. It's quite possible/likely some
 > other loop, but I do have to say that it sure isn't _obvious_. And
 > that whole EAGAIN return case is quite deep and special, so ...
 > 
 >                  Linus
 > 
 > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.

btw, I've left the machine in that state, and will for as long as necesary
in case someone has any ideas for further tracing experiments.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 19:55     ` Linus Torvalds
  2013-12-10 20:27       ` Dave Jones
@ 2013-12-10 20:33       ` Thomas Gleixner
  2013-12-10 20:43         ` Linus Torvalds
  2013-12-10 20:35       ` Oleg Nesterov
  2 siblings, 1 reply; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > So this is pretty unlikely. The retry requires:
> >
> >    get_futex_value_locked() == EFAULT;
> >
> > Now we drop the hash bucket locks and do:
> >
> >    get_user();
> >
> > And if that get_user() faults again, we bail out.
> 
> I think you need to look closer.
> 
> We have at least also that "futex_proxy_trylock_atomic() returns
> -EAGAIN" case. Which triggers at some exit condition. Another thread
> in the same group, perhaps never completing the exit because it's
> waiting for this one? I dunno, I didn't look any closer (but this does
> make me think "Hey, we should add Oleg to the Cc too", since
> PF_EXITING is involved).. So maybe there is some situation where that
> EAGAIN will keep happening, forever..
> 
> Now, I'm *not* saying that that is it. It's quite possible/likely some
> other loop, but I do have to say that it sure isn't _obvious_. And
> that whole EAGAIN return case is quite deep and special, so ...

The -EAGAIN is when the user value changed, simplified:

    oldval = *uval;
    sys_futex(....., oldval)
      do_futex(...) {		     
	if (oldval != *uval)
	   return -EAGAIN;
      }
    sys_exit();

And user space loops on that.

Thanks,

	tglx
	
	

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:27       ` Dave Jones
@ 2013-12-10 20:34         ` Thomas Gleixner
  2013-12-10 20:55           ` Dave Jones
  0 siblings, 1 reply; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 20:34 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, 10 Dec 2013, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 11:55:06AM -0800, Linus Torvalds wrote:
>  > On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>  > >
>  > > So this is pretty unlikely. The retry requires:
>  > >
>  > >    get_futex_value_locked() == EFAULT;
>  > >
>  > > Now we drop the hash bucket locks and do:
>  > >
>  > >    get_user();
>  > >
>  > > And if that get_user() faults again, we bail out.
>  > 
>  > I think you need to look closer.
>  > 
>  > We have at least also that "futex_proxy_trylock_atomic() returns
>  > -EAGAIN" case. Which triggers at some exit condition. Another thread
>  > in the same group, perhaps never completing the exit because it's
>  > waiting for this one? I dunno, I didn't look any closer (but this does
>  > make me think "Hey, we should add Oleg to the Cc too", since
>  > PF_EXITING is involved).. So maybe there is some situation where that
>  > EAGAIN will keep happening, forever..
>  > 
>  > Now, I'm *not* saying that that is it. It's quite possible/likely some
>  > other loop, but I do have to say that it sure isn't _obvious_. And
>  > that whole EAGAIN return case is quite deep and special, so ...
>  > 
>  >                  Linus
>  > 
>  > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
> 
> btw, I've left the machine in that state, and will for as long as necesary
> in case someone has any ideas for further tracing experiments.

Can you gather a trace with the function tracer? That will tell us
what the thing is actually doing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 19:55     ` Linus Torvalds
  2013-12-10 20:27       ` Dave Jones
  2013-12-10 20:33       ` Thomas Gleixner
@ 2013-12-10 20:35       ` Oleg Nesterov
  2013-12-10 20:49         ` Dave Jones
  2 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-10 20:35 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones
  Cc: Thomas Gleixner, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

Dave, I must have missed something, help.

I am looking at the first message and I can't understand who stuck
"at exit".

The trace shows that the task with pid=10818 called sys_futex() ?

Perhaps "exit" means the userspace paths?

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:33       ` Thomas Gleixner
@ 2013-12-10 20:43         ` Linus Torvalds
  2013-12-10 21:17           ` Thomas Gleixner
  0 siblings, 1 reply; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 20:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, Dec 10, 2013 at 12:33 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The -EAGAIN is when the user value changed, simplified:

No it's not.

Thomas, stop this crap already. Look at the f*cking code carefully
instead of just dismissing cases.

The worrisome EAGAIN case is

  futex_requeue
    futex_proxy_trylock_atomic
      futex_lock_pi_atomic
        lookup_pi_state:
          ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;

and now futex_requeue() will do "goto repeat" for that EAGAIN case.

So, Christ, Thomas, you have now *twice* dismissed a real concern with
totally bogus "that can never happen" by explaining some totally
unrelated *simple* case rather than the much more complex case.

So please. Really. Truly look at the code and thing about it, or shut
the f*ck up. No more of this shit where you glance at the code, find
some simple case, and say "that can't happen", and dismiss the
bug-report.

So far Dave's bug-reports have generally pretty much universally shown
real bugs. Being dismissive about it is not helpful, quite the
reverse.

Maybe the loop I'm pointing at cannot happen, but *your* explanation
for why it couldn't happen was pure and utter garbage, and was clearly
because you hadn't even bothered to look at all the cases.

                Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:35       ` Oleg Nesterov
@ 2013-12-10 20:49         ` Dave Jones
  2013-12-10 21:06           ` Darren Hart
  2013-12-10 21:34           ` Oleg Nesterov
  0 siblings, 2 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 20:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
 > Dave, I must have missed something, help.
 > 
 > I am looking at the first message and I can't understand who stuck
 > "at exit".
 > 
 > The trace shows that the task with pid=10818 called sys_futex() ?
 >
 > Perhaps "exit" means the userspace paths?

pid 1131 is wait()'ing for 10818 to exit

pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
tired of waiting. 10818 is ignoring these because it's stuck in a loop
somewhere in the kernel.

I tried attaching to 10818 with gdb, and it just hangs.
(possibly because its weird stack situation [see 1st post])

by inspecting the shared mapping that all processes have (by gdb'ing 1130)
I can see that 10818 did all its full run without incident, and the
"exit child" flag in the fuzzer had been in set.

The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call
must come from somewhere in libc maybe ?

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:34         ` Thomas Gleixner
@ 2013-12-10 20:55           ` Dave Jones
  2013-12-10 21:25             ` Darren Hart
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-10 20:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:
 
 > >  > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
 > > 
 > > btw, I've left the machine in that state, and will for as long as necesary
 > > in case someone has any ideas for further tracing experiments.
 > 
 > Can you gather a trace with the function tracer? That will tell us
 > what the thing is actually doing.

Feed me command lines, and I'll see what I can do.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 18:40 ` Linus Torvalds
  2013-12-10 19:18   ` Thomas Gleixner
@ 2013-12-10 20:57   ` Darren Hart
  2013-12-10 21:09     ` Dave Jones
  1 sibling, 1 reply; 114+ messages in thread
From: Darren Hart @ 2013-12-10 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List

On Tue, 2013-12-10 at 10:40 -0800, Linus Torvalds wrote:
> Hmm. Looks like the futex code is somehow stuck in a loop, calling
> get_user_pages_fast().
> 
> The futex code itself is apparently so low-overhead that it doesn't
> show up in your 'perf top' report (which is dominated by all the
> expensive debug things that get_user_pages_fast() etc ends up doing),
> but that's the only looping I can see. Perhaps the "goto again" case
> for transparent huge pages in get_futex_key()? Or the
> "retry[_private]" cases in futex_requeue()? Some error condition that
> causes us to retry forever, rather than returning the error code?
> 
> Added a few more people to the cc.. Ideas?
> 
>               Linus
> 
> 
> On Tue, Dec 10, 2013 at 7:47 AM, Dave Jones <davej@redhat.com> wrote:
> > I woke up to find my fuzzer in a curious state.
> >
> >  1121 pts/5    SN+    0:00  |       \_ ../trinity -q -l off -N 999999 -C 42
> >  1130 pts/5    SN+    0:01  |           \_ ../trinity -q -l off -N 999999 -C 42
> >  1131 pts/5    SN+    0:17  |           \_ ../trinity -q -l off -N 999999 -C 42
> > 10818 ?        RNs  21115152:53  |               \_ ../trinity -q -l off -N 999999 -C 42
> >
> > (ignore the first 3 pids, they're waiting on 10818, which is the interesting one)
> >
> > It's completed its run of 999999 syscalls, and looking at tmux, it tried to exit.
> >
> > /proc/10818/stack just shows
> >
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > Top reports a core is spinning in the kernel 100%, so I ran perf top -a
> > and saw..
> >
> >   8.63%  [kernel]                [k] trace_hardirqs_off_caller
> >   7.68%  [kernel]                [k] __lock_acquire
> >   5.35%  [kernel]                [k] gup_huge_pmd
> >   5.31%  [kernel]                [k] put_compound_page
> >   4.93%  [kernel]                [k] gup_pud_range
> >   4.76%  [kernel]                [k] trace_hardirqs_on_caller
> >   4.58%  [kernel]                [k] get_user_pages_fast
> >   4.00%  [kernel]                [k] debug_smp_processor_id
> >   4.00%  [kernel]                [k] lock_is_held
> >   3.73%  [kernel]                [k] lock_acquired
> >   3.67%  [kernel]                [k] lock_release
> >
> >
> > sysrq-t shows..
> >
> > trinity-child27 R  running task     5520 10818   1131 0x00080004
> >  0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108
> >  0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0
> >  ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50
> > Call Trace:
> >  [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe
> >  [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >  [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80
> >  [<ffffffff8109624f>] ? local_clock+0xf/0x50
> >  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
> >  [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190
> >  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
> >  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> >  [<ffffffff810a8a2f>] ? up_read+0x1f/0x40
> >  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
> >  [<ffffffff8115f76c>] ? put_page+0x3c/0x50
> >  [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0
> >  [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0
> >  [<ffffffff810e019e>] ? do_futex+0xae/0xc80
> >  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
> >  [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170
> >  [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190
> >  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> >  [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150
> >  [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0
> >  [<ffffffff81760be4>] ? tracesys+0xdd/0xe2
> >

Hi Dave,

Can you get us an idea of the arguments trinity is tossing into
SYS_futex?

Op code? Would help to know if this was requeue_pi for example.
Type of memory being used for the uaddr?

I see futex_requeue in the stack, which means the opcode is one of:

FUTEX_REQUEUE
FUTEX_CMP_REQUEUE
FUTEX_CMP_REQUEUE_PI

FUTEX_REQUEUE has a known issue and was replaced with FUTEX_CMP_REQUEUE,
for details, test cases, and an analysis see the historic tree:

commit 9b91d73bde9d68800f9e5c338c0cf9d0fe3bc862
Author: Andrew Morton <akpm@osdl.org>
Date:   2004-05-31

    [PATCH] Add FUTEX_CMP_REQUEUE futex op

Specifically:
http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html


Trinity is going to trigger hangs in futexes just by it's very nature,
but I believe you have watchdogs in place to kill such malformed tests
after a timeout?

I'll keep digging.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:49         ` Dave Jones
@ 2013-12-10 21:06           ` Darren Hart
  2013-12-10 21:12             ` Dave Jones
  2013-12-10 21:18             ` Linus Torvalds
  2013-12-10 21:34           ` Oleg Nesterov
  1 sibling, 2 replies; 114+ messages in thread
From: Darren Hart @ 2013-12-10 21:06 UTC (permalink / raw)
  To: Dave Jones
  Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 2013-12-10 at 15:49 -0500, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
>  > Dave, I must have missed something, help.
>  > 
>  > I am looking at the first message and I can't understand who stuck
>  > "at exit".
>  > 
>  > The trace shows that the task with pid=10818 called sys_futex() ?
>  >
>  > Perhaps "exit" means the userspace paths?
> 
> pid 1131 is wait()'ing for 10818 to exit
> 
> pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
> tired of waiting. 10818 is ignoring these because it's stuck in a loop
> somewhere in the kernel.
> 
> I tried attaching to 10818 with gdb, and it just hangs.
> (possibly because its weird stack situation [see 1st post])
> 
> by inspecting the shared mapping that all processes have (by gdb'ing 1130)
> I can see that 10818 did all its full run without incident, and the
> "exit child" flag in the fuzzer had been in set.
> 
> The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call
> must come from somewhere in libc maybe ?

If that is the case, then Linus' requeue_pi path is highly unlikely as
FUTEX_CMP_REQUEUE_PI is not used by glibc (yet). That gives me hope as
that way there be dragons. Knowing exactly what syscall was made would
be very useful, but I don't know if that information is even available
anymore.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:57   ` Darren Hart
@ 2013-12-10 21:09     ` Dave Jones
  0 siblings, 0 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 21:09 UTC (permalink / raw)
  To: Darren Hart
  Cc: Linus Torvalds, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List

On Tue, Dec 10, 2013 at 12:57:57PM -0800, Darren Hart wrote:


 > > > Call Trace:
 > > >  [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe
 > > >  [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 > > >  [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80
 > > >  [<ffffffff8109624f>] ? local_clock+0xf/0x50
 > > >  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
 > > >  [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190
 > > >  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
 > > >  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
 > > >  [<ffffffff810a8a2f>] ? up_read+0x1f/0x40
 > > >  [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
 > > >  [<ffffffff8115f76c>] ? put_page+0x3c/0x50
 > > >  [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0
 > > >  [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0
 > > >  [<ffffffff810e019e>] ? do_futex+0xae/0xc80
 > > >  [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
 > > >  [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170
 > > >  [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190
 > > >  [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
 > > >  [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150
 > > >  [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0
 > > >  [<ffffffff81760be4>] ? tracesys+0xdd/0xe2
 > > >
 > 
 > Can you get us an idea of the arguments trinity is tossing into
 > SYS_futex?
 > 
 > Op code? Would help to know if this was requeue_pi for example.
 > Type of memory being used for the uaddr?

As is always the case, the interesting bugs only seem to happen
when I have logging disabled. So other than what I can glean from what's
left in the shm, no idea.

One of the other child processes (which exited already) did do a sys_futex.
the params it passed were..

1cb5000, -1, c57, 1cb5004, ffffffffffd8f420, 90000000091a6311

The result of this syscall was -1

 > I see futex_requeue in the stack, which means the opcode is one of:
 > 
 > FUTEX_REQUEUE
 > FUTEX_CMP_REQUEUE
 > FUTEX_CMP_REQUEUE_PI
 > 
 > FUTEX_REQUEUE has a known issue and was replaced with FUTEX_CMP_REQUEUE,
 > for details, test cases, and an analysis see the historic tree:
 > 
 > commit 9b91d73bde9d68800f9e5c338c0cf9d0fe3bc862
 > Author: Andrew Morton <akpm@osdl.org>
 > Date:   2004-05-31
 > 
 >     [PATCH] Add FUTEX_CMP_REQUEUE futex op
 > 
 > Specifically:
 > http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html
 > 
 > 
 > Trinity is going to trigger hangs in futexes just by it's very nature,
 > but I believe you have watchdogs in place to kill such malformed tests
 > after a timeout?

It should. Though that pid is happily ignoring the SIGKILL's the watchdog
is continuing to send, because it's never getting around to processing the
signals apparently.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:06           ` Darren Hart
@ 2013-12-10 21:12             ` Dave Jones
  2013-12-10 21:18             ` Linus Torvalds
  1 sibling, 0 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 21:12 UTC (permalink / raw)
  To: Darren Hart
  Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 01:06:23PM -0800, Darren Hart wrote:
 > On Tue, 2013-12-10 at 15:49 -0500, Dave Jones wrote:
 > > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
 > >  > Dave, I must have missed something, help.
 > >  > 
 > >  > I am looking at the first message and I can't understand who stuck
 > >  > "at exit".
 > >  > 
 > >  > The trace shows that the task with pid=10818 called sys_futex() ?
 > >  >
 > >  > Perhaps "exit" means the userspace paths?
 > > 
 > > pid 1131 is wait()'ing for 10818 to exit
 > > 
 > > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
 > > tired of waiting. 10818 is ignoring these because it's stuck in a loop
 > > somewhere in the kernel.
 > > 
 > > I tried attaching to 10818 with gdb, and it just hangs.
 > > (possibly because its weird stack situation [see 1st post])
 > > 
 > > by inspecting the shared mapping that all processes have (by gdb'ing 1130)
 > > I can see that 10818 did all its full run without incident, and the
 > > "exit child" flag in the fuzzer had been in set.
 > > 
 > > The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call
 > > must come from somewhere in libc maybe ?
 > 
 > If that is the case, then Linus' requeue_pi path is highly unlikely as
 > FUTEX_CMP_REQUEUE_PI is not used by glibc (yet). That gives me hope as
 > that way there be dragons. Knowing exactly what syscall was made would
 > be very useful, but I don't know if that information is even available
 > anymore.

So that last syscall _that_ 'stuck' thread did was accept4, but I found
another pid that had done a futex call just before it exited.
(see other mail)

What I don't understand is how the running child has futex as part of
its stack trace, when the internal log that trinity keeps has on record
for that particular pid having called it.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:43         ` Linus Torvalds
@ 2013-12-10 21:17           ` Thomas Gleixner
  0 siblings, 0 replies; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 12:33 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > The -EAGAIN is when the user value changed, simplified:
> 
> No it's not.
> 
> Thomas, stop this crap already. Look at the f*cking code carefully
> instead of just dismissing cases.
> 
> The worrisome EAGAIN case is
> 
>   futex_requeue
>     futex_proxy_trylock_atomic
>       futex_lock_pi_atomic
>         lookup_pi_state:
>           ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
> 
> and now futex_requeue() will do "goto repeat" for that EAGAIN case.
> 
> So, Christ, Thomas, you have now *twice* dismissed a real concern with
> totally bogus "that can never happen" by explaining some totally
> unrelated *simple* case rather than the much more complex case.
> 
> So please. Really. Truly look at the code and thing about it, or shut
> the f*ck up. No more of this shit where you glance at the code, find
> some simple case, and say "that can't happen", and dismiss the
> bug-report.

Well, I spent a fricking long time to work on that code and find the
absurdest bugs in it and I'm well aware of the exit issue.
 
> So far Dave's bug-reports have generally pretty much universally shown
> real bugs. Being dismissive about it is not helpful, quite the
> reverse.

I know and I used that information more than once to carefully track
down the real reason. I never dismissed a single report.
 
> Maybe the loop I'm pointing at cannot happen, but *your* explanation
> for why it couldn't happen was pure and utter garbage, and was clearly
> because you hadn't even bothered to look at all the cases.

I might have been sloppy and not really explaining why I think, that
the requeue_pi exit case is not likely.

To make that loop happen it requires the following:

1) An actual user of requeue_pi, which can only be the fuzzer as glibc
   does not support it. But Daves last fuzzed syscall was something
   else.

2) The report said, that the last fuzzing test was already done and
   the fuzzer app is about to exit.

   That involves futex_requeue from deep inside glibc thread
   library. And that is NOT using REQUEUE_PI.

3) So now even IF it would use requeue_pi then this would require,
   that the outer PI lock is already held by some other task which is
   already in the process of exiting. IOW, this lock would be held by
   something which did not release it before exit and already set
   PF_EXITING in exit_signals() and then got stuck for ever before
   reaching: tsk->flags |= PF_EXITPIDONE;

I still think, that it is highly unlikely, but to make sure I already
asked Dave before reading your rant to fire up the tracer, so we know
for sure where hell the thing is looping.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:06           ` Darren Hart
  2013-12-10 21:12             ` Dave Jones
@ 2013-12-10 21:18             ` Linus Torvalds
  2013-12-10 21:24               ` Linus Torvalds
  2013-12-10 21:32               ` Dave Jones
  1 sibling, 2 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 21:18 UTC (permalink / raw)
  To: Darren Hart
  Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 1:06 PM, Darren Hart <dvhart@linux.intel.com> wrote:
>
>    . Knowing exactly what syscall was made would
> be very useful, but I don't know if that information is even available
> anymore.

Well, the loop should be visible in the profile, since it's still active.

So doing something like

     perf record -e cycles:pp -g -a sleep 60

to get a good profile for a minute, and then looking at the
instruction-level profiles in futex_requeue() should be possible.

Of course, inlining etc often makes those rather hard to see, and the
bulk of the profile is clearly all about the lock debugging overhead,
but if the loop is in futex_requeue() then that *should* be visible in
the profile, even if it may not be anywhere near the top..

Dave?

               Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:18             ` Linus Torvalds
@ 2013-12-10 21:24               ` Linus Torvalds
  2013-12-10 21:32               ` Dave Jones
  1 sibling, 0 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 21:24 UTC (permalink / raw)
  To: Darren Hart
  Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 1:18 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> to get a good profile for a minute, and then looking at the
> instruction-level profiles in futex_requeue() should be possible.

Ugh. Looking at kernel/futex.s even *without* debugging enabled is
pretty messy. Although much of it seems to be because the hash is
actually fairly complex. jhash2() really ends up expanding to a lot of
instructions.

But especially if that kernel was compiled with debug information,
then 'perf report' is pretty good about matching it up with the source
code, so it might not be *too* hard to try to figure out where the
loop is.

And if it's not in futex_requeue() (there are loops at other levels,
like futex_get_key()), then having the callchain information for the
costly operations will hopefully give a hint where the top loop is. So
profiling data can be very powerful for things like this.

              Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:55           ` Dave Jones
@ 2013-12-10 21:25             ` Darren Hart
  2013-12-10 21:28               ` Thomas Gleixner
  0 siblings, 1 reply; 114+ messages in thread
From: Darren Hart @ 2013-12-10 21:25 UTC (permalink / raw)
  To: Dave Jones, Steven Rostedt
  Cc: Thomas Gleixner, Linus Torvalds, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:
>  
>  > >  > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
>  > > 
>  > > btw, I've left the machine in that state, and will for as long as necesary
>  > > in case someone has any ideas for further tracing experiments.
>  > 
>  > Can you gather a trace with the function tracer? That will tell us
>  > what the thing is actually doing.
> 
> Feed me command lines, and I'll see what I can do.
> 
> 	Dave
> 

So let's see if we can determine the call chain inside futex_requeue, if
there is one. We want to see if we are calling the following functions:

futex.c: free_pi_state
futex.c: get_futex_value_locked
futex.c: futex_proxy_trylock_atomic
futex.c: lookup_pi_state
futex.c: fault_in_user_writeable

I'm hoping we can make use of ftrace here?

Consider:

$ trace-cmd record -P <PID> -p function -l '*futex*'

And to try outside of futex:

$ trace-cmd record -P <PID> -p function

Then ^C after a few seconds. The trace will be in trace.dat



-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:25             ` Darren Hart
@ 2013-12-10 21:28               ` Thomas Gleixner
  2013-12-10 21:39                 ` Steven Rostedt
  0 siblings, 1 reply; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 21:28 UTC (permalink / raw)
  To: Darren Hart
  Cc: Dave Jones, Steven Rostedt, Linus Torvalds, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, 10 Dec 2013, Darren Hart wrote:
> On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote:
> > On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:
> >  
> >  > >  > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
> >  > > 
> >  > > btw, I've left the machine in that state, and will for as long as necesary
> >  > > in case someone has any ideas for further tracing experiments.
> >  > 
> >  > Can you gather a trace with the function tracer? That will tell us
> >  > what the thing is actually doing.
> > 
> > Feed me command lines, and I'll see what I can do.
> > 
> > 	Dave
> > 
> 
> So let's see if we can determine the call chain inside futex_requeue, if
> there is one. We want to see if we are calling the following functions:
> 
> futex.c: free_pi_state
> futex.c: get_futex_value_locked
> futex.c: futex_proxy_trylock_atomic
> futex.c: lookup_pi_state
> futex.c: fault_in_user_writeable
> 
> I'm hoping we can make use of ftrace here?
> 
> Consider:
> 
> $ trace-cmd record -P <PID> -p function -l '*futex*'
> 
> And to try outside of futex:
> 
> $ trace-cmd record -P <PID> -p function
> 
> Then ^C after a few seconds. The trace will be in trace.dat

We definitely want the latter. The futex code gets compiled into some
badly traceable mess.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:18             ` Linus Torvalds
  2013-12-10 21:24               ` Linus Torvalds
@ 2013-12-10 21:32               ` Dave Jones
  2013-12-10 21:49                 ` Linus Torvalds
  1 sibling, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-10 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 01:18:20PM -0800, Linus Torvalds wrote:
 > On Tue, Dec 10, 2013 at 1:06 PM, Darren Hart <dvhart@linux.intel.com> wrote:
 > >
 > >    . Knowing exactly what syscall was made would
 > > be very useful, but I don't know if that information is even available
 > > anymore.
 > 
 > Well, the loop should be visible in the profile, since it's still active.
 > 
 > So doing something like
 > 
 >      perf record -e cycles:pp -g -a sleep 60
 > 
 > to get a good profile for a minute, and then looking at the
 > instruction-level profiles in futex_requeue() should be possible.
 > 
 > Of course, inlining etc often makes those rather hard to see, and the
 > bulk of the profile is clearly all about the lock debugging overhead,
 > but if the loop is in futex_requeue() then that *should* be visible in
 > the profile, even if it may not be anywhere near the top..
 > 
 > Dave?

http://www.codemonkey.org.uk/junk/perf.data.xz

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 20:49         ` Dave Jones
  2013-12-10 21:06           ` Darren Hart
@ 2013-12-10 21:34           ` Oleg Nesterov
  2013-12-10 21:41             ` Dave Jones
  1 sibling, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-10 21:34 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Thomas Gleixner, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman

On 12/10, Dave Jones wrote:
>
> On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
>  >
>  > I am looking at the first message and I can't understand who stuck
>  > "at exit".
>  >
>  > The trace shows that the task with pid=10818 called sys_futex() ?
>  >
>  > Perhaps "exit" means the userspace paths?
>
> pid 1131 is wait()'ing for 10818 to exit
>
> pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
> tired of waiting. 10818 is ignoring these because it's stuck in a loop
> somewhere in the kernel.

OK, thanks. So it doesn't return to user-space.

could you do

	cd /sys/kernel/debug/tracing/
	echo 10818 >> set_ftrace_pid
	echo function_graph >> current_tracer
	echo 1 >> tracing_on

and look into "trace" file to find out how exactly it loops?

> I tried attaching to 10818 with gdb, and it just hangs.

This is clear, 10818 obviously can't stop.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:28               ` Thomas Gleixner
@ 2013-12-10 21:39                 ` Steven Rostedt
  0 siblings, 0 replies; 114+ messages in thread
From: Steven Rostedt @ 2013-12-10 21:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, Dave Jones, Linus Torvalds, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Oleg Nesterov

On Tue, 10 Dec 2013 22:28:01 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, 10 Dec 2013, Darren Hart wrote:
> > On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote:
> > > On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:
> > >  
> > >  > >  > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
> > >  > > 
> > >  > > btw, I've left the machine in that state, and will for as long as necesary
> > >  > > in case someone has any ideas for further tracing experiments.
> > >  > 
> > >  > Can you gather a trace with the function tracer? That will tell us
> > >  > what the thing is actually doing.
> > > 
> > > Feed me command lines, and I'll see what I can do.
> > > 
> > > 	Dave
> > > 
> > 
> > So let's see if we can determine the call chain inside futex_requeue, if
> > there is one. We want to see if we are calling the following functions:
> > 
> > futex.c: free_pi_state
> > futex.c: get_futex_value_locked
> > futex.c: futex_proxy_trylock_atomic
> > futex.c: lookup_pi_state
> > futex.c: fault_in_user_writeable
> > 
> > I'm hoping we can make use of ftrace here?
> > 
> > Consider:
> > 
> > $ trace-cmd record -P <PID> -p function -l '*futex*'

Note the above just picks functions that have "futex" in its name, and
will not include all the functions you listed above.

> > 
> > And to try outside of futex:
> > 
> > $ trace-cmd record -P <PID> -p function
> > 
> > Then ^C after a few seconds. The trace will be in trace.dat
> 
> We definitely want the latter. The futex code gets compiled into some
> badly traceable mess.

Yeah, lets use all functions. If its in a loop, we should see exactly
what is happening, with full function tracing.

You may be able to simplify it to just:

	trace-cmd record -p function sleep 10

and then read the trace. Nasty loopers usually dominate these types of
traces so no need to filter on them in the recording.

-- Steve

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:34           ` Oleg Nesterov
@ 2013-12-10 21:41             ` Dave Jones
  2013-12-10 21:57               ` Linus Torvalds
  2013-12-10 22:09               ` Steven Rostedt
  0 siblings, 2 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 21:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 10:34:31PM +0100, Oleg Nesterov wrote:
 > On 12/10, Dave Jones wrote:
 > >
 > > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
 > >  >
 > >  > I am looking at the first message and I can't understand who stuck
 > >  > "at exit".
 > >  >
 > >  > The trace shows that the task with pid=10818 called sys_futex() ?
 > >  >
 > >  > Perhaps "exit" means the userspace paths?
 > >
 > > pid 1131 is wait()'ing for 10818 to exit
 > >
 > > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
 > > tired of waiting. 10818 is ignoring these because it's stuck in a loop
 > > somewhere in the kernel.
 > 
 > OK, thanks. So it doesn't return to user-space.
 > 
 > could you do
 > 
 > 	cd /sys/kernel/debug/tracing/
 > 	echo 10818 >> set_ftrace_pid
 > 	echo function_graph >> current_tracer
 > 	echo 1 >> tracing_on
 > 
 > and look into "trace" file to find out how exactly it loops?

http://codemonkey.org.uk/junk/trace

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:32               ` Dave Jones
@ 2013-12-10 21:49                 ` Linus Torvalds
  2013-12-10 21:56                   ` Dave Jones
  2013-12-11 12:45                   ` Ingo Molnar
  0 siblings, 2 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 21:49 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Darren Hart, Oleg Nesterov,
	Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <davej@redhat.com> wrote:
>
> http://www.codemonkey.org.uk/junk/perf.data.xz

 "Forbidden

  You don't have permission to access /junk/perf.data.xz on this server."

also, we'd need the vmlinux file to actually decode the data, I think.

              Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:49                 ` Linus Torvalds
@ 2013-12-10 21:56                   ` Dave Jones
  2013-12-10 21:59                     ` Linus Torvalds
  2013-12-11 12:45                   ` Ingo Molnar
  1 sibling, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-10 21:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 01:49:19PM -0800, Linus Torvalds wrote:
 > On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > http://www.codemonkey.org.uk/junk/perf.data.xz
 > 
 >  "Forbidden
 > 
 >   You don't have permission to access /junk/perf.data.xz on this server."

fixed.

 > also, we'd need the vmlinux file to actually decode the data, I think.

Hmm, the only vmlinux I have still around is newer than the running kernel,
so that's not going to be much help.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:41             ` Dave Jones
@ 2013-12-10 21:57               ` Linus Torvalds
  2013-12-10 22:02                 ` Dave Jones
                                   ` (3 more replies)
  2013-12-10 22:09               ` Steven Rostedt
  1 sibling, 4 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 21:57 UTC (permalink / raw)
  To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <davej@redhat.com> wrote:
>
> http://codemonkey.org.uk/junk/trace

Hmm. Ok, so something is calling [__]get_user_pages_fast() and
put_page() in a loop, but the trace doesn't show what that "something"
is, because it is itself not ever called.

However, that pattern does seem to imply that the loop is in
get_futex_key(), because all the other loops I see seem to be calling
other things as well.

And the __get_user_pages_fast() call implies that it's the THP case
that triggers the "unlikely(PageTail(page))" case. And anyway,
otherwise we'd see lock_page()/unlock_page() too.

So it looks like __get_user_pages_fast() fails, and keeps failing.
Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp:
update futex compound knowledge") to be exact.

                   Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:56                   ` Dave Jones
@ 2013-12-10 21:59                     ` Linus Torvalds
  2013-12-10 22:07                       ` Dave Jones
  0 siblings, 1 reply; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 21:59 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Darren Hart, Oleg Nesterov,
	Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 1:56 PM, Dave Jones <davej@redhat.com> wrote:
>
> Hmm, the only vmlinux I have still around is newer than the running kernel,
> so that's not going to be much help.

Ok, /proc/kallsyms would do it, but never mind. I think you already
pinpointed where the loop is with the trace file, so no need for
trying to decode the profile.

            Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:57               ` Linus Torvalds
@ 2013-12-10 22:02                 ` Dave Jones
  2013-12-10 22:09                 ` Oleg Nesterov
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 22:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Thomas Gleixner, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 01:57:49PM -0800, Linus Torvalds wrote:
 > On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > http://codemonkey.org.uk/junk/trace
 > 
 > Hmm. Ok, so something is calling [__]get_user_pages_fast() and
 > put_page() in a loop, but the trace doesn't show what that "something"
 > is, because it is itself not ever called.
 > 
 > However, that pattern does seem to imply that the loop is in
 > get_futex_key(), because all the other loops I see seem to be calling
 > other things as well.
 > 
 > And the __get_user_pages_fast() call implies that it's the THP case
 > that triggers the "unlikely(PageTail(page))" case. And anyway,
 > otherwise we'd see lock_page()/unlock_page() too.
 > 
 > So it looks like __get_user_pages_fast() fails, and keeps failing.
 > Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp:
 > update futex compound knowledge") to be exact.

So, a reason that this might only be showing up now, is that in the last
week I added support to trinity to explicitly do huge page mmaps in the children,
whereas before it only ever did that with MAP_SHARED in the main pid, and then
every child inherited them on fork().

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:59                     ` Linus Torvalds
@ 2013-12-10 22:07                       ` Dave Jones
  0 siblings, 0 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 01:59:30PM -0800, Linus Torvalds wrote:
 > On Tue, Dec 10, 2013 at 1:56 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > Hmm, the only vmlinux I have still around is newer than the running kernel,
 > > so that's not going to be much help.
 > 
 > Ok, /proc/kallsyms would do it, but never mind. I think you already
 > pinpointed where the loop is with the trace file, so no need for
 > trying to decode the profile.

just for completeness in case someone else is following along and prefers to try that:
http://www.codemonkey.org.uk/junk/syms

	Dave



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:57               ` Linus Torvalds
  2013-12-10 22:02                 ` Dave Jones
@ 2013-12-10 22:09                 ` Oleg Nesterov
  2013-12-10 22:19                 ` Linus Torvalds
  2013-12-12 19:00                 ` Andrea Arcangeli
  3 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-10 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Thomas Gleixner, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/10, Linus Torvalds wrote:
>
> So it looks like __get_user_pages_fast() fails, and keeps failing.

And "again:" does get_user_pages_fast().

And according to this trace get_user_pages_fast() takes mmap_sem
and calls __get_user_pages().

But __get_user_pages() should fail even before follow_page_mask()
due to fatal_signal_pending(), in this case get_futex_key() should
return the error. is_vm_hugetlb_page(vma).

So it seems that this is not THP but VM_HUGETLB and
follow_hugetlb_page() succeeds every time?

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:41             ` Dave Jones
  2013-12-10 21:57               ` Linus Torvalds
@ 2013-12-10 22:09               ` Steven Rostedt
  2013-12-10 22:16                 ` Dave Jones
  1 sibling, 1 reply; 114+ messages in thread
From: Steven Rostedt @ 2013-12-10 22:09 UTC (permalink / raw)
  To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 04:41:43PM -0500, Dave Jones wrote:
>  > 
>  > OK, thanks. So it doesn't return to user-space.
>  > 
>  > could you do
>  > 
>  > 	cd /sys/kernel/debug/tracing/
>  > 	echo 10818 >> set_ftrace_pid
>  > 	echo function_graph >> current_tracer
>  > 	echo 1 >> tracing_on
>  > 
>  > and look into "trace" file to find out how exactly it loops?
> 
> http://codemonkey.org.uk/junk/trace

Because we are already in the function that is looping, we don't see what
that function is (it's never called).

So you can do either:

trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5

Which will trace the get_user_pages_fast and spit out a full call trace.

Or if you don't want to use trace-cmd, you can do it by hand.
But be warned! If you don't do this right, you can live lock the system.
Or make it extremely slow. That is, you must have a filter on the
functions you trace before you set the function stack trace flag.
(/me needs to prevent that from happening)

cd /sys/kernel/debug/tracing
echo get_user_pages_fast > set_ftrace_filter
cat set_ftrace_filter
# make sure get_user_pages_fast is there
echo function > current_tracer
echo 1 > options/func_stack_trace

read your trace. Either by:

  cat trace

or

  trace-cmd show

And then after you recorded that.

echo 0 > options/func_stack_trace

to make sure you don't accidently enable stack tracing on *all*
functions. I haven't had that really live lock the system, but
it took about two minutes to disable it again, as each key stroke
took several seconds to compete.

-- Steve


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:09               ` Steven Rostedt
@ 2013-12-10 22:16                 ` Dave Jones
  2013-12-10 22:21                   ` Steven Rostedt
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-10 22:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman

On Tue, Dec 10, 2013 at 05:09:46PM -0500, Steven Rostedt wrote:
 
 > So you can do either:
 > 
 > trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5
 > 
 > Which will trace the get_user_pages_fast and spit out a full call trace.
This gives me a 100M trace.dat, but trace show shows..

# tracer: nop
#
# entries-in-buffer/entries-written: 0/0   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |

and that's it.
 
 > Or if you don't want to use trace-cmd, you can do it by hand.

Given the state the system is in, I don't want to perturb it too much.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:57               ` Linus Torvalds
  2013-12-10 22:02                 ` Dave Jones
  2013-12-10 22:09                 ` Oleg Nesterov
@ 2013-12-10 22:19                 ` Linus Torvalds
  2013-12-10 22:33                   ` Linus Torvalds
  2013-12-10 22:42                   ` process 'stuck' at exit Thomas Gleixner
  2013-12-12 19:00                 ` Andrea Arcangeli
  3 siblings, 2 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 22:19 UTC (permalink / raw)
  To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So it looks like __get_user_pages_fast() fails, and keeps failing.

Hmm.. Is any of the addresses unchecked, perhaps?
__get_user_pages_fast() does an access_ok() check, while
get_user_pages_fast() does *not* seem to do one.

That looks a bit dangerous. Yeah, users should have checked the
address range, but there really is no reason not to do it in
get_user_pages_fast().

And it looks like the futex code is actually seriously buggered. It
only does the access_ok() check for the non-shared case.

Why?

Shouldn't we do something like the attached?

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1379 bytes --]

 arch/x86/mm/gup.c | 4 +---
 kernel/futex.c    | 5 +++--
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46828c0..61e726597599 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -326,10 +326,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	if (end < start)
 		goto slow_irqon;
 
-#ifdef CONFIG_X86_64
-	if (end >> __VIRTUAL_MASK_SHIFT)
+	if (end > TASK_SIZE_MAX)
 		goto slow_irqon;
-#endif
 
 	/*
 	 * XXX: batch / limit 'nr', to avoid large irq off latency
diff --git a/kernel/futex.c b/kernel/futex.c
index 80ba086f021d..23c12a35dca2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -251,6 +251,9 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 		return -EINVAL;
 	address -= key->both.offset;
 
+	if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
+		return -EFAULT;
+
 	/*
 	 * PROCESS_PRIVATE futexes are fast.
 	 * As the mm cannot disappear under us and the 'key' only needs
@@ -259,8 +262,6 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 	 *        but access_ok() should be faster than find_vma()
 	 */
 	if (!fshared) {
-		if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
-			return -EFAULT;
 		key->private.mm = mm;
 		key->private.address = address;
 		get_futex_key_refs(key);

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:16                 ` Dave Jones
@ 2013-12-10 22:21                   ` Steven Rostedt
  2013-12-10 22:27                     ` Dave Jones
  0 siblings, 1 reply; 114+ messages in thread
From: Steven Rostedt @ 2013-12-10 22:21 UTC (permalink / raw)
  To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 05:16:21PM -0500, Dave Jones wrote:
> 46.GA6962@home.goodmis.org>
> User-Agent: Mutt/1.5.21 (2010-09-15)
> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
> Sender:	linux-kernel-owner@vger.kernel.org
> Precedence: bulk
> List-ID: <linux-kernel.vger.kernel.org>
> X-Mailing-List:	linux-kernel@vger.kernel.org
> X-UID: 421684                                        
> Content-Length: 1147
> Status: O
> 
> On Tue, Dec 10, 2013 at 05:09:46PM -0500, Steven Rostedt wrote:
>  
>  > So you can do either:
>  > 
>  > trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5
>  > 
>  > Which will trace the get_user_pages_fast and spit out a full call trace.
> This gives me a 100M trace.dat, but trace show shows..
> 
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 0/0   #P:4
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
> 
> and that's it.

Oh, if you use trace-cmd record. You need to do trace-cmd report.

The trace-cmd show, shows you what's in the trace file, which was for the
"manual" version.

Sorry for the confusion.

-- Steve

>  
>  > Or if you don't want to use trace-cmd, you can 

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:21                   ` Steven Rostedt
@ 2013-12-10 22:27                     ` Dave Jones
  0 siblings, 0 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 22:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman

On Tue, Dec 10, 2013 at 05:21:50PM -0500, Steven Rostedt wrote:
 
 > The trace-cmd show, shows you what's in the trace file, which was for the
 > "manual" version.
 > 
 > Sorry for the confusion.

 ah, thanks.

That shows..

version = 6
CPU 0 is empty
CPU 2 is empty
CPU 3 is empty
cpus=4
 trinity-child27-10818 [001] 89790.703544: kernel_stack:         <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)
 trinity-child27-10818 [001] 89790.703545: function:             get_user_pages_fast
 trinity-child27-10818 [001] 89790.703547: kernel_stack:         <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)
 trinity-child27-10818 [001] 89790.703547: function:             get_user_pages_fast
 trinity-child27-10818 [001] 89790.703549: kernel_stack:         <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)
 trinity-child27-10818 [001] 89790.703550: function:             get_user_pages_fast
 trinity-child27-10818 [001] 89790.703552: kernel_stack:         <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)


over and over and..

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:19                 ` Linus Torvalds
@ 2013-12-10 22:33                   ` Linus Torvalds
  2013-12-10 22:38                     ` Darren Hart
                                       ` (2 more replies)
  2013-12-10 22:42                   ` process 'stuck' at exit Thomas Gleixner
  1 sibling, 3 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 22:33 UTC (permalink / raw)
  To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Shouldn't we do something like the attached?

So I think that kernel/futex.c part of the patch might be a good idea,
but on x86-64 (which is what Dave is running), the

        if (end >> __VIRTUAL_MASK_SHIFT)

test in get_user_pages_fast() should have been equivalent. And even on
32-bit, we do check the _PAGE_USER bits in the page tables, so I guess
it's all good on a get_user_pages_fast() side.

So never mind. It's not the address checking.

And I think I see what's up.

I think what happens is:
 - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
 - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
 - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).

so what triggers this is likely that Dave now does large-pages, and
one of them is a read-only mapping.

So I would suggest replacing the second "1" in the
__get_user_pages_fast() call with a "!ro" instead. So how about this
second patch instead (the access_ok() move remains).

Comments?

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1255 bytes --]

 kernel/futex.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 80ba086f021d..6272f560385c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -251,6 +251,9 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 		return -EINVAL;
 	address -= key->both.offset;
 
+	if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
+		return -EFAULT;
+
 	/*
 	 * PROCESS_PRIVATE futexes are fast.
 	 * As the mm cannot disappear under us and the 'key' only needs
@@ -259,8 +262,6 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 	 *        but access_ok() should be faster than find_vma()
 	 */
 	if (!fshared) {
-		if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
-			return -EFAULT;
 		key->private.mm = mm;
 		key->private.address = address;
 		get_futex_key_refs(key);
@@ -288,7 +289,7 @@ again:
 		put_page(page);
 		/* serialize against __split_huge_page_splitting() */
 		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
 			page_head = compound_head(page);
 			/*
 			 * page_head is valid pointer but we must pin

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:33                   ` Linus Torvalds
@ 2013-12-10 22:38                     ` Darren Hart
  2013-12-10 22:57                     ` Thomas Gleixner
  2013-12-11 17:08                     ` Oleg Nesterov
  2 siblings, 0 replies; 114+ messages in thread
From: Darren Hart @ 2013-12-10 22:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 2013-12-10 at 14:33 -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Shouldn't we do something like the attached?
> 
> So I think that kernel/futex.c part of the patch might be a good idea,
> but on x86-64 (which is what Dave is running), the
> 
>         if (end >> __VIRTUAL_MASK_SHIFT)
> 
> test in get_user_pages_fast() should have been equivalent. And even on
> 32-bit, we do check the _PAGE_USER bits in the page tables, so I guess
> it's all good on a get_user_pages_fast() side.
> 
> So never mind. It's not the address checking.
> 
> And I think I see what's up.
> 
> I think what happens is:
>  - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
>  - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
>  - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).
> 
> so what triggers this is likely that Dave now does large-pages, and
> one of them is a read-only mapping.
> 
> So I would suggest replacing the second "1" in the
> __get_user_pages_fast() call with a "!ro" instead. So how about this
> second patch instead (the access_ok() move remains).
> 
> Comments?
> 

You're too fast for me, but I'm trying to keep up.

It was my understanding that access_ok() was an optimization for private
futexes, but something more heavy weight was required for shared. I
believe that was find_vma() in the past, but Peter Z changed that with
fast gup. Trying to page that all in now and get the exact details....

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:19                 ` Linus Torvalds
  2013-12-10 22:33                   ` Linus Torvalds
@ 2013-12-10 22:42                   ` Thomas Gleixner
  2013-12-10 22:48                     ` Linus Torvalds
                                       ` (3 more replies)
  1 sibling, 4 replies; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 22:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So it looks like __get_user_pages_fast() fails, and keeps failing.
> 
> Hmm.. Is any of the addresses unchecked, perhaps?
> __get_user_pages_fast() does an access_ok() check, while
> get_user_pages_fast() does *not* seem to do one.
> 
> That looks a bit dangerous. Yeah, users should have checked the
> address range, but there really is no reason not to do it in
> get_user_pages_fast().
> 
> And it looks like the futex code is actually seriously buggered. It
> only does the access_ok() check for the non-shared case.
> 
> Why?

The !fshared case is the fast path which does not even reach
get_user_pages_fast().

We had this discussion some time ago already, where the access_ok()
check was missing in the !fshared case or the check was buggered for
some reason. Need to dig up the gory details.

And yes, I remember that we do not do an extra check for the fshared
case, because get_user_pages_fast() should do it for us already. If
not we are fubared not only in the futex code.

But there is a subtle detail:

    err = get_user_pages_fast(address, 1, 1, &page);

So we ask for write access as the write argument is 1. In case that
fails we have that fallback path:

        /*
         * If write access is not required (eg. FUTEX_WAIT), try
         * and get read-only access.
         */
        if (err == -EFAULT && rw == VERIFY_READ) {
                err = get_user_pages_fast(address, 1, 0, &page);

That's a legitimate use case. And futex_requeue only requests
VERIFY_READ for the !requeue_pi case.

Now, if that map is RO, i.e. we took the fallback path then the THP
one will fail as it has write=1 unconditionally.

      if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) 

Thanks,

	tglx
 

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:42                   ` process 'stuck' at exit Thomas Gleixner
@ 2013-12-10 22:48                     ` Linus Torvalds
  2013-12-10 22:58                       ` Darren Hart
                                         ` (3 more replies)
  2013-12-10 22:51                     ` Darren Hart
                                       ` (2 subsequent siblings)
  3 siblings, 4 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 22:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> And yes, I remember that we do not do an extra check for the fshared
> case, because get_user_pages_fast() should do it for us already. If
> not we are fubared not only in the futex code.

Yeah. It turns out we do do the access check indirectly - by looking
at the PAGE_USER bit, even if we don't necessarily check the actual
limits. So get_user_pages_fast() is fine.

> But there is a subtle detail:

Yup, see my email from ten minutes ago, we found the same thing. And
that would seem to explain the endless loop, and also the timing
(since Dave mentions he started doing large-pages lately).

So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
should work.

Dave, can you re-create that trinity run and test that patch? I think
we've got this, but it might be nice to leave the hung machine up and
running until it's verified.. Although I don't really see what else we
could need or get out of it, so..

             Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:42                   ` process 'stuck' at exit Thomas Gleixner
  2013-12-10 22:48                     ` Linus Torvalds
@ 2013-12-10 22:51                     ` Darren Hart
  2013-12-10 23:24                     ` Al Viro
  2013-12-11 16:31                     ` Mel Gorman
  3 siblings, 0 replies; 114+ messages in thread
From: Darren Hart @ 2013-12-10 22:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 2013-12-10 at 23:42 +0100, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Linus Torvalds wrote:
> 
> > On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > So it looks like __get_user_pages_fast() fails, and keeps failing.
> > 
> > Hmm.. Is any of the addresses unchecked, perhaps?
> > __get_user_pages_fast() does an access_ok() check, while
> > get_user_pages_fast() does *not* seem to do one.
> > 
> > That looks a bit dangerous. Yeah, users should have checked the
> > address range, but there really is no reason not to do it in
> > get_user_pages_fast().
> > 
> > And it looks like the futex code is actually seriously buggered. It
> > only does the access_ok() check for the non-shared case.
> > 
> > Why?
> 
> The !fshared case is the fast path which does not even reach
> get_user_pages_fast().
> 
> We had this discussion some time ago already, where the access_ok()
> check was missing in the !fshared case or the check was buggered for
> some reason. Need to dig up the gory details.
> 
> And yes, I remember that we do not do an extra check for the fshared
> case, because get_user_pages_fast() should do it for us already. If
> not we are fubared not only in the futex code.
> 
> But there is a subtle detail:
> 
>     err = get_user_pages_fast(address, 1, 1, &page);
> 
> So we ask for write access as the write argument is 1. In case that
> fails we have that fallback path:
> 
>         /*
>          * If write access is not required (eg. FUTEX_WAIT), try
>          * and get read-only access.
>          */
>         if (err == -EFAULT && rw == VERIFY_READ) {
>                 err = get_user_pages_fast(address, 1, 0, &page);
> 
> That's a legitimate use case. And futex_requeue only requests
> VERIFY_READ for the !requeue_pi case.
> 
> Now, if that map is RO, i.e. we took the fallback path then the THP
> one will fail as it has write=1 unconditionally.
> 
>       if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) 
> 

Is there a reason THP requires unconditional rw? Andrea?

Or is the following actually the answer here?

diff --git a/kernel/futex.c b/kernel/futex.c
index 80ba086..02febad 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -288,7 +288,7 @@ again:
                put_page(page);
                /* serialize against __split_huge_page_splitting() */
                local_irq_disable();
-               if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+               if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
                        page_head = compound_head(page);
                        /*
                         * page_head is valid pointer but we must pin




-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:33                   ` Linus Torvalds
  2013-12-10 22:38                     ` Darren Hart
@ 2013-12-10 22:57                     ` Thomas Gleixner
  2013-12-10 23:05                       ` Linus Torvalds
  2013-12-11 17:08                     ` Oleg Nesterov
  2 siblings, 1 reply; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 22:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Shouldn't we do something like the attached?
> 
> So I think that kernel/futex.c part of the patch might be a good idea,
> but on x86-64 (which is what Dave is running), the
> 
>         if (end >> __VIRTUAL_MASK_SHIFT)
> 
> test in get_user_pages_fast() should have been equivalent. And even on
> 32-bit, we do check the _PAGE_USER bits in the page tables, so I guess
> it's all good on a get_user_pages_fast() side.
> 
> So never mind. It's not the address checking.
> 
> And I think I see what's up.
> 
> I think what happens is:
>  - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
>  - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
>  - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).
> 
> so what triggers this is likely that Dave now does large-pages, and
> one of them is a read-only mapping.
> 
> So I would suggest replacing the second "1" in the
> __get_user_pages_fast() call with a "!ro" instead. So how about this
> second patch instead (the access_ok() move remains).

Yes, that's what I decoded as well.

But how does the access_ok() move do anything helpful here?

We really need it for the fastpath !fshared case, but for the fshared
case you actively break working code, because you force a VERIFY_WRITE
check into it. The VERIFY_WRITE is necessary for !fshared, because
there is no way that one thread can map the futex RO and the other RW,
right?

But for fshared it's legitimate to have a RO mapping if you just wait
for the futex. Note, that futexes are (ab)used as user space
waitqueues, so RO makes sense. And your move would break them.

If [__]get_user_pages_fast() does not do the right checks, then we
need to fix that and not add random access_ok() checks into the call
sites.

Thanks,

	tglx






^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:48                     ` Linus Torvalds
@ 2013-12-10 22:58                       ` Darren Hart
  2013-12-10 23:01                         ` Dave Jones
  2013-12-10 23:00                       ` Dave Jones
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 114+ messages in thread
From: Darren Hart @ 2013-12-10 22:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Dave Jones, Oleg Nesterov, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 2013-12-10 at 14:48 -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > And yes, I remember that we do not do an extra check for the fshared
> > case, because get_user_pages_fast() should do it for us already. If
> > not we are fubared not only in the futex code.
> 
> Yeah. It turns out we do do the access check indirectly - by looking
> at the PAGE_USER bit, even if we don't necessarily check the actual
> limits. So get_user_pages_fast() is fine.
> 
> > But there is a subtle detail:
> 
> Yup, see my email from ten minutes ago, we found the same thing. And
> that would seem to explain the endless loop, and also the timing
> (since Dave mentions he started doing large-pages lately).
> 
> So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
> should work.
> 
> Dave, can you re-create that trinity run and test that patch? I think
> we've got this, but it might be nice to leave the hung machine up and
> running until it's verified.. Although I don't really see what else we
> could need or get out of it, so..

Would it be possible to limit the options to only pass FUTEX_CMP_REQUEUE
and a read-only uaddr? That should improve confidence when it doesn't
fail :-)

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:48                     ` Linus Torvalds
  2013-12-10 22:58                       ` Darren Hart
@ 2013-12-10 23:00                       ` Dave Jones
  2013-12-11  0:05                         ` Steven Rostedt
  2013-12-11  4:09                       ` Dave Jones
  2013-12-12  4:26                       ` Dave Jones
  3 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-10 23:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:
 > On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
 > >
 > > And yes, I remember that we do not do an extra check for the fshared
 > > case, because get_user_pages_fast() should do it for us already. If
 > > not we are fubared not only in the futex code.
 > 
 > Yeah. It turns out we do do the access check indirectly - by looking
 > at the PAGE_USER bit, even if we don't necessarily check the actual
 > limits. So get_user_pages_fast() is fine.
 > 
 > > But there is a subtle detail:
 > 
 > Yup, see my email from ten minutes ago, we found the same thing. And
 > that would seem to explain the endless loop, and also the timing
 > (since Dave mentions he started doing large-pages lately).
 > 
 > So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
 > should work.
 > 
 > Dave, can you re-create that trinity run and test that patch? I think
 > we've got this, but it might be nice to leave the hung machine up and
 > running until it's verified.. Although I don't really see what else we
 > could need or get out of it, so..

The only thing I'm still unclear on, is how that pid allegedly wasn't doing
a futex call as part of its run. The only thing I can think of is that
the other pid that _did_ do a futex call did it on a page that was MAP_SHARED
between all the other children, and this 'spin forever' thing only
happens when the last process with a reference on that page exits ?

does that make sense?

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:58                       ` Darren Hart
@ 2013-12-10 23:01                         ` Dave Jones
  0 siblings, 0 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-10 23:01 UTC (permalink / raw)
  To: Darren Hart
  Cc: Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 02:58:19PM -0800, Darren Hart wrote:
 > On Tue, 2013-12-10 at 14:48 -0800, Linus Torvalds wrote:
 > > On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
 > > >
 > > > And yes, I remember that we do not do an extra check for the fshared
 > > > case, because get_user_pages_fast() should do it for us already. If
 > > > not we are fubared not only in the futex code.
 > > 
 > > Yeah. It turns out we do do the access check indirectly - by looking
 > > at the PAGE_USER bit, even if we don't necessarily check the actual
 > > limits. So get_user_pages_fast() is fine.
 > > 
 > > > But there is a subtle detail:
 > > 
 > > Yup, see my email from ten minutes ago, we found the same thing. And
 > > that would seem to explain the endless loop, and also the timing
 > > (since Dave mentions he started doing large-pages lately).
 > > 
 > > So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
 > > should work.
 > > 
 > > Dave, can you re-create that trinity run and test that patch? I think
 > > we've got this, but it might be nice to leave the hung machine up and
 > > running until it's verified.. Although I don't really see what else we
 > > could need or get out of it, so..
 > 
 > Would it be possible to limit the options to only pass FUTEX_CMP_REQUEUE
 > and a read-only uaddr? That should improve confidence when it doesn't
 > fail :-)

easy enough to hack into the code yeah. A bit complicated to come up with
a sensible grammar for a command line parser for such cases sadly.

I'll give the patch a try after dinner.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:57                     ` Thomas Gleixner
@ 2013-12-10 23:05                       ` Linus Torvalds
  2013-12-10 23:28                         ` Thomas Gleixner
  2013-12-10 23:31                         ` Al Viro
  0 siblings, 2 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-10 23:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 2:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> But how does the access_ok() move do anything helpful here?

Just making it all more obvious.

> We really need it for the fastpath !fshared case, but for the fshared
> case you actively break working code, because you force a VERIFY_WRITE
> check into it. The VERIFY_WRITE is necessary for !fshared, because
> there is no way that one thread can map the futex RO and the other RW,
> right?

Nobody actually uses that argument any more (it goes back to the old
i386 "let's manually verify that we have write permissions, because
the CPU doesn't do it for us in the trap handling"), and it should
probably be removed.

But you're right that it's at least misleading. I'd love to remove it
entirely, because it's not even syntax-checked, and it's confusing.
But that would be a humongous patch.

So these days, "access_ok()" literally just checks that the address is
in the user address space range. And that would seem to always be
appropriate for futexes, so why not just do it in the generic code?

             Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:42                   ` process 'stuck' at exit Thomas Gleixner
  2013-12-10 22:48                     ` Linus Torvalds
  2013-12-10 22:51                     ` Darren Hart
@ 2013-12-10 23:24                     ` Al Viro
  2013-12-11 16:31                     ` Mel Gorman
  3 siblings, 0 replies; 114+ messages in thread
From: Al Viro @ 2013-12-10 23:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman

On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote:
>         /*
>          * If write access is not required (eg. FUTEX_WAIT), try
>          * and get read-only access.
>          */
>         if (err == -EFAULT && rw == VERIFY_READ) {
>                 err = get_user_pages_fast(address, 1, 0, &page);
> 
> That's a legitimate use case. And futex_requeue only requests
> VERIFY_READ for the !requeue_pi case.
> 
> Now, if that map is RO, i.e. we took the fallback path then the THP
> one will fail as it has write=1 unconditionally.

access_ok() has nothing whatsoever to do with RO vs. RW mappings.
It checks whether the address is OK for userland on architectures
with userland and kernel sharing the same address space (e.g. x86).
On something like e.g. sparc64 or s390 it's constant 1.

Note that there's nothing to stop another thread from remapping an RW
area RO just as you've returned from access_ok(), so checking for
writability in access_ok() would've been racy as hell.  Ditto for
address being mapped at all...

Moreover, there are exactly two architectures that do not ignore the
first argument of access_ok() - microblaze and um.  The former uses
it in debugging printk in failure case.  The latter... AFAICS, it's
pointless - it's a special dispensation for read access to host
vsyscall page from guest process.  The thing is, writes there are
going to fail anyway - host kernel won't let the guest kernel to
modify that page, period.  IOW, it looks like um might as well drop
the (type == VERIFY_READ) part in __access_ok_vsyscall().

Why do we have the 'type' argument of access_ok(), anyway?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 23:05                       ` Linus Torvalds
@ 2013-12-10 23:28                         ` Thomas Gleixner
  2013-12-10 23:31                         ` Al Viro
  1 sibling, 0 replies; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-10 23:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 2:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > But how does the access_ok() move do anything helpful here?
> 
> Just making it all more obvious.
> 
> > We really need it for the fastpath !fshared case, but for the fshared
> > case you actively break working code, because you force a VERIFY_WRITE
> > check into it. The VERIFY_WRITE is necessary for !fshared, because
> > there is no way that one thread can map the futex RO and the other RW,
> > right?
> 
> Nobody actually uses that argument any more (it goes back to the old
> i386 "let's manually verify that we have write permissions, because
> the CPU doesn't do it for us in the trap handling"), and it should
> probably be removed.

Fair enough.
 
> But you're right that it's at least misleading. I'd love to remove it
> entirely, because it's not even syntax-checked, and it's confusing.
> But that would be a humongous patch.

Well, we should ask Julia for a coccinelle patch to limit the
wreckage. :)

Seriously, if that VERIFY_WRITE is completely useless we really want
to get rid of it.
 
> So these days, "access_ok()" literally just checks that the address is
> in the user address space range. And that would seem to always be
> appropriate for futexes, so why not just do it in the generic code?

Agreed, but as long as the VERIFY_WRITE argument is there it needs at
least a big fat comment :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 23:05                       ` Linus Torvalds
  2013-12-10 23:28                         ` Thomas Gleixner
@ 2013-12-10 23:31                         ` Al Viro
  1 sibling, 0 replies; 114+ messages in thread
From: Al Viro @ 2013-12-10 23:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Dave Jones, Oleg Nesterov, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman

On Tue, Dec 10, 2013 at 03:05:46PM -0800, Linus Torvalds wrote:

> Nobody actually uses that argument any more (it goes back to the old
> i386 "let's manually verify that we have write permissions, because
> the CPU doesn't do it for us in the trap handling"), and it should
> probably be removed.

Ah...

> But you're right that it's at least misleading. I'd love to remove it
> entirely, because it's not even syntax-checked, and it's confusing.
> But that would be a humongous patch.
> 
> So these days, "access_ok()" literally just checks that the address is
> in the user address space range. And that would seem to always be
> appropriate for futexes, so why not just do it in the generic code?

So let's turn it into
#define access_ok(type, addr, size) address_ok(addr, size)

and then users can be converted at leisure.  Eventually we'll just remove
the unused macros...

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 23:00                       ` Dave Jones
@ 2013-12-11  0:05                         ` Steven Rostedt
  2013-12-11  0:23                           ` Dave Jones
  0 siblings, 1 reply; 114+ messages in thread
From: Steven Rostedt @ 2013-12-11  0:05 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Thomas Gleixner, Oleg Nesterov,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 06:00:09PM -0500, Dave Jones wrote:
> 
> The only thing I'm still unclear on, is how that pid allegedly wasn't doing
> a futex call as part of its run. The only thing I can think of is that
> the other pid that _did_ do a futex call did it on a page that was MAP_SHARED
> between all the other children, and this 'spin forever' thing only
> happens when the last process with a reference on that page exits ?

Which thread did not do the futex call? The one that was spinning? No, that one
most definitely was, at least according to the stack trace trace you posted:

 trinity-child27-10818 [001] 89790.703547: kernel_stack:         <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)  

It did a futex() system call.

Or are you talking about another thread?

-- Steve


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-11  0:05                         ` Steven Rostedt
@ 2013-12-11  0:23                           ` Dave Jones
  2013-12-11  0:55                             ` Dave Jones
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-11  0:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman

On Tue, Dec 10, 2013 at 07:05:04PM -0500, Steven Rostedt wrote:
 > On Tue, Dec 10, 2013 at 06:00:09PM -0500, Dave Jones wrote:
 > > 
 > > The only thing I'm still unclear on, is how that pid allegedly wasn't doing
 > > a futex call as part of its run. The only thing I can think of is that
 > > the other pid that _did_ do a futex call did it on a page that was MAP_SHARED
 > > between all the other children, and this 'spin forever' thing only
 > > happens when the last process with a reference on that page exits ?
 > 
 > Which thread did not do the futex call? The one that was spinning? No, that one
 > most definitely was, at least according to the stack trace trace you posted:
 > 
 >  trinity-child27-10818 [001] 89790.703547: kernel_stack:         <stack trace>
 > => futex_requeue (ffffffff810df18a)
 > => do_futex (ffffffff810e019e)
 > => SyS_futex (ffffffff810e0de1)
 > => tracesys (ffffffff81760be4)  
 > 
 > It did a futex() system call.
 > 
 > Or are you talking about another thread?

It's the same thread. but here's what it says the last thing it did was..

(gdb) print shm->previous_syscallno[27]
$1 = 288

accept4.

Just to verify I'm looking at the right array member..

(gdb) print shm->pids[27]
$2 = 10818

Oh, hmm. Wait, I'm an idiot.
I only update ->previous when we come back from the syscall.
It's _still_ doing this syscall.  

(gdb) print shm->syscallno[27]
$4 = 202

I was distracted by seeing all the other threads exiting, so I was only looking at
what this one had already done.

ok, mystery solved. derp.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-11  0:23                           ` Dave Jones
@ 2013-12-11  0:55                             ` Dave Jones
  2013-12-14 20:17                               ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-11  0:55 UTC (permalink / raw)
  To: Steven Rostedt, Linus Torvalds, Thomas Gleixner, Oleg Nesterov,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 07:23:30PM -0500, Dave Jones wrote:
 
 > I was distracted by seeing all the other threads exiting, so I was only looking at
 > what this one had already done.
 
another thing that distracted me was that /proc/10818/stack was just showing that

[<ffffffffffffffff>] 0xffffffffffffffff

output.

For my own education, what causes that ?
How come it didn't show the same trace I saw when I sysrq-t'd ?

	Dave

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 19:18   ` Thomas Gleixner
  2013-12-10 19:55     ` Linus Torvalds
@ 2013-12-11  1:02     ` Mel Gorman
  1 sibling, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2013-12-11  1:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra

On Tue, Dec 10, 2013 at 08:18:29PM +0100, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Linus Torvalds wrote:
> 
> > Hmm. Looks like the futex code is somehow stuck in a loop, calling
> > get_user_pages_fast().
> > 
> > The futex code itself is apparently so low-overhead that it doesn't
> > show up in your 'perf top' report (which is dominated by all the
> > expensive debug things that get_user_pages_fast() etc ends up doing),
> > but that's the only looping I can see. Perhaps the "goto again" case
> > for transparent huge pages in get_futex_key()? Or the
> 
> Cc'ng more folks on that.
> 

I just saw this before heading to bed and have not read the thread. I'll
read it in the morning but in the meantime the following might ring a bell
for someone elses investigation or someone more familiar with how futexs
work from end to end.

Was NUMA balancing enabled and was this a NUMA machine?

I ask because of these two patches that are currently in flight

  mm: numa: Serialise parallel get_user_page against THP migration mm
  fix TLB flush race between migration, and change_protection_range

There are related patches but these two are the most important for what
I have in mind. The two in combination address a problem whereby a write
from one thread can be lost due to a THP migration but it's specific to
automatic NUMA balancing. If the lost update was for a page containing a
futex then the lost write could confuse waiters. The downside is that this
is a bad fit for the problem description in the first mail. A lost update
might result in processes waiting forever on a value that never changes
but offhand it's less clear why it might result in a loop. Unless of
course there is a combination of events that allows for a busy wait on a
value that will never change due to the lost write.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:48                     ` Linus Torvalds
  2013-12-10 22:58                       ` Darren Hart
  2013-12-10 23:00                       ` Dave Jones
@ 2013-12-11  4:09                       ` Dave Jones
  2013-12-12  4:26                       ` Dave Jones
  3 siblings, 0 replies; 114+ messages in thread
From: Dave Jones @ 2013-12-11  4:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:
 
 > Dave, can you re-create that trinity run and test that patch?

Looks ok so far, but I'll leave it run overnight to be sure.

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:49                 ` Linus Torvalds
  2013-12-10 21:56                   ` Dave Jones
@ 2013-12-11 12:45                   ` Ingo Molnar
  1 sibling, 0 replies; 114+ messages in thread
From: Ingo Molnar @ 2013-12-11 12:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Darren Hart, Oleg Nesterov, Thomas Gleixner,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman, Arnaldo Carvalho de Melo


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <davej@redhat.com> wrote:
> >
> > http://www.codemonkey.org.uk/junk/perf.data.xz
> 
>  "Forbidden
> 
>   You don't have permission to access /junk/perf.data.xz on this server."
> 
> also, we'd need the vmlinux file to actually decode the data, I think.

So my point is somewhat moot in light of a fix patch, but the best way 
to export all interesting data is via running 'perf archive' after the 
perf.data got created:

  comet:~/tip> perf archive
  Now please run:

  $ tar xvf perf.data.tar.bz2 -C ~/.debug

  wherever you need to run 'perf report' on.
  comet:~/tip> 

and copying over both the perf.data and the perf.data.tar.bz2.

Arnaldo, I think we should make it even easier and more obvious to 
export/import perf data, via something like:

   perf clean              # cleans out ~/.debug
   perf record ...
   perf export             # creates perf.data.tar.bz2 which includes perf.data as well, not just .debug

and then whoever gets a perf.data.tar.bz2 can do:

   perf import             # does 'tar xjvf perf.data.tar.bz2'
   perf report             # analyze the data as if it was captured on your own box

which extracts it into the local directory. This makes the whole thing 
rather easy and makes the result complete and trustable.

What do you think?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:42                   ` process 'stuck' at exit Thomas Gleixner
                                       ` (2 preceding siblings ...)
  2013-12-10 23:24                     ` Al Viro
@ 2013-12-11 16:31                     ` Mel Gorman
  2013-12-11 16:38                       ` Thomas Gleixner
  3 siblings, 1 reply; 114+ messages in thread
From: Mel Gorman @ 2013-12-11 16:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra

On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Linus Torvalds wrote:
> 
> > On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > So it looks like __get_user_pages_fast() fails, and keeps failing.
> > 
> > Hmm.. Is any of the addresses unchecked, perhaps?
> > __get_user_pages_fast() does an access_ok() check, while
> > get_user_pages_fast() does *not* seem to do one.
> > 
> > That looks a bit dangerous. Yeah, users should have checked the
> > address range, but there really is no reason not to do it in
> > get_user_pages_fast().
> > 
> > And it looks like the futex code is actually seriously buggered. It
> > only does the access_ok() check for the non-shared case.
> > 
> > Why?
> 
> The !fshared case is the fast path which does not even reach
> get_user_pages_fast().
> 
> We had this discussion some time ago already, where the access_ok()
> check was missing in the !fshared case or the check was buggered for
> some reason. Need to dig up the gory details.
> 
> And yes, I remember that we do not do an extra check for the fshared
> case, because get_user_pages_fast() should do it for us already. If
> not we are fubared not only in the futex code.
> 
> But there is a subtle detail:
> 
>     err = get_user_pages_fast(address, 1, 1, &page);
> 
> So we ask for write access as the write argument is 1. In case that
> fails we have that fallback path:
> 
>         /*
>          * If write access is not required (eg. FUTEX_WAIT), try
>          * and get read-only access.
>          */
>         if (err == -EFAULT && rw == VERIFY_READ) {
>                 err = get_user_pages_fast(address, 1, 0, &page);
> 
> That's a legitimate use case. And futex_requeue only requests
> VERIFY_READ for the !requeue_pi case.
> 
> Now, if that map is RO, i.e. we took the fallback path then the THP
> one will fail as it has write=1 unconditionally.
> 
>       if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) 
> 

Not that it really matters but the naming and comments around that
particular __get_user_pages_fast call are a little misleading.  The ifdef
CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has
to be taken against THP splits, not because it is dealing exclusively with
THP. The PageTail check applies to either hugetlbfs or THPs and similarly
gup_huge_pmd handles both. The whole path should be very rare for THPs
considering that THPs exist on MAP_PRIVATE anonymous mappings and how many
shared futexes exist backed by such mappings? A RO mapping makes that seem
even more strange because what thread is updating the value the caller is
waiting on? It seems more like that was a shared futex on a hugetlbfs-backed
mappings which might explain why the bug was undiscovered for so long.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-11 16:31                     ` Mel Gorman
@ 2013-12-11 16:38                       ` Thomas Gleixner
  2013-12-11 17:57                         ` Mel Gorman
  0 siblings, 1 reply; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-11 16:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra

On Wed, 11 Dec 2013, Mel Gorman wrote:
> On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote:
> > Now, if that map is RO, i.e. we took the fallback path then the THP
> > one will fail as it has write=1 unconditionally.
> > 
> >       if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) 
> > 
> 
> Not that it really matters but the naming and comments around that
> particular __get_user_pages_fast call are a little misleading.  The ifdef
> CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has
> to be taken against THP splits, not because it is dealing exclusively with
> THP. The PageTail check applies to either hugetlbfs or THPs and similarly
> gup_huge_pmd handles both. The whole path should be very rare for THPs
> considering that THPs exist on MAP_PRIVATE anonymous mappings and how many
> shared futexes exist backed by such mappings? A RO mapping makes that seem
> even more strange because what thread is updating the value the caller is
> waiting on? It seems more like that was a shared futex on a hugetlbfs-backed
> mappings which might explain why the bug was undiscovered for so long.

This is the fshared path. The MAP_PRIVATE path does not do that dance
at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:33                   ` Linus Torvalds
  2013-12-10 22:38                     ` Darren Hart
  2013-12-10 22:57                     ` Thomas Gleixner
@ 2013-12-11 17:08                     ` Oleg Nesterov
  2013-12-11 17:18                       ` Thomas Gleixner
  2 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-11 17:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Thomas Gleixner, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/10, Linus Torvalds wrote:
>
> I think what happens is:
>  - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
>  - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
>  - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).
>
> so what triggers this is likely that Dave now does large-pages, and
> one of them is a read-only mapping.
>
> So I would suggest replacing the second "1" in the
> __get_user_pages_fast() call with a "!ro" instead. So how about this
> second patch instead (the access_ok() move remains).

I know almost nothing about THP, but why we may need write == true in
this case?

IOW,

> -		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
> +		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {

can't
		if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) {

work?


I have to admit, I do not understand why we can't avoid this altogether.

__get_page_tail() can find the stable ->first_page, why get_futex_key()
can't ?

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-11 17:08                     ` Oleg Nesterov
@ 2013-12-11 17:18                       ` Thomas Gleixner
  2013-12-11 17:56                         ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Thomas Gleixner @ 2013-12-11 17:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Wed, 11 Dec 2013, Oleg Nesterov wrote:
> On 12/10, Linus Torvalds wrote:
> >
> > I think what happens is:
> >  - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
> >  - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
> >  - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).
> >
> > so what triggers this is likely that Dave now does large-pages, and
> > one of them is a read-only mapping.
> >
> > So I would suggest replacing the second "1" in the
> > __get_user_pages_fast() call with a "!ro" instead. So how about this
> > second patch instead (the access_ok() move remains).
> 
> I know almost nothing about THP, but why we may need write == true in
> this case?
> 
> IOW,
> 
> > -		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
> > +		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
> 
> can't
> 		if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) {
> 
> work?

If the futex call needs to write to that address, we need a writeable
mapping, right? And get_user_pages_fast() will verify that for us.
 
> I have to admit, I do not understand why we can't avoid this altogether.
> 
> __get_page_tail() can find the stable ->first_page, why get_futex_key()
> can't ?

Because it can be split under us.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-11 17:18                       ` Thomas Gleixner
@ 2013-12-11 17:56                         ` Oleg Nesterov
  2013-12-11 19:18                           ` PATCH? introduce get_compound_page (Was: process 'stuck' at exit) Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-11 17:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/11, Thomas Gleixner wrote:
>
> On Wed, 11 Dec 2013, Oleg Nesterov wrote:
> >
> > I know almost nothing about THP, but why we may need write == true in
> > this case?
> >
> > IOW,
> >
> > > -		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
> > > +		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
> >
> > can't
> > 		if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) {
> >
> > work?
>
> If the futex call needs to write to that address, we need a writeable
> mapping, right? And get_user_pages_fast() will verify that for us.

Yes sure. I meant __get_user_pages_fast() which is called to find
page_head.

> > I have to admit, I do not understand why we can't avoid this altogether.
> >
> > __get_page_tail() can find the stable ->first_page, why get_futex_key()
> > can't ?
>
> Because it can be split under us.

I understand, but why we can't rely on compound_lock() like
__get_page_tail() does?

OK, could you please explain why something like the pseudo-code
below can't work?

Oleg.


--- x/mm/swap.c
+++ x/mm/swap.c
@@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page);
  * This function is exported but must not be called by anything other
  * than get_page(). It implements the slow path of get_page().
  */
-bool __get_page_tail(struct page *page)
+struct page *__get_page_tail(struct page *page)
 {
 	/*
 	 * This takes care of get_page() if run on a tail page
@@ -235,7 +235,7 @@ bool __get_page_tail(struct page *page)
 				 */
 				VM_BUG_ON(!PageHead(page_head));
 				__get_page_tail_foll(page, false);
-				return true;
+				return page_head;
 			} else {
 				/*
 				 * __split_huge_page_refcount run
@@ -247,7 +247,7 @@ bool __get_page_tail(struct page *page)
 				 * slab on x86).
 				 */
 				put_page(page_head);
-				return false;
+				return NULL;
 			}
 		}
 
@@ -264,10 +264,12 @@ bool __get_page_tail(struct page *page)
 			got = true;
 		}
 		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
+		if (likely(got))
+			return page_head;
+		else
 			put_page(page_head);
 	}
-	return got;
+	return NULL;
 }
 EXPORT_SYMBOL(__get_page_tail);
 
--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -282,41 +282,11 @@ again:
 	else
 		err = 0;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page_head = page;
-	if (unlikely(PageTail(page))) {
+	page_head = __get_page_tail(page);
+	if (page_head) {
 		put_page(page);
-		/* serialize against __split_huge_page_splitting() */
-		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
-			page_head = compound_head(page);
-			/*
-			 * page_head is valid pointer but we must pin
-			 * it before taking the PG_lock and/or
-			 * PG_compound_lock. The moment we re-enable
-			 * irqs __split_huge_page_splitting() can
-			 * return and the head page can be freed from
-			 * under us. We can't take the PG_lock and/or
-			 * PG_compound_lock on a page that could be
-			 * freed from under us.
-			 */
-			if (page != page_head) {
-				get_page(page_head);
-				put_page(page);
-			}
-			local_irq_enable();
-		} else {
-			local_irq_enable();
-			goto again;
-		}
-	}
-#else
-	page_head = compound_head(page);
-	if (page != page_head) {
-		get_page(page_head);
-		put_page(page);
-	}
-#endif
+	else
+		page_head = page;
 
 	lock_page(page_head);
 


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-11 16:38                       ` Thomas Gleixner
@ 2013-12-11 17:57                         ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2013-12-11 17:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart,
	Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra

On Wed, Dec 11, 2013 at 05:38:55PM +0100, Thomas Gleixner wrote:
> On Wed, 11 Dec 2013, Mel Gorman wrote:
> > On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote:
> > > Now, if that map is RO, i.e. we took the fallback path then the THP
> > > one will fail as it has write=1 unconditionally.
> > > 
> > >       if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) 
> > > 
> > 
> > Not that it really matters but the naming and comments around that
> > particular __get_user_pages_fast call are a little misleading.  The ifdef
> > CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has
> > to be taken against THP splits, not because it is dealing exclusively with
> > THP. The PageTail check applies to either hugetlbfs or THPs and similarly
> > gup_huge_pmd handles both. The whole path should be very rare for THPs
> > considering that THPs exist on MAP_PRIVATE anonymous mappings and how many
> > shared futexes exist backed by such mappings? A RO mapping makes that seem
> > even more strange because what thread is updating the value the caller is
> > waiting on? It seems more like that was a shared futex on a hugetlbfs-backed
> > mappings which might explain why the bug was undiscovered for so long.
> 
> This is the fshared path. The MAP_PRIVATE path does not do that dance
> at all.
> 

do_futex takes an op parameter and sets FLAGS_SHARED on that op parameter,
not whether the VMA backing that address was created with the MAP_PRIVATE
or MAP_SHARED flag. For example;

do_futex(addr, op) where !(ops & FUTEX_PRIVATE_FLAG) and cmd == FUTEX_WAIT
  -> futex_wait
    -> futex_wait_setup
      -> get_futex_key(fshared == true)

THP pages encountered in the fshared path should be rare because why create
a shared futex on a private mapping? That does not make it impossible
which is why splits are guarded against. If it was genuinely impossible
to be in this path for MAP_PRIVATE then there is no point worrying about
THP as it is currently supported at least. Overall, it still seems more
likely this was a hugetlbfs page.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-11 17:56                         ` Oleg Nesterov
@ 2013-12-11 19:18                           ` Oleg Nesterov
  2013-12-13 15:10                             ` Andrea Arcangeli
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-11 19:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/11, Oleg Nesterov wrote:
>
> On 12/11, Thomas Gleixner wrote:
> >
> > On Wed, 11 Dec 2013, Oleg Nesterov wrote:
> > >
> > > I have to admit, I do not understand why we can't avoid this altogether.
> > >
> > > __get_page_tail() can find the stable ->first_page, why get_futex_key()
> > > can't ?
> >
> > Because it can be split under us.
>
> I understand, but why we can't rely on compound_lock() like
> __get_page_tail() does?
>
> OK, could you please explain why something like the pseudo-code
> below can't work?

Seriously. Can the path below work?

With this change get_futex_key() can simply do

	page_head = get_compound_head(page);
	if (page_head)
		put_page(page);
	else
		page_head = page;

instead of CONFIG_TRANSPARENT_HUGEPAGE/else code.

(and we can also remove "bool get_page_head" from __get_page_tail_foll).

I am sure I missed something. But I am curious and I'd like to understand
what I have missed.

Thanks,

Oleg.

--- a/mm/swap.c
+++ b/mm/swap.c
@@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page);
  * This function is exported but must not be called by anything other
  * than get_page(). It implements the slow path of get_page().
  */
-bool __get_page_tail(struct page *page)
+static struct page *__get_compound_head(struct page *page, bool and_tail)
 {
 	/*
 	 * This takes care of get_page() if run on a tail page
@@ -234,8 +234,9 @@ bool __get_page_tail(struct page *page)
 				 * cannot race here.
 				 */
 				VM_BUG_ON(!PageHead(page_head));
-				__get_page_tail_foll(page, false);
-				return true;
+				if (and_tail)
+					get_huge_page_tail(page);
+				return page_head;
 			} else {
 				/*
 				 * __split_huge_page_refcount run
@@ -260,17 +261,34 @@ bool __get_page_tail(struct page *page)
 		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
+			if (and_tail)
+				get_huge_page_tail(page);
 			got = true;
 		}
 		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
+
+		if (likely(got))
+			return page_head;
+		else
 			put_page(page_head);
 	}
-	return got;
+	return NULL;
+}
+
+bool __get_page_tail(struct page *page)
+{
+	return !!__get_compound_head(page, true);
 }
 EXPORT_SYMBOL(__get_page_tail);
 
+struct page *get_compound_head(struct page *page)
+{
+	if (PageTail(page))
+		return __get_compound_head(page, false);
+	else
+		return NULL;
+}
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 22:48                     ` Linus Torvalds
                                         ` (2 preceding siblings ...)
  2013-12-11  4:09                       ` Dave Jones
@ 2013-12-12  4:26                       ` Dave Jones
  2013-12-12  5:29                         ` Darren Hart
  3 siblings, 1 reply; 114+ messages in thread
From: Dave Jones @ 2013-12-12  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:
 
 > Dave, can you re-create that trinity run and test that patch? I think
 > we've got this

24 hours later, all is well.  I think we can call this one done.

Tested-by: Dave Jones <davej@fedoraproject.org>

	Dave


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-12  4:26                       ` Dave Jones
@ 2013-12-12  5:29                         ` Darren Hart
  0 siblings, 0 replies; 114+ messages in thread
From: Darren Hart @ 2013-12-12  5:29 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Andrea Arcangeli,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Wed, 2013-12-11 at 23:26 -0500, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:
>  
>  > Dave, can you re-create that trinity run and test that patch? I think
>  > we've got this
> 
> 24 hours later, all is well.  I think we can call this one done.
> 
> Tested-by: Dave Jones <davej@fedoraproject.org>

Thank you again for a fine preemptive bug catch Dave!

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-10 21:57               ` Linus Torvalds
                                   ` (2 preceding siblings ...)
  2013-12-10 22:19                 ` Linus Torvalds
@ 2013-12-12 19:00                 ` Andrea Arcangeli
  2013-12-12 19:03                   ` Linus Torvalds
  3 siblings, 1 reply; 114+ messages in thread
From: Andrea Arcangeli @ 2013-12-12 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

Hello,

On Tue, Dec 10, 2013 at 01:57:49PM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <davej@redhat.com> wrote:
> >
> > http://codemonkey.org.uk/junk/trace
> 
> Hmm. Ok, so something is calling [__]get_user_pages_fast() and
> put_page() in a loop, but the trace doesn't show what that "something"
> is, because it is itself not ever called.
> 
> However, that pattern does seem to imply that the loop is in
> get_futex_key(), because all the other loops I see seem to be calling
> other things as well.
> 
> And the __get_user_pages_fast() call implies that it's the THP case
> that triggers the "unlikely(PageTail(page))" case. And anyway,
> otherwise we'd see lock_page()/unlock_page() too.
> 
> So it looks like __get_user_pages_fast() fails, and keeps failing.
> Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp:
> update futex compound knowledge") to be exact.

I can only agree that the __get_user_pages_fast "write" parameter
should have been !ro instead of 1.

However it wasn't me introducing the bug, my code when patched in
early 2011 would work fine. The bug was introduced half a year later
in commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae .

@@ -262,8 +264,18 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)
 
 again:
        err = get_user_pages_fast(address, 1, 1, &page);
+       /*
+        * If write access is not required (eg. FUTEX_WAIT), try
+        * and get read-only access.
+        */
+       if (err == -EFAULT && rw == VERIFY_READ) {
+               err = get_user_pages_fast(address, 1, 0, &page);
+               ro = 1;
+       }
        if (err < 0)
                return err;
+       else
+               err = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
        page_head = page;
@@ -305,6 +317,13 @@ again:
        if (!page_head->mapping) {
                unlock_page(page_head);
                put_page(page_head);
+               /*
+               * ZERO_PAGE pages don't have a mapping. Avoid a busy
        loop
+               * trying to find one. RW mapping would have COW'd (and
        thus
+               * have a mapping) so this page is RO and won't ever
        change.
+               */
+               if ((page_head == ZERO_PAGE(address)))
+                       return -EFAULT;
                goto again;
        }

[..]

This commit didn't update the __get_user_pages_fast to use !ro instead
of 1, as it would have been required after the above change.

I'll now review Oleg proposed cleanup.

Thanks for tracking this down!

Andrea

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-12 19:00                 ` Andrea Arcangeli
@ 2013-12-12 19:03                   ` Linus Torvalds
  0 siblings, 0 replies; 114+ messages in thread
From: Linus Torvalds @ 2013-12-12 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Thu, Dec 12, 2013 at 11:00 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> However it wasn't me introducing the bug, my code when patched in
> early 2011 would work fine. The bug was introduced half a year later
> in commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae .

I'd argue that half a year later the bug was *fixed*, but it was only
fixed for the regular pages. Leaving largepages buggy.

            Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-11 19:18                           ` PATCH? introduce get_compound_page (Was: process 'stuck' at exit) Oleg Nesterov
@ 2013-12-13 15:10                             ` Andrea Arcangeli
  2013-12-13 16:22                               ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Andrea Arcangeli @ 2013-12-13 15:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote:
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page);
>   * This function is exported but must not be called by anything other
>   * than get_page(). It implements the slow path of get_page().
>   */
> -bool __get_page_tail(struct page *page)
> +static struct page *__get_compound_head(struct page *page, bool and_tail)
>  {

This is not necessarily inline (unlike __get_page_tail_foll which is
inline and optimizes away the build time known "false"/"true" params).

>  	/*
>  	 * This takes care of get_page() if run on a tail page
> @@ -234,8 +234,9 @@ bool __get_page_tail(struct page *page)
>  				 * cannot race here.
>  				 */
>  				VM_BUG_ON(!PageHead(page_head));
> -				__get_page_tail_foll(page, false);
> -				return true;
> +				if (and_tail)
> +					get_huge_page_tail(page);

So you may be introducing a real branch here unless gcc decides to
self-inline the static func.

> +				return page_head;
>  			} else {
>  				/*
>  				 * __split_huge_page_refcount run
> @@ -260,17 +261,34 @@ bool __get_page_tail(struct page *page)
>  		flags = compound_lock_irqsave(page_head);
>  		/* here __split_huge_page_refcount won't run anymore */
>  		if (likely(PageTail(page))) {
> -			__get_page_tail_foll(page, false);
> +			if (and_tail)
> +				get_huge_page_tail(page);

Same here.

> +bool __get_page_tail(struct page *page)
> +{
> +	return !!__get_compound_head(page, true);
>  }

I hope gcc would optimize away the !! overhead vs the old got =
true/false, once __get_compound_head is inline.

So I think you should add inline to __get_compound_head, there's no
benefit to keep the get_compound_head in the CPU icache.

get_huge_page_tail checks different invariants in the VM_BUG_ON and is
only used by gup.c not sure why to call that here.

But __get_page_tail has changed in -mm so the above will reject, in
-mm you will find __get_page_tail_foll called with "true" parameter
too in __get_page_tail for example (not equivalent to
get_huge_page_tail even ignoring the invariant checking in VM_BUG_ON)
and it defines a fast path for hugetlbfs and slab (faster than the one
upstream).

But here we have a "tail page" that has reference on both head and
tail, and you're taking one more reference on the head, just to
release it later by releasing the tail.

Your objective is to do:
	 
	page_head = get_compound_head(page);
	if (page_head)
		put_page(page);
	else
		page_head = page;

If we've to mess with the compound lock for this, considering
get_compound_head would have only this user anyway, we could do:

     page_head = put_compound_tail(page);
     if (!page_head)
     	page_head = page;

Implemented as:

put_compound_tail(page) {
	    page_head = compound_trans_head(page);
	    if (!__compound_tail_refcounted(page_head)) {
	       smp_rmb();
	       if (PageTail(page)) {
			VM_BUG_ON(!PageHead(page_head));
			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
			VM_BUG_ON(page_mapcount(page) != 0);
			return page_head;
		} else
		  return NULL;

	    }
	    flags = compound_lock_irqsave(page_head);
	    if (!PageTail(page)) {
	       unlock
	       return NULL
	    }
            VM_BUG_ON(page_head != page->first_page);
	    /* __split_huge_page_refcount will wait now */
	    VM_BUG_ON(page_mapcount(page) <= 0);
	    atomic_dec(&page->_mapcount);
	    VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
	    VM_BUG_ON(atomic_read(&page->_count) != 0);
	    compound_unlock_irqrestore(page_head, flags);
	    return page_head;
}

Considering this is going incremental with -mm, and that the -mm
previous patches the above would depend on are not upstream yet, I'd
suggest to fix the bug in futex with Linus patch now (s/1/!ro/).

The strict obviously safe fix Linus posted, is not going to interfere
with these optimization later, so there's no downside to apply it now
and defer the optimizations to the -mm queue.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-13 15:10                             ` Andrea Arcangeli
@ 2013-12-13 16:22                               ` Oleg Nesterov
  2013-12-13 17:34                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-13 16:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

Andrea. Thanks a lot for the detailed reply.

On 12/13, Andrea Arcangeli wrote:
>
> On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote:
>
> get_huge_page_tail checks different invariants in the VM_BUG_ON and is
> only used by gup.c not sure why to call that here.

Compared to __get_page_tail_foll(get_page_head => F) get_huge_page_tail()
lacks  the ->first_page->_count, but it is called right after
get_page_unless_zero(page_head) so to me get_huge_page_tail() looks a bit
better.

Personally, I think that even this change

	--- x/kernel/events/../../mm/internal.h
	+++ x/kernel/events/../../mm/internal.h
	@@ -47,11 +47,9 @@ static inline void __get_page_tail_foll(
		 * page_cache_get_speculative()) on tail pages.
		 */
		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
	-	VM_BUG_ON(atomic_read(&page->_count) != 0);
	-	VM_BUG_ON(page_mapcount(page) < 0);
		if (get_page_head)
			atomic_inc(&page->first_page->_count);
	-	atomic_inc(&page->_mapcount);
	+	get_huge_page_tail(page);
	 }
	 
	 /*

makes sense. But this is minor and subjective, I am not going to argue.

And I agree with other comments, the patch I sent only tried to explain
what I meant.

> If we've to mess with the compound lock for this, considering
> get_compound_head would have only this user anyway, we could do:
>
>      page_head = put_compound_tail(page);
>      if (!page_head)
>      	page_head = page;
>
> Implemented as:

OK, this looks better, I agree.

I'll try to make v2 based on -mm and your suggestions.

> Considering this is going incremental with -mm, and that the -mm
> previous patches the above would depend on are not upstream yet, I'd
> suggest to fix the bug in futex with Linus patch now (s/1/!ro/).

Yes, yes, sure. I didn't mean that this (potential) cleanup can replace
the simple fix from Linus.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-13 16:22                               ` Oleg Nesterov
@ 2013-12-13 17:34                                 ` Andrea Arcangeli
  2013-12-16 18:36                                   ` Oleg Nesterov
  2013-12-16 20:19                                   ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov
  0 siblings, 2 replies; 114+ messages in thread
From: Andrea Arcangeli @ 2013-12-13 17:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote:
> Andrea. Thanks a lot for the detailed reply.
> 
> On 12/13, Andrea Arcangeli wrote:
> >
> > On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote:
> >
> > get_huge_page_tail checks different invariants in the VM_BUG_ON and is
> > only used by gup.c not sure why to call that here.
> 
> Compared to __get_page_tail_foll(get_page_head => F) get_huge_page_tail()
> lacks  the ->first_page->_count, but it is called right after
> get_page_unless_zero(page_head) so to me get_huge_page_tail() looks a bit
> better.
> 
> Personally, I think that even this change
> 
> 	--- x/kernel/events/../../mm/internal.h
> 	+++ x/kernel/events/../../mm/internal.h
> 	@@ -47,11 +47,9 @@ static inline void __get_page_tail_foll(
> 		 * page_cache_get_speculative()) on tail pages.
> 		 */
> 		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> 	-	VM_BUG_ON(atomic_read(&page->_count) != 0);
> 	-	VM_BUG_ON(page_mapcount(page) < 0);
> 		if (get_page_head)
> 			atomic_inc(&page->first_page->_count);
> 	-	atomic_inc(&page->_mapcount);
> 	+	get_huge_page_tail(page);
> 	 }
> 	 
> 	 /*
> 
> makes sense. But this is minor and subjective, I am not going to argue.

The above diff looks a straightforward cleanup you could submit it as
a separate patch in a v2 series. This more clearly shows the
difference between the two functions, so there is less overlap
too.

Feel free to change it further if you want, but the current model was
to have get_huge_page_tail only for arch/*/mm/gup.c (in mm.h) and
__get_page_tail_foll in internal.h for mm/*.c and with the ability to
obtain the head page too (needed in some of the mm/*.c cases, and
never needed for arch/*/mm/gup.c).

> I'll try to make v2 based on -mm and your suggestions.

Ok great!

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: process 'stuck' at exit.
  2013-12-11  0:55                             ` Dave Jones
@ 2013-12-14 20:17                               ` Oleg Nesterov
  0 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-14 20:17 UTC (permalink / raw)
  To: Dave Jones, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List,
	Peter Zijlstra, Mel Gorman

On 12/10, Dave Jones wrote:
>
> On Tue, Dec 10, 2013 at 07:23:30PM -0500, Dave Jones wrote:
>
>  > I was distracted by seeing all the other threads exiting, so I was only looking at
>  > what this one had already done.
>
> another thing that distracted me was that /proc/10818/stack was just showing that
>
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> output.
>
> For my own education, what causes that ?

save_stack_trace_tsk() adds ULONG_MAX as the "last" entry.

and dump_trace() fails if task is running and != current (note that
cat /proc/self/stack works).

> How come it didn't show the same trace I saw when I sysrq-t'd ?

Because print_trace_address() does not skip !reliable entries,
unlike __save_stack_address(). This (afaics) makes the difference.

I'll try to make a patch but I am not sure... I am not even sure
it makes sense, but in any case this all doesn't look right to me.

First of all, stack = task->thread.sp is not really right if this
task is running. Worse, bp = *stack returned by stack_frame() is
random in this case. This equally applies to sysrq-t's output.
Not that bad, but still wrong and confusing, imho.

And lets look at dump_trace(),

	const unsigned cpu = get_cpu();
	unsigned long *irq_stack_end =
		(unsigned long *)per_cpu(irq_stack_ptr, cpu);

This (in general) has nothing to do with task_cpu(task).

And why dump_trace() checks irq_stack_end != NULL ? This is always true.

I think that these paths should not even try to guess what bp is
if the task is not running/current. But it is not clear to "disable"
reliable check in __save_stack_address(), we should report this fact
in proc_pid_stack()->seq_printf() somehow.

And proc_pid_stack() should drop ->cred_guard_mutex right after
save_stack_trace_tsk(), although this is off-topic.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-13 17:34                                 ` Andrea Arcangeli
@ 2013-12-16 18:36                                   ` Oleg Nesterov
  2013-12-16 20:19                                     ` Andrea Arcangeli
  2013-12-16 20:19                                   ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov
  1 sibling, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-16 18:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/13, Andrea Arcangeli wrote:
>
> On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote:
> >
> > I'll try to make v2 based on -mm and your suggestions.
>
> Ok great!

Yes, it would be great, but I need your help again ;)


Let me quote the pseudo-code you sent me:

	put_compound_tail(page) {
		page_head = compound_trans_head(page);
		if (!__compound_tail_refcounted(page_head)) {
			...
			return ...;
		}

		flags = compound_lock_irqsave(page_head);
		...

Sure, put_compound_tail() should be the simplified version of
put_compound_page() which doesn't dec page_head->_count, this is clear.

But afaics, compound_lock_irqsave() above looks unsafe without
get_page_unless_zero(page_head) ? If we race with _split, page_head
can be freed and compound_lock() can race with, say, free_pages_check()
which plays with page->flags ?

So it seems that put_compound_tail() should also do get/put(head) like
put_compound_page() does, and this probably means we should factor out
the common code somehow.

Or I missed something?



OTOH, I can't really understand

	if (likely(page != page_head && get_page_unless_zero(page_head)))

in __get_page_tail() and put_compound_page().

First of all, is it really possible that page == compound_trans_head(page)?
We already verified that PG_tail was set. Of course this bit can be cleared
and ->first_page can be a dangling pointer but it can never be changed to
point to this page? (in fact, afaics it can be changed at all as long as we
have a reference, but this doesn't matter).



And compound_lock_irqsave() looks racy even after get_page_unless_zero().

For example, suppose that page_head was already freed and then re-allocated
as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
race with the non-atomic prep_compound_page()->__SetPageHead() ?



Finally. basepage_index(page) after put_page(page) in get_futex_key() looks
confusing imho. I think this is correct, we already checked PageAnon() so it
can't be a thp page. But probably this needs a comment and __basepage_index()
should do BUG_ON(!PageHuge()).

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups
  2013-12-13 17:34                                 ` Andrea Arcangeli
  2013-12-16 18:36                                   ` Oleg Nesterov
@ 2013-12-16 20:19                                   ` Oleg Nesterov
  2013-12-16 20:19                                     ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov
                                                       ` (2 more replies)
  1 sibling, 3 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-16 20:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On top of

	mm-tail-page-refcounting-optimization-for-slab-and-hugetlbfs.patch

should not be applied without the ack from Andrea.

On 12/13, Andrea Arcangeli wrote:
>
> The above diff looks a straightforward cleanup you could submit it as
> a separate patch in a v2 series.

OK, let me send this separately, because (afaics) put_compound_tail()
needs more thinking.

See also 2/2. Again, I won't argue if you dislike this change even if
it is correct, so please review and ack/nack. To me compound_head() in
get_huge_page_tail() looks confusing, as if get_huge_page_tail() can
accept a !PageTail page. But perhaps this is only because I am new to
this code.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail()
  2013-12-16 20:19                                   ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov
@ 2013-12-16 20:19                                     ` Oleg Nesterov
  2013-12-16 20:19                                     ` [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() Oleg Nesterov
  2013-12-16 20:27                                     ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Andrea Arcangeli
  2 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-16 20:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

Cleanup. Change __get_page_tail_foll() to use get_huge_page_tail()
to avoid the code duplication.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/internal.h |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a85a3ab..a346ba1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,12 +47,9 @@ static inline void __get_page_tail_foll(struct page *page,
 	 * page_cache_get_speculative()) on tail pages.
 	 */
 	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	VM_BUG_ON(page_mapcount(page) < 0);
 	if (get_page_head)
 		atomic_inc(&page->first_page->_count);
-	if (compound_tail_refcounted(page->first_page))
-		atomic_inc(&page->_mapcount);
+	get_huge_page_tail(page);
 }
 
 /*
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail()
  2013-12-16 20:19                                   ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov
  2013-12-16 20:19                                     ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov
@ 2013-12-16 20:19                                     ` Oleg Nesterov
  2013-12-16 20:27                                     ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Andrea Arcangeli
  2 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-16 20:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

get_huge_page_tail()->compound_head() looks confusing. Every caller
must check PageTail(page), otherwise atomic_inc(&page->_mapcount) is
simply wrong if this page is compound-trans-head.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/mm.h |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 13bae9e..13495bd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -431,13 +431,12 @@ static inline bool compound_tail_refcounted(struct page *page)
 static inline void get_huge_page_tail(struct page *page)
 {
 	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 * In turn no need of compound_trans_head here.
+	 * __split_huge_page_refcount() cannot run from under us.
 	 */
+	VM_BUG_ON(!PageTail(page));
 	VM_BUG_ON(page_mapcount(page) < 0);
 	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	if (compound_tail_refcounted(compound_head(page)))
+	if (compound_tail_refcounted(page->first_page))
 		atomic_inc(&page->_mapcount);
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-16 18:36                                   ` Oleg Nesterov
@ 2013-12-16 20:19                                     ` Andrea Arcangeli
  2013-12-16 20:46                                       ` Oleg Nesterov
  2013-12-19 19:08                                       ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov
  0 siblings, 2 replies; 114+ messages in thread
From: Andrea Arcangeli @ 2013-12-16 20:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> On 12/13, Andrea Arcangeli wrote:
> >
> > On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote:
> > >
> > > I'll try to make v2 based on -mm and your suggestions.
> >
> > Ok great!
> 
> Yes, it would be great, but I need your help again ;)
> 
> 
> Let me quote the pseudo-code you sent me:
> 
> 	put_compound_tail(page) {
> 		page_head = compound_trans_head(page);
> 		if (!__compound_tail_refcounted(page_head)) {
> 			...
> 			return ...;
> 		}
> 
> 		flags = compound_lock_irqsave(page_head);
> 		...
> 
> Sure, put_compound_tail() should be the simplified version of
> put_compound_page() which doesn't dec page_head->_count, this is clear.
> 
> But afaics, compound_lock_irqsave() above looks unsafe without
> get_page_unless_zero(page_head) ? If we race with _split, page_head
> can be freed and compound_lock() can race with, say, free_pages_check()
> which plays with page->flags ?
> 
> So it seems that put_compound_tail() should also do get/put(head) like
> put_compound_page() does, and this probably means we should factor out
> the common code somehow.
> 

Yes it was supposed to do the get_page_unless_zero before the
compound_lock.

> Or I missed something?
> 
> 
> 
> OTOH, I can't really understand
> 
> 	if (likely(page != page_head && get_page_unless_zero(page_head)))
> 
> in __get_page_tail() and put_compound_page().
> 
> First of all, is it really possible that page == compound_trans_head(page)?

It is possible if it become a regular page because split_huge_page
run in between.

compound_trans_head guarantees us it got us to a safe head page. And
the idea was to do get_page_unless_zero only on head pages and never
on tails to be safer/simpler. Same goes for the compound_lock that
follows still on the page_head where "page != page_head".

If you like, you can try to optimize away the check and reuse the
PageTail branch inside the compound_lock but I'm not sure if it's
worth it.

> We already verified that PG_tail was set. Of course this bit can be cleared
> and ->first_page can be a dangling pointer but it can never be changed to
> point to this page? (in fact, afaics it can be changed at all as long as we
> have a reference, but this doesn't matter).

If PG_tail gets cleared compound_trans_head returns "page".

> And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> 
> For example, suppose that page_head was already freed and then re-allocated
> as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> race with the non-atomic prep_compound_page()->__SetPageHead() ?

Yes. We need to change to:

	if (order && (gfp_flags & __GFP_COMP))
		prep_compound_page(page, order);
	smp_wmb();
	/* as the compound_lock can be taken after it's refcounted */
	set_page_refcounted(page);

__SetPageHead uses bts asm insn so literally only a "lock" prefix is
missing in a assembly instruction. So the race window is incredibly
small, but it must be fixed indeed. This also puts set_page_refcounted
as the last action of buffered_rmqueue so there shouldn't be any other
issues like this left in the page allocation code.

Can you reorder set_page_refcount in your v2?

I wonder if arch_alloc_page needs refcount 1, it sets the page as
stable on s390. the other way around is to move prep_compound_page
before set_page_refcounted (though I think if we can, keeping the
refcounted at the very last with a comment is preferable).

> Finally. basepage_index(page) after put_page(page) in get_futex_key() looks
> confusing imho. I think this is correct, we already checked PageAnon() so it
> can't be a thp page. But probably this needs a comment and __basepage_index()
> should do BUG_ON(!PageHuge()).

This looks a bug if you apply the patches to add THP in pagecache, and
BUG_ON(!PageHuge) would also break THP in pagecache later (though
safer than silently broken like now).

It'd safer to read the basepage_index in a local variable just before
doing the put_compund_tail but I agree it's not going to make a
difference right now.

But the whole idea of working with the head of the THP despite the
address points to a tail, looks flakey if there is a split (THP in
pagecache). I mean chances are reading basepage_index(page) before
releasing the tail pin is not enough.

The Anon path looks sane for all cases, as it doesn't pretend to
return any information on the actual mapping and it limits itself to
do the PageAnon check on the actual head that has PageAnon set, which
is still valid thing to do even after a split, and in fact required if
a split didn't take place yet as the tail "page" isn't anon but just a
tail.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups
  2013-12-16 20:19                                   ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov
  2013-12-16 20:19                                     ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov
  2013-12-16 20:19                                     ` [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() Oleg Nesterov
@ 2013-12-16 20:27                                     ` Andrea Arcangeli
  2 siblings, 0 replies; 114+ messages in thread
From: Andrea Arcangeli @ 2013-12-16 20:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones,
	Darren Hart, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman

On Mon, Dec 16, 2013 at 09:19:00PM +0100, Oleg Nesterov wrote:
> On top of
> 
> 	mm-tail-page-refcounting-optimization-for-slab-and-hugetlbfs.patch
> 
> should not be applied without the ack from Andrea.
> 
> On 12/13, Andrea Arcangeli wrote:
> >
> > The above diff looks a straightforward cleanup you could submit it as
> > a separate patch in a v2 series.
> 
> OK, let me send this separately, because (afaics) put_compound_tail()
> needs more thinking.
> 
> See also 2/2. Again, I won't argue if you dislike this change even if
> it is correct, so please review and ack/nack. To me compound_head() in
> get_huge_page_tail() looks confusing, as if get_huge_page_tail() can
> accept a !PageTail page. But perhaps this is only because I am new to
> this code.

compound_head in get_huge_page_tail was just a more readable version
of page->first_page in __get_page_tail_foll. But page->first_page is
faster so it's better.

I reviewed all callers and there's no risk of the VM_BUG_ON triggering
but I prefer it too.

Both patches Acked.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-16 20:19                                     ` Andrea Arcangeli
@ 2013-12-16 20:46                                       ` Oleg Nesterov
  2013-12-17 16:53                                         ` Oleg Nesterov
  2013-12-19 19:08                                       ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov
  1 sibling, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-16 20:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/16, Andrea Arcangeli wrote:
>
> On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> >
> > So it seems that put_compound_tail() should also do get/put(head) like
> > put_compound_page() does, and this probably means we should factor out
> > the common code somehow.
> >
>
> Yes it was supposed to do the get_page_unless_zero before the
> compound_lock.

OK,

> > OTOH, I can't really understand
> >
> > 	if (likely(page != page_head && get_page_unless_zero(page_head)))
> >
> > in __get_page_tail() and put_compound_page().
> >
> > First of all, is it really possible that page == compound_trans_head(page)?
>
> ...
>
> If PG_tail gets cleared compound_trans_head returns "page".

Damn, indeed, thanks.

> > And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> >
> > For example, suppose that page_head was already freed and then re-allocated
> > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> > race with the non-atomic prep_compound_page()->__SetPageHead() ?
>
> Yes. We need to change to:
>
> 	if (order && (gfp_flags & __GFP_COMP))
> 		prep_compound_page(page, order);
> 	smp_wmb();
> 	/* as the compound_lock can be taken after it's refcounted */
> 	set_page_refcounted(page);
>
> __SetPageHead uses bts asm insn so literally only a "lock" prefix is
> missing in a assembly instruction. So the race window is incredibly
> small, but it must be fixed indeed. This also puts set_page_refcounted
> as the last action of buffered_rmqueue so there shouldn't be any other
> issues like this left in the page allocation code.
>
> Can you reorder set_page_refcount in your v2?

OK. I'll try to make something on Wednesday.


> > Finally. basepage_index(page) after put_page(page) in get_futex_key() looks
> > confusing imho. I think this is correct, we already checked PageAnon() so it
> > can't be a thp page. But probably this needs a comment and __basepage_index()
> > should do BUG_ON(!PageHuge()).
>
> This looks a bug if you apply the patches to add THP in pagecache, and
> BUG_ON(!PageHuge) would also break THP in pagecache later (though
> safer than silently broken like now).
>
> It'd safer to read the basepage_index in a local variable just before
> doing the put_compund_tail but I agree it's not going to make a
> difference right now.

Yes, so lets discuss (and perhaps change) this later/separately.

Andrea, thanks again for your help.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-16 20:46                                       ` Oleg Nesterov
@ 2013-12-17 16:53                                         ` Oleg Nesterov
  2013-12-17 18:06                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-17 16:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/16, Oleg Nesterov wrote:
>
> On 12/16, Andrea Arcangeli wrote:
> >
> > On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> > >
> > > And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> > >
> > > For example, suppose that page_head was already freed and then re-allocated
> > > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> > > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> > > race with the non-atomic prep_compound_page()->__SetPageHead() ?
> >
> > Yes. We need to change to:
> >
> > 	if (order && (gfp_flags & __GFP_COMP))
> > 		prep_compound_page(page, order);
> > 	smp_wmb();
> > 	/* as the compound_lock can be taken after it's refcounted */
> > 	set_page_refcounted(page);
> >
> > __SetPageHead uses bts asm insn so literally only a "lock" prefix is
> > missing in a assembly instruction. So the race window is incredibly
> > small, but it must be fixed indeed. This also puts set_page_refcounted
> > as the last action of buffered_rmqueue so there shouldn't be any other
> > issues like this left in the page allocation code.
> >
> > Can you reorder set_page_refcount in your v2?
>
> OK. I'll try to make something on Wednesday.

Yes, I will, but...

I can't stop thinking about another change. What if we simply change
__split_huge_page_refcount() to also do compound_lock/unlock(page_tail)
in a main loop?

This way we can greatly simplify get/put_page paths, we can rely on
compound_lock(sub-page) and avoid get_page_unless_zero(page_head).
Yes, this will make _split a bit slower, but PG_compound_lock should
not be contended? And we should change page_tail->flags carefully, but
this looks simple.

Or this is not possible/desirable?

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
  2013-12-17 16:53                                         ` Oleg Nesterov
@ 2013-12-17 18:06                                           ` Andrea Arcangeli
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Andrea Arcangeli @ 2013-12-17 18:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On Tue, Dec 17, 2013 at 05:53:35PM +0100, Oleg Nesterov wrote:
> On 12/16, Oleg Nesterov wrote:
> >
> > On 12/16, Andrea Arcangeli wrote:
> > >
> > > On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> > > >
> > > > And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> > > >
> > > > For example, suppose that page_head was already freed and then re-allocated
> > > > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> > > > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> > > > race with the non-atomic prep_compound_page()->__SetPageHead() ?
> > >
> > > Yes. We need to change to:
> > >
> > > 	if (order && (gfp_flags & __GFP_COMP))
> > > 		prep_compound_page(page, order);
> > > 	smp_wmb();
> > > 	/* as the compound_lock can be taken after it's refcounted */
> > > 	set_page_refcounted(page);
> > >
> > > __SetPageHead uses bts asm insn so literally only a "lock" prefix is
> > > missing in a assembly instruction. So the race window is incredibly
> > > small, but it must be fixed indeed. This also puts set_page_refcounted
> > > as the last action of buffered_rmqueue so there shouldn't be any other
> > > issues like this left in the page allocation code.
> > >
> > > Can you reorder set_page_refcount in your v2?
> >
> > OK. I'll try to make something on Wednesday.
> 
> Yes, I will, but...
> 
> I can't stop thinking about another change. What if we simply change
> __split_huge_page_refcount() to also do compound_lock/unlock(page_tail)
> in a main loop?
> 
> This way we can greatly simplify get/put_page paths, we can rely on
> compound_lock(sub-page) and avoid get_page_unless_zero(page_head).
> Yes, this will make _split a bit slower, but PG_compound_lock should
> not be contended? And we should change page_tail->flags carefully, but
> this looks simple.
> 
> Or this is not possible/desirable?

That would be 512 nested spinlocks instead of 1, last time I did
something like that in mm_take_all_locks people weren't too pleased as
it started firing lockdeps complains too. Generally I try to avoid
taking too many locks nested if I can.

mm_take_all_locks is fine because it only runs when you register an mm
into a device driver, so it is a very rare event and not performance
critical at all, it is a slow path by all means (only runs when you
start a virtual machine or start X with nvidia etc..). So it is not a
concern. split_huge_page to the contrary could run in a flood if
you're unlucky. split_huge_page is needed not just to handle non-THP
aware paths that mostly disappeared by now, but also when you truncate
a vma so that a THP doesn't fit in it anymore. So it's up to userland
how frequently it needs to run.

I think it's reasonable to consider it though, but then it's not
guaranteed that a put_page on a THP tail is more frequent than
split_huge_page. Keep in mind we do the get_page_unless_zero to
stabilize the head to take the compound_lock on it, only for the
tails, never for the heads. So this restricts it to _only_ the
put_page following a gup_fast. Only gup_fast can ever take a reference
on the tail pages of a THP. Nothing else can.

I intend to add the foll_flags to gup_fast parameter so we remove
FOLL_GET from it in the KVM page fault to avoid doing atomic_inc
immediately followed by atomic_dec when establishing sptes. So the
only transient mappings that cannot be converted to pin-less
mmu_notifier will have to run put_page in gup_fast.

In short you're only going to help O_DIRECT with that change, and
removing 1 locked op for every 4k written to disk may not offset the
cost of 511 locked ops in split_huge_page.

Still worth thinking about it, but not obvious win in my view (plus
the lockdep trouble with taking too many locks nested). Comments welcome.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH -mm 0/7] (Was: introduce get_compound_page)
  2013-12-17 18:06                                           ` Andrea Arcangeli
@ 2013-12-18 19:19                                             ` Oleg Nesterov
  2013-12-18 19:19                                               ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov
                                                                 ` (6 more replies)
  0 siblings, 7 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman

On 12/17, Andrea Arcangeli wrote:
>
> On Tue, Dec 17, 2013 at 05:53:35PM +0100, Oleg Nesterov wrote:
> >
> > I can't stop thinking about another change. What if we simply change
> > __split_huge_page_refcount() to also do compound_lock/unlock(page_tail)
> > in a main loop?
> >
> > This way we can greatly simplify get/put_page paths, we can rely on
> > compound_lock(sub-page) and avoid get_page_unless_zero(page_head).
> > Yes, this will make _split a bit slower, but PG_compound_lock should
> > not be contended? And we should change page_tail->flags carefully, but
> > this looks simple.
> >
> > Or this is not possible/desirable?
>
> That would be 512 nested spinlocks instead of 1,

Yes. But, just in case, we do not need to take them all at once. We
only need to take/release the additional lock per page_tail in a loop.

> last time I did
> something like that in mm_take_all_locks people weren't too pleased as
> it started firing lockdeps complains too.

bit_spin_lock() doesn't have the lockdep annotations and it would be
trivial to add _nested if necessary.

> split_huge_page to the contrary could run in a flood if
> you're unlucky. split_huge_page is needed not just to handle non-THP
> aware paths that mostly disappeared by now, but also when you truncate
> a vma so that a THP doesn't fit in it anymore. So it's up to userland
> how frequently it needs to run.

Yes, I understand. I _think_ this should be fine performance-wise,
compared to other things split_huge_page() paths should do these 511
test_and_set_bit + clear_bit should not be noticeable.

But of course I can be wrong.

> I think it's reasonable to consider it though, but then it's not
> guaranteed that a put_page on a THP tail is more frequent than
> split_huge_page. Keep in mind we do the get_page_unless_zero to
> stabilize the head to take the compound_lock on it, only for the
> tails, never for the heads. So this restricts it to _only_ the
> put_page following a gup_fast. Only gup_fast can ever take a reference
> on the tail pages of a THP. Nothing else can.

I see. But my main point was simplification.

And I'm afraid this compound_lock() in get/put can race with someone
else playing with page->flags "because I own this freshly allocated
page". Perhaps.


However. I do understand your concerns, lets forget about this for now.


---------------------------------------------------------------------------
Please review get_futex_key() changes we discussed before. As for the race
with prep_new_page() I'll send another patch, it is completely orthogonal
to this series.

Testing. I simply have no idea how can I test this series so I didn't
even try ;) I'll try to do something tomorrow, but in any case I have
to rely on your review.

Most of the cleanups in this series are not needed to change get_futex_key(),
but imho make sense. And we need more cleanups after this series, I'll do
this later if this series makes any sense.

Oleg.

 include/linux/mm.h |    2 +
 kernel/futex.c     |   37 +--------
 mm/swap.c          |  251 ++++++++++++++++++++++++++--------------------------
 3 files changed, 128 insertions(+), 162 deletions(-)


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page()
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
@ 2013-12-18 19:19                                               ` Oleg Nesterov
  2013-12-18 19:19                                               ` [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() Oleg Nesterov
                                                                 ` (5 subsequent siblings)
  6 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Peter Zijlstra, Mel Gorman, linux-kernel

Add the new helper, __put_nontail_page(page), which simply does
compound/single put depending on PageHead(). put_compound_page()
can use it 3 times, probably it can have other users.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/swap.c |   60 ++++++++++++++++++++++++++----------------------------------
 1 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 0040d9c..f83de1f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,21 +79,26 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
+static void __put_nontail_page(struct page *page)
+{
+	if (put_page_testzero(page)) {
+		/*
+		 * By the time all refcounts have been released
+		 * split_huge_page cannot run anymore from under us.
+		 */
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
+	}
+}
+
 static void put_compound_page(struct page *page)
 {
 	struct page *page_head;
 
 	if (likely(!PageTail(page))) {
-		if (put_page_testzero(page)) {
-			/*
-			 * By the time all refcounts have been released
-			 * split_huge_page cannot run anymore from under us.
-			 */
-			if (PageHead(page))
-				__put_compound_page(page);
-			else
-				__put_single_page(page);
-		}
+		__put_nontail_page(page);
 		return;
 	}
 
@@ -176,24 +181,16 @@ static void put_compound_page(struct page *page)
 		if (unlikely(!PageTail(page))) {
 			/* __split_huge_page_refcount run before us */
 			compound_unlock_irqrestore(page_head, flags);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * The head page may have been freed
-				 * and reallocated as a compound page
-				 * of smaller order and then freed
-				 * again.  All we know is that it
-				 * cannot have become: a THP page, a
-				 * compound page of higher order, a
-				 * tail page.  That is because we
-				 * still hold the refcount of the
-				 * split THP tail and page_head was
-				 * the THP head before the split.
-				 */
-				if (PageHead(page_head))
-					__put_compound_page(page_head);
-				else
-					__put_single_page(page_head);
-			}
+			/*
+			 * The head page may have been freed and reallocated
+			 * as a compound page of smaller order and then freed
+			 * again. All we know is that it cannot have become:
+			 * a THP page, a compound page of higher order, a tail
+			 * page. That is because we still hold the refcount of
+			 * the split THP tail and page_head was the THP head
+			 * before the split.
+			 */
+			__put_nontail_page(page_head);
 out_put_single:
 			if (put_page_testzero(page))
 				__put_single_page(page);
@@ -215,12 +212,7 @@ out_put_single:
 		VM_BUG_ON(atomic_read(&page->_count) != 0);
 		compound_unlock_irqrestore(page_head, flags);
 
-		if (put_page_testzero(page_head)) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
-		}
+		__put_nontail_page(page_head);
 	} else {
 		/* page_head is a dangling pointer */
 		VM_BUG_ON(PageTail(page));
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page()
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
  2013-12-18 19:19                                               ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov
@ 2013-12-18 19:19                                               ` Oleg Nesterov
  2013-12-18 19:19                                               ` [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() Oleg Nesterov
                                                                 ` (4 subsequent siblings)
  6 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Peter Zijlstra, Mel Gorman, linux-kernel

__get_page_tail() can use __put_nontail_page() instead of put_page()
if it races with split_huge_page(). This is what put_compound_page()
does, so this is equally safe. And this allows us to factor out this
code later.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/swap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index f83de1f..aae24fe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -289,7 +289,7 @@ bool __get_page_tail(struct page *page)
 		}
 		compound_unlock_irqrestore(page_head, flags);
 		if (unlikely(!got))
-			put_page(page_head);
+			__put_nontail_page(page_head);
 	}
 	return got;
 }
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page()
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
  2013-12-18 19:19                                               ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov
  2013-12-18 19:19                                               ` [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() Oleg Nesterov
@ 2013-12-18 19:19                                               ` Oleg Nesterov
  2013-12-18 19:19                                               ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov
                                                                 ` (3 subsequent siblings)
  6 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Peter Zijlstra, Mel Gorman, linux-kernel

Change release_pages() to avoid put_compound_page() and simply
call put_page(). This adds the additional PageCompound() check
into the unlikely path but allows us to change the semantics of
put_compound_page().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/swap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index aae24fe..b0e65b6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -819,7 +819,7 @@ void release_pages(struct page **pages, int nr, int cold)
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
 				zone = NULL;
 			}
-			put_compound_page(page);
+			put_page(page);
 			continue;
 		}
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail()
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
                                                                 ` (2 preceding siblings ...)
  2013-12-18 19:19                                               ` [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() Oleg Nesterov
@ 2013-12-18 19:19                                               ` Oleg Nesterov
  2013-12-18 19:36                                                 ` Peter Zijlstra
  2013-12-18 19:20                                               ` [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() Oleg Nesterov
                                                                 ` (2 subsequent siblings)
  6 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Peter Zijlstra, Mel Gorman, linux-kernel

put_page(page) calls put_compound_page() if PageCompound(), but
put_compound_page() falls back to __put_nontail_page() if the page
is non-tail.

So we can simply shift the PageTail() logic from put_compound_page()
to put_page(), and put_page() can use __put_nontail_page() otherwise.

The patch also renames put_compound_page() to __put_page_tail() to
reflect the semantics change. And this way put_page() looks more
symmetrical with get_page().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/swap.c |   17 ++++++-----------
 1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index b0e65b6..0400f3b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -93,15 +93,10 @@ static void __put_nontail_page(struct page *page)
 	}
 }
 
-static void put_compound_page(struct page *page)
+static void __put_page_tail(struct page *page)
 {
 	struct page *page_head;
 
-	if (likely(!PageTail(page))) {
-		__put_nontail_page(page);
-		return;
-	}
-
 	/* __split_huge_page_refcount can run under us */
 	page_head = compound_trans_head(page);
 
@@ -222,10 +217,10 @@ out_put_single:
 
 void put_page(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		put_compound_page(page);
-	else if (put_page_testzero(page))
-		__put_single_page(page);
+	if (unlikely(PageTail(page)))
+		__put_page_tail(page);
+	else
+		__put_nontail_page(page);
 }
 EXPORT_SYMBOL(put_page);
 
@@ -247,7 +242,7 @@ bool __get_page_tail(struct page *page)
 	bool got;
 	struct page *page_head = compound_trans_head(page);
 
-	/* Ref to put_compound_page() comment. */
+	/* Ref to __put_page_tail() comment. */
 	if (!__compound_tail_refcounted(page_head)) {
 		smp_rmb();
 		if (likely(PageTail(page))) {
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail()
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
                                                                 ` (3 preceding siblings ...)
  2013-12-18 19:19                                               ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov
@ 2013-12-18 19:20                                               ` Oleg Nesterov
  2013-12-18 19:20                                               ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov
  2013-12-18 19:20                                               ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov
  6 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:20 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Peter Zijlstra, Mel Gorman, linux-kernel

The patch looks complicated but it is not. It simply moves the
out_put_single label at the end of the function and changes the
!__compound_tail_refcounted() block to save a tab stop.

This cleanups the code layout a bit, and this is the preparation
for the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/swap.c |   70 ++++++++++++++++++++++++------------------------------------
 1 files changed, 28 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 0400f3b..5f3dda6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -125,42 +125,29 @@ static void __put_page_tail(struct page *page)
 		 * can be freed and reallocated.
 		 */
 		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * __split_huge_page_refcount cannot race
-			 * here.
-			 */
-			VM_BUG_ON(!PageHead(page_head));
-			VM_BUG_ON(page_mapcount(page) != 0);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * If this is the tail of a slab
-				 * compound page, the tail pin must
-				 * not be the last reference held on
-				 * the page, because the PG_slab
-				 * cannot be cleared before all tail
-				 * pins (which skips the _mapcount
-				 * tail refcounting) have been
-				 * released. For hugetlbfs the tail
-				 * pin may be the last reference on
-				 * the page instead, because
-				 * PageHeadHuge will not go away until
-				 * the compound page enters the buddy
-				 * allocator.
-				 */
-				VM_BUG_ON(PageSlab(page_head));
-				__put_compound_page(page_head);
-			}
-			return;
-		} else
+		if (unlikely(!PageTail(page)))
+			goto out_put_single;
+
+		/*
+		 * __split_huge_page_refcount cannot race here.
+		 */
+		VM_BUG_ON(!PageHead(page_head));
+		VM_BUG_ON(page_mapcount(page) != 0);
+		if (put_page_testzero(page_head)) {
 			/*
-			 * __split_huge_page_refcount run before us,
-			 * "page" was a THP tail. The split page_head
-			 * has been freed and reallocated as slab or
-			 * hugetlbfs page of smaller order (only
-			 * possible if reallocated as slab on x86).
+			 * If this is the tail of a slab compound page, the
+			 * tail pin must not be the last reference held on
+			 * the page, because the PG_slab cannot be cleared
+			 * before all tail pins (which skips the _mapcount
+			 * tail refcounting) have been released. For hugetlbfs
+			 * the tail pin may be the last reference on the page
+			 * instead, because PageHeadHuge will not go away until
+			 * the compound page enters the buddy allocator.
 			 */
-			goto out_put_single;
+			VM_BUG_ON(PageSlab(page_head));
+			__put_compound_page(page_head);
+		}
+		return;
 	}
 
 	if (likely(page != page_head && get_page_unless_zero(page_head))) {
@@ -186,10 +173,7 @@ static void __put_page_tail(struct page *page)
 			 * before the split.
 			 */
 			__put_nontail_page(page_head);
-out_put_single:
-			if (put_page_testzero(page))
-				__put_single_page(page);
-			return;
+			goto out_put_single;
 		}
 		VM_BUG_ON(page_head != page->first_page);
 		/*
@@ -208,11 +192,13 @@ out_put_single:
 		compound_unlock_irqrestore(page_head, flags);
 
 		__put_nontail_page(page_head);
-	} else {
-		/* page_head is a dangling pointer */
-		VM_BUG_ON(PageTail(page));
-		goto out_put_single;
+		return;
 	}
+
+out_put_single:
+	VM_BUG_ON(PageTail(page));
+	if (put_page_testzero(page))
+		__put_single_page(page);
 }
 
 void put_page(struct page *page)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
                                                                 ` (4 preceding siblings ...)
  2013-12-18 19:20                                               ` [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() Oleg Nesterov
@ 2013-12-18 19:20                                               ` Oleg Nesterov
  2013-12-18 21:37                                                 ` Linus Torvalds
  2013-12-18 19:20                                               ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov
  6 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:20 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Peter Zijlstra, Mel Gorman, linux-kernel

Both __put_page_tail() and __get_page_tail() need to carefully
take a reference on page_head, take compound_lock() and recheck
PageTail(page) under this lock.

This patch extracts this code into the new helper, it will have
another user. This also means that __get_page_tail() gets the
same VM_BUG_ON() checks.

Note: this change can also help if we decide to change the locking,
perhaps it makes sense to change __split_huge_page_refcount() to
also do compound_lock/unlock(page_tail) in a loop. In this case it
would be simple to adapt this helper and its usage.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/swap.c |   99 +++++++++++++++++++++++++++---------------------------------
 1 files changed, 45 insertions(+), 54 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5f3dda6..972923d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -93,12 +93,45 @@ static void __put_nontail_page(struct page *page)
 	}
 }
 
-static void __put_page_tail(struct page *page)
+static bool get_lock_thp_head(struct page *head, struct page *tail,
+				unsigned long *flags)
 {
-	struct page *page_head;
+	if (unlikely(head == tail || !get_page_unless_zero(head)))
+		return false;
+
+	/*
+	 * page_head wasn't a dangling pointer but it may not be a head
+	 * page anymore by the time we obtain the lock. That is ok as
+	 * long as it can't be freed from under us.
+	 */
+	*flags = compound_lock_irqsave(head);
+	if (likely(PageTail(tail))) {
+		VM_BUG_ON(head != tail->first_page);
+		VM_BUG_ON(atomic_read(&head->_count) <= 1);
+		VM_BUG_ON(atomic_read(&tail->_count) != 0);
+		VM_BUG_ON(page_mapcount(tail) <= 1);
+		return true;
+	}
+
+	compound_unlock_irqrestore(head, *flags);
+	/*
+	 * The head page may have been freed and reallocated as a compound
+	 * page of smaller order and then freed again. All we know is that
+	 * it cannot have become: a THP page, a compound page of higher
+	 * order, a tail page. That is because we still hold the refcount
+	 * of the split THP tail and page_head was the THP head before the
+	 * split.
+	 */
+	__put_nontail_page(head);
+	return false;
+}
 
+
+static void __put_page_tail(struct page *page)
+{
 	/* __split_huge_page_refcount can run under us */
-	page_head = compound_trans_head(page);
+	struct page *page_head = compound_trans_head(page);
+	unsigned long flags;
 
 	/*
 	 * THP can not break up slab pages so avoid taking
@@ -150,45 +183,16 @@ static void __put_page_tail(struct page *page)
 		return;
 	}
 
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		unsigned long flags;
-
-		/*
-		 * page_head wasn't a dangling pointer but it may not
-		 * be a head page anymore by the time we obtain the
-		 * lock. That is ok as long as it can't be freed from
-		 * under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		if (unlikely(!PageTail(page))) {
-			/* __split_huge_page_refcount run before us */
-			compound_unlock_irqrestore(page_head, flags);
-			/*
-			 * The head page may have been freed and reallocated
-			 * as a compound page of smaller order and then freed
-			 * again. All we know is that it cannot have become:
-			 * a THP page, a compound page of higher order, a tail
-			 * page. That is because we still hold the refcount of
-			 * the split THP tail and page_head was the THP head
-			 * before the split.
-			 */
-			__put_nontail_page(page_head);
-			goto out_put_single;
-		}
-		VM_BUG_ON(page_head != page->first_page);
+	if (likely(get_lock_thp_head(page_head, page, &flags))) {
 		/*
-		 * We can release the refcount taken by
-		 * get_page_unless_zero() now that
-		 * __split_huge_page_refcount() is blocked on the
+		 * We can release the refcount taken by get_lock_thp_head()
+		 * now that __split_huge_page_refcount() is blocked on the
 		 * compound_lock.
 		 */
 		if (put_page_testzero(page_head))
 			VM_BUG_ON(1);
 		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON(page_mapcount(page) <= 0);
 		atomic_dec(&page->_mapcount);
-		VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
-		VM_BUG_ON(atomic_read(&page->_count) != 0);
 		compound_unlock_irqrestore(page_head, flags);
 
 		__put_nontail_page(page_head);
@@ -224,9 +228,8 @@ bool __get_page_tail(struct page *page)
 	 * proper PT lock that already serializes against
 	 * split_huge_page().
 	 */
-	unsigned long flags;
-	bool got;
 	struct page *page_head = compound_trans_head(page);
+	unsigned long flags;
 
 	/* Ref to __put_page_tail() comment. */
 	if (!__compound_tail_refcounted(page_head)) {
@@ -254,25 +257,13 @@ bool __get_page_tail(struct page *page)
 		}
 	}
 
-	got = false;
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		/*
-		 * page_head wasn't a dangling pointer but it
-		 * may not be a head page anymore by the time
-		 * we obtain the lock. That is ok as long as it
-		 * can't be freed from under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		/* here __split_huge_page_refcount won't run anymore */
-		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
-			got = true;
-		}
+	if (likely(get_lock_thp_head(page_head, page, &flags))) {
+		__get_page_tail_foll(page, false);
 		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
-			__put_nontail_page(page_head);
+		return true;
 	}
-	return got;
+
+	return false;
 }
 EXPORT_SYMBOL(__get_page_tail);
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it
  2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
                                                                 ` (5 preceding siblings ...)
  2013-12-18 19:20                                               ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov
@ 2013-12-18 19:20                                               ` Oleg Nesterov
  2013-12-18 19:28                                                 ` Peter Zijlstra
  6 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:20 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Peter Zijlstra, Mel Gorman, linux-kernel

1. Add the new helper, compound_head_put_tail(page) which returns
   the stable compound_head() and drops the reference to this page
   if it is the sub-page of __compound_tail_refcounted(head).

   Note: this patch tries to be as simple as possible. But we need
   to unify the new helper with __put_page_tail() later, or at least
   factor out the common code.

2. Remove the nasty __get_user_pages_fast() code in get_futex_key(),
   it can use the new helper to get page_head.

   Note: with or without this change basepage_index(page) after
   put_page(page) looks confusing at least, this will be addressed
   later.

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/mm.h |    2 ++
 kernel/futex.c     |   37 +------------------------------------
 mm/swap.c          |   35 +++++++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 13495bd..545df45 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -501,6 +501,8 @@ static inline void __ClearPageBuddy(struct page *page)
 void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
+struct page *compound_head_put_tail(struct page *page);
+
 void split_page(struct page *page, unsigned int order);
 int split_free_page(struct page *page);
 
diff --git a/kernel/futex.c b/kernel/futex.c
index c3a1a55..1fd7031 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -282,42 +282,7 @@ again:
 	else
 		err = 0;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page_head = page;
-	if (unlikely(PageTail(page))) {
-		put_page(page);
-		/* serialize against __split_huge_page_splitting() */
-		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
-			page_head = compound_head(page);
-			/*
-			 * page_head is valid pointer but we must pin
-			 * it before taking the PG_lock and/or
-			 * PG_compound_lock. The moment we re-enable
-			 * irqs __split_huge_page_splitting() can
-			 * return and the head page can be freed from
-			 * under us. We can't take the PG_lock and/or
-			 * PG_compound_lock on a page that could be
-			 * freed from under us.
-			 */
-			if (page != page_head) {
-				get_page(page_head);
-				put_page(page);
-			}
-			local_irq_enable();
-		} else {
-			local_irq_enable();
-			goto again;
-		}
-	}
-#else
-	page_head = compound_head(page);
-	if (page != page_head) {
-		get_page(page_head);
-		put_page(page);
-	}
-#endif
-
+	page_head = compound_head_put_tail(page);
 	lock_page(page_head);
 
 	/*
diff --git a/mm/swap.c b/mm/swap.c
index 972923d..6742c85 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -215,6 +215,41 @@ void put_page(struct page *page)
 EXPORT_SYMBOL(put_page);
 
 /*
+ * Like compound_head() but also drops the additional reference
+ * this page can have. Unlike compound_head() it returns the page
+ * which has a reference, and can't race with split_huge_page().
+ */
+struct page *compound_head_put_tail(struct page *page)
+{
+	struct page *page_head;
+	unsigned long flags;
+
+	if (!PageTail(page))
+		return page;
+
+	page_head = compound_trans_head(page);
+
+	if (!__compound_tail_refcounted(page_head)) {
+		smp_rmb();
+		if (likely(PageTail(page)))
+			return page_head;
+		else
+			return page;
+	}
+
+	if (likely(get_lock_thp_head(page_head, page, &flags))) {
+		if (put_page_testzero(page_head))
+			VM_BUG_ON(1);
+		atomic_dec(&page->_mapcount);
+		compound_unlock_irqrestore(page_head, flags);
+
+		return page_head;
+	}
+
+	return page;
+}
+
+/*
  * This function is exported but must not be called by anything other
  * than get_page(). It implements the slow path of get_page().
  */
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it
  2013-12-18 19:20                                               ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov
@ 2013-12-18 19:28                                                 ` Peter Zijlstra
  2013-12-18 19:40                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Peter Zijlstra @ 2013-12-18 19:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Mel Gorman, linux-kernel

On Wed, Dec 18, 2013 at 08:20:08PM +0100, Oleg Nesterov wrote:
> +struct page *compound_head_put_tail(struct page *page)
> +{
> +	struct page *page_head;
> +	unsigned long flags;
> +
> +	if (!PageTail(page))
> +		return page;
> +
> +	page_head = compound_trans_head(page);
> +
> +	if (!__compound_tail_refcounted(page_head)) {

This barrier is missing a comment describing the ordering and the
pairing.

> +		smp_rmb();
> +		if (likely(PageTail(page)))
> +			return page_head;
> +		else
> +			return page;
> +	}
> +
> +	if (likely(get_lock_thp_head(page_head, page, &flags))) {
> +		if (put_page_testzero(page_head))
> +			VM_BUG_ON(1);
> +		atomic_dec(&page->_mapcount);
> +		compound_unlock_irqrestore(page_head, flags);
> +
> +		return page_head;
> +	}
> +
> +	return page;
> +}

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail()
  2013-12-18 19:19                                               ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov
@ 2013-12-18 19:36                                                 ` Peter Zijlstra
  2013-12-18 19:50                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Peter Zijlstra @ 2013-12-18 19:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Mel Gorman, linux-kernel

On Wed, Dec 18, 2013 at 08:19:58PM +0100, Oleg Nesterov wrote:
> @@ -247,7 +242,7 @@ bool __get_page_tail(struct page *page)
>  	bool got;
>  	struct page *page_head = compound_trans_head(page);
>  
> -	/* Ref to put_compound_page() comment. */
> +	/* Ref to __put_page_tail() comment. */
>  	if (!__compound_tail_refcounted(page_head)) {
>  		smp_rmb();
>  		if (likely(PageTail(page))) {

What code is this against, my local tree doesn't have that smp_rmb().

This suggests its a recent patch; which is good since then we can still
drop it and wait for people to send one with a proper comment in.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it
  2013-12-18 19:28                                                 ` Peter Zijlstra
@ 2013-12-18 19:40                                                   ` Oleg Nesterov
  0 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Mel Gorman, linux-kernel

On 12/18, Peter Zijlstra wrote:
>
> On Wed, Dec 18, 2013 at 08:20:08PM +0100, Oleg Nesterov wrote:
> > +struct page *compound_head_put_tail(struct page *page)
> > +{
> > +	struct page *page_head;
> > +	unsigned long flags;
> > +
> > +	if (!PageTail(page))
> > +		return page;
> > +
> > +	page_head = compound_trans_head(page);
> > +
> > +	if (!__compound_tail_refcounted(page_head)) {
>
> This barrier is missing a comment describing the ordering and the
> pairing.

Yes, and this function lacks other comments it needs, they should be
copy-and-past'ed from the very similar __put_page_tail().

But I am going to unify them later if this series passes the review,
can't the comments wait till then?

OTOH, I won't mind to send v2 with the additional comment about this
rmb() at least. Or perhaps I should simply add /* See the comments in
__put_page_tail() */ into the new helper.

OK. But I'll wait for other reviews first.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail()
  2013-12-18 19:36                                                 ` Peter Zijlstra
@ 2013-12-18 19:50                                                   ` Oleg Nesterov
  0 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-18 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Mel Gorman, linux-kernel

On 12/18, Peter Zijlstra wrote:
>
> On Wed, Dec 18, 2013 at 08:19:58PM +0100, Oleg Nesterov wrote:
> > @@ -247,7 +242,7 @@ bool __get_page_tail(struct page *page)
> >  	bool got;
> >  	struct page *page_head = compound_trans_head(page);
> >
> > -	/* Ref to put_compound_page() comment. */
> > +	/* Ref to __put_page_tail() comment. */
> >  	if (!__compound_tail_refcounted(page_head)) {
> >  		smp_rmb();
> >  		if (likely(PageTail(page))) {
>
> What code is this against, my local tree doesn't have that smp_rmb().

This is against -mm code.

> This suggests its a recent patch; which is good since then we can still
> drop it and wait for people to send one with a proper comment in.

put_compound_page() explains this barrier, this is what "Ref" above means,
I guess.

To remind, I think we can do more cleanups here.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()
  2013-12-18 19:20                                               ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov
@ 2013-12-18 21:37                                                 ` Linus Torvalds
  2013-12-19 16:29                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Linus Torvalds @ 2013-12-18 21:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Dave Jones,
	Darren Hart, Peter Zijlstra, Mel Gorman,
	Linux Kernel Mailing List

On Wed, Dec 18, 2013 at 11:20 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Both __put_page_tail() and __get_page_tail() need to carefully
> take a reference on page_head, take compound_lock() and recheck
> PageTail(page) under this lock.

Btw I suspect this is just disgustingly expensive, and I don't think
there's a really good reason for it.

May I suggest:

 - getting rid of the PG_compound_lock bit-lock

   bitlocks are expensive and unfair, and don't even get lockdep checking

 - replace it with a (small, say 32-256 entries) array of hashed sequence locks

 - just hash based on the "struct page" pointer, and teach this code
to do a read_seqcount_begin/read_seqcount_retry sequence instead for
the page lookup.

I think you can get rid of all the irq disables too, and the sequence
lock should be pure memory reads for the read-case that we care about.

Hmm? This is obviously orthogonal to your series, I just reacted to
seeing that bitlock thing that needs atomics for both locking and
unlocking and the irq disable, and just generally looks like the worst
possible way to do these things.

              Linus

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()
  2013-12-18 21:37                                                 ` Linus Torvalds
@ 2013-12-19 16:29                                                   ` Oleg Nesterov
  0 siblings, 0 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-19 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Dave Jones,
	Darren Hart, Peter Zijlstra, Mel Gorman,
	Linux Kernel Mailing List


On 12/18, Linus Torvalds wrote:
>
> On Wed, Dec 18, 2013 at 11:20 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > Both __put_page_tail() and __get_page_tail() need to carefully
> > take a reference on page_head, take compound_lock() and recheck
> > PageTail(page) under this lock.
>
> Btw I suspect this is just disgustingly expensive, and I don't think
> there's a really good reason for it.
>
> May I suggest:
>
>  - getting rid of the PG_compound_lock bit-lock
>
>    bitlocks are expensive and unfair, and don't even get lockdep checking
>
>  - replace it with a (small, say 32-256 entries) array of hashed sequence locks
>
>  - just hash based on the "struct page" pointer, and teach this code
> to do a read_seqcount_begin/read_seqcount_retry sequence instead for
> the page lookup.

Yes, I thought about this too but didn't dare to suggest. Not sure
about seqlock/irqs and other details, this needs more discussion anyway.
And of course I am not sure this will be actually better.

> This is obviously orthogonal to your series,

Yes.

But please note that one of the reasons for the new helper is simplify
the potential locking changes. The changelog only mentions "even more
bitlocks" change, but this doesn't matter.

And in fact I think that this allows to do more cleanups even if we
do not change the locking, get_lock_thp_head() should return page_head
or NULL, tail != head is just another PageTail() check. Perhaps.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2013-12-16 20:19                                     ` Andrea Arcangeli
  2013-12-16 20:46                                       ` Oleg Nesterov
@ 2013-12-19 19:08                                       ` Oleg Nesterov
  2013-12-19 19:09                                         ` [PATCH 1/1] " Oleg Nesterov
  2013-12-20 14:19                                         ` [PATCH 0/1] " Martin Schwidefsky
  1 sibling, 2 replies; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-19 19:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Martin Schwidefsky, Heiko Carstens

On 12/16, Andrea Arcangeli wrote:
>
> Can you reorder set_page_refcount in your v2?

Please see the patch.

> I wonder if arch_alloc_page needs refcount 1, it sets the page as
> stable on s390.

Obviously I have no idea what set_page_stable() does, but it works
with page_to_phys(), unlikely the content of "struct page" can matter.
And only s390 HAVE_ARCH_ALLOC_PAGE, I added Martin and Heiko.

> the other way around is to move prep_compound_page
> before set_page_refcounted (though I think if we can, keeping the
> refcounted at the very last with a comment is preferable).

Yes, yes, this looks much more natural.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2013-12-19 19:08                                       ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov
@ 2013-12-19 19:09                                         ` Oleg Nesterov
  2013-12-23 11:43                                           ` Andrea Arcangeli
  2013-12-20 14:19                                         ` [PATCH 0/1] " Martin Schwidefsky
  1 sibling, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2013-12-19 19:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Martin Schwidefsky, Heiko Carstens

get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
compound_lock(). In theory this page_head can be already freed and
reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
get_page_unless_zero() can succeed right after set_page_refcounted(),
and compound_lock() can race with the non-atomic __SetPageHead().

Perhaps we should rework the thp locking (under discussion), but
until then this patch moves set_page_refcounted() and adds wmb()
to ensure that page->_count != 0 comes as a last change.

I am not sure about other callers of set_page_refcounted(), but at
first glance they look fine to me.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/page_alloc.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 115b23b..9402337 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,8 +865,6 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 	}
 
 	set_page_private(page, 0);
-	set_page_refcounted(page);
-
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
 
@@ -876,6 +874,14 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
 
+	/*
+	 * Make sure the caller of get_page_unless_zero() will see the
+	 * fully initialized page. Say, to ensure that compound_lock()
+	 * can't race with the non-atomic __SetPage*() above.
+	 */
+	smp_wmb();
+	set_page_refcounted(page);
+
 	return 0;
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2013-12-19 19:08                                       ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov
  2013-12-19 19:09                                         ` [PATCH 1/1] " Oleg Nesterov
@ 2013-12-20 14:19                                         ` Martin Schwidefsky
  1 sibling, 0 replies; 114+ messages in thread
From: Martin Schwidefsky @ 2013-12-20 14:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones,
	Darren Hart, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman, Heiko Carstens

On Thu, 19 Dec 2013 20:08:46 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 12/16, Andrea Arcangeli wrote:
> >
> > Can you reorder set_page_refcount in your v2?
> 
> Please see the patch.
> 
> > I wonder if arch_alloc_page needs refcount 1, it sets the page as
> > stable on s390.
> 
> Obviously I have no idea what set_page_stable() does, but it works
> with page_to_phys(), unlikely the content of "struct page" can matter.
> And only s390 HAVE_ARCH_ALLOC_PAGE, I added Martin and Heiko.

On s390 the arch_alloc_page primitive is used to tell the hipervisor
that a page is going to be used. While the page is free it is marked
as "unused" which allows the hipervisor to throw away the page content
if the page is selected to be swapped. We do have a patch to add the
host support for KVM somewhere in our queue.

The content of the "struct page" does not matter at all for the
set-stable/set-unused state transition, s390 does not care about
the refcount in its arch_alloc_page function.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2013-12-19 19:09                                         ` [PATCH 1/1] " Oleg Nesterov
@ 2013-12-23 11:43                                           ` Andrea Arcangeli
  2014-01-03 19:55                                             ` [PATCH v2 0/1] " Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Andrea Arcangeli @ 2013-12-23 11:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Martin Schwidefsky, Heiko Carstens

On Thu, Dec 19, 2013 at 08:09:20PM +0100, Oleg Nesterov wrote:
> get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> compound_lock(). In theory this page_head can be already freed and
> reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> get_page_unless_zero() can succeed right after set_page_refcounted(),
> and compound_lock() can race with the non-atomic __SetPageHead().
> 
> Perhaps we should rework the thp locking (under discussion), but
> until then this patch moves set_page_refcounted() and adds wmb()
> to ensure that page->_count != 0 comes as a last change.
> 
> I am not sure about other callers of set_page_refcounted(), but at
> first glance they look fine to me.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

Only one improvement possible, the smp_wmb() could have been put under
CONFIG_TRANSPARENT_HUGEPAGE somehow. No difference for x86-64 though.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH v2 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2013-12-23 11:43                                           ` Andrea Arcangeli
@ 2014-01-03 19:55                                             ` Oleg Nesterov
  2014-01-03 19:55                                               ` [PATCH v2 1/1] " Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-03 19:55 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Martin Schwidefsky, Heiko Carstens

On 12/23, Andrea Arcangeli wrote:
>
> Acked-by: Andrea Arcangeli <aarcange@redhat.com>
>
> Only one improvement possible, the smp_wmb() could have been put under
> CONFIG_TRANSPARENT_HUGEPAGE somehow. No difference for x86-64 though.

I thought it would be better to serialize it with get_page_unless_zero()
in general, but OK, As the changelog says I didn't find another problematic
caller, so please see v2. I preserved your ack, thanks.

And I would like to know your opinion about other changes I sent ;)

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-03 19:55                                             ` [PATCH v2 0/1] " Oleg Nesterov
@ 2014-01-03 19:55                                               ` Oleg Nesterov
  2014-01-03 21:00                                                 ` Andrew Morton
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-03 19:55 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman,
	Martin Schwidefsky, Heiko Carstens

get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
compound_lock(). In theory this page_head can be already freed and
reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
get_page_unless_zero() can succeed right after set_page_refcounted(),
and compound_lock() can race with the non-atomic __SetPageHead().

Perhaps we should rework the thp locking (under discussion), but
until then this patch moves set_page_refcounted() and adds wmb()
to ensure that page->_count != 0 comes as a last change.

I am not sure about other callers of set_page_refcounted(), but at
first glance they look fine to me.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 115b23b..d323102 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,8 +865,6 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 	}
 
 	set_page_private(page, 0);
-	set_page_refcounted(page);
-
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
 
@@ -876,6 +874,16 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
 
+	/*
+	 * Make sure the caller of get_page_unless_zero() will see the
+	 * fully initialized page. Say, to ensure that compound_lock()
+	 * can't race with the non-atomic __SetPage*() above.
+	 */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	smp_wmb();
+#endif
+	set_page_refcounted(page);
+
 	return 0;
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-03 19:55                                               ` [PATCH v2 1/1] " Oleg Nesterov
@ 2014-01-03 21:00                                                 ` Andrew Morton
  2014-01-04 16:43                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Andrew Morton @ 2014-01-03 21:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones,
	Darren Hart, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman, Martin Schwidefsky, Heiko Carstens

On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote:

> get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> compound_lock(). In theory this page_head can be already freed and
> reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> get_page_unless_zero() can succeed right after set_page_refcounted(),
> and compound_lock() can race with the non-atomic __SetPageHead().

Would be useful to mention that these things are happening inside
prep_compound_opage() (yes?).

> Perhaps we should rework the thp locking (under discussion), but
> until then this patch moves set_page_refcounted() and adds wmb()
> to ensure that page->_count != 0 comes as a last change.
> 
> I am not sure about other callers of set_page_refcounted(), but at
> first glance they look fine to me.

I don't get it.  We're in prep_new_page() - this page is freshly
allocated and no other thread yet has any means by which to look it up
and start fiddling with it?


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-03 21:00                                                 ` Andrew Morton
@ 2014-01-04 16:43                                                   ` Oleg Nesterov
  2014-01-08 11:54                                                     ` Mel Gorman
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-04 16:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones,
	Darren Hart, Linux Kernel Mailing List, Peter Zijlstra,
	Mel Gorman, Martin Schwidefsky, Heiko Carstens

On 01/03, Andrew Morton wrote:
>
> On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > compound_lock(). In theory this page_head can be already freed and
> > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > get_page_unless_zero() can succeed right after set_page_refcounted(),
> > and compound_lock() can race with the non-atomic __SetPageHead().
>
> Would be useful to mention that these things are happening inside
> prep_compound_opage() (yes?).

Agreed. Added "in prep_compound_opage()" into the changelog:

	get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
	compound_lock(). In theory this page_head can be already freed and
	reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
	get_page_unless_zero() can succeed right after set_page_refcounted(),
	and compound_lock() can race with the non-atomic __SetPageHead() in
	prep_compound_page().

	Perhaps we should rework the thp locking (under discussion), but
	until then this patch moves set_page_refcounted() and adds wmb()
	to ensure that page->_count != 0 comes as a last change.

	I am not sure about other callers of set_page_refcounted(), but at
	first glance they look fine to me.

or should I send v3?

> > Perhaps we should rework the thp locking (under discussion), but
> > until then this patch moves set_page_refcounted() and adds wmb()
> > to ensure that page->_count != 0 comes as a last change.
> >
> > I am not sure about other callers of set_page_refcounted(), but at
> > first glance they look fine to me.
>
> I don't get it.  We're in prep_new_page() - this page is freshly
> allocated and no other thread yet has any means by which to look it up
> and start fiddling with it?

Yes, but thp can access this page_head via stale pointer, tail->first_page,
if it races with split_huge_page_refcount(). In this case we rely on
compound_lock() to detect this race, the problem is that compound_lock()
itself can race with head_page->flags manipulations.

For example, __get_page_tail() roughly does:

	// PageTail(page) was already checked

	page_head = page->first_page;

	/* WINDOW */

	get_page_unless_zero(page_head);

	compound_lock(page_head);

	recheck PageTail(page) to ensure page_head is still valid

However, in the WINDOW above, split_huge_page() can split this huge page.
After that its head can be freed and reallocated. Of course, I don't think
it is possible to hit this race in practice, but still this looks wrong.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-04 16:43                                                   ` Oleg Nesterov
@ 2014-01-08 11:54                                                     ` Mel Gorman
  2014-01-08 13:14                                                       ` Mel Gorman
  2014-01-08 16:13                                                       ` Oleg Nesterov
  0 siblings, 2 replies; 114+ messages in thread
From: Mel Gorman @ 2014-01-08 11:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> On 01/03, Andrew Morton wrote:
> >
> > On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > > compound_lock(). In theory this page_head can be already freed and
> > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > > get_page_unless_zero() can succeed right after set_page_refcounted(),
> > > and compound_lock() can race with the non-atomic __SetPageHead().
> >
> > Would be useful to mention that these things are happening inside
> > prep_compound_opage() (yes?).
> 
> Agreed. Added "in prep_compound_opage()" into the changelog:
> 
> 	get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> 	compound_lock(). In theory this page_head can be already freed and
> 	reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> 	get_page_unless_zero() can succeed right after set_page_refcounted(),
> 	and compound_lock() can race with the non-atomic __SetPageHead() in
> 	prep_compound_page().
> 
> 	Perhaps we should rework the thp locking (under discussion), but
> 	until then this patch moves set_page_refcounted() and adds wmb()
> 	to ensure that page->_count != 0 comes as a last change.
> 
> 	I am not sure about other callers of set_page_refcounted(), but at
> 	first glance they look fine to me.
> 
> or should I send v3?
> 

This patch is putting a write barrier in the page allocator fast path and
that is going to be a leading cause of Sad Face. We already have seen
large regressions before when write barriers were introduced to the page
allocator paths for cpusets.  Sticking it under CONFIG_TRANSPARENT_HUGEPAGE
does not really address the issue.

> > > Perhaps we should rework the thp locking (under discussion), but
> > > until then this patch moves set_page_refcounted() and adds wmb()
> > > to ensure that page->_count != 0 comes as a last change.
> > >
> > > I am not sure about other callers of set_page_refcounted(), but at
> > > first glance they look fine to me.
> >
> > I don't get it.  We're in prep_new_page() - this page is freshly
> > allocated and no other thread yet has any means by which to look it up
> > and start fiddling with it?
> 
> Yes, but thp can access this page_head via stale pointer, tail->first_page,
> if it races with split_huge_page_refcount().

To justify the introduction of a performance regression we need to be 100%
sure this race actually exists and not just theoretical. I had lost track
of this thread but did not see a description of how this bug can actually
happen. At the risk of sounding stupid -- what are the actual circumstances
when this race can occur?

For example, in the reclaim paths, we are going to be dealing with the head
pages as they are on the LRU. They get split into base pages and then the
compound handling becomes irrelevant. I cannot see problems there.

For futex, the THP page (and the tail) must have been discovered via
the page tables in which case the page tables are temporarily preventing
the page being freed to the allocator. GUP fast paths are protected from
parallel teardowns and the slow paths are protected from parallel unmaps
and frees by mmap_sem.

Compaction and other walkers mostly deal with the head and/or deal with
pages on the LRU where there will be an elevated reference count.

The changelog needs a specific example where the problem can occur and
even then we should consider if there is an option other than smacking
the page allocator.

>  In this case we rely on
> compound_lock() to detect this race, the problem is that compound_lock()
> itself can race with head_page->flags manipulations.
> 
> For example, __get_page_tail() roughly does:
> 
> 	// PageTail(page) was already checked
> 
> 	page_head = page->first_page;
> 
> 	/* WINDOW */
> 
> 	get_page_unless_zero(page_head);
> 
> 	compound_lock(page_head);
> 
> 	recheck PageTail(page) to ensure page_head is still valid
> 
> However, in the WINDOW above, split_huge_page() can split this huge page.
> After that its head can be freed and reallocated. Of course, I don't think
> it is possible to hit this race in practice, but still this looks wrong.
> 

I can't think of a reason why we would actually hit that race in practice
but I might have tunnel vision due to disliking that barrier so much. Unless
there is an example with no other possible solution, I do not think we
should stick a write barrier into the page allocator fast path.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-08 11:54                                                     ` Mel Gorman
@ 2014-01-08 13:14                                                       ` Mel Gorman
  2014-01-08 16:13                                                       ` Oleg Nesterov
  1 sibling, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2014-01-08 13:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On Wed, Jan 08, 2014 at 11:54:00AM +0000, Mel Gorman wrote:
> On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> > On 01/03, Andrew Morton wrote:
> > >
> > > On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > > > compound_lock(). In theory this page_head can be already freed and
> > > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > > > get_page_unless_zero() can succeed right after set_page_refcounted(),
> > > > and compound_lock() can race with the non-atomic __SetPageHead().
> > >
> > > Would be useful to mention that these things are happening inside
> > > prep_compound_opage() (yes?).
> > 
> > Agreed. Added "in prep_compound_opage()" into the changelog:
> > 
> > 	get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > 	compound_lock(). In theory this page_head can be already freed and
> > 	reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > 	get_page_unless_zero() can succeed right after set_page_refcounted(),
> > 	and compound_lock() can race with the non-atomic __SetPageHead() in
> > 	prep_compound_page().
> > 
> > 	Perhaps we should rework the thp locking (under discussion), but
> > 	until then this patch moves set_page_refcounted() and adds wmb()
> > 	to ensure that page->_count != 0 comes as a last change.
> > 
> > 	I am not sure about other callers of set_page_refcounted(), but at
> > 	first glance they look fine to me.
> > 
> > or should I send v3?
> > 
> 
> This patch is putting a write barrier in the page allocator fast path and
> that is going to be a leading cause of Sad Face.

Peter Zijlstra correctly pointed out to me that on x86 that we generally
would not care/notice a write barrier as it almost always is a no-op.
X86 (which is all I test any more) can execute an sfence for a smp_wmb
but not in any configuration that matters. The previous barrier damage in
page_alloc.c was due to full barriers but I generally assume barriers have a
cost in core code when I see them regardless of the underlying architecture
details. So 99% of the time, we will not care and I won't be making Sad
Face but eventually someone using an affected architecture will whinge --
ppc64 probably as write barriers on sparc are compile barriers.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-08 11:54                                                     ` Mel Gorman
  2014-01-08 13:14                                                       ` Mel Gorman
@ 2014-01-08 16:13                                                       ` Oleg Nesterov
  2014-01-08 18:02                                                         ` Mel Gorman
  1 sibling, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-08 16:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On 01/08, Mel Gorman wrote:
>
> On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> >
> > 	get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > 	compound_lock(). In theory this page_head can be already freed and
> > 	reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > 	get_page_unless_zero() can succeed right after set_page_refcounted(),
> > 	and compound_lock() can race with the non-atomic __SetPageHead() in
> > 	prep_compound_page().
> >
> This patch is putting a write barrier in the page allocator fast path and
> that is going to be a leading cause of Sad Face. We already have seen
> large regressions before when write barriers were introduced to the page
> allocator paths for cpusets.  Sticking it under CONFIG_TRANSPARENT_HUGEPAGE
> does not really address the issue.

As you already mentioned in another email, smp_wmb() is mostly nop. On
x86_64 at least. Although perhaps it would be nice to have

	static inline void atomic_store_release(atomic_t *v, int i)
	{
		smp_store_release(&v->counter, i);
	}

> > Yes, but thp can access this page_head via stale pointer, tail->first_page,
> > if it races with split_huge_page_refcount().
>
> To justify the introduction of a performance regression we need to be 100%
> sure this race actually exists

See below. But let me remind that I never looked at this code before,
I can be easily wrong.

> and not just theoretical.

It is theoretical anyway, I guess.

> For futex, the THP page (and the tail) must have been discovered via
> the page tables in which case the page tables are temporarily preventing
> the page being freed to the allocator.

Yes. But, for example, get_futex_key() does

	if (unlikely(PageTail(page))) {
		put_page(page);

why this put_page() can't race with _split? If nothing else, another thread
can unmap the part of this vma.

> > For example, __get_page_tail() roughly does:
> >
> > 	// PageTail(page) was already checked
> >
> > 	page_head = page->first_page;
> >
> > 	/* WINDOW */
> >
> > 	get_page_unless_zero(page_head);
> >
> > 	compound_lock(page_head);
> >
> > 	recheck PageTail(page) to ensure page_head is still valid
> >
> > However, in the WINDOW above, split_huge_page() can split this huge page.
> > After that its head can be freed and reallocated. Of course, I don't think
> > it is possible to hit this race in practice, but still this looks wrong.
> >
>
> I can't think of a reason why we would actually hit that race in practice

Agreed, the window is tiny, unlikely this possible.

> I do not think we
> should stick a write barrier into the page allocator fast path.

OK, I won't argue, I leave this to you and Andrea.


But I still think this code needs other cleanups/simplifications. In
particular get_futex_key()->__get_user_pages_fast() should die imho.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-08 16:13                                                       ` Oleg Nesterov
@ 2014-01-08 18:02                                                         ` Mel Gorman
  2014-01-08 19:04                                                           ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Mel Gorman @ 2014-01-08 18:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote:
> On 01/08, Mel Gorman wrote:
> >
> > On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> > >
> > > 	get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > > 	compound_lock(). In theory this page_head can be already freed and
> > > 	reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > > 	get_page_unless_zero() can succeed right after set_page_refcounted(),
> > > 	and compound_lock() can race with the non-atomic __SetPageHead() in
> > > 	prep_compound_page().
> > >
> > This patch is putting a write barrier in the page allocator fast path and
> > that is going to be a leading cause of Sad Face. We already have seen
> > large regressions before when write barriers were introduced to the page
> > allocator paths for cpusets.  Sticking it under CONFIG_TRANSPARENT_HUGEPAGE
> > does not really address the issue.
> 
> As you already mentioned in another email, smp_wmb() is mostly nop. On
> x86_64 at least.

Which sometimes means that it'll just take longer for someone to find it
and bitch about it.

> Although perhaps it would be nice to have
> 
> 	static inline void atomic_store_release(atomic_t *v, int i)
> 	{
> 		smp_store_release(&v->counter, i);
> 	}
> 
> > > Yes, but thp can access this page_head via stale pointer, tail->first_page,
> > > if it races with split_huge_page_refcount().
> >
> > To justify the introduction of a performance regression we need to be 100%
> > sure this race actually exists
> 
> See below. But let me remind that I never looked at this code before,
> I can be easily wrong.
> 
> > and not just theoretical.
> 
> It is theoretical anyway, I guess.
> 
> > For futex, the THP page (and the tail) must have been discovered via
> > the page tables in which case the page tables are temporarily preventing
> > the page being freed to the allocator.
> 
> Yes. But, for example, get_futex_key() does
> 
> 	if (unlikely(PageTail(page))) {
> 		put_page(page);
> 
> why this put_page() can't race with _split? If nothing else, another thread
> can unmap the part of this vma.
> 

The race is not prevented but that does not mean it matters. Basic
scenario where a split starts after the PageTail check but before the
put_page in get_futex_key

CPU A
get_futex_key
  -> fast gup, page table removing prevents parallel unmap and free
    -> gup_huge_pmd (arch/x86/mm/gup.c at least)
      -> get_huge_page_tail (increment page tail _map_count)
      -> get_huge_page_multiple (increment ref on head page)
  -> Check PageTail
					CPU B
					split_huge_page_to_list
					  -> split_huge_page_refcount
					     spin_lock_irq(lru_lock)
					     compound_lock
  -> put_page(tail_page)
    ->put_compound_page
       looks up head page
       takes reference unless zero
       compound_lock (block)
					     complete split
					     compound_unlock
       check PageTail

This put_page blocks on the compound lock, finds the page is no longer a
PageTail as the split barriers correctly and backs out gracefully. So sure,
splits can race but the case is cared for.

The parallel unmap is prevented by get_huge_page_multiple in the gup path
and held in place until put_page_compound frees it later.

I still don't see the case where a page gets freed to the page allocator
that causes weird problems here. Unfortunately, I also recognise I have
tunnel vision because subconsciously I don't *want* to see a problem here
that justifies adding a write barrier. Andrea may stomp all over this
reasoning in which case we'll get a good comment for the smp_wmb :/

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-08 18:02                                                         ` Mel Gorman
@ 2014-01-08 19:04                                                           ` Oleg Nesterov
  2014-01-09 11:27                                                             ` Mel Gorman
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-08 19:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On 01/08, Mel Gorman wrote:
>
> On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote:
> >
> > Yes. But, for example, get_futex_key() does
> >
> > 	if (unlikely(PageTail(page))) {
> > 		put_page(page);
> >
> > why this put_page() can't race with _split? If nothing else, another thread
> > can unmap the part of this vma.
> >
>
> The race is not prevented but that does not mean it matters. Basic
> scenario where a split starts after the PageTail check but before the
> put_page in get_futex_key
>
> CPU A
> get_futex_key
>   -> fast gup, page table removing prevents parallel unmap and free
>     -> gup_huge_pmd (arch/x86/mm/gup.c at least)
>       -> get_huge_page_tail (increment page tail _map_count)
>       -> get_huge_page_multiple (increment ref on head page)
>   -> Check PageTail
> 					CPU B
> 					split_huge_page_to_list
> 					  -> split_huge_page_refcount
> 					     spin_lock_irq(lru_lock)
> 					     compound_lock
>   -> put_page(tail_page)
>     ->put_compound_page
>        looks up head page

Yes.

But suppose that CPU B completes split_huge_page_to_list/munmap/etc
and frees this head page.

>        takes reference unless zero

suppose this page_head was reallocated and get_page_unless_zero()
succeds right after set_page_refcounted(),

>        compound_lock (block)
> 					     complete split
> 					     compound_unlock
>        check PageTail
>
> This put_page blocks on the compound lock, finds the page is no longer a
> PageTail

Sure. The problem is that compound_lock() itself can race with prep_new_page()
or I missed something.

> The parallel unmap is prevented by get_huge_page_multiple in the gup path
> and held in place until put_page_compound frees it later.

Again, I can easily miss something. And yes, page_tail returned by gup
has a reference to its page_head (via page_head->_count). But
__split_huge_page_refcount() destroys this connection and decrements
page_head->_count.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-08 19:04                                                           ` Oleg Nesterov
@ 2014-01-09 11:27                                                             ` Mel Gorman
  2014-01-09 14:04                                                               ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Mel Gorman @ 2014-01-09 11:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On Wed, Jan 08, 2014 at 08:04:43PM +0100, Oleg Nesterov wrote:
> On 01/08, Mel Gorman wrote:
> >
> > On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote:
> > >
> > > Yes. But, for example, get_futex_key() does
> > >
> > > 	if (unlikely(PageTail(page))) {
> > > 		put_page(page);
> > >
> > > why this put_page() can't race with _split? If nothing else, another thread
> > > can unmap the part of this vma.
> > >
> >
> > The race is not prevented but that does not mean it matters. Basic
> > scenario where a split starts after the PageTail check but before the
> > put_page in get_futex_key
> >
> > CPU A
> > get_futex_key
> >   -> fast gup, page table removing prevents parallel unmap and free
> >     -> gup_huge_pmd (arch/x86/mm/gup.c at least)
> >       -> get_huge_page_tail (increment page tail _map_count)
> >       -> get_huge_page_multiple (increment ref on head page)
> >   -> Check PageTail
> > 					CPU B
> > 					split_huge_page_to_list
> > 					  -> split_huge_page_refcount
> > 					     spin_lock_irq(lru_lock)
> > 					     compound_lock
> >   -> put_page(tail_page)
> >     ->put_compound_page
> >        looks up head page
> 
> Yes.
> 
> But suppose that CPU B completes split_huge_page_to_list/munmap/etc
> and frees this head page.
> 

Where did the reference taken by get_huge_page_multiple go?

CPU A
static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
                unsigned long end, int write, struct page **pages, int *nr)
{
	....
	do {
		...
		if (PageTail(page))
			/* Increment page->_mapcount */
			get_huge_page_tail(page);
		...
		refs++;
	} while (...)
	get_head_page_multiple(head, refs);
}

CPU A in get_futex_key has taken multiple references to the head page,
one for every base page on the huge page

Now the splitter comes along which does a bunch of stuff but the
important part is in __split_huge_page_refcount()

static void __split_huge_page_refcount(struct page *page,
                                       struct list_head *list)
{
	...
	spin_lock_irq(&zone->lru_lock);
	compound_lock(page);

	for_every_tail_page() {
		/* This picks up refcounts from GUP get_huge_page_tail */
		tail_count += page_mapcount(page_tail);

		/* Propogate all mapcounts to the "real" refcount in the tail page */
		atomic_add(page_mapcount(head) + page_mapcount(tail), tail->_count)

		.... flag reinits with barriers ...
	}
	atomic_sub(tail_count, headpage->_count);
	...
	unlock stuff
}

The refcounts on page->_mapcount taken while the page was huge is
propogated to the tail pages so it's still pinned in place.

	
> >        takes reference unless zero
> 
> suppose this page_head was reallocated and get_page_unless_zero()
> succeds right after set_page_refcounted(),
> 

You're right. The head page can still be freed and reallocated as a *smaller*
compound page but futex.c is doing the reference count on the tail page
that should have an elevated count even after the split

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
        page_head = page;
        if (unlikely(PageTail(page))) {
                put_page(page);


so I'm still not seeing how a tail page racing with a split ends up with
mayhem.

> >        compound_lock (block)
> > 					     complete split
> > 					     compound_unlock
> >        check PageTail
> >
> > This put_page blocks on the compound lock, finds the page is no longer a
> > PageTail
> 
> Sure. The problem is that compound_lock() itself can race with prep_new_page()
> or I missed something.
> 

I could also still be stuck in a "la la la, everything is fine" mode.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-09 11:27                                                             ` Mel Gorman
@ 2014-01-09 14:04                                                               ` Oleg Nesterov
  2014-01-09 18:52                                                                 ` Andrea Arcangeli
  2014-01-10 16:12                                                                 ` Mel Gorman
  0 siblings, 2 replies; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-09 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On 01/09, Mel Gorman wrote:
>
> On Wed, Jan 08, 2014 at 08:04:43PM +0100, Oleg Nesterov wrote:
> >
> > Yes.
> >
> > But suppose that CPU B completes split_huge_page_to_list/munmap/etc
> > and frees this head page.
> >
>
> Where did the reference taken by get_huge_page_multiple go?

In __split_huge_page_refcount(), see below.

> CPU A
> static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
>                 unsigned long end, int write, struct page **pages, int *nr)
> {
> 	....
> 	do {
> 		...
> 		if (PageTail(page))
> 			/* Increment page->_mapcount */
> 			get_huge_page_tail(page);
> 		...
> 		refs++;
> 	} while (...)
> 	get_head_page_multiple(head, refs);
> }
>
> CPU A in get_futex_key has taken multiple references to the head page,
> one for every base page on the huge page

Not sure I understand "multiple references to the head page" above...
I mean, in this particular case case nr == 1.

IOW, If gup returns a tail page, this page_tail has 1 in ->_mapcount
(to simplify) and its ->first_page gets the additional 1 in ->_count.

> static void __split_huge_page_refcount(struct page *page,
>                                        struct list_head *list)
> {
> 	...
> 	spin_lock_irq(&zone->lru_lock);
> 	compound_lock(page);
>
> 	for_every_tail_page() {
> 		/* This picks up refcounts from GUP get_huge_page_tail */
> 		tail_count += page_mapcount(page_tail);
>
> 		/* Propogate all mapcounts to the "real" refcount in the tail page */
> 		atomic_add(page_mapcount(head) + page_mapcount(tail), tail->_count)
>
> 		.... flag reinits with barriers ...
> 	}
> 	atomic_sub(tail_count, headpage->_count);
> 	...
> 	unlock stuff
> }
>
> The refcounts on page->_mapcount taken while the page was huge is
> propogated to the tail pages so it's still pinned in place.

Yes. But at the same time atomic_sub(tail_count, headpage->_count)
above reverts the result of get_head_page_multiple(head) done by
gup() above.

IOW, after __split_huge_page_refcount() page_tail no longer pins its
former page_head.

> > >        takes reference unless zero
> >
> > suppose this page_head was reallocated and get_page_unless_zero()
> > succeds right after set_page_refcounted(),
> >
>
> You're right. The head page can still be freed and reallocated as a *smaller*
> compound page but futex.c is doing the reference count on the tail page
> that should have an elevated count even after the split

Yes, page_tail can't go away, the reference was moved to page_tail->_count.

> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         page_head = page;
>         if (unlikely(PageTail(page))) {
>                 put_page(page);
>
>
> so I'm still not seeing how a tail page racing with a split ends up with
> mayhem.

But get/put(page_tail) plays with page_head which can be freed/reallocated,
it does compound_lock(page_head).

> I could also still be stuck in a "la la la, everything is fine" mode.

More likely it is me who tries to deny the fact I missed something ;)



But let me try again. Lets ignore PageSlab/PageHeadHuge. put_compound_page()
is complicated, but roughly it does:

	CPU 0						CPU 1

	if (!PageTail(page_tail))
		return;

	page_head = page_tail->first_page;
	
							unmap this vma, free everything.
							page_tail can't go away, its
							->_count was incremented by
							__split_huge_page_refcount().
							

							alloc_pages(GFP_COMP, 1) reallocates
							page_head, prep_new_page() does
							set_page_refcounted(page_head),
							

	// succeeds after set_page_refcounted()
	get_page_unless_zero(page_head);

	compound_lock(page_head)			prep_compound_page(page_head);

Now, both compound_lock() and prep_compound_page() play with the same
page_head->flags, but __SetPageHead(page_head) is non-atomic.

OK. Even if I am right, we can probably make another fix.
put_compound_page() and __get_page_tail() can do yet another PageTail()
check _before_ compound_lock().

Although personally I'd prefer this patch. And if we change get/put
I think it would be  better to do this on top of

	"[PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()"
	http://marc.info/?l=linux-kernel&m=138739438800899

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-09 14:04                                                               ` Oleg Nesterov
@ 2014-01-09 18:52                                                                 ` Andrea Arcangeli
  2014-01-09 19:43                                                                   ` Oleg Nesterov
  2014-01-10 16:12                                                                 ` Mel Gorman
  1 sibling, 1 reply; 114+ messages in thread
From: Andrea Arcangeli @ 2014-01-09 18:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On Thu, Jan 09, 2014 at 03:04:47PM +0100, Oleg Nesterov wrote:
> OK. Even if I am right, we can probably make another fix.

I think the confusion here was to think this was related to the futex
code, it isn't. This was just a generic theoretical problem found
doing the futex cleanups but it's not related to the futex code.

> put_compound_page() and __get_page_tail() can do yet another PageTail()
> check _before_ compound_lock().

The above alternate fix looks good to me too.

Only thing to sort out is in the common code (not just x86) then we
may need a smp_mb() between PageTail check and the bit_spin_lock... We
just can't risk writing the bit_spin_lock before reading PageTail. Not
sure if the branch that skips the bit_spin_lock helps on some arch and
we can depend on that but I doubt.

smp_mb and smp_rmb are not noops like smp_wmb on x86 so the other
patch would be more efficient on x86 if we have to add a smp_mb()
before compound_lock(), but then the put/get_page slow path is only
taken when releasing or getting tail pages after gup_fast.

And regardless of gup_fast, like Linus said, for increased NUMA
fairness we could move the compound lock from page->flags to an hashed
array of proper spinlocks sized in function of ram. The contention on
these locks is so low that I doubt we can run into lock starvation,
but because the contention is so low, the array would be fine as well,
and it would be more theoretically correct for NUMA usages than the
bit spinlock. So this problem also goes away if we convert the
bit_spin_lock to an hashed array of spin_lock.

So in the longer term it doesn't matter how we fix it now, and we
should document it in the fix that the additional PageTail check
should be dropped after converting the bit_spin_lock to an array of
hashed spinlocks (or the smp_wmb() if we go with the previous fix).

I personally prefer to keep the complexity in one place so adding to
get/put_page but the other way is a bit more efficient for x86 maybe
(until we have smp_mb__before_spin_lock/bit_spin_lock at least).

> Although personally I'd prefer this patch. And if we change get/put
> I think it would be  better to do this on top of
> 
> 	"[PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()"
> 	http://marc.info/?l=linux-kernel&m=138739438800899

Not against the cleanups of course, but about the order, it gets
harder to backport it for distros if applied after the cleanups.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-09 18:52                                                                 ` Andrea Arcangeli
@ 2014-01-09 19:43                                                                   ` Oleg Nesterov
  2014-01-09 21:36                                                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-09 19:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On 01/09, Andrea Arcangeli wrote:
>
> On Thu, Jan 09, 2014 at 03:04:47PM +0100, Oleg Nesterov wrote:
> > OK. Even if I am right, we can probably make another fix.
>
> I think the confusion here was to think this was related to the futex
> code, it isn't. This was just a generic theoretical problem found
> doing the futex cleanups but it's not related to the futex code.

Yes, yes, sure. I mentioned get_futex_key() just for example.

> > put_compound_page() and __get_page_tail() can do yet another PageTail()
> > check _before_ compound_lock().
>
> The above alternate fix looks good to me too.
>
> Only thing to sort out is in the common code (not just x86) then we
> may need a smp_mb() between PageTail check and the bit_spin_lock... We
> just can't risk writing the bit_spin_lock before reading PageTail.

I do not think we need mb() in between... other callers of compound_lock()
look fine, get/put(page_tail) can't have the false positive after successful
get_page_unless_zero(), and recently it was documented that the kernel can
rely on the control dependency to serialize LOAD + STORE.

But we probably need barrier() in between, we can't use ACCESS_ONCE().

> And regardless of gup_fast, like Linus said, for increased NUMA
> fairness we could move the compound lock from page->flags to an hashed
> array of proper spinlocks sized in function of ram. The contention on
> these locks is so low that I doubt we can run into lock starvation,
> but because the contention is so low, the array would be fine as well,
> and it would be more theoretically correct for NUMA usages than the
> bit spinlock. So this problem also goes away if we convert the
> bit_spin_lock to an hashed array of spin_lock.

Yes. But in this case I really think we should cleanup get/put first
and add the helper, like the patch I mentioned does.

> I personally prefer to keep the complexity in one place so adding to
> get/put_page

OK. I'll send v3.

> > Although personally I'd prefer this patch. And if we change get/put
> > I think it would be  better to do this on top of
> >
> > 	"[PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()"
> > 	http://marc.info/?l=linux-kernel&m=138739438800899
>
> Not against the cleanups of course, but about the order, it gets
> harder to backport it for distros if applied after the cleanups.

Oh, I don't think this highly theoreitical fix should be backported
but I agree, lets fix the bug first.

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-09 19:43                                                                   ` Oleg Nesterov
@ 2014-01-09 21:36                                                                     ` Andrea Arcangeli
  2014-01-10 16:12                                                                       ` Oleg Nesterov
  0 siblings, 1 reply; 114+ messages in thread
From: Andrea Arcangeli @ 2014-01-09 21:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On Thu, Jan 09, 2014 at 08:43:50PM +0100, Oleg Nesterov wrote:
> get_page_unless_zero(), and recently it was documented that the kernel can
> rely on the control dependency to serialize LOAD + STORE.

Ok, that's cheap then.

> 
> But we probably need barrier() in between, we can't use ACCESS_ONCE().

After get_page_unless_zero I don't think there's any need of
barrier(). barrier() should have been implicit in __atomic_add_unless
in fact it should be a full smp_mb() equivalent too. Memory is always
clobbered there and the asm is volatile.

My wondering was only about the runtime (not compiler) barrier after
running PageTail and before compound_lock, because bit_spin_lock has
only acquire semantics so in absence of the branch that bails out the
lock, the spinlock could run before PageTail. If the branch is good
enough guarantee for all archs it's good and cheap solution. Clearly
this is not an x86 concern, which is always fine without anything when
surrounded by locked ops like here.

> Yes. But in this case I really think we should cleanup get/put first
> and add the helper, like the patch I mentioned does.

Ok, up to you, I'll check it.

> Oh, I don't think this highly theoreitical fix should be backported
> but I agree, lets fix the bug first.

I think it should, but the risk free version of it, so either a few
liner addition before compound_lock or the previous with smp_wmb()
inside the ifdef (considering it only matters on x86 where smp_wmb is
zero cost and the only operational change is actually the reordering
of the set_page_refcounted not the smp_wmb irrelevant for x86).

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-09 21:36                                                                     ` Andrea Arcangeli
@ 2014-01-10 16:12                                                                       ` Oleg Nesterov
  2014-01-10 16:50                                                                         ` Peter Zijlstra
  0 siblings, 1 reply; 114+ messages in thread
From: Oleg Nesterov @ 2014-01-10 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli, Paul E. McKenney
  Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On 01/09, Andrea Arcangeli wrote:
>
> >
> > But we probably need barrier() in between, we can't use ACCESS_ONCE().
>
> After get_page_unless_zero I don't think there's any need of
> barrier(). barrier() should have been implicit in __atomic_add_unless
> in fact it should be a full smp_mb() equivalent too. Memory is always
> clobbered there and the asm is volatile.

Yes, yes,

> My wondering was only about the runtime (not compiler) barrier after
> running PageTail and before compound_lock,

Yes, this is what I meant.

Except I really meant the compiler barrier, although I do not think it
is actually needed, test_and_set_bit() implies mb().

> because bit_spin_lock has
> only acquire semantics so in absence of the branch that bails out the
> lock, the spinlock could run before PageTail. If the branch is good
> enough guarantee for all archs it's good and cheap solution.

The recent "[PATCH v6 tip/core/locking 3/8] Documentation/memory-barriers.txt:
Prohibit speculative writes" from Paul says:

	No SMP architecture currently supporting Linux allows speculative writes,

	...

	+ACCESS_ONCE(), which preserves the ordering between
	+the load from variable 'a' and the store to variable 'b':
	+
	+       q = ACCESS_ONCE(a);
	+       if (q) {
	+               ACCESS_ONCE(b) = p;
	+               do_something();
	+       }


We can't use ACCESS_ONCE(), but I think that

		if (PageTail(page)) {
			barrier();
			compound_lock(page_head);
		}

should obviously work (even if compound_lock() didn't imply mb).

Oleg.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-09 14:04                                                               ` Oleg Nesterov
  2014-01-09 18:52                                                                 ` Andrea Arcangeli
@ 2014-01-10 16:12                                                                 ` Mel Gorman
  1 sibling, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2014-01-10 16:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Darren Hart, Linux Kernel Mailing List,
	Peter Zijlstra, Martin Schwidefsky, Heiko Carstens

On Thu, Jan 09, 2014 at 03:04:47PM +0100, Oleg Nesterov wrote:
> 
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         page_head = page;
> >         if (unlikely(PageTail(page))) {
> >                 put_page(page);
> >
> >
> > so I'm still not seeing how a tail page racing with a split ends up with
> > mayhem.
> 
> But get/put(page_tail) plays with page_head which can be freed/reallocated,
> it does compound_lock(page_head).
> 
> > I could also still be stuck in a "la la la, everything is fine" mode.
> 
> More likely it is me who tries to deny the fact I missed something ;)
> 

My hangup was that this was related to futex and I was focusing it as
a specific example that made the patch necessary. However, this is a
therotical case that potentially impacts a put_page if it mistakenly
believes it is still a tail page when it's not due a a parallel split. I
see and understand that race and while I think the patch is overkill, I
have no problem with including it at the start of a series that reexamines
the locking in that area. It makes for a suitable -stable backport and
I hope/expect the reworked locking would then remove the barrier again
for upstream.

I haven't looked at the reworked locking but understand there is a v3 on
the way so I'll wait until that happens and work my way through it.

Thanks and sorry for the noise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race
  2014-01-10 16:12                                                                       ` Oleg Nesterov
@ 2014-01-10 16:50                                                                         ` Peter Zijlstra
  0 siblings, 0 replies; 114+ messages in thread
From: Peter Zijlstra @ 2014-01-10 16:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrea Arcangeli, Paul E. McKenney, Mel Gorman, Andrew Morton,
	Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart,
	Linux Kernel Mailing List, Martin Schwidefsky, Heiko Carstens

On Fri, Jan 10, 2014 at 05:12:27PM +0100, Oleg Nesterov wrote:
> The recent "[PATCH v6 tip/core/locking 3/8] Documentation/memory-barriers.txt:
> Prohibit speculative writes" from Paul says:
> 
> 	No SMP architecture currently supporting Linux allows speculative writes,
> 
> 	...
> 
> 	+ACCESS_ONCE(), which preserves the ordering between
> 	+the load from variable 'a' and the store to variable 'b':
> 	+
> 	+       q = ACCESS_ONCE(a);
> 	+       if (q) {
> 	+               ACCESS_ONCE(b) = p;
> 	+               do_something();
> 	+       }
> 
> 
> We can't use ACCESS_ONCE(), but I think that
> 
> 		if (PageTail(page)) {
> 			barrier();
> 			compound_lock(page_head);
> 		}
> 
> should obviously work (even if compound_lock() didn't imply mb).

The compiler can actually screw you over if that's preceded by something
like: SetPageTail(page). In which case it can prove that if (PageTail())
is a non-condition.

But yes, barring that, the version with barrier() in should stop the
compiler from doing most terrible things and it ought to work out.

^ permalink raw reply	[flat|nested] 114+ messages in thread

end of thread, other threads:[~2014-01-10 16:50 UTC | newest]

Thread overview: 114+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-10 15:47 process 'stuck' at exit Dave Jones
2013-12-10 18:40 ` Linus Torvalds
2013-12-10 19:18   ` Thomas Gleixner
2013-12-10 19:55     ` Linus Torvalds
2013-12-10 20:27       ` Dave Jones
2013-12-10 20:34         ` Thomas Gleixner
2013-12-10 20:55           ` Dave Jones
2013-12-10 21:25             ` Darren Hart
2013-12-10 21:28               ` Thomas Gleixner
2013-12-10 21:39                 ` Steven Rostedt
2013-12-10 20:33       ` Thomas Gleixner
2013-12-10 20:43         ` Linus Torvalds
2013-12-10 21:17           ` Thomas Gleixner
2013-12-10 20:35       ` Oleg Nesterov
2013-12-10 20:49         ` Dave Jones
2013-12-10 21:06           ` Darren Hart
2013-12-10 21:12             ` Dave Jones
2013-12-10 21:18             ` Linus Torvalds
2013-12-10 21:24               ` Linus Torvalds
2013-12-10 21:32               ` Dave Jones
2013-12-10 21:49                 ` Linus Torvalds
2013-12-10 21:56                   ` Dave Jones
2013-12-10 21:59                     ` Linus Torvalds
2013-12-10 22:07                       ` Dave Jones
2013-12-11 12:45                   ` Ingo Molnar
2013-12-10 21:34           ` Oleg Nesterov
2013-12-10 21:41             ` Dave Jones
2013-12-10 21:57               ` Linus Torvalds
2013-12-10 22:02                 ` Dave Jones
2013-12-10 22:09                 ` Oleg Nesterov
2013-12-10 22:19                 ` Linus Torvalds
2013-12-10 22:33                   ` Linus Torvalds
2013-12-10 22:38                     ` Darren Hart
2013-12-10 22:57                     ` Thomas Gleixner
2013-12-10 23:05                       ` Linus Torvalds
2013-12-10 23:28                         ` Thomas Gleixner
2013-12-10 23:31                         ` Al Viro
2013-12-11 17:08                     ` Oleg Nesterov
2013-12-11 17:18                       ` Thomas Gleixner
2013-12-11 17:56                         ` Oleg Nesterov
2013-12-11 19:18                           ` PATCH? introduce get_compound_page (Was: process 'stuck' at exit) Oleg Nesterov
2013-12-13 15:10                             ` Andrea Arcangeli
2013-12-13 16:22                               ` Oleg Nesterov
2013-12-13 17:34                                 ` Andrea Arcangeli
2013-12-16 18:36                                   ` Oleg Nesterov
2013-12-16 20:19                                     ` Andrea Arcangeli
2013-12-16 20:46                                       ` Oleg Nesterov
2013-12-17 16:53                                         ` Oleg Nesterov
2013-12-17 18:06                                           ` Andrea Arcangeli
2013-12-18 19:19                                             ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
2013-12-18 19:19                                               ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov
2013-12-18 19:19                                               ` [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() Oleg Nesterov
2013-12-18 19:19                                               ` [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() Oleg Nesterov
2013-12-18 19:19                                               ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov
2013-12-18 19:36                                                 ` Peter Zijlstra
2013-12-18 19:50                                                   ` Oleg Nesterov
2013-12-18 19:20                                               ` [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() Oleg Nesterov
2013-12-18 19:20                                               ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov
2013-12-18 21:37                                                 ` Linus Torvalds
2013-12-19 16:29                                                   ` Oleg Nesterov
2013-12-18 19:20                                               ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov
2013-12-18 19:28                                                 ` Peter Zijlstra
2013-12-18 19:40                                                   ` Oleg Nesterov
2013-12-19 19:08                                       ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov
2013-12-19 19:09                                         ` [PATCH 1/1] " Oleg Nesterov
2013-12-23 11:43                                           ` Andrea Arcangeli
2014-01-03 19:55                                             ` [PATCH v2 0/1] " Oleg Nesterov
2014-01-03 19:55                                               ` [PATCH v2 1/1] " Oleg Nesterov
2014-01-03 21:00                                                 ` Andrew Morton
2014-01-04 16:43                                                   ` Oleg Nesterov
2014-01-08 11:54                                                     ` Mel Gorman
2014-01-08 13:14                                                       ` Mel Gorman
2014-01-08 16:13                                                       ` Oleg Nesterov
2014-01-08 18:02                                                         ` Mel Gorman
2014-01-08 19:04                                                           ` Oleg Nesterov
2014-01-09 11:27                                                             ` Mel Gorman
2014-01-09 14:04                                                               ` Oleg Nesterov
2014-01-09 18:52                                                                 ` Andrea Arcangeli
2014-01-09 19:43                                                                   ` Oleg Nesterov
2014-01-09 21:36                                                                     ` Andrea Arcangeli
2014-01-10 16:12                                                                       ` Oleg Nesterov
2014-01-10 16:50                                                                         ` Peter Zijlstra
2014-01-10 16:12                                                                 ` Mel Gorman
2013-12-20 14:19                                         ` [PATCH 0/1] " Martin Schwidefsky
2013-12-16 20:19                                   ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov
2013-12-16 20:19                                     ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov
2013-12-16 20:19                                     ` [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() Oleg Nesterov
2013-12-16 20:27                                     ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Andrea Arcangeli
2013-12-10 22:42                   ` process 'stuck' at exit Thomas Gleixner
2013-12-10 22:48                     ` Linus Torvalds
2013-12-10 22:58                       ` Darren Hart
2013-12-10 23:01                         ` Dave Jones
2013-12-10 23:00                       ` Dave Jones
2013-12-11  0:05                         ` Steven Rostedt
2013-12-11  0:23                           ` Dave Jones
2013-12-11  0:55                             ` Dave Jones
2013-12-14 20:17                               ` Oleg Nesterov
2013-12-11  4:09                       ` Dave Jones
2013-12-12  4:26                       ` Dave Jones
2013-12-12  5:29                         ` Darren Hart
2013-12-10 22:51                     ` Darren Hart
2013-12-10 23:24                     ` Al Viro
2013-12-11 16:31                     ` Mel Gorman
2013-12-11 16:38                       ` Thomas Gleixner
2013-12-11 17:57                         ` Mel Gorman
2013-12-12 19:00                 ` Andrea Arcangeli
2013-12-12 19:03                   ` Linus Torvalds
2013-12-10 22:09               ` Steven Rostedt
2013-12-10 22:16                 ` Dave Jones
2013-12-10 22:21                   ` Steven Rostedt
2013-12-10 22:27                     ` Dave Jones
2013-12-11  1:02     ` Mel Gorman
2013-12-10 20:57   ` Darren Hart
2013-12-10 21:09     ` Dave Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.