linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tty/perf lockdep trace.
@ 2013-10-03 19:08 Dave Jones
  2013-10-03 19:42 ` tty^Wrcu/perf " Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Jones @ 2013-10-03 19:08 UTC (permalink / raw)
  To: Linux Kernel; +Cc: a.p.zijlstra, gregkh, peter

 ======================================================
 [ INFO: possible circular locking dependency detected ]
 3.12.0-rc3+ #92 Not tainted
 -------------------------------------------------------
 trinity-child2/15191 is trying to acquire lock:
  (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
 
but task is already holding lock:
  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&ctx->lock){-.-...}:
        [<ffffffff810cc243>] lock_acquire+0x93/0x200
        [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
        [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
        [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
        [<ffffffff81732052>] __schedule+0x1d2/0xa20
        [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
        [<ffffffff817352b6>] retint_kernel+0x26/0x30
        [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
        [<ffffffff813f0504>] pty_write+0x54/0x60
        [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
        [<ffffffff813e5838>] tty_write+0x158/0x2d0
        [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
        [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
        [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
 
-> #2 (&rq->lock){-.-.-.}:
        [<ffffffff810cc243>] lock_acquire+0x93/0x200
        [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
        [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
        [<ffffffff81054336>] do_fork+0x126/0x460
        [<ffffffff81054696>] kernel_thread+0x26/0x30
        [<ffffffff8171ff93>] rest_init+0x23/0x140
        [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
        [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
        [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
 
-> #1 (&p->pi_lock){-.-.-.}:
        [<ffffffff810cc243>] lock_acquire+0x93/0x200
        [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
        [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
        [<ffffffff81097d62>] default_wake_function+0x12/0x20
        [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
        [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
        [<ffffffff8108ff59>] __wake_up+0x39/0x50
        [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
        [<ffffffff81111450>] __call_rcu+0x140/0x820
        [<ffffffff81111b8d>] call_rcu+0x1d/0x20
        [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
        [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
        [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
        [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
        [<ffffffff817200be>] kernel_init+0xe/0x190
        [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
 
-> #0 (&rdp->nocb_wq){......}:
        [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
        [<ffffffff810cc243>] lock_acquire+0x93/0x200
        [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
        [<ffffffff8108ff43>] __wake_up+0x23/0x50
        [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
        [<ffffffff81111450>] __call_rcu+0x140/0x820
        [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
        [<ffffffff81149abf>] put_ctx+0x4f/0x70
        [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
        [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
        [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
        [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
        [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
 
other info that might help us debug this:

Chain exists of:
  &rdp->nocb_wq --> &rq->lock --> &ctx->lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&ctx->lock);
                                lock(&rq->lock);
                                lock(&ctx->lock);
   lock(&rdp->nocb_wq);
 
 *** DEADLOCK ***

1 lock held by trinity-child2/15191:
 #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230

stack backtrace:
CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92 
 ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
 ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
 ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
Call Trace:
 [<ffffffff8172a363>] dump_stack+0x4e/0x82
 [<ffffffff81726741>] print_circular_bug+0x200/0x20f
 [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
 [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
 [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
 [<ffffffff810cc243>] lock_acquire+0x93/0x200
 [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
 [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
 [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
 [<ffffffff8108ff43>] __wake_up+0x23/0x50
 [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
 [<ffffffff81111450>] __call_rcu+0x140/0x820
 [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
 [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
 [<ffffffff81149abf>] put_ctx+0x4f/0x70
 [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
 [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
 [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
 [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
 [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
 [<ffffffff8173d4e4>] tracesys+0xdd/0xe2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-03 19:08 tty/perf lockdep trace Dave Jones
@ 2013-10-03 19:42 ` Peter Zijlstra
  2013-10-03 19:58   ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-03 19:42 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, gregkh, peter; +Cc: paulmck


That's not tty; that's RCU..

On Thu, Oct 03, 2013 at 03:08:30PM -0400, Dave Jones wrote:
>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #3 (&ctx->lock){-.-...}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
>         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
>         [<ffffffff81732052>] __schedule+0x1d2/0xa20
>         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
>         [<ffffffff817352b6>] retint_kernel+0x26/0x30
>         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
>         [<ffffffff813f0504>] pty_write+0x54/0x60
>         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
>         [<ffffffff813e5838>] tty_write+0x158/0x2d0
>         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
>         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>  
> -> #2 (&rq->lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
>         [<ffffffff81054336>] do_fork+0x126/0x460
>         [<ffffffff81054696>] kernel_thread+0x26/0x30
>         [<ffffffff8171ff93>] rest_init+0x23/0x140
>         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
>         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
>         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>  
> -> #1 (&p->pi_lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
>         [<ffffffff81097d62>] default_wake_function+0x12/0x20
>         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
>         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
>         [<ffffffff8108ff59>] __wake_up+0x39/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
>         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
>         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
>         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
>         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
>         [<ffffffff817200be>] kernel_init+0xe/0x190
>         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>  
> -> #0 (&rdp->nocb_wq){......}:
>         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff8108ff43>] __wake_up+0x23/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>         [<ffffffff81149abf>] put_ctx+0x4f/0x70
>         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>  
> other info that might help us debug this:
> 
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
> 
>   Possible unsafe locking scenario:
> 
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>  
>  *** DEADLOCK ***
> 
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
> 
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92 
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [<ffffffff8172a363>] dump_stack+0x4e/0x82
>  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
>  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
>  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
>  [<ffffffff810cc243>] lock_acquire+0x93/0x200
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [<ffffffff81111450>] __call_rcu+0x140/0x820
>  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
>  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>  [<ffffffff81149abf>] put_ctx+0x4f/0x70
>  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-03 19:42 ` tty^Wrcu/perf " Peter Zijlstra
@ 2013-10-03 19:58   ` Paul E. McKenney
  2013-10-04  6:58     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-03 19:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Thu, Oct 03, 2013 at 09:42:26PM +0200, Peter Zijlstra wrote:
> 
> That's not tty; that's RCU..
> 
> On Thu, Oct 03, 2013 at 03:08:30PM -0400, Dave Jones wrote:
> >  ======================================================
> >  [ INFO: possible circular locking dependency detected ]
> >  3.12.0-rc3+ #92 Not tainted
> >  -------------------------------------------------------
> >  trinity-child2/15191 is trying to acquire lock:
> >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
> >  
> > but task is already holding lock:
> >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
> > 
> > which lock already depends on the new lock.
> > 
> > 
> > the existing dependency chain (in reverse order) is:
> > 
> > -> #3 (&ctx->lock){-.-...}:
> >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
> >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> >         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> >         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> >         [<ffffffff81732052>] __schedule+0x1d2/0xa20
> >         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> >         [<ffffffff817352b6>] retint_kernel+0x26/0x30
> >         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> >         [<ffffffff813f0504>] pty_write+0x54/0x60
> >         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> >         [<ffffffff813e5838>] tty_write+0x158/0x2d0
> >         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> >         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
> >  
> > -> #2 (&rq->lock){-.-.-.}:
> >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
> >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> >         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> >         [<ffffffff81054336>] do_fork+0x126/0x460
> >         [<ffffffff81054696>] kernel_thread+0x26/0x30
> >         [<ffffffff8171ff93>] rest_init+0x23/0x140
> >         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> >         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> >         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
> >  
> > -> #1 (&p->pi_lock){-.-.-.}:
> >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
> >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> >         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> >         [<ffffffff81097d62>] default_wake_function+0x12/0x20
> >         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> >         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> >         [<ffffffff8108ff59>] __wake_up+0x39/0x50
> >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> >         [<ffffffff81111450>] __call_rcu+0x140/0x820
> >         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> >         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> >         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> >         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> >         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> >         [<ffffffff817200be>] kernel_init+0xe/0x190
> >         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
> >  
> > -> #0 (&rdp->nocb_wq){......}:
> >         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
> >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> >         [<ffffffff8108ff43>] __wake_up+0x23/0x50
> >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> >         [<ffffffff81111450>] __call_rcu+0x140/0x820
> >         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> >         [<ffffffff81149abf>] put_ctx+0x4f/0x70
> >         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> >         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> >         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> >         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

I suppose I could defer the ->nocb_wq wakeup until the next context switch
or transition to idle/userspace, but it might be simpler for put_ctx()
to maintain a per-CPU chain of callbacks which are kfree_rcu()ed when
ctx->lock is dropped.  Also easier on the kernel/user and kernel/idle
transition overhead/latency...

Other thoughts?

							Thanx, Paul

> > other info that might help us debug this:
> > 
> > Chain exists of:
> >   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
> > 
> >   Possible unsafe locking scenario:
> > 
> >         CPU0                    CPU1
> >         ----                    ----
> >    lock(&ctx->lock);
> >                                 lock(&rq->lock);
> >                                 lock(&ctx->lock);
> >    lock(&rdp->nocb_wq);
> >  
> >  *** DEADLOCK ***
> > 
> > 1 lock held by trinity-child2/15191:
> >  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
> > 
> > stack backtrace:
> > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92 
> >  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> >  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> >  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> > Call Trace:
> >  [<ffffffff8172a363>] dump_stack+0x4e/0x82
> >  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> >  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> >  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> >  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> >  [<ffffffff810cc243>] lock_acquire+0x93/0x200
> >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> >  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> >  [<ffffffff8108ff43>] __wake_up+0x23/0x50
> >  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> >  [<ffffffff81111450>] __call_rcu+0x140/0x820
> >  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> >  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> >  [<ffffffff81149abf>] put_ctx+0x4f/0x70
> >  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> >  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> >  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> >  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> >  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> >  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> >  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
> > 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-03 19:58   ` Paul E. McKenney
@ 2013-10-04  6:58     ` Peter Zijlstra
  2013-10-04 16:03       ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-04  6:58 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Thu, Oct 03, 2013 at 12:58:32PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 03, 2013 at 09:42:26PM +0200, Peter Zijlstra wrote:
> > 
> > That's not tty; that's RCU..
> > 
> > On Thu, Oct 03, 2013 at 03:08:30PM -0400, Dave Jones wrote:
> > >  ======================================================
> > >  [ INFO: possible circular locking dependency detected ]
> > >  3.12.0-rc3+ #92 Not tainted
> > >  -------------------------------------------------------
> > >  trinity-child2/15191 is trying to acquire lock:
> > >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
> > >  
> > > but task is already holding lock:
> > >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
> > > 
> > > which lock already depends on the new lock.
> > > 
> > > 
> > > the existing dependency chain (in reverse order) is:
> > > 
> > > -> #3 (&ctx->lock){-.-...}:
> > >  
> > > -> #2 (&rq->lock){-.-.-.}:
> > >  
> > > -> #1 (&p->pi_lock){-.-.-.}:
> > >  
> > > -> #0 (&rdp->nocb_wq){......}:
> 
> I suppose I could defer the ->nocb_wq wakeup until the next context switch
> or transition to idle/userspace, but it might be simpler for put_ctx()
> to maintain a per-CPU chain of callbacks which are kfree_rcu()ed when
> ctx->lock is dropped.  Also easier on the kernel/user and kernel/idle
> transition overhead/latency...
> 
> Other thoughts?

What's caused this? We've had that kfree_rcu() in there for ages. I need
to audit all the get/put_ctx calls anyway for an unrelated issue but I
fear its going to be messy to defer that kfree_rcu() call, but I can
try.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04  6:58     ` Peter Zijlstra
@ 2013-10-04 16:03       ` Paul E. McKenney
  2013-10-04 16:15         ` Paul E. McKenney
  2013-10-04 16:50         ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-04 16:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 08:58:35AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 03, 2013 at 12:58:32PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 03, 2013 at 09:42:26PM +0200, Peter Zijlstra wrote:
> > > 
> > > That's not tty; that's RCU..
> > > 
> > > On Thu, Oct 03, 2013 at 03:08:30PM -0400, Dave Jones wrote:
> > > >  ======================================================
> > > >  [ INFO: possible circular locking dependency detected ]
> > > >  3.12.0-rc3+ #92 Not tainted
> > > >  -------------------------------------------------------
> > > >  trinity-child2/15191 is trying to acquire lock:
> > > >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
> > > >  
> > > > but task is already holding lock:
> > > >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
> > > > 
> > > > which lock already depends on the new lock.
> > > > 
> > > > 
> > > > the existing dependency chain (in reverse order) is:
> > > > 
> > > > -> #3 (&ctx->lock){-.-...}:
> > > >  
> > > > -> #2 (&rq->lock){-.-.-.}:
> > > >  
> > > > -> #1 (&p->pi_lock){-.-.-.}:
> > > >  
> > > > -> #0 (&rdp->nocb_wq){......}:
> > 
> > I suppose I could defer the ->nocb_wq wakeup until the next context switch
> > or transition to idle/userspace, but it might be simpler for put_ctx()
> > to maintain a per-CPU chain of callbacks which are kfree_rcu()ed when
> > ctx->lock is dropped.  Also easier on the kernel/user and kernel/idle
> > transition overhead/latency...
> > 
> > Other thoughts?
> 
> What's caused this? We've had that kfree_rcu() in there for ages. I need
> to audit all the get/put_ctx calls anyway for an unrelated issue but I
> fear its going to be messy to defer that kfree_rcu() call, but I can
> try.

The problem exists, but NOCB made it much more probable.  With non-NOCB
kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
kernel, the wake_up() happens on the first callback.

So let's look at what is required to solve this within RCU.  Currently,
I cannot safely do any sort of wakeup or even a resched_cpu() from
within an call_rcu() that is called with interrupts disabled because of
this deadlock.  I could require that the rcu_nocb_poll sysfs parameter
always be set, but the energy-efficiency guys are not going to be happy
with the resulting wakeups on idle systems.

I could try defering the wake_up(), Lai Jiangshan style.  The question
is then "to where do I defer it?"  The straightforward answer is to
check on each context switch, each transition to RCU idle, and each
scheduling-clock interrupt from userspace execution.  The scenario that
defeats this is where the CPU has a single runnable task, but where that
task spends much of its time in the kernel, so that the scheduling-clock
interrupts always hit kernel-mode execution.  The callback is then
deferred forever.

We could keep Frederic Weisbecker's kernel/user transition hooks,
currently in place only for NO_HZ_FULL, and propagate these to all
architectures, and do the additional checking on those transitions.
This would work, but is not an immediate solution.  And adds overhead
that is not otherwise needed.

Another approach that just now occurred to me is to do a mod_timer()
each time the first callback is posted with irqs disabled, and to
cancel that timer if the wake_up() gets done later.  (I can safely and
unconditionally do a wake_up() from a timer handler, IIRC.)  So, does
perf ever want to invoke call_rcu() holding a timer lock?

I am not too happy about the complexity of deferring, but maybe it is
the right approach, at least assuming perf isn't going to whack me
with a timer lock.  ;-)

Any other approaches that I am missing?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 16:03       ` Paul E. McKenney
@ 2013-10-04 16:15         ` Paul E. McKenney
  2013-10-04 16:50         ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-04 16:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 09:03:52AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 04, 2013 at 08:58:35AM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 03, 2013 at 12:58:32PM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 03, 2013 at 09:42:26PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > That's not tty; that's RCU..
> > > > 
> > > > On Thu, Oct 03, 2013 at 03:08:30PM -0400, Dave Jones wrote:
> > > > >  ======================================================
> > > > >  [ INFO: possible circular locking dependency detected ]
> > > > >  3.12.0-rc3+ #92 Not tainted
> > > > >  -------------------------------------------------------
> > > > >  trinity-child2/15191 is trying to acquire lock:
> > > > >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
> > > > >  
> > > > > but task is already holding lock:
> > > > >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
> > > > > 
> > > > > which lock already depends on the new lock.
> > > > > 
> > > > > 
> > > > > the existing dependency chain (in reverse order) is:
> > > > > 
> > > > > -> #3 (&ctx->lock){-.-...}:
> > > > >  
> > > > > -> #2 (&rq->lock){-.-.-.}:
> > > > >  
> > > > > -> #1 (&p->pi_lock){-.-.-.}:
> > > > >  
> > > > > -> #0 (&rdp->nocb_wq){......}:
> > > 
> > > I suppose I could defer the ->nocb_wq wakeup until the next context switch
> > > or transition to idle/userspace, but it might be simpler for put_ctx()
> > > to maintain a per-CPU chain of callbacks which are kfree_rcu()ed when
> > > ctx->lock is dropped.  Also easier on the kernel/user and kernel/idle
> > > transition overhead/latency...
> > > 
> > > Other thoughts?
> > 
> > What's caused this? We've had that kfree_rcu() in there for ages. I need
> > to audit all the get/put_ctx calls anyway for an unrelated issue but I
> > fear its going to be messy to defer that kfree_rcu() call, but I can
> > try.
> 
> The problem exists, but NOCB made it much more probable.  With non-NOCB
> kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
> there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
> kernel, the wake_up() happens on the first callback.
> 
> So let's look at what is required to solve this within RCU.  Currently,
> I cannot safely do any sort of wakeup or even a resched_cpu() from
> within an call_rcu() that is called with interrupts disabled because of
> this deadlock.  I could require that the rcu_nocb_poll sysfs parameter
> always be set, but the energy-efficiency guys are not going to be happy
> with the resulting wakeups on idle systems.
> 
> I could try defering the wake_up(), Lai Jiangshan style.  The question
> is then "to where do I defer it?"  The straightforward answer is to
> check on each context switch, each transition to RCU idle, and each
> scheduling-clock interrupt from userspace execution.  The scenario that
> defeats this is where the CPU has a single runnable task, but where that
> task spends much of its time in the kernel, so that the scheduling-clock
> interrupts always hit kernel-mode execution.  The callback is then
> deferred forever.

Ah, but it is safe to call wake_up() from a scheduling-clock interrupt,
because these cannot interrupt an irq-disabled lock critical section.
So maybe I can unconditionally defer to a scheduling-clock interrupt.

> We could keep Frederic Weisbecker's kernel/user transition hooks,
> currently in place only for NO_HZ_FULL, and propagate these to all
> architectures, and do the additional checking on those transitions.
> This would work, but is not an immediate solution.  And adds overhead
> that is not otherwise needed.

But if !NO_HZ_FULL, there will eventually be a scheduling-clock interrupt.

So maybe check on each context switch, transition to idle, subsequent
non-irq-disabled call_rcu(), and scheduling-clock interrupt?

Does that actually avoid this deadlock?

							Thanx, Paul

> Another approach that just now occurred to me is to do a mod_timer()
> each time the first callback is posted with irqs disabled, and to
> cancel that timer if the wake_up() gets done later.  (I can safely and
> unconditionally do a wake_up() from a timer handler, IIRC.)  So, does
> perf ever want to invoke call_rcu() holding a timer lock?
> 
> I am not too happy about the complexity of deferring, but maybe it is
> the right approach, at least assuming perf isn't going to whack me
> with a timer lock.  ;-)
> 
> Any other approaches that I am missing?
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 16:03       ` Paul E. McKenney
  2013-10-04 16:15         ` Paul E. McKenney
@ 2013-10-04 16:50         ` Peter Zijlstra
  2013-10-04 17:09           ` Paul E. McKenney
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-04 16:50 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 09:03:52AM -0700, Paul E. McKenney wrote:
> The problem exists, but NOCB made it much more probable.  With non-NOCB
> kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
> there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
> kernel, the wake_up() happens on the first callback.

Oh I see.. so I was hoping this was some NOCB crackbrained damage we
could still 'fix'.

And that wakeup is because we moved grace-period advancing into
kthreads, right?

> I am not too happy about the complexity of deferring, but maybe it is
> the right approach, at least assuming perf isn't going to whack me
> with a timer lock.  ;-)

I'm not too thrilled about trying to move the call_rcu() usage either.

> Any other approaches that I am missing?

Probably; so the regular no-NOCB would be easy to work around by
providing me a call_rcu variant that never does the wakeup.

NOCB might be a little more difficult; depending on the reason why it
needs to do this wakeup on every single invocation; that seems
particularly expensive.

Man, RCU was so much easier when all it was was a strict per-cpu state
with timer-interrupt driven state machine; non of all this nonsense.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 16:50         ` Peter Zijlstra
@ 2013-10-04 17:09           ` Paul E. McKenney
  2013-10-04 18:52             ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-04 17:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 06:50:44PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 04, 2013 at 09:03:52AM -0700, Paul E. McKenney wrote:
> > The problem exists, but NOCB made it much more probable.  With non-NOCB
> > kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
> > there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
> > kernel, the wake_up() happens on the first callback.
> 
> Oh I see.. so I was hoping this was some NOCB crackbrained damage we
> could still 'fix'.
> 
> And that wakeup is because we moved grace-period advancing into
> kthreads, right?

Yep, in earlier kernels we would instead be doing raise_softirq().
Which would instead wake up ksoftirqd, if I am reading the code
correctly -- spin_lock_irq() does not affect preempt_count.

> > I am not too happy about the complexity of deferring, but maybe it is
> > the right approach, at least assuming perf isn't going to whack me
> > with a timer lock.  ;-)
> 
> I'm not too thrilled about trying to move the call_rcu() usage either.

Understood!

> > Any other approaches that I am missing?
> 
> Probably; so the regular no-NOCB would be easy to work around by
> providing me a call_rcu variant that never does the wakeup.

Well, if we can safely, sanely, and reliably defer the wakeup, there is
no reason not to make plain old call_rcu() do what you need.  If there
is no such way to defer the wakeup, then I don't see how to make that
variant.

> NOCB might be a little more difficult; depending on the reason why it
> needs to do this wakeup on every single invocation; that seems
> particularly expensive.

Not on every single invocation, just on those invocations where the list
is initially empty.  So the first call_rcu() on a CPU whose rcuo kthread
is sleeping will do a wakeup, but subsequent call_rcu()s will just queue,
at least until rcuo goes to sleep again.  Which takes awhile, since it
has to wait for a grace period before invoking that first RCU callback.

> Man, RCU was so much easier when all it was was a strict per-cpu state
> with timer-interrupt driven state machine; non of all this nonsense.

Tell me about it!  This bit about avoiding scheduling-clock interrupts
for all sorts of reasons has most definitely added to my collection of
gray hairs.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 17:09           ` Paul E. McKenney
@ 2013-10-04 18:52             ` Peter Zijlstra
  2013-10-04 21:25               ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-04 18:52 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 10:09:54AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 04, 2013 at 06:50:44PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 04, 2013 at 09:03:52AM -0700, Paul E. McKenney wrote:
> > > The problem exists, but NOCB made it much more probable.  With non-NOCB
> > > kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
> > > there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
> > > kernel, the wake_up() happens on the first callback.
> > 
> > Oh I see.. so I was hoping this was some NOCB crackbrained damage we
> > could still 'fix'.
> > 
> > And that wakeup is because we moved grace-period advancing into
> > kthreads, right?
> 
> Yep, in earlier kernels we would instead be doing raise_softirq().
> Which would instead wake up ksoftirqd, if I am reading the code
> correctly -- spin_lock_irq() does not affect preempt_count.

I suspect you got lost in the indirection fest; but have a look at
__raw_spin_lock_irqsave(). It does:

	local_irq_save();
	preempt_disable();

> > Probably; so the regular no-NOCB would be easy to work around by
> > providing me a call_rcu variant that never does the wakeup.
> 
> Well, if we can safely, sanely, and reliably defer the wakeup, there is
> no reason not to make plain old call_rcu() do what you need.

Agreed.

> If there
> is no such way to defer the wakeup, then I don't see how to make that
> variant.

Wouldn't it be a simple matter of making __call_rcu_core() return early,
just like it does for irqs_disabled_flags()?

> > NOCB might be a little more difficult; depending on the reason why it
> > needs to do this wakeup on every single invocation; that seems
> > particularly expensive.
> 
> Not on every single invocation, just on those invocations where the list
> is initially empty.  So the first call_rcu() on a CPU whose rcuo kthread
> is sleeping will do a wakeup, but subsequent call_rcu()s will just queue,
> at least until rcuo goes to sleep again.  Which takes awhile, since it
> has to wait for a grace period before invoking that first RCU callback.

So I've not kept up with RCU the last year or so due to circumstance, so
please bear with me ( http://www.youtube.com/watch?v=4sxtHODemi0 ). Why
do we still have a per-cpu kthread in nocb mode? The idea is that we do
not disturb the cpu, right? So I suppose these kthreads get to run on
another cpu.

Since its running on another cpu; we get into atomic and memory barriers
anyway; so why not keep the logic the same as no-nocb but have another
cpu check our nocb cpu's state.

That is; I'm fumbling to understand how all this works and needs to be
different.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 18:52             ` Peter Zijlstra
@ 2013-10-04 21:25               ` Paul E. McKenney
  2013-10-04 22:02                 ` Paul E. McKenney
  2013-10-05 16:05                 ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-04 21:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 08:52:39PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 04, 2013 at 10:09:54AM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 04, 2013 at 06:50:44PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 04, 2013 at 09:03:52AM -0700, Paul E. McKenney wrote:
> > > > The problem exists, but NOCB made it much more probable.  With non-NOCB
> > > > kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
> > > > there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
> > > > kernel, the wake_up() happens on the first callback.
> > > 
> > > Oh I see.. so I was hoping this was some NOCB crackbrained damage we
> > > could still 'fix'.
> > > 
> > > And that wakeup is because we moved grace-period advancing into
> > > kthreads, right?
> > 
> > Yep, in earlier kernels we would instead be doing raise_softirq().
> > Which would instead wake up ksoftirqd, if I am reading the code
> > correctly -- spin_lock_irq() does not affect preempt_count.
> 
> I suspect you got lost in the indirection fest; but have a look at
> __raw_spin_lock_irqsave(). It does:
> 
> 	local_irq_save();
> 	preempt_disable();

ETOOMANYLEVELS.  ;-)

OK, then I should be able to just do raise_softirq() in this
situation.  As long as irqs were disabled due to something other than
local_irq_disable(), but I would get around that by doing an explicit
preempt_disable() in call_rcu() or friends.

The real-time guys won't like that much -- they are trying to get rid
of softirq -- but hopefully we can come up with something better later.
Or maybe they have other tricks available to them.

> > > Probably; so the regular no-NOCB would be easy to work around by
> > > providing me a call_rcu variant that never does the wakeup.
> > 
> > Well, if we can safely, sanely, and reliably defer the wakeup, there is
> > no reason not to make plain old call_rcu() do what you need.
> 
> Agreed.
> 
> > If there
> > is no such way to defer the wakeup, then I don't see how to make that
> > variant.
> 
> Wouldn't it be a simple matter of making __call_rcu_core() return early,
> just like it does for irqs_disabled_flags()?

Given that raise_softirq() works, it just might be.  Except that I would
need to leave an indication that there are deferred callbacks and do
an invoke_rcu_core().  Plus make __rcu_process_callbacks() check this
indication and do the wakeup.

I should be able to use in_interrupt() despite its inaccuracy near
the beginnings and ends of interrupt handlers because all I really
care about is whether one of the scheduler or perf locks might be
held, and they will cause in_interrupt() to return true regardless.

> > > NOCB might be a little more difficult; depending on the reason why it
> > > needs to do this wakeup on every single invocation; that seems
> > > particularly expensive.
> > 
> > Not on every single invocation, just on those invocations where the list
> > is initially empty.  So the first call_rcu() on a CPU whose rcuo kthread
> > is sleeping will do a wakeup, but subsequent call_rcu()s will just queue,
> > at least until rcuo goes to sleep again.  Which takes awhile, since it
> > has to wait for a grace period before invoking that first RCU callback.
> 
> So I've not kept up with RCU the last year or so due to circumstance, so
> please bear with me ( http://www.youtube.com/watch?v=4sxtHODemi0 ).

I would, but the people sitting near me might be disturbed.  ;-)

>                                                                     Why
> do we still have a per-cpu kthread in nocb mode? The idea is that we do
> not disturb the cpu, right? So I suppose these kthreads get to run on
> another cpu.

Yep, the idea is that usermode figures out where to run them.  Even if
usermode doesn't do that, this has the effect of getting them to be
more out of the way of real-time tasks.

> Since its running on another cpu; we get into atomic and memory barriers
> anyway; so why not keep the logic the same as no-nocb but have another
> cpu check our nocb cpu's state.

You can do that today by setting rcu_nocb_poll, but that results in
frequent polling wakeups even when the system is completely idle, which
is out of the question for the battery-powered embedded guys.

> That is; I'm fumbling to understand how all this works and needs to be
> different.

Here is an untested patch.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a33a24d10611..1a0a14349b29 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2310,6 +2310,9 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	/* If there are callbacks ready, invoke them. */
 	if (cpu_has_callbacks_ready_to_invoke(rdp))
 		invoke_rcu_callbacks(rsp, rdp);
+
+	/* Do any needed deferred wakeups of rcuo kthreads. */
+	do_nocb_deferred_wakeup(rdp);
 }
 
 /*
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 8e34d8674a4e..c20096cf8206 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -335,6 +335,7 @@ struct rcu_data {
 	int nocb_p_count_lazy;		/*  (approximate). */
 	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
 	struct task_struct *nocb_kthread;
+	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
 	/* 8) RCU CPU stall data. */
@@ -553,6 +554,7 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
 			    bool lazy);
 static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
 				      struct rcu_data *rdp);
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
 static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
 static void rcu_kick_nohz_cpu(int cpu);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 747df70a4d64..7f54fb25a036 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2125,9 +2125,17 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 	}
 	len = atomic_long_read(&rdp->nocb_q_count);
 	if (old_rhpp == &rdp->nocb_head) {
-		wake_up(&rdp->nocb_wq); /* ... only if queue was empty ... */
+		if (!in_interrupt()) {
+			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    TPS("WakeEmpty"));
+		} else {
+			rdp->nocb_defer_wakeup = true;
+			invoke_rcu_core();
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    TPS("WakeEmptyIsDeferred"));
+		}
 		rdp->qlen_last_fqs_check = 0;
-		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeEmpty"));
 	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
 		wake_up_process(t); /* ... or if many callbacks queued. */
 		rdp->qlen_last_fqs_check = LONG_MAX / 2;
@@ -2314,6 +2322,16 @@ static int rcu_nocb_kthread(void *arg)
 	return 0;
 }
 
+/* Do a deferred wakeup of rcu_nocb_kthread(). */
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
+{
+	if (!ACCESS_ONCE(rdp->nocb_defer_wakeup))
+		return;
+	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
+	wake_up(&rdp->nocb_wq);
+	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
+}
+
 /* Initialize per-rcu_data variables for no-CBs CPUs. */
 static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
@@ -2384,6 +2402,10 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
 }
 
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
+{
+}
+
 static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
 {
 }


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 21:25               ` Paul E. McKenney
@ 2013-10-04 22:02                 ` Paul E. McKenney
  2013-10-05  0:23                   ` Paul E. McKenney
  2013-10-05 16:05                 ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-04 22:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 04, 2013 at 08:52:39PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 04, 2013 at 10:09:54AM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 04, 2013 at 06:50:44PM +0200, Peter Zijlstra wrote:
> > > > On Fri, Oct 04, 2013 at 09:03:52AM -0700, Paul E. McKenney wrote:
> > > > > The problem exists, but NOCB made it much more probable.  With non-NOCB
> > > > > kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
> > > > > there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
> > > > > kernel, the wake_up() happens on the first callback.
> > > > 
> > > > Oh I see.. so I was hoping this was some NOCB crackbrained damage we
> > > > could still 'fix'.
> > > > 
> > > > And that wakeup is because we moved grace-period advancing into
> > > > kthreads, right?
> > > 
> > > Yep, in earlier kernels we would instead be doing raise_softirq().
> > > Which would instead wake up ksoftirqd, if I am reading the code
> > > correctly -- spin_lock_irq() does not affect preempt_count.
> > 
> > I suspect you got lost in the indirection fest; but have a look at
> > __raw_spin_lock_irqsave(). It does:
> > 
> > 	local_irq_save();
> > 	preempt_disable();
> 
> ETOOMANYLEVELS.  ;-)
> 
> OK, then I should be able to just do raise_softirq() in this
> situation.  As long as irqs were disabled due to something other than
> local_irq_disable(), but I would get around that by doing an explicit
> preempt_disable() in call_rcu() or friends.
> 
> The real-time guys won't like that much -- they are trying to get rid
> of softirq -- but hopefully we can come up with something better later.
> Or maybe they have other tricks available to them.

But preempt_disable() won't set anything that in_interrupt() sees...
And it is a no-op in any case on CONFIG_PREEMPT=n kernels.

> > > > Probably; so the regular no-NOCB would be easy to work around by
> > > > providing me a call_rcu variant that never does the wakeup.
> > > 
> > > Well, if we can safely, sanely, and reliably defer the wakeup, there is
> > > no reason not to make plain old call_rcu() do what you need.
> > 
> > Agreed.
> > 
> > > If there
> > > is no such way to defer the wakeup, then I don't see how to make that
> > > variant.
> > 
> > Wouldn't it be a simple matter of making __call_rcu_core() return early,
> > just like it does for irqs_disabled_flags()?
> 
> Given that raise_softirq() works, it just might be.  Except that I would
> need to leave an indication that there are deferred callbacks and do
> an invoke_rcu_core().  Plus make __rcu_process_callbacks() check this
> indication and do the wakeup.
> 
> I should be able to use in_interrupt() despite its inaccuracy near
> the beginnings and ends of interrupt handlers because all I really
> care about is whether one of the scheduler or perf locks might be
> held, and they will cause in_interrupt() to return true regardless.

So I can't use in_interrupt(), because it doesn't see the bits that
preempt_disable() sets, even in kernels where preempt_disable isn't
a no-op.

I can instead use irqs_disabled_flags(), but then I am back to not
seeing how raise_softirq() is safe in non-irq-handler contexts where
the scheduler locks might be held.  The check of in_interrupt() only
looks at HARDIRQ_MASK, SOFTIRQ_MASK, and NMI_MASK, so it won't see
local_irq_disable() or spin_lock_irqsave() from process context.

This puts me back into the position of having to check for deferred
wakeups in many locations.

Back to the drawing board...

							Thanx, Paul

> > > > NOCB might be a little more difficult; depending on the reason why it
> > > > needs to do this wakeup on every single invocation; that seems
> > > > particularly expensive.
> > > 
> > > Not on every single invocation, just on those invocations where the list
> > > is initially empty.  So the first call_rcu() on a CPU whose rcuo kthread
> > > is sleeping will do a wakeup, but subsequent call_rcu()s will just queue,
> > > at least until rcuo goes to sleep again.  Which takes awhile, since it
> > > has to wait for a grace period before invoking that first RCU callback.
> > 
> > So I've not kept up with RCU the last year or so due to circumstance, so
> > please bear with me ( http://www.youtube.com/watch?v=4sxtHODemi0 ).
> 
> I would, but the people sitting near me might be disturbed.  ;-)
> 
> >                                                                     Why
> > do we still have a per-cpu kthread in nocb mode? The idea is that we do
> > not disturb the cpu, right? So I suppose these kthreads get to run on
> > another cpu.
> 
> Yep, the idea is that usermode figures out where to run them.  Even if
> usermode doesn't do that, this has the effect of getting them to be
> more out of the way of real-time tasks.
> 
> > Since its running on another cpu; we get into atomic and memory barriers
> > anyway; so why not keep the logic the same as no-nocb but have another
> > cpu check our nocb cpu's state.
> 
> You can do that today by setting rcu_nocb_poll, but that results in
> frequent polling wakeups even when the system is completely idle, which
> is out of the question for the battery-powered embedded guys.
> 
> > That is; I'm fumbling to understand how all this works and needs to be
> > different.
> 
> Here is an untested patch.  Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index a33a24d10611..1a0a14349b29 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -2310,6 +2310,9 @@ __rcu_process_callbacks(struct rcu_state *rsp)
>  	/* If there are callbacks ready, invoke them. */
>  	if (cpu_has_callbacks_ready_to_invoke(rdp))
>  		invoke_rcu_callbacks(rsp, rdp);
> +
> +	/* Do any needed deferred wakeups of rcuo kthreads. */
> +	do_nocb_deferred_wakeup(rdp);
>  }
>  
>  /*
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 8e34d8674a4e..c20096cf8206 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -335,6 +335,7 @@ struct rcu_data {
>  	int nocb_p_count_lazy;		/*  (approximate). */
>  	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
>  	struct task_struct *nocb_kthread;
> +	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
>  #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
>  
>  	/* 8) RCU CPU stall data. */
> @@ -553,6 +554,7 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
>  			    bool lazy);
>  static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
>  				      struct rcu_data *rdp);
> +static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
>  static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
>  static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
>  static void rcu_kick_nohz_cpu(int cpu);
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 747df70a4d64..7f54fb25a036 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -2125,9 +2125,17 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
>  	}
>  	len = atomic_long_read(&rdp->nocb_q_count);
>  	if (old_rhpp == &rdp->nocb_head) {
> -		wake_up(&rdp->nocb_wq); /* ... only if queue was empty ... */
> +		if (!in_interrupt()) {
> +			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
> +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> +					    TPS("WakeEmpty"));
> +		} else {
> +			rdp->nocb_defer_wakeup = true;
> +			invoke_rcu_core();
> +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> +					    TPS("WakeEmptyIsDeferred"));
> +		}
>  		rdp->qlen_last_fqs_check = 0;
> -		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeEmpty"));
>  	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
>  		wake_up_process(t); /* ... or if many callbacks queued. */
>  		rdp->qlen_last_fqs_check = LONG_MAX / 2;
> @@ -2314,6 +2322,16 @@ static int rcu_nocb_kthread(void *arg)
>  	return 0;
>  }
>  
> +/* Do a deferred wakeup of rcu_nocb_kthread(). */
> +static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
> +{
> +	if (!ACCESS_ONCE(rdp->nocb_defer_wakeup))
> +		return;
> +	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
> +	wake_up(&rdp->nocb_wq);
> +	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
> +}
> +
>  /* Initialize per-rcu_data variables for no-CBs CPUs. */
>  static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  {
> @@ -2384,6 +2402,10 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  {
>  }
>  
> +static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
> +{
> +}
> +
>  static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
>  {
>  }


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 22:02                 ` Paul E. McKenney
@ 2013-10-05  0:23                   ` Paul E. McKenney
  2013-10-07 11:24                     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-05  0:23 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 03:02:32PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 04, 2013 at 08:52:39PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 04, 2013 at 10:09:54AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Oct 04, 2013 at 06:50:44PM +0200, Peter Zijlstra wrote:
> > > > > On Fri, Oct 04, 2013 at 09:03:52AM -0700, Paul E. McKenney wrote:
> > > > > > The problem exists, but NOCB made it much more probable.  With non-NOCB
> > > > > > kernels, an irq-disabled call_rcu() invocation does a wake_up() only if
> > > > > > there are more than 10,000 callbacks stacked up on the CPU.  With a NOCB
> > > > > > kernel, the wake_up() happens on the first callback.
> > > > > 
> > > > > Oh I see.. so I was hoping this was some NOCB crackbrained damage we
> > > > > could still 'fix'.
> > > > > 
> > > > > And that wakeup is because we moved grace-period advancing into
> > > > > kthreads, right?
> > > > 
> > > > Yep, in earlier kernels we would instead be doing raise_softirq().
> > > > Which would instead wake up ksoftirqd, if I am reading the code
> > > > correctly -- spin_lock_irq() does not affect preempt_count.
> > > 
> > > I suspect you got lost in the indirection fest; but have a look at
> > > __raw_spin_lock_irqsave(). It does:
> > > 
> > > 	local_irq_save();
> > > 	preempt_disable();
> > 
> > ETOOMANYLEVELS.  ;-)
> > 
> > OK, then I should be able to just do raise_softirq() in this
> > situation.  As long as irqs were disabled due to something other than
> > local_irq_disable(), but I would get around that by doing an explicit
> > preempt_disable() in call_rcu() or friends.
> > 
> > The real-time guys won't like that much -- they are trying to get rid
> > of softirq -- but hopefully we can come up with something better later.
> > Or maybe they have other tricks available to them.
> 
> But preempt_disable() won't set anything that in_interrupt() sees...
> And it is a no-op in any case on CONFIG_PREEMPT=n kernels.
> 
> > > > > Probably; so the regular no-NOCB would be easy to work around by
> > > > > providing me a call_rcu variant that never does the wakeup.
> > > > 
> > > > Well, if we can safely, sanely, and reliably defer the wakeup, there is
> > > > no reason not to make plain old call_rcu() do what you need.
> > > 
> > > Agreed.
> > > 
> > > > If there
> > > > is no such way to defer the wakeup, then I don't see how to make that
> > > > variant.
> > > 
> > > Wouldn't it be a simple matter of making __call_rcu_core() return early,
> > > just like it does for irqs_disabled_flags()?
> > 
> > Given that raise_softirq() works, it just might be.  Except that I would
> > need to leave an indication that there are deferred callbacks and do
> > an invoke_rcu_core().  Plus make __rcu_process_callbacks() check this
> > indication and do the wakeup.
> > 
> > I should be able to use in_interrupt() despite its inaccuracy near
> > the beginnings and ends of interrupt handlers because all I really
> > care about is whether one of the scheduler or perf locks might be
> > held, and they will cause in_interrupt() to return true regardless.
> 
> So I can't use in_interrupt(), because it doesn't see the bits that
> preempt_disable() sets, even in kernels where preempt_disable isn't
> a no-op.
> 
> I can instead use irqs_disabled_flags(), but then I am back to not
> seeing how raise_softirq() is safe in non-irq-handler contexts where
> the scheduler locks might be held.  The check of in_interrupt() only
> looks at HARDIRQ_MASK, SOFTIRQ_MASK, and NMI_MASK, so it won't see
> local_irq_disable() or spin_lock_irqsave() from process context.
> 
> This puts me back into the position of having to check for deferred
> wakeups in many locations.
> 
> Back to the drawing board...

OK, so I used whether or not irqs are disabled to defer the wakeup.
I will try testing this, but in the meantime...

							Thanx, Paul

------------------------------------------------------------------------

rcu: Break call_rcu() deadlock involving scheduler and perf

Dave Jones got the following lockdep splat:

>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
>         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
>         [<ffffffff81732052>] __schedule+0x1d2/0xa20
>         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
>         [<ffffffff817352b6>] retint_kernel+0x26/0x30
>         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
>         [<ffffffff813f0504>] pty_write+0x54/0x60
>         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
>         [<ffffffff813e5838>] tty_write+0x158/0x2d0
>         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
>         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
>         [<ffffffff81054336>] do_fork+0x126/0x460
>         [<ffffffff81054696>] kernel_thread+0x26/0x30
>         [<ffffffff8171ff93>] rest_init+0x23/0x140
>         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
>         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
>         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
>         [<ffffffff81097d62>] default_wake_function+0x12/0x20
>         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
>         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
>         [<ffffffff8108ff59>] __wake_up+0x39/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
>         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
>         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
>         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
>         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
>         [<ffffffff817200be>] kernel_init+0xe/0x190
>         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
>         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff8108ff43>] __wake_up+0x23/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>         [<ffffffff81149abf>] put_ctx+0x4f/0x70
>         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>
>  *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [<ffffffff8172a363>] dump_stack+0x4e/0x82
>  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
>  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
>  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
>  [<ffffffff810cc243>] lock_acquire+0x93/0x200
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [<ffffffff81111450>] __call_rcu+0x140/0x820
>  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
>  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>  [<ffffffff81149abf>] put_ctx+0x4f/0x70
>  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks.  The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held.  Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1.	Determine when it is likely that a relevant scheduler lock is held.

2.	Defer the wakeup in such cases.

3.	Ensure that all deferred wakeups eventually happen, preferably
    	sooner rather than later.

We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held.  This works because the relevant locks are always acquired
with interrupts disabled.  We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing.  The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set.  Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!).  So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>

diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index f3778f8952da..b8c3c813ea57 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -396,14 +396,14 @@ o	Each element of the form "3/3 ..>. 0:7 ^0" represents one rcu_node
 
 The output of "cat rcu/rcu_sched/rcu_pending" looks as follows:
 
-  0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903
-  1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113
-  2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889
-  3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469
-  4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042
-  5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422
-  6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699
-  7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147
+  0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903 ndw=0
+  1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113 ndw=0
+  2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889 ndw=0
+  3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469 ndw=0
+  4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042 ndw=0
+  5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422 ndw=0
+  6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699 ndw=0
+  7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147 ndw=0
 
 The fields are as follows:
 
@@ -432,6 +432,10 @@ o	"gpc" is the number of times that an old grace period had
 o	"gps" is the number of times that a new grace period had started,
 	but this CPU was not yet aware of it.
 
+o	"ndw" is the number of times that a wakeup of an rcuo
+	callback-offload kthread had to be deferred in order to avoid
+	deadlock.
+
 o	"nn" is the number of times that this CPU needed nothing.
 
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a33a24d10611..106f7f5cdd1d 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -362,6 +362,9 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
 				bool user)
 {
+	struct rcu_state *rsp;
+	struct rcu_data *rdp;
+
 	trace_rcu_dyntick(TPS("Start"), oldval, rdtp->dynticks_nesting);
 	if (!user && !is_idle_task(current)) {
 		struct task_struct *idle __maybe_unused =
@@ -373,6 +376,10 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
 			  current->pid, current->comm,
 			  idle->pid, idle->comm); /* must be idle task! */
 	}
+	for_each_rcu_flavor(rsp) {
+		rdp = this_cpu_ptr(rsp->rda);
+		do_nocb_deferred_wakeup(rdp);
+	}
 	rcu_prepare_for_idle(smp_processor_id());
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
 	smp_mb__before_atomic_inc();  /* See above. */
@@ -1908,13 +1915,13 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
  * Adopt the RCU callbacks from the specified rcu_state structure's
  * orphanage.  The caller must hold the ->orphan_lock.
  */
-static void rcu_adopt_orphan_cbs(struct rcu_state *rsp)
+static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
 {
 	int i;
 	struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
 
 	/* No-CBs CPUs are handled specially. */
-	if (rcu_nocb_adopt_orphan_cbs(rsp, rdp))
+	if (rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
 		return;
 
 	/* Do the accounting first. */
@@ -1993,7 +2000,7 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
 	rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
-	rcu_adopt_orphan_cbs(rsp);
+	rcu_adopt_orphan_cbs(rsp, flags);
 
 	/* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */
 	mask = rdp->grpmask;	/* rnp->grplo is constant. */
@@ -2310,6 +2317,9 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	/* If there are callbacks ready, invoke them. */
 	if (cpu_has_callbacks_ready_to_invoke(rdp))
 		invoke_rcu_callbacks(rsp, rdp);
+
+	/* Do any needed deferred wakeups of rcuo kthreads. */
+	do_nocb_deferred_wakeup(rdp);
 }
 
 /*
@@ -2444,7 +2454,7 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 
 		if (cpu != -1)
 			rdp = per_cpu_ptr(rsp->rda, cpu);
-		offline = !__call_rcu_nocb(rdp, head, lazy);
+		offline = !__call_rcu_nocb(rdp, head, lazy, flags);
 		WARN_ON_ONCE(offline);
 		/* _call_rcu() is illegal on offline CPU; leak the callback. */
 		local_irq_restore(flags);
@@ -2797,6 +2807,12 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 		return 1;
 	}
 
+	/* Does this CPU need a deferred NOCB wakeup? */
+	if (rcu_nocb_need_deferred_wakeup(rdp)) {
+		rdp->n_rp_nocb_defer_wakeup++;
+		return 1;
+	}
+
 	/* nothing to do */
 	rdp->n_rp_need_nothing++;
 	return 0;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 8e34d8674a4e..a87adfc2916b 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -317,6 +317,7 @@ struct rcu_data {
 	unsigned long n_rp_cpu_needs_gp;
 	unsigned long n_rp_gp_completed;
 	unsigned long n_rp_gp_started;
+	unsigned long n_rp_nocb_defer_wakeup;
 	unsigned long n_rp_need_nothing;
 
 	/* 6) _rcu_barrier() and OOM callbacks. */
@@ -335,6 +336,7 @@ struct rcu_data {
 	int nocb_p_count_lazy;		/*  (approximate). */
 	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
 	struct task_struct *nocb_kthread;
+	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
 	/* 8) RCU CPU stall data. */
@@ -550,9 +552,12 @@ static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
 static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
-			    bool lazy);
+			    bool lazy, unsigned long flags);
 static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
-				      struct rcu_data *rdp);
+				      struct rcu_data *rdp,
+				      unsigned long flags);
+static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
 static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
 static void rcu_kick_nohz_cpu(int cpu);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 747df70a4d64..d42e193c44c0 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2104,7 +2104,8 @@ bool rcu_is_nocb_cpu(int cpu)
 static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 				    struct rcu_head *rhp,
 				    struct rcu_head **rhtp,
-				    int rhcount, int rhcount_lazy)
+				    int rhcount, int rhcount_lazy,
+				    unsigned long flags)
 {
 	int len;
 	struct rcu_head **old_rhpp;
@@ -2125,9 +2126,16 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 	}
 	len = atomic_long_read(&rdp->nocb_q_count);
 	if (old_rhpp == &rdp->nocb_head) {
-		wake_up(&rdp->nocb_wq); /* ... only if queue was empty ... */
+		if (!irqs_disabled_flags(flags)) {
+			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    TPS("WakeEmpty"));
+		} else {
+			rdp->nocb_defer_wakeup = true;
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    TPS("WakeEmptyIsDeferred"));
+		}
 		rdp->qlen_last_fqs_check = 0;
-		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeEmpty"));
 	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
 		wake_up_process(t); /* ... or if many callbacks queued. */
 		rdp->qlen_last_fqs_check = LONG_MAX / 2;
@@ -2148,12 +2156,12 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
  * "rcuo" kthread can find it.
  */
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
-			    bool lazy)
+			    bool lazy, unsigned long flags)
 {
 
 	if (!rcu_is_nocb_cpu(rdp->cpu))
 		return 0;
-	__call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy);
+	__call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy, flags);
 	if (__is_kfree_rcu_offset((unsigned long)rhp->func))
 		trace_rcu_kfree_callback(rdp->rsp->name, rhp,
 					 (unsigned long)rhp->func,
@@ -2171,7 +2179,8 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
  * not a no-CBs CPU.
  */
 static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
-						     struct rcu_data *rdp)
+						     struct rcu_data *rdp,
+						     unsigned long flags)
 {
 	long ql = rsp->qlen;
 	long qll = rsp->qlen_lazy;
@@ -2185,14 +2194,14 @@ static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
 	/* First, enqueue the donelist, if any.  This preserves CB ordering. */
 	if (rsp->orphan_donelist != NULL) {
 		__call_rcu_nocb_enqueue(rdp, rsp->orphan_donelist,
-					rsp->orphan_donetail, ql, qll);
+					rsp->orphan_donetail, ql, qll, flags);
 		ql = qll = 0;
 		rsp->orphan_donelist = NULL;
 		rsp->orphan_donetail = &rsp->orphan_donelist;
 	}
 	if (rsp->orphan_nxtlist != NULL) {
 		__call_rcu_nocb_enqueue(rdp, rsp->orphan_nxtlist,
-					rsp->orphan_nxttail, ql, qll);
+					rsp->orphan_nxttail, ql, qll, flags);
 		ql = qll = 0;
 		rsp->orphan_nxtlist = NULL;
 		rsp->orphan_nxttail = &rsp->orphan_nxtlist;
@@ -2314,6 +2323,22 @@ static int rcu_nocb_kthread(void *arg)
 	return 0;
 }
 
+/* Is a deferred wakeup of rcu_nocb_kthread() required? */
+static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
+{
+	return ACCESS_ONCE(rdp->nocb_defer_wakeup);
+}
+
+/* Do a deferred wakeup of rcu_nocb_kthread(). */
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
+{
+	if (rcu_nocb_need_deferred_wakeup(rdp))
+		return;
+	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
+	wake_up(&rdp->nocb_wq);
+	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
+}
+
 /* Initialize per-rcu_data variables for no-CBs CPUs. */
 static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
@@ -2369,13 +2394,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 }
 
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
-			    bool lazy)
+			    bool lazy, unsigned long flags)
 {
 	return 0;
 }
 
 static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
-						     struct rcu_data *rdp)
+						     struct rcu_data *rdp,
+						     unsigned long flags)
 {
 	return 0;
 }
@@ -2384,6 +2410,15 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
 }
 
+static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
+{
+	return false;
+}
+
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
+{
+}
+
 static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
 {
 }
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index cf6c17412932..77a6831b9c51 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -364,9 +364,10 @@ static void print_one_rcu_pending(struct seq_file *m, struct rcu_data *rdp)
 		   rdp->n_rp_report_qs,
 		   rdp->n_rp_cb_ready,
 		   rdp->n_rp_cpu_needs_gp);
-	seq_printf(m, "gpc=%ld gps=%ld nn=%ld\n",
+	seq_printf(m, "gpc=%ld gps=%ld nn=%ld ndw%ld\n",
 		   rdp->n_rp_gp_completed,
 		   rdp->n_rp_gp_started,
+		   rdp->n_rp_nocb_defer_wakeup,
 		   rdp->n_rp_need_nothing);
 }
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-04 21:25               ` Paul E. McKenney
  2013-10-04 22:02                 ` Paul E. McKenney
@ 2013-10-05 16:05                 ` Peter Zijlstra
  2013-10-05 16:28                   ` Paul E. McKenney
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-05 16:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> >                                                                     Why
> > do we still have a per-cpu kthread in nocb mode? The idea is that we do
> > not disturb the cpu, right? So I suppose these kthreads get to run on
> > another cpu.
> 
> Yep, the idea is that usermode figures out where to run them.  Even if
> usermode doesn't do that, this has the effect of getting them to be
> more out of the way of real-time tasks.
> 
> > Since its running on another cpu; we get into atomic and memory barriers
> > anyway; so why not keep the logic the same as no-nocb but have another
> > cpu check our nocb cpu's state.
> 
> You can do that today by setting rcu_nocb_poll, but that results in
> frequent polling wakeups even when the system is completely idle, which
> is out of the question for the battery-powered embedded guys.

So its this polling I don't get.. why is the different behaviour
required? And why would you continue polling if the cpus were actually
idle.

Is there some confusion between the nr_running==1 extended quiescent
state and the nr_running==0 extended quiescent state?

Now, none of this solves the issue at hand because event the 'regular'
no-nocb rcu mode has this issue of needing to wake kthreads, but I'd
like to get a better understanding of why nocb mode is as it is.


I've seen you've since send a few more emails; I might find some of the
answers in there. Let me go read the :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-05 16:05                 ` Peter Zijlstra
@ 2013-10-05 16:28                   ` Paul E. McKenney
  2013-10-05 19:59                     ` Peter Zijlstra
  2013-10-07 17:35                     ` Paul E. McKenney
  0 siblings, 2 replies; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-05 16:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Sat, Oct 05, 2013 at 06:05:11PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> > >                                                                     Why
> > > do we still have a per-cpu kthread in nocb mode? The idea is that we do
> > > not disturb the cpu, right? So I suppose these kthreads get to run on
> > > another cpu.
> > 
> > Yep, the idea is that usermode figures out where to run them.  Even if
> > usermode doesn't do that, this has the effect of getting them to be
> > more out of the way of real-time tasks.
> > 
> > > Since its running on another cpu; we get into atomic and memory barriers
> > > anyway; so why not keep the logic the same as no-nocb but have another
> > > cpu check our nocb cpu's state.
> > 
> > You can do that today by setting rcu_nocb_poll, but that results in
> > frequent polling wakeups even when the system is completely idle, which
> > is out of the question for the battery-powered embedded guys.
> 
> So its this polling I don't get.. why is the different behaviour
> required? And why would you continue polling if the cpus were actually
> idle.

The idea is to offload the overhead of doing the wakeup from (say)
a real-time thread/CPU onto some housekeeping CPU.

> Is there some confusion between the nr_running==1 extended quiescent
> state and the nr_running==0 extended quiescent state?

This is independent of the nr_running=1 extended quiescent state.  The
wakeups only happen when runnning in the kernel.  That said, a real-time
thread might want both rcu_nocb_poll=y and CONFIG_NO_HZ_FULL=y.

> Now, none of this solves the issue at hand because event the 'regular'
> no-nocb rcu mode has this issue of needing to wake kthreads, but I'd
> like to get a better understanding of why nocb mode is as it is.
> 
> 
> I've seen you've since send a few more emails; I might find some of the
> answers in there. Let me go read the :-)

I -think- I have solved it, but much testing and review will of course
be required.  And fixing last night's test failures...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-05 16:28                   ` Paul E. McKenney
@ 2013-10-05 19:59                     ` Peter Zijlstra
  2013-10-05 22:03                       ` Paul E. McKenney
  2013-10-07 17:35                     ` Paul E. McKenney
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-05 19:59 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Sat, Oct 05, 2013 at 09:28:02AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 05, 2013 at 06:05:11PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> > > >                                                                     Why
> > > > do we still have a per-cpu kthread in nocb mode? The idea is that we do
> > > > not disturb the cpu, right? So I suppose these kthreads get to run on
> > > > another cpu.
> > > 
> > > Yep, the idea is that usermode figures out where to run them.  Even if
> > > usermode doesn't do that, this has the effect of getting them to be
> > > more out of the way of real-time tasks.
> > > 
> > > > Since its running on another cpu; we get into atomic and memory barriers
> > > > anyway; so why not keep the logic the same as no-nocb but have another
> > > > cpu check our nocb cpu's state.
> > > 
> > > You can do that today by setting rcu_nocb_poll, but that results in
> > > frequent polling wakeups even when the system is completely idle, which
> > > is out of the question for the battery-powered embedded guys.
> > 
> > So its this polling I don't get.. why is the different behaviour
> > required? And why would you continue polling if the cpus were actually
> > idle.
> 
> The idea is to offload the overhead of doing the wakeup from (say)
> a real-time thread/CPU onto some housekeeping CPU.

Sure I get that that is the idea; what I don't get is why it needs to
behave differently depending on NOCB.

Why does a NOCB thingy need to wake up the kthread far more often?

> > Is there some confusion between the nr_running==1 extended quiescent
> > state and the nr_running==0 extended quiescent state?
> 
> This is independent of the nr_running=1 extended quiescent state.  The
> wakeups only happen when runnning in the kernel.  That said, a real-time
> thread might want both rcu_nocb_poll=y and CONFIG_NO_HZ_FULL=y.

So there's 3 behaviours?

 - CONFIG_NO_HZ_FULL=n
 - CONFIG_NO_HZ_FULL=y, rcu_nocb_poll=n
 - CONFIG_NO_HZ_FULL=y, rcu_nocb_poll=y

What I'm trying to understand is why do all those things behave
differently? For all 3 configs there's kthreads that do the GP advancing
and can run on different cpus.

And why does rcu_nocb_poll=y need to be terrible for power usage; surely
we know when cpus are actually idle and can stop polling them.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-05 19:59                     ` Peter Zijlstra
@ 2013-10-05 22:03                       ` Paul E. McKenney
  2013-10-07  8:42                         ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-05 22:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Sat, Oct 05, 2013 at 09:59:49PM +0200, Peter Zijlstra wrote:
> On Sat, Oct 05, 2013 at 09:28:02AM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 05, 2013 at 06:05:11PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> > > > >                                                                     Why
> > > > > do we still have a per-cpu kthread in nocb mode? The idea is that we do
> > > > > not disturb the cpu, right? So I suppose these kthreads get to run on
> > > > > another cpu.
> > > > 
> > > > Yep, the idea is that usermode figures out where to run them.  Even if
> > > > usermode doesn't do that, this has the effect of getting them to be
> > > > more out of the way of real-time tasks.
> > > > 
> > > > > Since its running on another cpu; we get into atomic and memory barriers
> > > > > anyway; so why not keep the logic the same as no-nocb but have another
> > > > > cpu check our nocb cpu's state.
> > > > 
> > > > You can do that today by setting rcu_nocb_poll, but that results in
> > > > frequent polling wakeups even when the system is completely idle, which
> > > > is out of the question for the battery-powered embedded guys.
> > > 
> > > So its this polling I don't get.. why is the different behaviour
> > > required? And why would you continue polling if the cpus were actually
> > > idle.
> > 
> > The idea is to offload the overhead of doing the wakeup from (say)
> > a real-time thread/CPU onto some housekeeping CPU.
> 
> Sure I get that that is the idea; what I don't get is why it needs to
> behave differently depending on NOCB.
> 
> Why does a NOCB thingy need to wake up the kthread far more often?

A better question would be "Why do NOCB wakeups have higher risk of
hitting RCU/sched/perf deadlocks?"

In the !NOCB case, we rely on the scheduling-clock interrupt handler and
transition to idle to do the wakeups of ksoftirqd.  These two environments
cannot possibly be holding scheduler or perf locks, so there is no risk
of deadlock in the common case.

Now there would be risk of deadlock in the !NOCB uncommon case where a
huge burst of call_rcu() invocations has caused more than 10,000 callbacks
to pile up on a single CPU -- in that case, call_rcu() will do the wakeup.
But only if interrupts are enabled, which again avoids the deadlock.

This deferral is OK because we are guaranteed that one of the following
things will eventually happen:

1.	Another call_rcu() arrives with more than 10,000 callbacks on
	the CPU, but when interrupts are enabled.

2.	In NO_HZ_PERIODIC kernels, or in workloads that remain in the
	kernel for a very long time, we eventually take a scheduling-clock
	interrupt.

3.	In !RCU_FAST_NO_HZ kernels, on attempted transition to an
	idle-RCU state (idle and, for NO_HZ_FULL, userspace with
	only one task runnable), rcu_needs_cpu() will refuse the
	request to turn off the scheduling-clock interrupt, again
	eventually taking a scheduling-clock interrupt.

4.	In RCU_FAST_NO_HZ kernels, on transition to an idle-RCU
	state, we advance the new callbacks and inform the RCU
	core of their existence.

But none of these currently apply in the NOCB case.  Instead, the idea
is to wake up the corresponding rcuo kthread when the first callback
arrives at an initially empty per-CPU list.  Wakeups are omitted if the
list already has callbacks in it.  Given that this produces a wakeup at
most once per grace period (several milliseconds at the very least),
I wasn't worried about it from a performance/scalability viewpoint.
Unfortunately, we have a deadlock issue.

My patch attempts to resolve this by moving the wakeup to the RCU core,
which is invoked by the scheduling-clock interrupt, and to the idle-entry
code.

> > > Is there some confusion between the nr_running==1 extended quiescent
> > > state and the nr_running==0 extended quiescent state?
> > 
> > This is independent of the nr_running=1 extended quiescent state.  The
> > wakeups only happen when runnning in the kernel.  That said, a real-time
> > thread might want both rcu_nocb_poll=y and CONFIG_NO_HZ_FULL=y.
> 
> So there's 3 behaviours?
> 
>  - CONFIG_NO_HZ_FULL=n
>  - CONFIG_NO_HZ_FULL=y, rcu_nocb_poll=n
>  - CONFIG_NO_HZ_FULL=y, rcu_nocb_poll=y

More like the following:

CONFIG_RCU_NOCB_CPU=n: Old behavior.
CONFIG_RCU_NOCB_CPU=y, rcu_nocb_poll=n: Wakeup deadlocks.
CONFIG_RCU_NOCB_CPU=y, rcu_nocb_poll=y: rcuo kthread periodically polls,
	so no wakeups and (presumably) no deadlocks.

> What I'm trying to understand is why do all those things behave
> differently? For all 3 configs there's kthreads that do the GP advancing
> and can run on different cpus.

Yes, there are always the RCU grace-period kthreads, but these are
never awakened by call_rcu().  Instead, call_rcu() awakens either the
ksoftirqd kthread (old behavior) or the rcuo callback-offload kthread
(CONFIG_RCU_NOCB_CPU=y, rcu_nocb_poll=n behavior), in both cases, the
kthread for the current CPU.

The old ksoftirqd wakeup is avoided if interrupts were disabled on entry
to call_rcu(), as you noted earlier, which avoids the deadlock.  The basic
idea of the patch is to do something similar in the CONFIG_RCU_NOCB_CPU=y,
rcu_nocb_poll=n case.

> And why does rcu_nocb_poll=y need to be terrible for power usage; surely
> we know when cpus are actually idle and can stop polling them.

In theory, we could do that.  But in practice, what would wake us up
when the CPUs go non-idle?

1.	We could do a wakeup on the idle-to-non-idle transition.  That
	would increase idle-to-non-idle latency, defeating the purpose
	of rcu_nocb_poll=y.  Plus there are workloads that enter and
	exit idle extremely quickly, which would not be good for either
	perforrmance, scalability, or energy efficiency.

2.	We could have some other thread poll all the CPUs for activity,
	for example, the RCU grace-period kthreads.  This might actually
	work, but there are some really ugly races involving CPUs becoming
	active just long enough to post a callback, going to sleep,
	with no other RCU activity in the system.  This could easily
	result in a system hang.

3.	We could post a timeout to check for the corresponding CPU
	being idle, but that just transfers the wakeups from idle from
	the rcuo kthreads to the other CPUs.

4.	I remove rcu_nocb_poll and see if anyone complains.  That doesn't
	solve the deadlock problem, but it does simplify RCU a bit.  ;-)

Other thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-05 22:03                       ` Paul E. McKenney
@ 2013-10-07  8:42                         ` Peter Zijlstra
  2013-10-07 13:11                           ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-07  8:42 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Sat, Oct 05, 2013 at 03:03:11PM -0700, Paul E. McKenney wrote:
> In theory, we could do that.  But in practice, what would wake us up
> when the CPUs go non-idle?
> 
> 1.	We could do a wakeup on the idle-to-non-idle transition.  That
> 	would increase idle-to-non-idle latency, defeating the purpose
> 	of rcu_nocb_poll=y.  Plus there are workloads that enter and
> 	exit idle extremely quickly, which would not be good for either
> 	perforrmance, scalability, or energy efficiency.
> 
> 2.	We could have some other thread poll all the CPUs for activity,
> 	for example, the RCU grace-period kthreads.  This might actually
> 	work, but there are some really ugly races involving CPUs becoming
> 	active just long enough to post a callback, going to sleep,
> 	with no other RCU activity in the system.  This could easily
> 	result in a system hang.
> 
> 3.	We could post a timeout to check for the corresponding CPU
> 	being idle, but that just transfers the wakeups from idle from
> 	the rcuo kthreads to the other CPUs.
> 
> 4.	I remove rcu_nocb_poll and see if anyone complains.  That doesn't
> 	solve the deadlock problem, but it does simplify RCU a bit.  ;-)
> 
> Other thoughts?

So we already move all the nocb rcuo threads over to the timekeeping
cpu, right? Giving you n threads to wake and/or poll and that's
expensive.

So why doesn't the time-keeping cpu, which is awake when at least one of
the nocb cpus is awake, not poll the nocb cpus their call list?

Arguably you don't want to do that from the old scheduler tick interrupt
or softirq context thingy, but by using a kthread but you've already got
all that around.

At that point; you've got a single kthread periodically being woken by
the scheduler timer interrupt -- which still goes away when the entire
machine goes idle -- which would do something like:


  for_each_cpu(cpu, nocb_cpus_mask) {
  	if (!list_empty_careful(&per_cpu(rcu_state, cpu)->callbacks))
		advance_cpu_callbacks(cpu);
  }


That fully preserves the !NOCB state of affairs while also dealing with
the NOCB stuff. And the single remote read only gets really expensive
once you go _very_ large or once the cpu in question actually touched
the cacheline and moved it into exclusive mode due to writing to it; at
which point you've saved yourself a wakeup and we're still faster.

It automatically deals with the full idle case, it basically gives you
'poll' behaviour for nr_running==1 and to me appears as the simplest and
most straight fwd extension of the RCU model.

More importantly it does away with that wakeup that so often happens on
nocb cpus. Although, rereading your email, I get the impression we do
this wakeup even on !nocb cpus when CONFIG_NOCB=y, which seems another
undesired feature.


Maybe you've already thought of this and there's a very good reason
things aren't like this; but like said, I've been away for a little
while and need to catch up a bit.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-05  0:23                   ` Paul E. McKenney
@ 2013-10-07 11:24                     ` Peter Zijlstra
  2013-10-07 12:59                       ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2013-10-07 11:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Fri, Oct 04, 2013 at 05:23:48PM -0700, Paul E. McKenney wrote:
> The underlying problem is that perf is invoking call_rcu() with the
> scheduler locks held, but in NOCB mode, call_rcu() will with high
> probability invoke the scheduler -- which just might want to use its
> locks.  The reason that call_rcu() needs to invoke the scheduler is
> to wake up the corresponding rcuo callback-offload kthread, which
> does the job of starting up a grace period and invoking the callbacks
> afterwards.
> 
> One solution (championed on a related problem by Lai Jiangshan) is to

That's rcu_read_unlock_special(), right? 

> simply defer the wakeup to some point where scheduler locks are no longer
> held.  Since we don't want to unnecessarily incur the cost of such
> deferral, the task before us is threefold:
> 
> 1.	Determine when it is likely that a relevant scheduler lock is held.
> 
> 2.	Defer the wakeup in such cases.
> 
> 3.	Ensure that all deferred wakeups eventually happen, preferably
>     	sooner rather than later.
> 
> We use irqs_disabled_flags() as a proxy for relevant scheduler locks
> being held.  This works because the relevant locks are always acquired
> with interrupts disabled.  We may defer more often than needed, but that
> is at least safe.

Fair enough; do you feel the need for something more specific?

> The wakeup deferral is tracked via a new field in the per-CPU and
> per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
> 
> This flag is checked by the RCU core processing.  The __rcu_pending()
> function now checks this flag, which causes rcu_check_callbacks()
> to initiate RCU core processing at each scheduling-clock interrupt
> where this flag is set.  Of course this is not sufficient because
> scheduling-clock interrupts are often turned off (the things we used to
> be able to count on!).  So the flags are also checked on entry to any
> state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
> state and NO_HZ_FULL user-mode-execution state.

So RCU doesn't current differentiate between EQS for nr_running==1 and
nr_running==0?

> This approach should allow call_rcu() to be invoked regardless of what
> locks you might be holding, the key word being "should".

Agreed. Except it looks like you've inverted the deferred wakeup
condition :-)

> @@ -2314,6 +2323,22 @@ static int rcu_nocb_kthread(void *arg)
>  	return 0;
>  }
>  
> +/* Is a deferred wakeup of rcu_nocb_kthread() required? */
> +static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
> +{
> +	return ACCESS_ONCE(rdp->nocb_defer_wakeup);
> +}
> +
> +/* Do a deferred wakeup of rcu_nocb_kthread(). */
> +static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
> +{
> +	if (rcu_nocb_need_deferred_wakeup(rdp))

	!rcu_nocb_need_deferred_wakeup() ?

> +		return;
> +	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
> +	wake_up(&rdp->nocb_wq);
> +	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
> +}
> +
>  /* Initialize per-rcu_data variables for no-CBs CPUs. */
>  static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  {


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-07 11:24                     ` Peter Zijlstra
@ 2013-10-07 12:59                       ` Paul E. McKenney
  0 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-07 12:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Mon, Oct 07, 2013 at 01:24:21PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 04, 2013 at 05:23:48PM -0700, Paul E. McKenney wrote:
> > The underlying problem is that perf is invoking call_rcu() with the
> > scheduler locks held, but in NOCB mode, call_rcu() will with high
> > probability invoke the scheduler -- which just might want to use its
> > locks.  The reason that call_rcu() needs to invoke the scheduler is
> > to wake up the corresponding rcuo callback-offload kthread, which
> > does the job of starting up a grace period and invoking the callbacks
> > afterwards.
> > 
> > One solution (championed on a related problem by Lai Jiangshan) is to
> 
> That's rcu_read_unlock_special(), right? 

Yep, different location, same general approach.

> > simply defer the wakeup to some point where scheduler locks are no longer
> > held.  Since we don't want to unnecessarily incur the cost of such
> > deferral, the task before us is threefold:
> > 
> > 1.	Determine when it is likely that a relevant scheduler lock is held.
> > 
> > 2.	Defer the wakeup in such cases.
> > 
> > 3.	Ensure that all deferred wakeups eventually happen, preferably
> >     	sooner rather than later.
> > 
> > We use irqs_disabled_flags() as a proxy for relevant scheduler locks
> > being held.  This works because the relevant locks are always acquired
> > with interrupts disabled.  We may defer more often than needed, but that
> > is at least safe.
> 
> Fair enough; do you feel the need for something more specific?

Maybe...

One advantage of the current approach is that it deals with any possible
deadlock.  If we were to focus on specific locks, my fear is that we
would be repeatedly chasing this sort of issue as new locks were added.

> > The wakeup deferral is tracked via a new field in the per-CPU and
> > per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
> > 
> > This flag is checked by the RCU core processing.  The __rcu_pending()
> > function now checks this flag, which causes rcu_check_callbacks()
> > to initiate RCU core processing at each scheduling-clock interrupt
> > where this flag is set.  Of course this is not sufficient because
> > scheduling-clock interrupts are often turned off (the things we used to
> > be able to count on!).  So the flags are also checked on entry to any
> > state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
> > state and NO_HZ_FULL user-mode-execution state.
> 
> So RCU doesn't current differentiate between EQS for nr_running==1 and
> nr_running==0?

Between usermode nr_running==1 and idle, you mean?

It has separate APIs for these two cases, but they both call the same
functions.  So yes, I can put a check in one place and catch both of
these two cases.

> > This approach should allow call_rcu() to be invoked regardless of what
> > locks you might be holding, the key word being "should".
> 
> Agreed. Except it looks like you've inverted the deferred wakeup
> condition :-)

Good point -- will send out updated patch.

/me wonders what else he messed up...

							Thanx, Paul

> > @@ -2314,6 +2323,22 @@ static int rcu_nocb_kthread(void *arg)
> >  	return 0;
> >  }
> >  
> > +/* Is a deferred wakeup of rcu_nocb_kthread() required? */
> > +static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
> > +{
> > +	return ACCESS_ONCE(rdp->nocb_defer_wakeup);
> > +}
> > +
> > +/* Do a deferred wakeup of rcu_nocb_kthread(). */
> > +static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
> > +{
> > +	if (rcu_nocb_need_deferred_wakeup(rdp))
> 
> 	!rcu_nocb_need_deferred_wakeup() ?
> 
> > +		return;
> > +	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
> > +	wake_up(&rdp->nocb_wq);
> > +	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
> > +}
> > +
> >  /* Initialize per-rcu_data variables for no-CBs CPUs. */
> >  static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
> >  {
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-07  8:42                         ` Peter Zijlstra
@ 2013-10-07 13:11                           ` Paul E. McKenney
  0 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-07 13:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Mon, Oct 07, 2013 at 10:42:39AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 05, 2013 at 03:03:11PM -0700, Paul E. McKenney wrote:
> > In theory, we could do that.  But in practice, what would wake us up
> > when the CPUs go non-idle?
> > 
> > 1.	We could do a wakeup on the idle-to-non-idle transition.  That
> > 	would increase idle-to-non-idle latency, defeating the purpose
> > 	of rcu_nocb_poll=y.  Plus there are workloads that enter and
> > 	exit idle extremely quickly, which would not be good for either
> > 	perforrmance, scalability, or energy efficiency.
> > 
> > 2.	We could have some other thread poll all the CPUs for activity,
> > 	for example, the RCU grace-period kthreads.  This might actually
> > 	work, but there are some really ugly races involving CPUs becoming
> > 	active just long enough to post a callback, going to sleep,
> > 	with no other RCU activity in the system.  This could easily
> > 	result in a system hang.
> > 
> > 3.	We could post a timeout to check for the corresponding CPU
> > 	being idle, but that just transfers the wakeups from idle from
> > 	the rcuo kthreads to the other CPUs.
> > 
> > 4.	I remove rcu_nocb_poll and see if anyone complains.  That doesn't
> > 	solve the deadlock problem, but it does simplify RCU a bit.  ;-)
> > 
> > Other thoughts?
> 
> So we already move all the nocb rcuo threads over to the timekeeping
> cpu, right? Giving you n threads to wake and/or poll and that's
> expensive.

I don't pin the rcuo threads anywhere, though I would expect people
to move them to some set of housekeeping CPUs, the timekeeping CPU
being a good candidate.

> So why doesn't the time-keeping cpu, which is awake when at least one of
> the nocb cpus is awake, not poll the nocb cpus their call list?

If !NO_HZ_FULL, there won't be a timekeeping CPU as such, if I remember
correctly.

> Arguably you don't want to do that from the old scheduler tick interrupt
> or softirq context thingy, but by using a kthread but you've already got
> all that around.

The polling happens in the grace-period kthread, but it is not guaranteed
to be happening unless NO_HZ_FULL_SYSIDLE, in which case the system
will generate artificial grace periods as needed to make the required
polling happen.  On the other hand, if !NO_HZ_FULL_SYSIDLE, there will
not be any polling if there is no RCU update activity.

> At that point; you've got a single kthread periodically being woken by
> the scheduler timer interrupt -- which still goes away when the entire
> machine goes idle -- which would do something like:
> 
> 
>   for_each_cpu(cpu, nocb_cpus_mask) {
>   	if (!list_empty_careful(&per_cpu(rcu_state, cpu)->callbacks))
> 		advance_cpu_callbacks(cpu);
>   }
> 
> 
> That fully preserves the !NOCB state of affairs while also dealing with
> the NOCB stuff. And the single remote read only gets really expensive
> once you go _very_ large or once the cpu in question actually touched
> the cacheline and moved it into exclusive mode due to writing to it; at
> which point you've saved yourself a wakeup and we're still faster.
> 
> It automatically deals with the full idle case, it basically gives you
> 'poll' behaviour for nr_running==1 and to me appears as the simplest and
> most straight fwd extension of the RCU model.
> 
> More importantly it does away with that wakeup that so often happens on
> nocb cpus. Although, rereading your email, I get the impression we do
> this wakeup even on !nocb cpus when CONFIG_NOCB=y, which seems another
> undesired feature.

The __call_rcu_nocb_enqueue() wakeup happens only when CONFIG_NOCB=y,
and even then only on CPUs that have actually been offloaded.
Now my patch does the checking even on non-offloaded CPUs, but this
still only happen on CONFIG_NOCB=y and is only a check of a
per-CPU variable.

The other wakeups in __call_rcu_core() only happen in special cases,
which I believe avoid this deadlock condition.

> Maybe you've already thought of this and there's a very good reason
> things aren't like this; but like said, I've been away for a little
> while and need to catch up a bit.

>From what I can see, what you suggest would work quite well in special
cases, but I still have to solve the general case.  If I solve the
general case, I don't believe I need to work on the special cases.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: tty^Wrcu/perf lockdep trace.
  2013-10-05 16:28                   ` Paul E. McKenney
  2013-10-05 19:59                     ` Peter Zijlstra
@ 2013-10-07 17:35                     ` Paul E. McKenney
  1 sibling, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2013-10-07 17:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, gregkh, peter

On Sat, Oct 05, 2013 at 09:28:02AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 05, 2013 at 06:05:11PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> > > >                                                                     Why
> > > > do we still have a per-cpu kthread in nocb mode? The idea is that we do
> > > > not disturb the cpu, right? So I suppose these kthreads get to run on
> > > > another cpu.
> > > 
> > > Yep, the idea is that usermode figures out where to run them.  Even if
> > > usermode doesn't do that, this has the effect of getting them to be
> > > more out of the way of real-time tasks.
> > > 
> > > > Since its running on another cpu; we get into atomic and memory barriers
> > > > anyway; so why not keep the logic the same as no-nocb but have another
> > > > cpu check our nocb cpu's state.
> > > 
> > > You can do that today by setting rcu_nocb_poll, but that results in
> > > frequent polling wakeups even when the system is completely idle, which
> > > is out of the question for the battery-powered embedded guys.
> > 
> > So its this polling I don't get.. why is the different behaviour
> > required? And why would you continue polling if the cpus were actually
> > idle.
> 
> The idea is to offload the overhead of doing the wakeup from (say)
> a real-time thread/CPU onto some housekeeping CPU.
> 
> > Is there some confusion between the nr_running==1 extended quiescent
> > state and the nr_running==0 extended quiescent state?
> 
> This is independent of the nr_running=1 extended quiescent state.  The
> wakeups only happen when runnning in the kernel.  That said, a real-time
> thread might want both rcu_nocb_poll=y and CONFIG_NO_HZ_FULL=y.
> 
> > Now, none of this solves the issue at hand because event the 'regular'
> > no-nocb rcu mode has this issue of needing to wake kthreads, but I'd
> > like to get a better understanding of why nocb mode is as it is.
> > 
> > 
> > I've seen you've since send a few more emails; I might find some of the
> > answers in there. Let me go read the :-)
> 
> I -think- I have solved it, but much testing and review will of course
> be required.  And fixing last night's test failures...

For which the patch turned out to be an innocent bystander.  Apparently,
some of the more creative possible settings of CONFIG_RCU_FANOUT and
CONFIG_RCU_FANOUT_LEAF cause RCU some heartburn.  Looking into it.
(Yes, I am finally refactoring my RCU tests.)

Here is the patch which fixes the bug that Peter found.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Break call_rcu() deadlock involving scheduler and perf

Dave Jones got the following lockdep splat:

>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
>         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
>         [<ffffffff81732052>] __schedule+0x1d2/0xa20
>         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
>         [<ffffffff817352b6>] retint_kernel+0x26/0x30
>         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
>         [<ffffffff813f0504>] pty_write+0x54/0x60
>         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
>         [<ffffffff813e5838>] tty_write+0x158/0x2d0
>         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
>         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
>         [<ffffffff81054336>] do_fork+0x126/0x460
>         [<ffffffff81054696>] kernel_thread+0x26/0x30
>         [<ffffffff8171ff93>] rest_init+0x23/0x140
>         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
>         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
>         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
>         [<ffffffff81097d62>] default_wake_function+0x12/0x20
>         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
>         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
>         [<ffffffff8108ff59>] __wake_up+0x39/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
>         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
>         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
>         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
>         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
>         [<ffffffff817200be>] kernel_init+0xe/0x190
>         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
>         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff8108ff43>] __wake_up+0x23/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>         [<ffffffff81149abf>] put_ctx+0x4f/0x70
>         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>
>  *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [<ffffffff8172a363>] dump_stack+0x4e/0x82
>  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
>  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
>  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
>  [<ffffffff810cc243>] lock_acquire+0x93/0x200
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [<ffffffff81111450>] __call_rcu+0x140/0x820
>  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
>  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>  [<ffffffff81149abf>] put_ctx+0x4f/0x70
>  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks.  The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held.  Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1.	Determine when it is likely that a relevant scheduler lock is held.

2.	Defer the wakeup in such cases.
    
3.	Ensure that all deferred wakeups eventually happen, preferably
    	sooner rather than later.
    
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held.  This works because the relevant locks are always acquired
with interrupts disabled.  We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing.  The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set.  Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!).  So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>

diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index f3778f8952da..b8c3c813ea57 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -396,14 +396,14 @@ o	Each element of the form "3/3 ..>. 0:7 ^0" represents one rcu_node
 
 The output of "cat rcu/rcu_sched/rcu_pending" looks as follows:
 
-  0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903
-  1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113
-  2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889
-  3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469
-  4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042
-  5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422
-  6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699
-  7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147
+  0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903 ndw=0
+  1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113 ndw=0
+  2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889 ndw=0
+  3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469 ndw=0
+  4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042 ndw=0
+  5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422 ndw=0
+  6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699 ndw=0
+  7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147 ndw=0
 
 The fields are as follows:
 
@@ -432,6 +432,10 @@ o	"gpc" is the number of times that an old grace period had
 o	"gps" is the number of times that a new grace period had started,
 	but this CPU was not yet aware of it.
 
+o	"ndw" is the number of times that a wakeup of an rcuo
+	callback-offload kthread had to be deferred in order to avoid
+	deadlock.
+
 o	"nn" is the number of times that this CPU needed nothing.
 
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a33a24d10611..106f7f5cdd1d 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -362,6 +362,9 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
 				bool user)
 {
+	struct rcu_state *rsp;
+	struct rcu_data *rdp;
+
 	trace_rcu_dyntick(TPS("Start"), oldval, rdtp->dynticks_nesting);
 	if (!user && !is_idle_task(current)) {
 		struct task_struct *idle __maybe_unused =
@@ -373,6 +376,10 @@ static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
 			  current->pid, current->comm,
 			  idle->pid, idle->comm); /* must be idle task! */
 	}
+	for_each_rcu_flavor(rsp) {
+		rdp = this_cpu_ptr(rsp->rda);
+		do_nocb_deferred_wakeup(rdp);
+	}
 	rcu_prepare_for_idle(smp_processor_id());
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
 	smp_mb__before_atomic_inc();  /* See above. */
@@ -1908,13 +1915,13 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
  * Adopt the RCU callbacks from the specified rcu_state structure's
  * orphanage.  The caller must hold the ->orphan_lock.
  */
-static void rcu_adopt_orphan_cbs(struct rcu_state *rsp)
+static void rcu_adopt_orphan_cbs(struct rcu_state *rsp, unsigned long flags)
 {
 	int i;
 	struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
 
 	/* No-CBs CPUs are handled specially. */
-	if (rcu_nocb_adopt_orphan_cbs(rsp, rdp))
+	if (rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
 		return;
 
 	/* Do the accounting first. */
@@ -1993,7 +2000,7 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
 	rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
-	rcu_adopt_orphan_cbs(rsp);
+	rcu_adopt_orphan_cbs(rsp, flags);
 
 	/* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */
 	mask = rdp->grpmask;	/* rnp->grplo is constant. */
@@ -2310,6 +2317,9 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	/* If there are callbacks ready, invoke them. */
 	if (cpu_has_callbacks_ready_to_invoke(rdp))
 		invoke_rcu_callbacks(rsp, rdp);
+
+	/* Do any needed deferred wakeups of rcuo kthreads. */
+	do_nocb_deferred_wakeup(rdp);
 }
 
 /*
@@ -2444,7 +2454,7 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 
 		if (cpu != -1)
 			rdp = per_cpu_ptr(rsp->rda, cpu);
-		offline = !__call_rcu_nocb(rdp, head, lazy);
+		offline = !__call_rcu_nocb(rdp, head, lazy, flags);
 		WARN_ON_ONCE(offline);
 		/* _call_rcu() is illegal on offline CPU; leak the callback. */
 		local_irq_restore(flags);
@@ -2797,6 +2807,12 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 		return 1;
 	}
 
+	/* Does this CPU need a deferred NOCB wakeup? */
+	if (rcu_nocb_need_deferred_wakeup(rdp)) {
+		rdp->n_rp_nocb_defer_wakeup++;
+		return 1;
+	}
+
 	/* nothing to do */
 	rdp->n_rp_need_nothing++;
 	return 0;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 8e34d8674a4e..a87adfc2916b 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -317,6 +317,7 @@ struct rcu_data {
 	unsigned long n_rp_cpu_needs_gp;
 	unsigned long n_rp_gp_completed;
 	unsigned long n_rp_gp_started;
+	unsigned long n_rp_nocb_defer_wakeup;
 	unsigned long n_rp_need_nothing;
 
 	/* 6) _rcu_barrier() and OOM callbacks. */
@@ -335,6 +336,7 @@ struct rcu_data {
 	int nocb_p_count_lazy;		/*  (approximate). */
 	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
 	struct task_struct *nocb_kthread;
+	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
 	/* 8) RCU CPU stall data. */
@@ -550,9 +552,12 @@ static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
 static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
-			    bool lazy);
+			    bool lazy, unsigned long flags);
 static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
-				      struct rcu_data *rdp);
+				      struct rcu_data *rdp,
+				      unsigned long flags);
+static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
 static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
 static void rcu_kick_nohz_cpu(int cpu);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 747df70a4d64..fe4caadae1b6 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2104,7 +2104,8 @@ bool rcu_is_nocb_cpu(int cpu)
 static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 				    struct rcu_head *rhp,
 				    struct rcu_head **rhtp,
-				    int rhcount, int rhcount_lazy)
+				    int rhcount, int rhcount_lazy,
+				    unsigned long flags)
 {
 	int len;
 	struct rcu_head **old_rhpp;
@@ -2125,9 +2126,16 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 	}
 	len = atomic_long_read(&rdp->nocb_q_count);
 	if (old_rhpp == &rdp->nocb_head) {
-		wake_up(&rdp->nocb_wq); /* ... only if queue was empty ... */
+		if (!irqs_disabled_flags(flags)) {
+			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    TPS("WakeEmpty"));
+		} else {
+			rdp->nocb_defer_wakeup = true;
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    TPS("WakeEmptyIsDeferred"));
+		}
 		rdp->qlen_last_fqs_check = 0;
-		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeEmpty"));
 	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
 		wake_up_process(t); /* ... or if many callbacks queued. */
 		rdp->qlen_last_fqs_check = LONG_MAX / 2;
@@ -2148,12 +2156,12 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
  * "rcuo" kthread can find it.
  */
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
-			    bool lazy)
+			    bool lazy, unsigned long flags)
 {
 
 	if (!rcu_is_nocb_cpu(rdp->cpu))
 		return 0;
-	__call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy);
+	__call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy, flags);
 	if (__is_kfree_rcu_offset((unsigned long)rhp->func))
 		trace_rcu_kfree_callback(rdp->rsp->name, rhp,
 					 (unsigned long)rhp->func,
@@ -2171,7 +2179,8 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
  * not a no-CBs CPU.
  */
 static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
-						     struct rcu_data *rdp)
+						     struct rcu_data *rdp,
+						     unsigned long flags)
 {
 	long ql = rsp->qlen;
 	long qll = rsp->qlen_lazy;
@@ -2185,14 +2194,14 @@ static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
 	/* First, enqueue the donelist, if any.  This preserves CB ordering. */
 	if (rsp->orphan_donelist != NULL) {
 		__call_rcu_nocb_enqueue(rdp, rsp->orphan_donelist,
-					rsp->orphan_donetail, ql, qll);
+					rsp->orphan_donetail, ql, qll, flags);
 		ql = qll = 0;
 		rsp->orphan_donelist = NULL;
 		rsp->orphan_donetail = &rsp->orphan_donelist;
 	}
 	if (rsp->orphan_nxtlist != NULL) {
 		__call_rcu_nocb_enqueue(rdp, rsp->orphan_nxtlist,
-					rsp->orphan_nxttail, ql, qll);
+					rsp->orphan_nxttail, ql, qll, flags);
 		ql = qll = 0;
 		rsp->orphan_nxtlist = NULL;
 		rsp->orphan_nxttail = &rsp->orphan_nxtlist;
@@ -2314,6 +2323,22 @@ static int rcu_nocb_kthread(void *arg)
 	return 0;
 }
 
+/* Is a deferred wakeup of rcu_nocb_kthread() required? */
+static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
+{
+	return ACCESS_ONCE(rdp->nocb_defer_wakeup);
+}
+
+/* Do a deferred wakeup of rcu_nocb_kthread(). */
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
+{
+	if (!rcu_nocb_need_deferred_wakeup(rdp))
+		return;
+	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
+	wake_up(&rdp->nocb_wq);
+	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
+}
+
 /* Initialize per-rcu_data variables for no-CBs CPUs. */
 static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
@@ -2369,13 +2394,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 }
 
 static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
-			    bool lazy)
+			    bool lazy, unsigned long flags)
 {
 	return 0;
 }
 
 static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
-						     struct rcu_data *rdp)
+						     struct rcu_data *rdp,
+						     unsigned long flags)
 {
 	return 0;
 }
@@ -2384,6 +2410,15 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
 }
 
+static bool rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
+{
+	return false;
+}
+
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
+{
+}
+
 static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
 {
 }
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index cf6c17412932..77a6831b9c51 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -364,9 +364,10 @@ static void print_one_rcu_pending(struct seq_file *m, struct rcu_data *rdp)
 		   rdp->n_rp_report_qs,
 		   rdp->n_rp_cb_ready,
 		   rdp->n_rp_cpu_needs_gp);
-	seq_printf(m, "gpc=%ld gps=%ld nn=%ld\n",
+	seq_printf(m, "gpc=%ld gps=%ld nn=%ld ndw%ld\n",
 		   rdp->n_rp_gp_completed,
 		   rdp->n_rp_gp_started,
+		   rdp->n_rp_nocb_defer_wakeup,
 		   rdp->n_rp_need_nothing);
 }
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-10-07 17:35 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-03 19:08 tty/perf lockdep trace Dave Jones
2013-10-03 19:42 ` tty^Wrcu/perf " Peter Zijlstra
2013-10-03 19:58   ` Paul E. McKenney
2013-10-04  6:58     ` Peter Zijlstra
2013-10-04 16:03       ` Paul E. McKenney
2013-10-04 16:15         ` Paul E. McKenney
2013-10-04 16:50         ` Peter Zijlstra
2013-10-04 17:09           ` Paul E. McKenney
2013-10-04 18:52             ` Peter Zijlstra
2013-10-04 21:25               ` Paul E. McKenney
2013-10-04 22:02                 ` Paul E. McKenney
2013-10-05  0:23                   ` Paul E. McKenney
2013-10-07 11:24                     ` Peter Zijlstra
2013-10-07 12:59                       ` Paul E. McKenney
2013-10-05 16:05                 ` Peter Zijlstra
2013-10-05 16:28                   ` Paul E. McKenney
2013-10-05 19:59                     ` Peter Zijlstra
2013-10-05 22:03                       ` Paul E. McKenney
2013-10-07  8:42                         ` Peter Zijlstra
2013-10-07 13:11                           ` Paul E. McKenney
2013-10-07 17:35                     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).