All of lore.kernel.org
 help / color / mirror / Atom feed
* lockdep circular locking error (rcu_node_level_0 vs rq->lock)
@ 2011-07-11 23:19 Dave Jones
  2011-07-12 20:20 ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Jones @ 2011-07-11 23:19 UTC (permalink / raw)
  To: Linux Kernel

I was doing an install in a kvm guest, which wedged itself at the end.
This was in the host dmesg.


[ INFO: possible circular locking dependency detected ]
3.0.0-rc6+ #91
-------------------------------------------------------
libvirtd/5720 is trying to acquire lock:
 (rcu_node_level_0){..-.-.}, at: [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60

but task is already holding lock:
 (&rq->lock){-.-.-.}, at: [<ffffffff8105408e>] sched_ttwu_pending+0x39/0x5b

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&rq->lock){-.-.-.}:
       [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
       [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
       [<ffffffff8104663a>] __task_rq_lock+0x5e/0x8b
       [<ffffffff8105506d>] wake_up_new_task+0x46/0x10d
       [<ffffffff8105a1c9>] do_fork+0x231/0x331
       [<ffffffff81010c80>] kernel_thread+0x75/0x77
       [<ffffffff814abe82>] rest_init+0x26/0xdc
       [<ffffffff81d3dbc2>] start_kernel+0x401/0x40c
       [<ffffffff81d3d2c4>] x86_64_start_reservations+0xaf/0xb3
       [<ffffffff81d3d3ca>] x86_64_start_kernel+0x102/0x111

-> #2 (&p->pi_lock){-.-.-.}:
       [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
       [<ffffffff814cf238>] _raw_spin_lock_irqsave+0x4f/0x89
       [<ffffffff81054e3d>] try_to_wake_up+0x2e/0x1db
       [<ffffffff81054ffc>] default_wake_function+0x12/0x14
       [<ffffffff81079008>] autoremove_wake_function+0x18/0x3d
       [<ffffffff81045010>] __wake_up_common+0x4d/0x83
       [<ffffffff8104634e>] __wake_up+0x39/0x4d
       [<ffffffff810c3cd6>] rcu_report_exp_rnp+0x52/0x8b
       [<ffffffff810c4f18>] __rcu_read_unlock+0x1d0/0x231
       [<ffffffff8115202a>] rcu_read_unlock+0x26/0x28
       [<ffffffff8115465d>] __d_lookup+0x103/0x115
       [<ffffffff8114b9eb>] walk_component+0x1b1/0x3af
       [<ffffffff8114bd8a>] link_path_walk+0x1a1/0x43b
       [<ffffffff8114c148>] path_lookupat+0x5a/0x2af
       [<ffffffff8114d222>] do_path_lookup+0x28/0x97
       [<ffffffff8114d658>] user_path_at+0x59/0x96
       [<ffffffff81145214>] sys_readlinkat+0x33/0x95
       [<ffffffff81145291>] sys_readlink+0x1b/0x1d
       [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b

-> #1 (sync_rcu_preempt_exp_wq.lock){......}:
       [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
       [<ffffffff814cf238>] _raw_spin_lock_irqsave+0x4f/0x89
       [<ffffffff81046337>] __wake_up+0x22/0x4d
       [<ffffffff810c3cd6>] rcu_report_exp_rnp+0x52/0x8b
       [<ffffffff810c4f18>] __rcu_read_unlock+0x1d0/0x231
       [<ffffffff8115202a>] rcu_read_unlock+0x26/0x28
       [<ffffffff8115465d>] __d_lookup+0x103/0x115
       [<ffffffff8114b9eb>] walk_component+0x1b1/0x3af
       [<ffffffff8114bd8a>] link_path_walk+0x1a1/0x43b
       [<ffffffff8114c148>] path_lookupat+0x5a/0x2af
       [<ffffffff8114d222>] do_path_lookup+0x28/0x97
       [<ffffffff8114d658>] user_path_at+0x59/0x96
       [<ffffffff81145214>] sys_readlinkat+0x33/0x95
       [<ffffffff81145291>] sys_readlink+0x1b/0x1d
       [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b

-> #0 (rcu_node_level_0){..-.-.}:
       [<ffffffff8108d7e5>] __lock_acquire+0xa2f/0xd0c
       [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
       [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
       [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
       [<ffffffff810c4ed6>] __rcu_read_unlock+0x18e/0x231
       [<ffffffff810463f4>] rcu_read_unlock+0x26/0x28
       [<ffffffff8104b6db>] cpuacct_charge+0x58/0x61
       [<ffffffff81052f18>] update_curr+0x107/0x134
       [<ffffffff8105349b>] check_preempt_wakeup+0xc9/0x1d0
       [<ffffffff81049775>] check_preempt_curr+0x2f/0x6e
       [<ffffffff81053f5e>] ttwu_do_wakeup+0x7b/0x111
       [<ffffffff81054050>] ttwu_do_activate.constprop.76+0x5c/0x61
       [<ffffffff8105409e>] sched_ttwu_pending+0x49/0x5b
       [<ffffffff810540be>] scheduler_ipi+0xe/0x10
       [<ffffffff810224f6>] smp_reschedule_interrupt+0x1b/0x1d
       [<ffffffff814d6b33>] reschedule_interrupt+0x13/0x20
       [<ffffffff813fcc18>] rcu_read_unlock+0x26/0x28
       [<ffffffff813fe308>] sock_def_readable+0x88/0x8d
       [<ffffffff81497760>] unix_stream_sendmsg+0x264/0x2ff
       [<ffffffff813f83c4>] sock_aio_write+0x112/0x126
       [<ffffffff8114093b>] do_sync_write+0xbf/0xff
       [<ffffffff81141012>] vfs_write+0xb6/0xf6
       [<ffffffff81141206>] sys_write+0x4d/0x74
       [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

Chain exists of:
  rcu_node_level_0 --> &p->pi_lock --> &rq->lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&rq->lock);
                               lock(&p->pi_lock);
                               lock(&rq->lock);
  lock(rcu_node_level_0);

 *** DEADLOCK ***

1 lock held by libvirtd/5720:
 #0:  (&rq->lock){-.-.-.}, at: [<ffffffff8105408e>] sched_ttwu_pending+0x39/0x5b

stack backtrace:
Pid: 5720, comm: libvirtd Not tainted 3.0.0-rc6+ #91
Call Trace:
 <IRQ>  [<ffffffff814c51cf>] print_circular_bug+0x1f8/0x209
 [<ffffffff8108d7e5>] __lock_acquire+0xa2f/0xd0c
 [<ffffffff8107e905>] ? sched_clock_local+0x12/0x75
 [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
 [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
 [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
 [<ffffffff8108adab>] ? lock_release_holdtime.part.10+0x59/0x62
 [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
 [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
 [<ffffffff814cf855>] ? _raw_spin_unlock+0x47/0x54
 [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
 [<ffffffff810c4e00>] ? __rcu_read_unlock+0xb8/0x231
 [<ffffffff810c4ed6>] __rcu_read_unlock+0x18e/0x231
 [<ffffffff810463f4>] rcu_read_unlock+0x26/0x28
 [<ffffffff8104b6db>] cpuacct_charge+0x58/0x61
 [<ffffffff81052f18>] update_curr+0x107/0x134
 [<ffffffff8105349b>] check_preempt_wakeup+0xc9/0x1d0
 [<ffffffff81049775>] check_preempt_curr+0x2f/0x6e
 [<ffffffff81053f5e>] ttwu_do_wakeup+0x7b/0x111
 [<ffffffff81054050>] ttwu_do_activate.constprop.76+0x5c/0x61
 [<ffffffff8105409e>] sched_ttwu_pending+0x49/0x5b
 [<ffffffff810540be>] scheduler_ipi+0xe/0x10
 [<ffffffff810224f6>] smp_reschedule_interrupt+0x1b/0x1d
 [<ffffffff814d6b33>] reschedule_interrupt+0x13/0x20
 <EOI>  [<ffffffff8107e905>] ? sched_clock_local+0x12/0x75
 [<ffffffff810c4d91>] ? __rcu_read_unlock+0x49/0x231
 [<ffffffff8108dea5>] ? lock_release+0x1b1/0x1de
 [<ffffffff813fcc18>] rcu_read_unlock+0x26/0x28
 [<ffffffff813fe308>] sock_def_readable+0x88/0x8d
 [<ffffffff81497760>] unix_stream_sendmsg+0x264/0x2ff
 [<ffffffff813f83c4>] sock_aio_write+0x112/0x126
 [<ffffffff8121cd95>] ? inode_has_perm+0x6a/0x77
 [<ffffffff8114093b>] do_sync_write+0xbf/0xff
 [<ffffffff81219562>] ? security_file_permission+0x2e/0x33
 [<ffffffff81140d71>] ? rw_verify_area+0xb6/0xd3
 [<ffffffff81141012>] vfs_write+0xb6/0xf6
 [<ffffffff811426a0>] ? fget_light+0x97/0xa2
 [<ffffffff81141206>] sys_write+0x4d/0x74
 [<ffffffff81078f85>] ? remove_wait_queue+0x1a/0x3a
 [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-11 23:19 lockdep circular locking error (rcu_node_level_0 vs rq->lock) Dave Jones
@ 2011-07-12 20:20 ` Peter Zijlstra
  2011-07-12 20:51   ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2011-07-12 20:20 UTC (permalink / raw)
  To: Dave Jones, Paul E. McKenney; +Cc: Linux Kernel

On Mon, 2011-07-11 at 19:19 -0400, Dave Jones wrote:
> I was doing an install in a kvm guest, which wedged itself at the end.
> This was in the host dmesg.
> 
> 
> [ INFO: possible circular locking dependency detected ]
> 3.0.0-rc6+ #91
> -------------------------------------------------------
> libvirtd/5720 is trying to acquire lock:
>  (rcu_node_level_0){..-.-.}, at: [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
> 
> but task is already holding lock:
>  (&rq->lock){-.-.-.}, at: [<ffffffff8105408e>] sched_ttwu_pending+0x39/0x5b
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #3 (&rq->lock){-.-.-.}:
>        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
>        [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
>        [<ffffffff8104663a>] __task_rq_lock+0x5e/0x8b
>        [<ffffffff8105506d>] wake_up_new_task+0x46/0x10d
>        [<ffffffff8105a1c9>] do_fork+0x231/0x331
>        [<ffffffff81010c80>] kernel_thread+0x75/0x77
>        [<ffffffff814abe82>] rest_init+0x26/0xdc
>        [<ffffffff81d3dbc2>] start_kernel+0x401/0x40c
>        [<ffffffff81d3d2c4>] x86_64_start_reservations+0xaf/0xb3
>        [<ffffffff81d3d3ca>] x86_64_start_kernel+0x102/0x111
> 
> -> #2 (&p->pi_lock){-.-.-.}:
>        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
>        [<ffffffff814cf238>] _raw_spin_lock_irqsave+0x4f/0x89
>        [<ffffffff81054e3d>] try_to_wake_up+0x2e/0x1db
>        [<ffffffff81054ffc>] default_wake_function+0x12/0x14
>        [<ffffffff81079008>] autoremove_wake_function+0x18/0x3d
>        [<ffffffff81045010>] __wake_up_common+0x4d/0x83
>        [<ffffffff8104634e>] __wake_up+0x39/0x4d
>        [<ffffffff810c3cd6>] rcu_report_exp_rnp+0x52/0x8b
>        [<ffffffff810c4f18>] __rcu_read_unlock+0x1d0/0x231
>        [<ffffffff8115202a>] rcu_read_unlock+0x26/0x28
>        [<ffffffff8115465d>] __d_lookup+0x103/0x115
>        [<ffffffff8114b9eb>] walk_component+0x1b1/0x3af
>        [<ffffffff8114bd8a>] link_path_walk+0x1a1/0x43b
>        [<ffffffff8114c148>] path_lookupat+0x5a/0x2af
>        [<ffffffff8114d222>] do_path_lookup+0x28/0x97
>        [<ffffffff8114d658>] user_path_at+0x59/0x96
>        [<ffffffff81145214>] sys_readlinkat+0x33/0x95
>        [<ffffffff81145291>] sys_readlink+0x1b/0x1d
>        [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b
> 
> -> #1 (sync_rcu_preempt_exp_wq.lock){......}:
>        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
>        [<ffffffff814cf238>] _raw_spin_lock_irqsave+0x4f/0x89
>        [<ffffffff81046337>] __wake_up+0x22/0x4d
>        [<ffffffff810c3cd6>] rcu_report_exp_rnp+0x52/0x8b
>        [<ffffffff810c4f18>] __rcu_read_unlock+0x1d0/0x231
>        [<ffffffff8115202a>] rcu_read_unlock+0x26/0x28
>        [<ffffffff8115465d>] __d_lookup+0x103/0x115
>        [<ffffffff8114b9eb>] walk_component+0x1b1/0x3af
>        [<ffffffff8114bd8a>] link_path_walk+0x1a1/0x43b
>        [<ffffffff8114c148>] path_lookupat+0x5a/0x2af
>        [<ffffffff8114d222>] do_path_lookup+0x28/0x97
>        [<ffffffff8114d658>] user_path_at+0x59/0x96
>        [<ffffffff81145214>] sys_readlinkat+0x33/0x95
>        [<ffffffff81145291>] sys_readlink+0x1b/0x1d
>        [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b
> 
> -> #0 (rcu_node_level_0){..-.-.}:
>        [<ffffffff8108d7e5>] __lock_acquire+0xa2f/0xd0c
>        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
>        [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
>        [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
>        [<ffffffff810c4ed6>] __rcu_read_unlock+0x18e/0x231
>        [<ffffffff810463f4>] rcu_read_unlock+0x26/0x28
>        [<ffffffff8104b6db>] cpuacct_charge+0x58/0x61
>        [<ffffffff81052f18>] update_curr+0x107/0x134
>        [<ffffffff8105349b>] check_preempt_wakeup+0xc9/0x1d0
>        [<ffffffff81049775>] check_preempt_curr+0x2f/0x6e
>        [<ffffffff81053f5e>] ttwu_do_wakeup+0x7b/0x111
>        [<ffffffff81054050>] ttwu_do_activate.constprop.76+0x5c/0x61
>        [<ffffffff8105409e>] sched_ttwu_pending+0x49/0x5b
>        [<ffffffff810540be>] scheduler_ipi+0xe/0x10
>        [<ffffffff810224f6>] smp_reschedule_interrupt+0x1b/0x1d
>        [<ffffffff814d6b33>] reschedule_interrupt+0x13/0x20
>        [<ffffffff813fcc18>] rcu_read_unlock+0x26/0x28
>        [<ffffffff813fe308>] sock_def_readable+0x88/0x8d
>        [<ffffffff81497760>] unix_stream_sendmsg+0x264/0x2ff
>        [<ffffffff813f83c4>] sock_aio_write+0x112/0x126
>        [<ffffffff8114093b>] do_sync_write+0xbf/0xff
>        [<ffffffff81141012>] vfs_write+0xb6/0xf6
>        [<ffffffff81141206>] sys_write+0x4d/0x74
>        [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   rcu_node_level_0 --> &p->pi_lock --> &rq->lock
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&rq->lock);
>                                lock(&p->pi_lock);
>                                lock(&rq->lock);
>   lock(rcu_node_level_0);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by libvirtd/5720:
>  #0:  (&rq->lock){-.-.-.}, at: [<ffffffff8105408e>] sched_ttwu_pending+0x39/0x5b
> 
> stack backtrace:
> Pid: 5720, comm: libvirtd Not tainted 3.0.0-rc6+ #91
> Call Trace:
>  <IRQ>  [<ffffffff814c51cf>] print_circular_bug+0x1f8/0x209
>  [<ffffffff8108d7e5>] __lock_acquire+0xa2f/0xd0c
>  [<ffffffff8107e905>] ? sched_clock_local+0x12/0x75
>  [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
>  [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
>  [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
>  [<ffffffff8108adab>] ? lock_release_holdtime.part.10+0x59/0x62
>  [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
>  [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
>  [<ffffffff814cf855>] ? _raw_spin_unlock+0x47/0x54
>  [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
>  [<ffffffff810c4e00>] ? __rcu_read_unlock+0xb8/0x231
>  [<ffffffff810c4ed6>] __rcu_read_unlock+0x18e/0x231
>  [<ffffffff810463f4>] rcu_read_unlock+0x26/0x28
>  [<ffffffff8104b6db>] cpuacct_charge+0x58/0x61
>  [<ffffffff81052f18>] update_curr+0x107/0x134
>  [<ffffffff8105349b>] check_preempt_wakeup+0xc9/0x1d0
>  [<ffffffff81049775>] check_preempt_curr+0x2f/0x6e
>  [<ffffffff81053f5e>] ttwu_do_wakeup+0x7b/0x111
>  [<ffffffff81054050>] ttwu_do_activate.constprop.76+0x5c/0x61
>  [<ffffffff8105409e>] sched_ttwu_pending+0x49/0x5b
>  [<ffffffff810540be>] scheduler_ipi+0xe/0x10
>  [<ffffffff810224f6>] smp_reschedule_interrupt+0x1b/0x1d
>  [<ffffffff814d6b33>] reschedule_interrupt+0x13/0x20
>  <EOI>  [<ffffffff8107e905>] ? sched_clock_local+0x12/0x75
>  [<ffffffff810c4d91>] ? __rcu_read_unlock+0x49/0x231
>  [<ffffffff8108dea5>] ? lock_release+0x1b1/0x1de
>  [<ffffffff813fcc18>] rcu_read_unlock+0x26/0x28
>  [<ffffffff813fe308>] sock_def_readable+0x88/0x8d
>  [<ffffffff81497760>] unix_stream_sendmsg+0x264/0x2ff
>  [<ffffffff813f83c4>] sock_aio_write+0x112/0x126
>  [<ffffffff8121cd95>] ? inode_has_perm+0x6a/0x77
>  [<ffffffff8114093b>] do_sync_write+0xbf/0xff
>  [<ffffffff81219562>] ? security_file_permission+0x2e/0x33
>  [<ffffffff81140d71>] ? rw_verify_area+0xb6/0xd3
>  [<ffffffff81141012>] vfs_write+0xb6/0xf6
>  [<ffffffff811426a0>] ? fget_light+0x97/0xa2
>  [<ffffffff81141206>] sys_write+0x4d/0x74
>  [<ffffffff81078f85>] ? remove_wait_queue+0x1a/0x3a
>  [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b

Hi Paul, RCU is doing really bad things here, rcu_read_unlock() ->
rcu_read_unlock_special() can do all kinds of nasty, including as per
the above call back into the scheduler, which is kinda a death-warrant
seeing as the scheduler itself can use RCU.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-12 20:20 ` Peter Zijlstra
@ 2011-07-12 20:51   ` Paul E. McKenney
  2011-07-12 21:44     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2011-07-12 20:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel

On Tue, Jul 12, 2011 at 10:20:14PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-07-11 at 19:19 -0400, Dave Jones wrote:
> > I was doing an install in a kvm guest, which wedged itself at the end.
> > This was in the host dmesg.
> > 
> > 
> > [ INFO: possible circular locking dependency detected ]
> > 3.0.0-rc6+ #91
> > -------------------------------------------------------
> > libvirtd/5720 is trying to acquire lock:
> >  (rcu_node_level_0){..-.-.}, at: [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
> > 
> > but task is already holding lock:
> >  (&rq->lock){-.-.-.}, at: [<ffffffff8105408e>] sched_ttwu_pending+0x39/0x5b
> > 
> > which lock already depends on the new lock.
> > 
> > 
> > the existing dependency chain (in reverse order) is:
> > 
> > -> #3 (&rq->lock){-.-.-.}:
> >        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
> >        [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
> >        [<ffffffff8104663a>] __task_rq_lock+0x5e/0x8b
> >        [<ffffffff8105506d>] wake_up_new_task+0x46/0x10d
> >        [<ffffffff8105a1c9>] do_fork+0x231/0x331
> >        [<ffffffff81010c80>] kernel_thread+0x75/0x77
> >        [<ffffffff814abe82>] rest_init+0x26/0xdc
> >        [<ffffffff81d3dbc2>] start_kernel+0x401/0x40c
> >        [<ffffffff81d3d2c4>] x86_64_start_reservations+0xaf/0xb3
> >        [<ffffffff81d3d3ca>] x86_64_start_kernel+0x102/0x111
> > 
> > -> #2 (&p->pi_lock){-.-.-.}:
> >        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
> >        [<ffffffff814cf238>] _raw_spin_lock_irqsave+0x4f/0x89
> >        [<ffffffff81054e3d>] try_to_wake_up+0x2e/0x1db
> >        [<ffffffff81054ffc>] default_wake_function+0x12/0x14
> >        [<ffffffff81079008>] autoremove_wake_function+0x18/0x3d
> >        [<ffffffff81045010>] __wake_up_common+0x4d/0x83
> >        [<ffffffff8104634e>] __wake_up+0x39/0x4d
> >        [<ffffffff810c3cd6>] rcu_report_exp_rnp+0x52/0x8b
> >        [<ffffffff810c4f18>] __rcu_read_unlock+0x1d0/0x231
> >        [<ffffffff8115202a>] rcu_read_unlock+0x26/0x28
> >        [<ffffffff8115465d>] __d_lookup+0x103/0x115
> >        [<ffffffff8114b9eb>] walk_component+0x1b1/0x3af
> >        [<ffffffff8114bd8a>] link_path_walk+0x1a1/0x43b
> >        [<ffffffff8114c148>] path_lookupat+0x5a/0x2af
> >        [<ffffffff8114d222>] do_path_lookup+0x28/0x97
> >        [<ffffffff8114d658>] user_path_at+0x59/0x96
> >        [<ffffffff81145214>] sys_readlinkat+0x33/0x95
> >        [<ffffffff81145291>] sys_readlink+0x1b/0x1d
> >        [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b
> > 
> > -> #1 (sync_rcu_preempt_exp_wq.lock){......}:
> >        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
> >        [<ffffffff814cf238>] _raw_spin_lock_irqsave+0x4f/0x89
> >        [<ffffffff81046337>] __wake_up+0x22/0x4d
> >        [<ffffffff810c3cd6>] rcu_report_exp_rnp+0x52/0x8b
> >        [<ffffffff810c4f18>] __rcu_read_unlock+0x1d0/0x231
> >        [<ffffffff8115202a>] rcu_read_unlock+0x26/0x28
> >        [<ffffffff8115465d>] __d_lookup+0x103/0x115
> >        [<ffffffff8114b9eb>] walk_component+0x1b1/0x3af
> >        [<ffffffff8114bd8a>] link_path_walk+0x1a1/0x43b
> >        [<ffffffff8114c148>] path_lookupat+0x5a/0x2af
> >        [<ffffffff8114d222>] do_path_lookup+0x28/0x97
> >        [<ffffffff8114d658>] user_path_at+0x59/0x96
> >        [<ffffffff81145214>] sys_readlinkat+0x33/0x95
> >        [<ffffffff81145291>] sys_readlink+0x1b/0x1d
> >        [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b
> > 
> > -> #0 (rcu_node_level_0){..-.-.}:
> >        [<ffffffff8108d7e5>] __lock_acquire+0xa2f/0xd0c
> >        [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
> >        [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
> >        [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
> >        [<ffffffff810c4ed6>] __rcu_read_unlock+0x18e/0x231
> >        [<ffffffff810463f4>] rcu_read_unlock+0x26/0x28
> >        [<ffffffff8104b6db>] cpuacct_charge+0x58/0x61
> >        [<ffffffff81052f18>] update_curr+0x107/0x134
> >        [<ffffffff8105349b>] check_preempt_wakeup+0xc9/0x1d0
> >        [<ffffffff81049775>] check_preempt_curr+0x2f/0x6e
> >        [<ffffffff81053f5e>] ttwu_do_wakeup+0x7b/0x111
> >        [<ffffffff81054050>] ttwu_do_activate.constprop.76+0x5c/0x61
> >        [<ffffffff8105409e>] sched_ttwu_pending+0x49/0x5b
> >        [<ffffffff810540be>] scheduler_ipi+0xe/0x10
> >        [<ffffffff810224f6>] smp_reschedule_interrupt+0x1b/0x1d
> >        [<ffffffff814d6b33>] reschedule_interrupt+0x13/0x20
> >        [<ffffffff813fcc18>] rcu_read_unlock+0x26/0x28
> >        [<ffffffff813fe308>] sock_def_readable+0x88/0x8d
> >        [<ffffffff81497760>] unix_stream_sendmsg+0x264/0x2ff
> >        [<ffffffff813f83c4>] sock_aio_write+0x112/0x126
> >        [<ffffffff8114093b>] do_sync_write+0xbf/0xff
> >        [<ffffffff81141012>] vfs_write+0xb6/0xf6
> >        [<ffffffff81141206>] sys_write+0x4d/0x74
> >        [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b
> > 
> > other info that might help us debug this:
> > 
> > Chain exists of:
> >   rcu_node_level_0 --> &p->pi_lock --> &rq->lock
> > 
> >  Possible unsafe locking scenario:
> > 
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(&rq->lock);
> >                                lock(&p->pi_lock);
> >                                lock(&rq->lock);
> >   lock(rcu_node_level_0);
> > 
> >  *** DEADLOCK ***
> > 
> > 1 lock held by libvirtd/5720:
> >  #0:  (&rq->lock){-.-.-.}, at: [<ffffffff8105408e>] sched_ttwu_pending+0x39/0x5b
> > 
> > stack backtrace:
> > Pid: 5720, comm: libvirtd Not tainted 3.0.0-rc6+ #91
> > Call Trace:
> >  <IRQ>  [<ffffffff814c51cf>] print_circular_bug+0x1f8/0x209
> >  [<ffffffff8108d7e5>] __lock_acquire+0xa2f/0xd0c
> >  [<ffffffff8107e905>] ? sched_clock_local+0x12/0x75
> >  [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
> >  [<ffffffff8108dfc5>] lock_acquire+0xf3/0x13e
> >  [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
> >  [<ffffffff8108adab>] ? lock_release_holdtime.part.10+0x59/0x62
> >  [<ffffffff814cf0ab>] _raw_spin_lock+0x40/0x73
> >  [<ffffffff814c6c12>] ? rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
> >  [<ffffffff814cf855>] ? _raw_spin_unlock+0x47/0x54
> >  [<ffffffff814c6c12>] rcu_report_unblock_qs_rnp.part.5+0x3f/0x60
> >  [<ffffffff810c4e00>] ? __rcu_read_unlock+0xb8/0x231
> >  [<ffffffff810c4ed6>] __rcu_read_unlock+0x18e/0x231
> >  [<ffffffff810463f4>] rcu_read_unlock+0x26/0x28
> >  [<ffffffff8104b6db>] cpuacct_charge+0x58/0x61
> >  [<ffffffff81052f18>] update_curr+0x107/0x134
> >  [<ffffffff8105349b>] check_preempt_wakeup+0xc9/0x1d0
> >  [<ffffffff81049775>] check_preempt_curr+0x2f/0x6e
> >  [<ffffffff81053f5e>] ttwu_do_wakeup+0x7b/0x111
> >  [<ffffffff81054050>] ttwu_do_activate.constprop.76+0x5c/0x61
> >  [<ffffffff8105409e>] sched_ttwu_pending+0x49/0x5b
> >  [<ffffffff810540be>] scheduler_ipi+0xe/0x10
> >  [<ffffffff810224f6>] smp_reschedule_interrupt+0x1b/0x1d
> >  [<ffffffff814d6b33>] reschedule_interrupt+0x13/0x20
> >  <EOI>  [<ffffffff8107e905>] ? sched_clock_local+0x12/0x75
> >  [<ffffffff810c4d91>] ? __rcu_read_unlock+0x49/0x231
> >  [<ffffffff8108dea5>] ? lock_release+0x1b1/0x1de
> >  [<ffffffff813fcc18>] rcu_read_unlock+0x26/0x28
> >  [<ffffffff813fe308>] sock_def_readable+0x88/0x8d
> >  [<ffffffff81497760>] unix_stream_sendmsg+0x264/0x2ff
> >  [<ffffffff813f83c4>] sock_aio_write+0x112/0x126
> >  [<ffffffff8121cd95>] ? inode_has_perm+0x6a/0x77
> >  [<ffffffff8114093b>] do_sync_write+0xbf/0xff
> >  [<ffffffff81219562>] ? security_file_permission+0x2e/0x33
> >  [<ffffffff81140d71>] ? rw_verify_area+0xb6/0xd3
> >  [<ffffffff81141012>] vfs_write+0xb6/0xf6
> >  [<ffffffff811426a0>] ? fget_light+0x97/0xa2
> >  [<ffffffff81141206>] sys_write+0x4d/0x74
> >  [<ffffffff81078f85>] ? remove_wait_queue+0x1a/0x3a
> >  [<ffffffff814d5c02>] system_call_fastpath+0x16/0x1b
> 
> Hi Paul, RCU is doing really bad things here, rcu_read_unlock() ->
> rcu_read_unlock_special() can do all kinds of nasty, including as per
> the above call back into the scheduler, which is kinda a death-warrant
> seeing as the scheduler itself can use RCU.

As we discussed on IRC, it doesn't look like RCU can safely defer
this processing.  For example, if softirqd is lower priority than an
RCU priority boosted task, softirqd might never get a chance to lower
the priority.  Plus this could mess up the accounting for the rt_mutex
used to carry out the priority boosting.  And for reporting the end of
an expedited RCU grace period, the possibly long-term deferral could
easily defeat the whole purpose of expediting.

So I am hoping that your idea of doing rcu_read_lock() before acquiring
rq locks (and pi locks) and doing rcu_read_unlock() after releasing them
does work out!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-12 20:51   ` Paul E. McKenney
@ 2011-07-12 21:44     ` Peter Zijlstra
  2011-07-12 22:54       ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2011-07-12 21:44 UTC (permalink / raw)
  To: paulmck; +Cc: Dave Jones, Linux Kernel, Ingo Molnar

On Tue, 2011-07-12 at 13:51 -0700, Paul E. McKenney wrote:
> 
> So I am hoping that your idea of doing rcu_read_lock() before acquiring
> rq locks (and pi locks) and doing rcu_read_unlock() after releasing them
> does work out! 

it would look something like the below, except that it needs some
performance testing, it does add a lot of rcu fiddling to hot paths.

Need sleep now.. will poke more in the morning.

---
 kernel/sched.c      |   76 ++++++++++++++++++++++++++++++++++++++-------------
 kernel/sched_fair.c |   14 +++++++---
 kernel/sched_rt.c   |    6 ++++
 3 files changed, 73 insertions(+), 23 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9769c75..4bf0951 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 	 * prev into current:
 	 */
 	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
+	rcu_read_acquire();
 
 	raw_spin_unlock_irq(&rq->lock);
 }
@@ -960,6 +961,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 	struct rq *rq;
 
 	for (;;) {
+		rcu_read_lock();
 		raw_spin_lock_irqsave(&p->pi_lock, *flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
@@ -967,6 +969,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 			return rq;
 		raw_spin_unlock(&rq->lock);
 		raw_spin_unlock_irqrestore(&p->pi_lock, *flags);
+		rcu_read_unlock();
 	}
 }
 
@@ -983,6 +986,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, unsigned long *flags)
 {
 	raw_spin_unlock(&rq->lock);
 	raw_spin_unlock_irqrestore(&p->pi_lock, *flags);
+	rcu_read_unlock();
 }
 
 /*
@@ -995,6 +999,7 @@ static struct rq *this_rq_lock(void)
 
 	local_irq_disable();
 	rq = this_rq();
+	rcu_read_lock();
 	raw_spin_lock(&rq->lock);
 
 	return rq;
@@ -1042,10 +1047,12 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 
+	rcu_read_lock();
 	raw_spin_lock(&rq->lock);
 	update_rq_clock(rq);
 	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
 	raw_spin_unlock(&rq->lock);
+	rcu_read_unlock();
 
 	return HRTIMER_NORESTART;
 }
@@ -1058,10 +1065,12 @@ static void __hrtick_start(void *arg)
 {
 	struct rq *rq = arg;
 
+	rcu_read_lock();
 	raw_spin_lock(&rq->lock);
 	hrtimer_restart(&rq->hrtick_timer);
 	rq->hrtick_csd_pending = 0;
 	raw_spin_unlock(&rq->lock);
+	rcu_read_unlock();
 }
 
 /*
@@ -1190,10 +1199,14 @@ static void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	if (!raw_spin_trylock_irqsave(&rq->lock, flags))
+	rcu_read_lock();
+	if (!raw_spin_trylock_irqsave(&rq->lock, flags)) {
+		rcu_read_unlock();
 		return;
+	}
 	resched_task(cpu_curr(cpu));
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	rcu_read_unlock();
 }
 
 #ifdef CONFIG_NO_HZ
@@ -1685,6 +1698,7 @@ static void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq1->lock)
 	__acquires(rq2->lock)
 {
+	rcu_read_lock();
 	BUG_ON(!irqs_disabled());
 	if (rq1 == rq2) {
 		raw_spin_lock(&rq1->lock);
@@ -1715,6 +1729,7 @@ static void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 		raw_spin_unlock(&rq2->lock);
 	else
 		__release(rq2->lock);
+	rcu_read_unlock();
 }
 
 #else /* CONFIG_SMP */
@@ -1731,6 +1746,7 @@ static void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
+	rcu_read_lock();
 	raw_spin_lock(&rq1->lock);
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
@@ -1747,6 +1763,7 @@ static void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(rq1 != rq2);
 	raw_spin_unlock(&rq1->lock);
+	rcu_read_unlock();
 	__release(rq2->lock);
 }
 
@@ -2552,6 +2569,7 @@ static void sched_ttwu_pending(void)
 	if (!list)
 		return;
 
+	rcu_read_lock();
 	raw_spin_lock(&rq->lock);
 
 	while (list) {
@@ -2561,6 +2579,7 @@ static void sched_ttwu_pending(void)
 	}
 
 	raw_spin_unlock(&rq->lock);
+	rcu_read_unlock();
 }
 
 void scheduler_ipi(void)
@@ -2645,6 +2664,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	int cpu, success = 0;
 
 	smp_wmb();
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -2698,6 +2718,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	rcu_read_unlock();
 
 	return success;
 }
@@ -2718,6 +2739,7 @@ static void try_to_wake_up_local(struct task_struct *p)
 	BUG_ON(p == current);
 	lockdep_assert_held(&rq->lock);
 
+	rcu_read_lock();
 	if (!raw_spin_trylock(&p->pi_lock)) {
 		raw_spin_unlock(&rq->lock);
 		raw_spin_lock(&p->pi_lock);
@@ -2734,6 +2756,7 @@ static void try_to_wake_up_local(struct task_struct *p)
 	ttwu_stat(p, smp_processor_id(), 0);
 out:
 	raw_spin_unlock(&p->pi_lock);
+	rcu_read_unlock();
 }
 
 /**
@@ -2843,9 +2866,11 @@ void sched_fork(struct task_struct *p)
 	 *
 	 * Silence PROVE_RCU.
 	 */
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	set_task_cpu(p, cpu);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	rcu_read_unlock();
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	if (likely(sched_info_on()))
@@ -2877,6 +2902,7 @@ void wake_up_new_task(struct task_struct *p)
 	unsigned long flags;
 	struct rq *rq;
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_SMP
 	/*
@@ -3026,6 +3052,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	local_irq_enable();
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
+	rcu_read_unlock();
 
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
@@ -3055,10 +3082,12 @@ static inline void post_schedule(struct rq *rq)
 	if (rq->post_schedule) {
 		unsigned long flags;
 
+		rcu_read_lock();
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->curr->sched_class->post_schedule)
 			rq->curr->sched_class->post_schedule(rq);
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		rcu_read_unlock();
 
 		rq->post_schedule = 0;
 	}
@@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 #ifndef __ARCH_WANT_UNLOCKED_CTXSW
 	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
+	rcu_read_release();
 #endif
 
 	/* Here we just switch the register state and the stack. */
@@ -3607,6 +3637,7 @@ void sched_exec(void)
 	unsigned long flags;
 	int dest_cpu;
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
@@ -3621,6 +3652,7 @@ void sched_exec(void)
 	}
 unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	rcu_read_unlock();
 }
 
 #endif
@@ -4062,11 +4094,13 @@ void scheduler_tick(void)
 
 	sched_clock_tick();
 
+	rcu_read_lock();
 	raw_spin_lock(&rq->lock);
 	update_rq_clock(rq);
 	update_cpu_load_active(rq);
 	curr->sched_class->task_tick(rq, curr, 0);
 	raw_spin_unlock(&rq->lock);
+	rcu_read_unlock();
 
 	perf_event_task_tick();
 
@@ -4231,6 +4265,7 @@ asmlinkage void __sched schedule(void)
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
+	rcu_read_lock();
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
@@ -4291,8 +4326,10 @@ asmlinkage void __sched schedule(void)
 		 */
 		cpu = smp_processor_id();
 		rq = cpu_rq(cpu);
-	} else
+	} else {
 		raw_spin_unlock_irq(&rq->lock);
+		rcu_read_unlock();
+	}
 
 	post_schedule(rq);
 
@@ -5136,8 +5173,7 @@ static int __sched_setscheduler(struct task_struct *p, int policy,
 	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
 			param->sched_priority == p->rt_priority))) {
 
-		__task_rq_unlock(rq);
-		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
 
@@ -5508,9 +5544,9 @@ SYSCALL_DEFINE0(sched_yield)
 	 * Since we are going to call schedule() anyway, there's
 	 * no need to preempt or enable interrupts:
 	 */
-	__release(rq->lock);
-	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
-	do_raw_spin_unlock(&rq->lock);
+	preempt_disable();
+	raw_spin_unlock(&rq->lock);
+	rcu_read_unlock();
 	preempt_enable_no_resched();
 
 	schedule();
@@ -5869,6 +5905,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	__sched_fork(idle);
@@ -5876,25 +5913,14 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
 	idle->se.exec_start = sched_clock();
 
 	do_set_cpus_allowed(idle, cpumask_of(cpu));
-	/*
-	 * We're having a chicken and egg problem, even though we are
-	 * holding rq->lock, the cpu isn't yet set to this cpu so the
-	 * lockdep check in task_group() will fail.
-	 *
-	 * Similar case to sched_fork(). / Alternatively we could
-	 * use task_rq_lock() here and obtain the other rq->lock.
-	 *
-	 * Silence PROVE_RCU
-	 */
-	rcu_read_lock();
 	__set_task_cpu(idle, cpu);
-	rcu_read_unlock();
 
 	rq->curr = rq->idle = idle;
 #if defined(CONFIG_SMP)
 	idle->on_cpu = 1;
 #endif
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	rcu_read_unlock();
 
 	/* Set the preempt count _outside_ the spinlocks! */
 	task_thread_info(idle)->preempt_count = 0;
@@ -6062,6 +6088,7 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 	rq_src = cpu_rq(src_cpu);
 	rq_dest = cpu_rq(dest_cpu);
 
+	rcu_read_lock();
 	raw_spin_lock(&p->pi_lock);
 	double_rq_lock(rq_src, rq_dest);
 	/* Already moved. */
@@ -6086,6 +6113,7 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 fail:
 	double_rq_unlock(rq_src, rq_dest);
 	raw_spin_unlock(&p->pi_lock);
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -6415,6 +6443,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 
 	case CPU_ONLINE:
 		/* Update our root-domain */
+		rcu_read_lock();
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
 			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
@@ -6422,12 +6451,14 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 			set_rq_online(rq);
 		}
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		rcu_read_unlock();
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_DYING:
 		sched_ttwu_pending();
 		/* Update our root-domain */
+		rcu_read_lock()
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
 			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
@@ -6436,6 +6467,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		migrate_tasks(cpu);
 		BUG_ON(rq->nr_running != 1); /* the migration thread */
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		rcu_read_unlock();
 
 		migrate_nr_uninterruptible(rq);
 		calc_global_load_remove(rq);
@@ -6694,6 +6726,7 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	if (rq->rd) {
@@ -6721,6 +6754,7 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 		set_rq_online(rq);
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	rcu_read_lock();
 
 	if (old_rd)
 		call_rcu_sched(&old_rd->rcu, free_rootdomain);
@@ -8116,6 +8150,7 @@ void normalize_rt_tasks(void)
 			continue;
 		}
 
+		rcu_read_lock();
 		raw_spin_lock(&p->pi_lock);
 		rq = __task_rq_lock(p);
 
@@ -8123,6 +8158,7 @@ void normalize_rt_tasks(void)
 
 		__task_rq_unlock(rq);
 		raw_spin_unlock(&p->pi_lock);
+		rcu_read_unlock();
 	} while_each_thread(g, p);
 
 	read_unlock_irqrestore(&tasklist_lock, flags);
@@ -8463,10 +8499,12 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 
 		se = tg->se[i];
 		/* Propagate contribution to hierarchy */
+		rcu_read_lock();
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		for_each_sched_entity(se)
 			update_cfs_shares(group_cfs_rq(se));
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		rcu_read_unlock();
 	}
 
 done:
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 433491c..8a5131c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2209,6 +2209,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 	rq = cpu_rq(cpu);
 	cfs_rq = tg->cfs_rq[cpu];
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
@@ -2221,6 +2222,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 	update_cfs_shares(cfs_rq);
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	rcu_read_unlock();
 
 	return 0;
 }
@@ -3501,10 +3503,10 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
 	 */
+	rcu_read_lock();
 	raw_spin_unlock(&this_rq->lock);
 
 	update_shares(this_cpu);
-	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 		int balance = 1;
@@ -3526,9 +3528,9 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
 			break;
 		}
 	}
-	rcu_read_unlock();
 
 	raw_spin_lock(&this_rq->lock);
+	rcu_read_unlock();
 
 	if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
 		/*
@@ -3553,6 +3555,7 @@ static int active_load_balance_cpu_stop(void *data)
 	struct rq *target_rq = cpu_rq(target_cpu);
 	struct sched_domain *sd;
 
+	rcu_read_lock();
 	raw_spin_lock_irq(&busiest_rq->lock);
 
 	/* make sure the requested cpu hasn't gone down in the meantime */
@@ -3575,7 +3578,6 @@ static int active_load_balance_cpu_stop(void *data)
 	double_lock_balance(busiest_rq, target_rq);
 
 	/* Search for an sd spanning us and the target CPU. */
-	rcu_read_lock();
 	for_each_domain(target_cpu, sd) {
 		if ((sd->flags & SD_LOAD_BALANCE) &&
 		    cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
@@ -3591,11 +3593,11 @@ static int active_load_balance_cpu_stop(void *data)
 		else
 			schedstat_inc(sd, alb_failed);
 	}
-	rcu_read_unlock();
 	double_unlock_balance(busiest_rq, target_rq);
 out_unlock:
 	busiest_rq->active_balance = 0;
 	raw_spin_unlock_irq(&busiest_rq->lock);
+	rcu_read_unlock();
 	return 0;
 }
 
@@ -3982,10 +3984,12 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 			break;
 		}
 
+		rcu_read_lock();
 		raw_spin_lock_irq(&this_rq->lock);
 		update_rq_clock(this_rq);
 		update_cpu_load(this_rq);
 		raw_spin_unlock_irq(&this_rq->lock);
+		rcu_read_unlock();
 
 		rebalance_domains(balance_cpu, CPU_IDLE);
 
@@ -4135,6 +4139,7 @@ static void task_fork_fair(struct task_struct *p)
 	struct rq *rq = this_rq();
 	unsigned long flags;
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
@@ -4163,6 +4168,7 @@ static void task_fork_fair(struct task_struct *p)
 	se->vruntime -= cfs_rq->min_vruntime;
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	rcu_read_unlock();
 }
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 10d0182..6e8a375 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -494,9 +494,11 @@ static void disable_runtime(struct rq *rq)
 {
 	unsigned long flags;
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&rq->lock, flags);
 	__disable_runtime(rq);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	rcu_read_unlock();
 }
 
 static void __enable_runtime(struct rq *rq)
@@ -527,9 +529,11 @@ static void enable_runtime(struct rq *rq)
 {
 	unsigned long flags;
 
+	rcu_read_lock();
 	raw_spin_lock_irqsave(&rq->lock, flags);
 	__enable_runtime(rq);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	rcu_read_unlock();
 }
 
 static int balance_runtime(struct rt_rq *rt_rq)
@@ -565,6 +569,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
 		struct rq *rq = rq_of_rt_rq(rt_rq);
 
+		rcu_read_lock();
 		raw_spin_lock(&rq->lock);
 		if (rt_rq->rt_time) {
 			u64 runtime;
@@ -597,6 +602,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
 		raw_spin_unlock(&rq->lock);
+		rcu_read_unlock();
 	}
 
 	return idle;



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-12 21:44     ` Peter Zijlstra
@ 2011-07-12 22:54       ` Paul E. McKenney
  2011-07-13  8:24         ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2011-07-12 22:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, Ingo Molnar

On Tue, Jul 12, 2011 at 11:44:07PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-07-12 at 13:51 -0700, Paul E. McKenney wrote:
> > 
> > So I am hoping that your idea of doing rcu_read_lock() before acquiring
> > rq locks (and pi locks) and doing rcu_read_unlock() after releasing them
> > does work out! 
> 
> it would look something like the below, except that it needs some
> performance testing, it does add a lot of rcu fiddling to hot paths.
> 
> Need sleep now.. will poke more in the morning.

Very cool!!!

Just a couple of comments below, one on rcu_read_acquire() and one
on rcu_read_release().

							Thanx, Paul

> ---
>  kernel/sched.c      |   76 ++++++++++++++++++++++++++++++++++++++-------------
>  kernel/sched_fair.c |   14 +++++++---
>  kernel/sched_rt.c   |    6 ++++
>  3 files changed, 73 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 9769c75..4bf0951 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
>  	 * prev into current:
>  	 */
>  	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> +	rcu_read_acquire();

Oooh...  This is a tricky one.  Hmmm...

The !PREEMPT_RCU case seems straightforward, but the PREEMPT_RCU case
needs some thought.

OK, so the outgoing task did an rcu_read_lock(), but it did so -after-
the scheduler invoked rcu_note_context_switch(), correct?  This means
that the outgoing task has its current->rcu_read_lock_nesting set to 1
(assuming that the task didn't enter schedule() within an RCU read-side
critical section), but that the task is not queued on the rcu_node
structure's blkd_tasks list.  This means that the outgoing task's
RCU read-side critical section is being ignored, because only the
->rcu_read_lock_nesting variables of tasks actually running at the
time are paid attention to.

When that task is scheduled once again, its RCU read-side critical
section will suddenly be paid attention to once again.  So maybe
we can just rely on this behavior.

One complication would be newly created tasks that never have run,
because they would not be in RCU read-side critical sections.  They could
be initialized to be born in RCU read-side critical sections, which again
RCU would be ignoring until such time as the task actually started running.

Another complication would be at task-exit time -- it seems like it would
be necessary to avoid the current logic that screams if a task exits
while in an RCU read-side critical section.  But maybe not -- after all,
doesn't the task make those checks on its own, before its final context
switch into oblivion?  If so, it just might all work out without changes.

OK, back to the case where the task called schedule() from within
an RCU read-side critical section.  In this case, the task was queued,
but gets an extra increment of current->rcu_read_lock_nesting after
being queued.  And this extra increment is matched by the extra
decrement when it is scheduled back in, right?  So this should work
out naturally.

Yeah, I know, famous last words.

So I believe that the rcu_read_acquire() you have above is not needed.
Instead, some code to initialize tasks to be born in RCU read-side
critical sections is required.  Which will in turn mean that boot-time
RCU callbacks get deferred until the scheduler is doing context switches
for PREEMPT_RCU kernels, though this can be hacked around if required.
(Boot-time synchronize_rcu() invocations will continue to take their
fast paths, so will continue to work normally, from what I can see at
the moment.)

Does any of this make sense?

>  	raw_spin_unlock_irq(&rq->lock);
>  }
> @@ -960,6 +961,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
>  	struct rq *rq;
> 
>  	for (;;) {
> +		rcu_read_lock();
>  		raw_spin_lock_irqsave(&p->pi_lock, *flags);
>  		rq = task_rq(p);
>  		raw_spin_lock(&rq->lock);

[ . . . ]

> @@ -3055,10 +3082,12 @@ static inline void post_schedule(struct rq *rq)
>  	if (rq->post_schedule) {
>  		unsigned long flags;
> 
> +		rcu_read_lock();
>  		raw_spin_lock_irqsave(&rq->lock, flags);
>  		if (rq->curr->sched_class->post_schedule)
>  			rq->curr->sched_class->post_schedule(rq);
>  		raw_spin_unlock_irqrestore(&rq->lock, flags);
> +		rcu_read_unlock();
> 
>  		rq->post_schedule = 0;
>  	}
> @@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  	 */
>  #ifndef __ARCH_WANT_UNLOCKED_CTXSW
>  	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> +	rcu_read_release();

My guess is that we don't need the rcu_read_release() -- the arch shouldn't
care that we have a non-atomic field in task_struct incremented, right?

Or am I confused about what this is doing?

>  #endif
> 
>  	/* Here we just switch the register state and the stack. */
> @@ -3607,6 +3637,7 @@ void sched_exec(void)
>  	unsigned long flags;
>  	int dest_cpu;
> 
> +	rcu_read_lock();
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>  	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
>  	if (dest_cpu == smp_processor_id())
> @@ -3621,6 +3652,7 @@ void sched_exec(void)

[ . . . ]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-12 22:54       ` Paul E. McKenney
@ 2011-07-13  8:24         ` Peter Zijlstra
  2011-07-13 16:33           ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2011-07-13  8:24 UTC (permalink / raw)
  To: paulmck; +Cc: Dave Jones, Linux Kernel, Ingo Molnar

On Tue, 2011-07-12 at 15:54 -0700, Paul E. McKenney wrote:
> > @@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> >        * prev into current:
> >        */
> >       spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> > +     rcu_read_acquire();
> 
> Oooh...  This is a tricky one.  Hmmm...

<snip>

> Does any of this make sense?

No?

> > @@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >        */
> >  #ifndef __ARCH_WANT_UNLOCKED_CTXSW
> >       spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> > +     rcu_read_release();
> 
> My guess is that we don't need the rcu_read_release() -- the arch shouldn't
> care that we have a non-atomic field in task_struct incremented, right?
> 
> Or am I confused about what this is doing?

Either that or I used the wrong primitives for what I was meaning to do.

So from looking at rcupdate.h rcu_read_{acquire,release} are the lockdep
annotations of the rcu_read_lock. The thing we're doing above is context
switching and not making lockdep upset.

The problem is that held lock state is tracked per task, and we're well
switching tasks, so we need to transfer the held lock state from the old
to the new task, since the new task will be unlocking the thing, if at
that time lockdep finds we're not actually holding it it screams bloody
murder.

So what we do is we release the lock (annotation only, the actual lock
says locked) on prev before switching and acquire the lock on next right
after switching.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-13  8:24         ` Peter Zijlstra
@ 2011-07-13 16:33           ` Paul E. McKenney
  2011-07-14  4:29             ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2011-07-13 16:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, Ingo Molnar

On Wed, Jul 13, 2011 at 10:24:08AM +0200, Peter Zijlstra wrote:
> On Tue, 2011-07-12 at 15:54 -0700, Paul E. McKenney wrote:
> > > @@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> > >        * prev into current:
> > >        */
> > >       spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> > > +     rcu_read_acquire();
> > 
> > Oooh...  This is a tricky one.  Hmmm...
> 
> <snip>
> 
> > Does any of this make sense?
> 
> No?
> 
> > > @@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> > >        */
> > >  #ifndef __ARCH_WANT_UNLOCKED_CTXSW
> > >       spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> > > +     rcu_read_release();
> > 
> > My guess is that we don't need the rcu_read_release() -- the arch shouldn't
> > care that we have a non-atomic field in task_struct incremented, right?
> > 
> > Or am I confused about what this is doing?
> 
> Either that or I used the wrong primitives for what I was meaning to do.
> 
> So from looking at rcupdate.h rcu_read_{acquire,release} are the lockdep
> annotations of the rcu_read_lock. The thing we're doing above is context
> switching and not making lockdep upset.
> 
> The problem is that held lock state is tracked per task, and we're well
> switching tasks, so we need to transfer the held lock state from the old
> to the new task, since the new task will be unlocking the thing, if at
> that time lockdep finds we're not actually holding it it screams bloody
> murder.
> 
> So what we do is we release the lock (annotation only, the actual lock
> says locked) on prev before switching and acquire the lock on next right
> after switching.

OK, got it.  I think, anyway.

So preemptible RCU and context switches are handled as follows by
your patch, correct?

1.	The outgoing task is in an RCU read-side critical section
	due to the rcu_read_lock() that you added at the beginning
	of schedule().

2.	In context_switch(), the lockdep state is set so that lockdep
	believes that the task has exited its RCU read-side critical
	section (thus suppressing the lockdep splat that would otherwise
	appear).

3.	Once the context switch has happened, RCU no longer pays
	attention to the outgoing task.  So the outgoing task's
	current->rcu_read_lock_nesting still indicates that the task is in
	an RCU read-side critical section, but RCU does not know or care.

4.	If the incoming task has done at least one prior context switch,
	its current->rcu_read_lock_nesting will indicate that it is
	already in an RCU read-side critical section.  Since there is
	no opportunity for RCU to report a quiescent state during the
	context switch itself (because holding the runqueue lock keeps
	interrupts disabled), RCU sees the whole context switch event
	as being one big RCU read-side critical section.

	If interrupts were enabled at any point, RCU might see this as
	two back-to-back read-side critical sections.  This might be
	OK for all I know, but it would open the door more widely to
	heavyweight RCU processing at rcu_read_unlock() time.

5.	Yes, context switch is a quiescent state, but that quiescent
	state happens when rcu_note_context_switch() is called from
	schedule, which is before the runqueue lock is acquired, and
	thus before the beginning of the RCU read-side critical section
	that covers the context switch.

6.	In finish_lock_switch(), the rcu_read_acquire() restores
	lockdep state to match that of the incoming task.

	But if the task is newly created, then its value of
	current->rcu_read_lock_nesting will be zero, which will leave
	the incoming task unprotected by RCU.  I believe that you
	need the (untested) patch below to make this work properly.
	But this patch needs to be included in your set, as I don't
	immediately see a reasonable way to make it work independently.
	(Yes, we could have a special CONFIG var for the transition,
	but it sounds a lot easier to just have it be part of your patch.)

7.	Exit processing is done by the task itself before a last
	call to schedule().  From what I can see, it should not matter
	that the value of current->rcu_read_lock_nesting gets incremented
	on this last TASK_DEAD call to schedule().

Or am I still confused?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 580f70c..617a323 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -106,7 +106,7 @@ extern struct group_info init_groups;
 #endif
 #ifdef CONFIG_PREEMPT_RCU
 #define INIT_TASK_RCU_PREEMPT(tsk)					\
-	.rcu_read_lock_nesting = 0,					\
+	.rcu_read_lock_nesting = 1,					\
 	.rcu_read_unlock_special = 0,					\
 	.rcu_node_entry = LIST_HEAD_INIT(tsk.rcu_node_entry),		\
 	INIT_TASK_RCU_TREE_PREEMPT()					\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 496770a..6baa51bb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1822,7 +1822,7 @@ extern void task_clear_group_stop_pending(struct task_struct *task);
 
 static inline void rcu_copy_process(struct task_struct *p)
 {
-	p->rcu_read_lock_nesting = 0;
+	p->rcu_read_lock_nesting = 1;
 	p->rcu_read_unlock_special = 0;
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	p->rcu_blocked_node = NULL;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-13 16:33           ` Paul E. McKenney
@ 2011-07-14  4:29             ` Paul E. McKenney
  2011-07-14  5:26               ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2011-07-14  4:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, Ingo Molnar

On Wed, Jul 13, 2011 at 09:33:09AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 13, 2011 at 10:24:08AM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-07-12 at 15:54 -0700, Paul E. McKenney wrote:
> > > > @@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> > > >        * prev into current:
> > > >        */
> > > >       spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> > > > +     rcu_read_acquire();
> > > 
> > > Oooh...  This is a tricky one.  Hmmm...
> > 
> > <snip>
> > 
> > > Does any of this make sense?
> > 
> > No?
> > 
> > > > @@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> > > >        */
> > > >  #ifndef __ARCH_WANT_UNLOCKED_CTXSW
> > > >       spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> > > > +     rcu_read_release();
> > > 
> > > My guess is that we don't need the rcu_read_release() -- the arch shouldn't
> > > care that we have a non-atomic field in task_struct incremented, right?
> > > 
> > > Or am I confused about what this is doing?
> > 
> > Either that or I used the wrong primitives for what I was meaning to do.
> > 
> > So from looking at rcupdate.h rcu_read_{acquire,release} are the lockdep
> > annotations of the rcu_read_lock. The thing we're doing above is context
> > switching and not making lockdep upset.
> > 
> > The problem is that held lock state is tracked per task, and we're well
> > switching tasks, so we need to transfer the held lock state from the old
> > to the new task, since the new task will be unlocking the thing, if at
> > that time lockdep finds we're not actually holding it it screams bloody
> > murder.
> > 
> > So what we do is we release the lock (annotation only, the actual lock
> > says locked) on prev before switching and acquire the lock on next right
> > after switching.
> 
> OK, got it.  I think, anyway.
> 
> So preemptible RCU and context switches are handled as follows by
> your patch, correct?
> 
> 1.	The outgoing task is in an RCU read-side critical section
> 	due to the rcu_read_lock() that you added at the beginning
> 	of schedule().
> 
> 2.	In context_switch(), the lockdep state is set so that lockdep
> 	believes that the task has exited its RCU read-side critical
> 	section (thus suppressing the lockdep splat that would otherwise
> 	appear).
> 
> 3.	Once the context switch has happened, RCU no longer pays
> 	attention to the outgoing task.  So the outgoing task's
> 	current->rcu_read_lock_nesting still indicates that the task is in
> 	an RCU read-side critical section, but RCU does not know or care.
> 
> 4.	If the incoming task has done at least one prior context switch,
> 	its current->rcu_read_lock_nesting will indicate that it is
> 	already in an RCU read-side critical section.  Since there is
> 	no opportunity for RCU to report a quiescent state during the
> 	context switch itself (because holding the runqueue lock keeps
> 	interrupts disabled), RCU sees the whole context switch event
> 	as being one big RCU read-side critical section.
> 
> 	If interrupts were enabled at any point, RCU might see this as
> 	two back-to-back read-side critical sections.  This might be
> 	OK for all I know, but it would open the door more widely to
> 	heavyweight RCU processing at rcu_read_unlock() time.
> 
> 5.	Yes, context switch is a quiescent state, but that quiescent
> 	state happens when rcu_note_context_switch() is called from
> 	schedule, which is before the runqueue lock is acquired, and
> 	thus before the beginning of the RCU read-side critical section
> 	that covers the context switch.
> 
> 6.	In finish_lock_switch(), the rcu_read_acquire() restores
> 	lockdep state to match that of the incoming task.
> 
> 	But if the task is newly created, then its value of
> 	current->rcu_read_lock_nesting will be zero, which will leave
> 	the incoming task unprotected by RCU.  I believe that you
> 	need the (untested) patch below to make this work properly.
> 	But this patch needs to be included in your set, as I don't
> 	immediately see a reasonable way to make it work independently.
> 	(Yes, we could have a special CONFIG var for the transition,
> 	but it sounds a lot easier to just have it be part of your patch.)
> 
> 7.	Exit processing is done by the task itself before a last
> 	call to schedule().  From what I can see, it should not matter
> 	that the value of current->rcu_read_lock_nesting gets incremented
> 	on this last TASK_DEAD call to schedule().
> 
> Or am I still confused?

Hmmm...  Something is confused.  I get boot time hangs with the
occasional stack overflow, whether or not I add my patch on top.
Left to myself, I would try applying your patch incrementally, and
also disabling irqs before rcu_read_lock() and enabling them after
rcu_read_unlock(), as this prevents rcu_read_unlock() from ever getting
to the rcu_read_unlock_special() slowpath.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)
  2011-07-14  4:29             ` Paul E. McKenney
@ 2011-07-14  5:26               ` Paul E. McKenney
  0 siblings, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2011-07-14  5:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linux Kernel, Ingo Molnar

On Wed, Jul 13, 2011 at 09:29:32PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 13, 2011 at 09:33:09AM -0700, Paul E. McKenney wrote:

[ . . . ]

> Hmmm...  Something is confused.  I get boot time hangs with the
> occasional stack overflow, whether or not I add my patch on top.
> Left to myself, I would try applying your patch incrementally, and
> also disabling irqs before rcu_read_lock() and enabling them after
> rcu_read_unlock(), as this prevents rcu_read_unlock() from ever getting
> to the rcu_read_unlock_special() slowpath.

And the real question...  Why did the rcu_read_unlock() called from
cpuacct_charge() enter rcu_read_unlock_special() in the first place?
If the runqueue lock is held, irqs should be disabled, so how did the
task get blocked or interrupted?

I need to get some sleep then figure out what is going on here.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-07-14  5:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-11 23:19 lockdep circular locking error (rcu_node_level_0 vs rq->lock) Dave Jones
2011-07-12 20:20 ` Peter Zijlstra
2011-07-12 20:51   ` Paul E. McKenney
2011-07-12 21:44     ` Peter Zijlstra
2011-07-12 22:54       ` Paul E. McKenney
2011-07-13  8:24         ` Peter Zijlstra
2011-07-13 16:33           ` Paul E. McKenney
2011-07-14  4:29             ` Paul E. McKenney
2011-07-14  5:26               ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.