All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.11-rc6 genetlink locking fix offends lockdep
@ 2013-08-19  5:04 Hugh Dickins
  2013-08-19  8:00 ` Johannes Berg
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2013-08-19  5:04 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Greg KH, David S. Miller, Andrei Otcheretianski,
	linux-kernel, netdev, stable

3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race")
gives me the lockdep trace below at startup.

I think it needs to be reverted until you can refine it.  And it has
already gone into today's stable review series, as 04/12 for 3.0.92,
26/34 for 3.4.59, 18/45 for 3.10.8: I raise an objection to those.

Hugh

[    4.004286] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[    4.105671] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[    4.106123] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[    4.110096] 
[    4.110113] ======================================================
[    4.110146] [ INFO: possible circular locking dependency detected ]
[    4.110180] 3.11.0-rc6 #1 Not tainted
[    4.110201] -------------------------------------------------------
[    4.110234] NetworkManager/358 is trying to acquire lock:
[    4.110262]  (genl_mutex){+.+.+.}, at: [<ffffffff8148204d>] genl_lock+0x12/0x14
[    4.110315] 
[    4.110315] but task is already holding lock:
[    4.110346]  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7
[    4.110400] 
[    4.110400] which lock already depends on the new lock.
[    4.110400] 
[    4.110442] 
[    4.110442] the existing dependency chain (in reverse order) is:
[    4.110482] 
[    4.110482] -> #1 (nlk->cb_mutex){+.+.+.}:
[    4.110517]        [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    4.110555]        [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    4.110589]        [<ffffffff81583e42>] mutex_lock_nested+0x5e/0x345
[    4.110627]        [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e
[    4.110665]        [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252
[    4.110699]        [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c
[    4.110734]        [<ffffffff8148199b>] genl_rcv+0x24/0x34
[    4.110766]        [<ffffffff814811ca>] netlink_unicast+0xed/0x17a
[    4.110801]        [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345
[    4.110838]        [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e
[    4.110871]        [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be
[    4.110907]        [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e
[    4.110942]        [<ffffffff81453256>] SyS_sendmsg+0xd/0x19
[    4.110975]        [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    4.111012] 
[    4.111012] -> #0 (genl_mutex){+.+.+.}:
[    4.111047]        [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e
[    4.111086]        [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    4.111122]        [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    4.111157]        [<ffffffff81583e42>] mutex_lock_nested+0x5e/0x345
[    4.111193]        [<ffffffff8148204d>] genl_lock+0x12/0x14
[    4.111226]        [<ffffffff814822d2>] ctrl_dumpfamily+0x31/0xfa
[    4.111260]        [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7
[    4.111295]        [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1
[    4.111331]        [<ffffffff81450328>] sock_recvmsg+0x83/0x98
[    4.111365]        [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207
[    4.111400]        [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e
[    4.111434]        [<ffffffff81453425>] SyS_recvmsg+0xd/0x19
[    4.111467]        [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    4.111504] 
[    4.111504] other info that might help us debug this:
[    4.111504] 
[    4.111545]  Possible unsafe locking scenario:
[    4.111545] 
[    4.111577]        CPU0                    CPU1
[    4.111601]        ----                    ----
[    4.111625]   lock(nlk->cb_mutex);
[    4.112865]                                lock(genl_mutex);
[    4.114216]                                lock(nlk->cb_mutex);
[    4.115315]   lock(genl_mutex);
[    4.116500] 
[    4.116500]  *** DEADLOCK ***
[    4.116500] 
[    4.119670] 1 lock held by NetworkManager/358:
[    4.120906]  #0:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7
[    4.122196] 
[    4.122196] stack backtrace:
[    4.124533] CPU: 0 PID: 358 Comm: NetworkManager Not tainted 3.11.0-rc6 #1
[    4.125779] Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011
[    4.126979]  ffffffff81d0a0f0 ffff88022b91d8c8 ffffffff8157cf80 0000000000000006
[    4.128274]  ffffffff81cc8750 ffff88022b91d918 ffffffff8157a898 ffff88022d798080
[    4.129472]  ffff88022d798080 ffff88022d798080 ffff88022d798750 ffff88022d798080
[    4.130645] Call Trace:
[    4.131801]  [<ffffffff8157cf80>] dump_stack+0x4f/0x84
[    4.132817]  [<ffffffff8157a898>] print_circular_bug+0x2ad/0x2be
[    4.133839]  [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e
[    4.134821]  [<ffffffff8145471a>] ? sock_def_write_space+0x1b5/0x1b5
[    4.135800]  [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    4.136842]  [<ffffffff810b40bb>] ? mark_held_locks+0xce/0xfa
[    4.137828]  [<ffffffff8148204d>] ? genl_lock+0x12/0x14
[    4.138876]  [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    4.139856]  [<ffffffff8148204d>] ? genl_lock+0x12/0x14
[    4.141027]  [<ffffffff81583e42>] mutex_lock_nested+0x5e/0x345
[    4.142194]  [<ffffffff8148204d>] ? genl_lock+0x12/0x14
[    4.143219]  [<ffffffff8111c594>] ? __kmalloc_node_track_caller+0x26/0x2d
[    4.144340]  [<ffffffff8148204d>] genl_lock+0x12/0x14
[    4.145387]  [<ffffffff814822d2>] ctrl_dumpfamily+0x31/0xfa
[    4.146387]  [<ffffffff8145ac41>] ? __alloc_skb+0x97/0x1a0
[    4.147454]  [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7
[    4.148448]  [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1
[    4.149475]  [<ffffffff81450328>] sock_recvmsg+0x83/0x98
[    4.150494]  [<ffffffff810f86fa>] ? might_fault+0x52/0xa2
[    4.151471]  [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207
[    4.152516]  [<ffffffff810b34d2>] ? __lock_acquire+0x865/0x956
[    4.153501]  [<ffffffff81148b2b>] ? fget_light+0x35c/0x377
[    4.154550]  [<ffffffff81148933>] ? fget_light+0x164/0x377
[    4.155521]  [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e
[    4.156568]  [<ffffffff8145471a>] ? sock_def_write_space+0x1b5/0x1b5
[    4.157552]  [<ffffffff81453425>] SyS_recvmsg+0xd/0x19
[    4.158607]  [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    4.160507] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[    4.160709] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-19  5:04 3.11-rc6 genetlink locking fix offends lockdep Hugh Dickins
@ 2013-08-19  8:00 ` Johannes Berg
  2013-08-19 11:00   ` Ding Tianhong
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Berg @ 2013-08-19  8:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei,
	linux-kernel, netdev, stable, Pravin B Shelar, Thomas Graf


> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race")
> gives me the lockdep trace below at startup.

Hmm. Yes, I see now how this happens, not sure why I didn't run into it.

The problem is that genl_family_rcv_msg() is called with the genl_lock
held, and then calls netlink_dump_start() with it held, creating a
genl_lock->cb_mutex dependency, but obviously the dump continuation is
the other way around.

We could use the semaphore instead, I believe, but I don't really
understand the mutex vs. semaphore well enough to be sure that's
correct.

johannes


diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f85f8a2..6cfa646 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -792,7 +792,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
 	bool need_locking = chains_to_skip || fams_to_skip;
 
 	if (need_locking)
-		genl_lock();
+		down_read(&cb_lock);
 
 	for (i = chains_to_skip; i < GENL_FAM_TAB_SIZE; i++) {
 		n = 0;
@@ -815,7 +815,7 @@ errout:
 	cb->args[1] = n;
 
 	if (need_locking)
-		genl_unlock();
+		up_read(&cb_lock);
 
 	return skb->len;
 }



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-19  8:00 ` Johannes Berg
@ 2013-08-19 11:00   ` Ding Tianhong
  2013-08-19 11:22     ` Johannes Berg
  0 siblings, 1 reply; 10+ messages in thread
From: Ding Tianhong @ 2013-08-19 11:00 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Hugh Dickins, Linus Torvalds, Greg KH, David S. Miller,
	Otcheretianski, Andrei, linux-kernel, netdev, stable,
	Pravin B Shelar, Thomas Graf

On 2013/8/19 16:00, Johannes Berg wrote:
> 
>> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race")
>> gives me the lockdep trace below at startup.
> 
> Hmm. Yes, I see now how this happens, not sure why I didn't run into it.
> 
> The problem is that genl_family_rcv_msg() is called with the genl_lock
> held, and then calls netlink_dump_start() with it held, creating a
> genl_lock->cb_mutex dependency, but obviously the dump continuation is
> the other way around.
> 
> We could use the semaphore instead, I believe, but I don't really
> understand the mutex vs. semaphore well enough to be sure that's
> correct.
> 
> johannes
> 
it is useless, the logic need to modify or otherwise it will still call lockdep trace.
maybe i could send a patch for it, if you wish.

> 
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index f85f8a2..6cfa646 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -792,7 +792,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
>  	bool need_locking = chains_to_skip || fams_to_skip;
>  
>  	if (need_locking)
> -		genl_lock();
> +		down_read(&cb_lock);
>  
>  	for (i = chains_to_skip; i < GENL_FAM_TAB_SIZE; i++) {
>  		n = 0;
> @@ -815,7 +815,7 @@ errout:
>  	cb->args[1] = n;
>  
>  	if (need_locking)
> -		genl_unlock();
> +		up_read(&cb_lock);
>  
>  	return skb->len;
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> .
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-19 11:00   ` Ding Tianhong
@ 2013-08-19 11:22     ` Johannes Berg
  2013-08-19 18:52       ` Hugh Dickins
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Berg @ 2013-08-19 11:22 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: Hugh Dickins, Linus Torvalds, Greg KH, David S. Miller,
	Otcheretianski, Andrei, linux-kernel, netdev, stable,
	Pravin B Shelar, Thomas Graf

On Mon, 2013-08-19 at 19:00 +0800, Ding Tianhong wrote:
> On 2013/8/19 16:00, Johannes Berg wrote:
> > 
> >> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race")
> >> gives me the lockdep trace below at startup.
> > 
> > Hmm. Yes, I see now how this happens, not sure why I didn't run into it.
> > 
> > The problem is that genl_family_rcv_msg() is called with the genl_lock
> > held, and then calls netlink_dump_start() with it held, creating a
> > genl_lock->cb_mutex dependency, but obviously the dump continuation is
> > the other way around.
> > 
> > We could use the semaphore instead, I believe, but I don't really
> > understand the mutex vs. semaphore well enough to be sure that's
> > correct.
> > 
> > johannes
> > 
> it is useless, the logic need to modify or otherwise it will still call lockdep trace.

I don't believe so, the semaphore and cb_mutex don't have a dependency
yet, afaict.

> maybe i could send a patch for it, if you wish.

What do you mean?

johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-19 11:22     ` Johannes Berg
@ 2013-08-19 18:52       ` Hugh Dickins
  2013-08-20  8:28         ` Johannes Berg
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2013-08-19 18:52 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller,
	Otcheretianski, Andrei, linux-kernel, netdev, stable,
	Pravin B Shelar, Thomas Graf

On Mon, 19 Aug 2013, Johannes Berg wrote:
> On Mon, 2013-08-19 at 19:00 +0800, Ding Tianhong wrote:
> > On 2013/8/19 16:00, Johannes Berg wrote:
> > > 
> > >> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race")
> > >> gives me the lockdep trace below at startup.
> > > 
> > > Hmm. Yes, I see now how this happens, not sure why I didn't run into it.
> > > 
> > > The problem is that genl_family_rcv_msg() is called with the genl_lock
> > > held, and then calls netlink_dump_start() with it held, creating a
> > > genl_lock->cb_mutex dependency, but obviously the dump continuation is
> > > the other way around.
> > > 
> > > We could use the semaphore instead, I believe, but I don't really
> > > understand the mutex vs. semaphore well enough to be sure that's
> > > correct.
> > > 
> > > johannes
> > > 
> > it is useless, the logic need to modify or otherwise it will still call lockdep trace.
> 
> I don't believe so, the semaphore and cb_mutex don't have a dependency
> yet, afaict.

The down_read(&cb_lock) patch you suggested gives the lockdep trace below.

> 
> > maybe i could send a patch for it, if you wish.
> 
> What do you mean?
> 
> johannes

[    4.027797] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[    4.129749] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[    4.130179] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[    4.134629] 
[    4.134646] ======================================================
[    4.134680] [ INFO: possible circular locking dependency detected ]
[    4.134714] 3.11.0-rc6 #3 Not tainted
[    4.134735] -------------------------------------------------------
[    4.134767] NetworkManager/357 is trying to acquire lock:
[    4.134797]  (cb_lock){++++++}, at: [<ffffffff8148204a>] ctrl_dumpfamily+0x38/0x108
[    4.134853] 
[    4.134853] but task is already holding lock:
[    4.134883]  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7
[    4.134938] 
[    4.134938] which lock already depends on the new lock.
[    4.134938] 
[    4.134981] 
[    4.134981] the existing dependency chain (in reverse order) is:
[    4.135020] 
[    4.135020] -> #2 (nlk->cb_mutex){+.+.+.}:
[    4.135056]        [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    4.135094]        [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    4.135129]        [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345
[    4.135167]        [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e
[    4.135205]        [<ffffffff8148224b>] genl_rcv_msg+0xf4/0x252
[    4.135239]        [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c
[    4.135275]        [<ffffffff8148199b>] genl_rcv+0x24/0x34
[    4.135307]        [<ffffffff814811ca>] netlink_unicast+0xed/0x17a
[    4.135342]        [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345
[    4.135378]        [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e
[    4.135412]        [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be
[    4.135448]        [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e
[    4.135483]        [<ffffffff81453256>] SyS_sendmsg+0xd/0x19
[    4.135517]        [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    4.135555] 
[    4.135555] -> #1 (genl_mutex){+.+.+.}:
[    4.135589]        [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    4.135626]        [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    4.135661]        [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345
[    4.135697]        [<ffffffff81482155>] genl_lock+0x12/0x14
[    4.135730]        [<ffffffff8148249d>] genl_lock_all+0x15/0x17
[    4.135763]        [<ffffffff81482b2a>] genl_register_family+0x51/0x142
[    4.135801]        [<ffffffff8148305f>] genl_register_family_with_ops+0x23/0x70
[    4.135842]        [<ffffffff8195e610>] genl_init+0x41/0x80
[    4.135876]        [<ffffffff81000267>] do_one_initcall+0x7f/0x108
[    4.135912]        [<ffffffff81930e29>] kernel_init_freeable+0x106/0x195
[    4.135951]        [<ffffffff81574621>] kernel_init+0x9/0xd1
[    4.135985]        [<ffffffff81587b6c>] ret_from_fork+0x7c/0xb0
[    4.136019] 
[    4.136019] -> #0 (cb_lock){++++++}:
[    4.136052]        [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e
[    4.136091]        [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    4.136127]        [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    4.136161]        [<ffffffff81584629>] down_read+0x42/0x57
[    4.136194]        [<ffffffff8148204a>] ctrl_dumpfamily+0x38/0x108
[    4.136230]        [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7
[    4.136264]        [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1
[    4.137410]        [<ffffffff81450328>] sock_recvmsg+0x83/0x98
[    4.138459]        [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207
[    4.139692]        [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e
[    4.140918]        [<ffffffff81453425>] SyS_recvmsg+0xd/0x19
[    4.141975]        [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    4.143042] 
[    4.143042] other info that might help us debug this:
[    4.143042] 
[    4.146007] Chain exists of:
[    4.146007]   cb_lock --> genl_mutex --> nlk->cb_mutex
[    4.146007] 
[    4.148919]  Possible unsafe locking scenario:
[    4.148919] 
[    4.150955]        CPU0                    CPU1
[    4.152060]        ----                    ----
[    4.153030]   lock(nlk->cb_mutex);
[    4.153950]                                lock(genl_mutex);
[    4.154912]                                lock(nlk->cb_mutex);
[    4.155832]   lock(cb_lock);
[    4.156732] 
[    4.156732]  *** DEADLOCK ***
[    4.156732] 
[    4.159477] 1 lock held by NetworkManager/357:
[    4.160354]  #0:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7
[    4.161411] 
[    4.161411] stack backtrace:
[    4.163489] CPU: 0 PID: 357 Comm: NetworkManager Not tainted 3.11.0-rc6 #3
[    4.164527] Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011
[    4.165491]  ffffffff81d0a450 ffff88022bd61938 ffffffff8157cf90 0000000000000006
[    4.166520]  ffffffff81cc85a0 ffff88022bd61988 ffffffff8157a8a8 ffff880200000001
[    4.167486]  ffff8802313ae080 ffff8802313ae080 ffff8802313ae750 ffff8802313ae080
[    4.168532] Call Trace:
[    4.169491]  [<ffffffff8157cf90>] dump_stack+0x4f/0x84
[    4.170532]  [<ffffffff8157a8a8>] print_circular_bug+0x2ad/0x2be
[    4.171507]  [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e
[    4.172548]  [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    4.173525]  [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    4.174536]  [<ffffffff8148204a>] ? ctrl_dumpfamily+0x38/0x108
[    4.175535]  [<ffffffff81584629>] down_read+0x42/0x57
[    4.176523]  [<ffffffff8148204a>] ? ctrl_dumpfamily+0x38/0x108
[    4.177564]  [<ffffffff8148204a>] ctrl_dumpfamily+0x38/0x108
[    4.178556]  [<ffffffff8145ac41>] ? __alloc_skb+0x97/0x1a0
[    4.179611]  [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7
[    4.180594]  [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1
[    4.181645]  [<ffffffff81450328>] sock_recvmsg+0x83/0x98
[    4.182628]  [<ffffffff810f86fa>] ? might_fault+0x52/0xa2
[    4.183683]  [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207
[    4.184668]  [<ffffffff810b34d2>] ? __lock_acquire+0x865/0x956
[    4.185731]  [<ffffffff81148b2b>] ? fget_light+0x35c/0x377
[    4.186707]  [<ffffffff81148933>] ? fget_light+0x164/0x377
[    4.187744]  [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e
[    4.188709]  [<ffffffff8145471a>] ? sock_def_write_space+0x1b5/0x1b5
[    4.189755]  [<ffffffff81453425>] SyS_recvmsg+0xd/0x19
[    4.190729]  [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    4.192674] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[    4.192887] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-19 18:52       ` Hugh Dickins
@ 2013-08-20  8:28         ` Johannes Berg
  2013-08-20  9:04           ` Borislav Petkov
  2013-08-20 19:02           ` Johannes Berg
  0 siblings, 2 replies; 10+ messages in thread
From: Johannes Berg @ 2013-08-20  8:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller,
	Otcheretianski, Andrei, linux-kernel, netdev, stable,
	Pravin B Shelar, Thomas Graf

On Mon, 2013-08-19 at 11:52 -0700, Hugh Dickins wrote:

> > > > We could use the semaphore instead, I believe, but I don't really
> > > > understand the mutex vs. semaphore well enough to be sure that's
> > > > correct.

> > I don't believe so, the semaphore and cb_mutex don't have a dependency
> > yet, afaict.
> 
> The down_read(&cb_lock) patch you suggested gives the lockdep trace below.

Clearly I was wrong then, not sure what I missed in the code though.
>From the lockdep trace it looks like the dependency comes via the
genl_lock, I didn't consider that.

David, can you please just revert it for now and tag the revert for
stable as well, in case Greg already took it somewhere? I think that
some stable trees may need a different fix anyway since they don't have
parallel_ops.

We never saw a problem due to this, and though it's probably fairly easy
to trigger by hand-rolling the dump loop in userspace, one does need to
be able to rmmod to crash the kernel with it.

The only way to fix this that I see right now (that doesn't rewrite the
locking completely) would be to make genetlink use parallel_ops itself,
thereby removing the genl_lock() in genl_rcv_msg() and breaking all
those lock chains that lockdep reported. After that, it should be safe
to use genl_lock() inside all the operations. Something like the patch
below, perhaps? Completely untested so far.

johannes


diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f85f8a2..dea9586 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -665,6 +665,7 @@ static struct genl_family genl_ctrl = {
 	.version = 0x2,
 	.maxattr = CTRL_ATTR_MAX,
 	.netnsok = true,
+	.parallel_ops = true,
 };
 
 static int ctrl_fill_info(struct genl_family *family, u32 portid, u32 seq,
@@ -789,10 +790,8 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net *net = sock_net(skb->sk);
 	int chains_to_skip = cb->args[0];
 	int fams_to_skip = cb->args[1];
-	bool need_locking = chains_to_skip || fams_to_skip;
 
-	if (need_locking)
-		genl_lock();
+	genl_lock();
 
 	for (i = chains_to_skip; i < GENL_FAM_TAB_SIZE; i++) {
 		n = 0;
@@ -814,8 +813,7 @@ errout:
 	cb->args[0] = i;
 	cb->args[1] = n;
 
-	if (need_locking)
-		genl_unlock();
+	genl_unlock();
 
 	return skb->len;
 }
@@ -870,6 +868,8 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info)
 	struct genl_family *res = NULL;
 	int err = -EINVAL;
 
+	genl_lock();
+
 	if (info->attrs[CTRL_ATTR_FAMILY_ID]) {
 		u16 id = nla_get_u16(info->attrs[CTRL_ATTR_FAMILY_ID]);
 		res = genl_family_find_byid(id);
@@ -896,19 +896,25 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	if (res == NULL)
-		return err;
+		goto out_unlock;
 
 	if (!res->netnsok && !net_eq(genl_info_net(info), &init_net)) {
 		/* family doesn't exist here */
-		return -ENOENT;
+		err = -ENOENT;
+		goto out_unlock;
 	}
 
 	msg = ctrl_build_family_msg(res, info->snd_portid, info->snd_seq,
 				    CTRL_CMD_NEWFAMILY);
-	if (IS_ERR(msg))
-		return PTR_ERR(msg);
+	if (IS_ERR(msg)) {
+		err = PTR_ERR(msg);
+		goto out_unlock;
+	}
 
-	return genlmsg_reply(msg, info);
+	err = genlmsg_reply(msg, info);
+out_unlock:
+	genl_unlock();
+	return err;
 }
 
 static int genl_ctrl_event(int event, void *data)



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-20  8:28         ` Johannes Berg
@ 2013-08-20  9:04           ` Borislav Petkov
  2013-08-20 15:49             ` Hugh Dickins
  2013-08-20 19:02           ` Johannes Berg
  1 sibling, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2013-08-20  9:04 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Hugh Dickins, Ding Tianhong, Linus Torvalds, Greg KH,
	David S. Miller, Otcheretianski, Andrei, linux-kernel, netdev,
	stable, Pravin B Shelar, Thomas Graf

On Tue, Aug 20, 2013 at 10:28:58AM +0200, Johannes Berg wrote:
> Something like the patch below, perhaps? Completely untested so far.

Yeah, this one seems to fix it here (I was seeing the same lockdep splat
as Hugh).

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-20  9:04           ` Borislav Petkov
@ 2013-08-20 15:49             ` Hugh Dickins
  0 siblings, 0 replies; 10+ messages in thread
From: Hugh Dickins @ 2013-08-20 15:49 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Borislav Petkov, Ding Tianhong, Linus Torvalds, Greg KH,
	David S. Miller, Otcheretianski, Andrei, linux-kernel, netdev,
	stable, Pravin B Shelar, Thomas Graf

On Tue, 20 Aug 2013, Borislav Petkov wrote:
> On Tue, Aug 20, 2013 at 10:28:58AM +0200, Johannes Berg wrote:
> > Something like the patch below, perhaps? Completely untested so far.
> 
> Yeah, this one seems to fix it here (I was seeing the same lockdep splat
> as Hugh).

Not so good for me: it fixed the original splat, but a few seconds later:

[    4.073542] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[    4.175849] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
[    4.176223] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[    4.182322] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[    4.182537] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1
[    4.405766] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[    4.405973] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1
[    4.504441] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
[    6.204569] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/serio2/input/input8
[    7.662343] 
[    7.662361] ======================================================
[    7.662393] [ INFO: possible circular locking dependency detected ]
[    7.662426] 3.11.0-rc6 #5 Not tainted
[    7.662462] -------------------------------------------------------
[    7.662500] wpa_supplicant/418 is trying to acquire lock:
[    7.662533]  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e
[    7.662603] 
[    7.662603] but task is already holding lock:
[    7.662638]  (genl_mutex){+.+.+.}, at: [<ffffffff8148204d>] genl_lock+0x12/0x14
[    7.662695] 
[    7.662695] which lock already depends on the new lock.
[    7.662695] 
[    7.662743] 
[    7.662743] the existing dependency chain (in reverse order) is:
[    7.662788] 
[    7.662788] -> #1 (genl_mutex){+.+.+.}:
[    7.662827]        [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    7.662870]        [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    7.662909]        [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345
[    7.662951]        [<ffffffff8148204d>] genl_lock+0x12/0x14
[    7.662989]        [<ffffffff814822cc>] ctrl_dumpfamily+0x2b/0xea
[    7.663029]        [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7
[    7.663069]        [<ffffffff81480187>] __netlink_dump_start+0x113/0x14e
[    7.663113]        [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252
[    7.663152]        [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c
[    7.663192]        [<ffffffff8148199b>] genl_rcv+0x24/0x34
[    7.663228]        [<ffffffff814811ca>] netlink_unicast+0xed/0x17a
[    7.663270]        [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345
[    7.663311]        [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e
[    7.663351]        [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be
[    7.663391]        [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e
[    7.663431]        [<ffffffff81453256>] SyS_sendmsg+0xd/0x19
[    7.663469]        [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    7.663513] 
[    7.663513] -> #0 (nlk->cb_mutex){+.+.+.}:
[    7.663554]        [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e
[    7.663599]        [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    7.663640]        [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    7.663679]        [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345
[    7.663722]        [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e
[    7.663765]        [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252
[    7.663804]        [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c
[    7.663844]        [<ffffffff8148199b>] genl_rcv+0x24/0x34
[    7.663881]        [<ffffffff814811ca>] netlink_unicast+0xed/0x17a
[    7.663921]        [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345
[    7.663961]        [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e
[    7.664000]        [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be
[    7.664041]        [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e
[    7.664081]        [<ffffffff81453256>] SyS_sendmsg+0xd/0x19
[    7.664119]        [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[    7.664161] 
[    7.664161] other info that might help us debug this:
[    7.664161] 
[    7.666543]  Possible unsafe locking scenario:
[    7.666543] 
[    7.668877]        CPU0                    CPU1
[    7.670032]        ----                    ----
[    7.671165]   lock(genl_mutex);
[    7.672298]                                lock(nlk->cb_mutex);
[    7.673424]                                lock(genl_mutex);
[    7.674532]   lock(nlk->cb_mutex);
[    7.675607] 
[    7.675607]  *** DEADLOCK ***
[    7.675607] 
[    7.678696] 2 locks held by wpa_supplicant/418:
[    7.679704]  #0:  (cb_lock){++++++}, at: [<ffffffff8148198c>] genl_rcv+0x15/0x34
[    7.680772]  #1:  (genl_mutex){+.+.+.}, at: [<ffffffff8148204d>] genl_lock+0x12/0x14
[    7.681840] 
[    7.681840] stack backtrace:
[    7.683933] CPU: 0 PID: 418 Comm: wpa_supplicant Not tainted 3.11.0-rc6 #5
[    7.685022] Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011
[    7.686118]  ffffffff81cc8750 ffff88022b8117b8 ffffffff8157cf90 0000000000000006
[    7.687230]  ffffffff81d0a450 ffff88022b811808 ffffffff8157a8a8 0000000000000001
[    7.688347]  ffff880230d0a080 ffff880230d0a080 ffff880230d0a778 ffff880230d0a080
[    7.689461] Call Trace:
[    7.690551]  [<ffffffff8157cf90>] dump_stack+0x4f/0x84
[    7.691657]  [<ffffffff8157a8a8>] print_circular_bug+0x2ad/0x2be
[    7.692775]  [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e
[    7.693893]  [<ffffffff810b34d2>] __lock_acquire+0x865/0x956
[    7.695012]  [<ffffffff81480122>] ? __netlink_dump_start+0xae/0x14e
[    7.696137]  [<ffffffff810b39fc>] lock_acquire+0x57/0x6d
[    7.697275]  [<ffffffff81480122>] ? __netlink_dump_start+0xae/0x14e
[    7.698404]  [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345
[    7.699541]  [<ffffffff81480122>] ? __netlink_dump_start+0xae/0x14e
[    7.700696]  [<ffffffff81586c32>] ? _raw_read_unlock+0x2d/0x4a
[    7.701840]  [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e
[    7.702991]  [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252
[    7.704143]  [<ffffffff81536380>] ? nl80211_dump_mpath+0x10d/0x10d
[    7.705317]  [<ffffffff8148204f>] ? genl_lock+0x14/0x14
[    7.706464]  [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c
[    7.707614]  [<ffffffff8148199b>] genl_rcv+0x24/0x34
[    7.708756]  [<ffffffff814811ca>] netlink_unicast+0xed/0x17a
[    7.709890]  [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345
[    7.711018]  [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e
[    7.712144]  [<ffffffff810f86fa>] ? might_fault+0x52/0xa2
[    7.713280]  [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be
[    7.714400]  [<ffffffff810fd37b>] ? handle_mm_fault+0x47d/0x49d
[    7.715525]  [<ffffffff81087aa5>] ? up_read+0x1b/0x32
[    7.716663]  [<ffffffff81053c34>] ? __do_page_fault+0x370/0x414
[    7.717789]  [<ffffffff811488e4>] ? fget_light+0x115/0x377
[    7.718922]  [<ffffffff810f86fa>] ? might_fault+0x52/0xa2
[    7.720052]  [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e
[    7.721192]  [<ffffffff81453256>] SyS_sendmsg+0xd/0x19
[    7.722309]  [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b
[   11.044520] wlan1: authenticate with c0:3f:0e:ad:ff:ee
[   11.096583] wlan1: send auth to c0:3f:0e:ad:ff:ee (try 1/3)
[   11.099448] wlan1: authenticated
[   11.100813] wlan1: associate with c0:3f:0e:ad:ff:ee (try 1/3)
[   11.105771] wlan1: RX AssocResp from c0:3f:0e:ad:ff:ee (capab=0x411 status=0 aid=6)
[   11.114801] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[   11.115884] wlan1: associated

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-20  8:28         ` Johannes Berg
  2013-08-20  9:04           ` Borislav Petkov
@ 2013-08-20 19:02           ` Johannes Berg
  2013-08-20 19:10             ` Johannes Berg
  1 sibling, 1 reply; 10+ messages in thread
From: Johannes Berg @ 2013-08-20 19:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller,
	Otcheretianski, Andrei, linux-kernel, netdev, stable,
	Pravin B Shelar, Thomas Graf

On Tue, 2013-08-20 at 10:28 +0200, Johannes Berg wrote:

> The only way to fix this that I see right now (that doesn't rewrite the
> locking completely) would be to make genetlink use parallel_ops itself,
> thereby removing the genl_lock() in genl_rcv_msg() and breaking all
> those lock chains that lockdep reported. After that, it should be safe
> to use genl_lock() inside all the operations. Something like the patch
> below, perhaps? Completely untested so far.

Tested now, and it still causes lockdep to complain, though that's a
lockdep issue I believe, it thinks that genl_mutex and nlk->cb_mutex can
be inverted although nlk->cb_mutex exists per family, so we need to
annotate lockdep there.

johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 3.11-rc6 genetlink locking fix offends lockdep
  2013-08-20 19:02           ` Johannes Berg
@ 2013-08-20 19:10             ` Johannes Berg
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Berg @ 2013-08-20 19:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller,
	Otcheretianski, Andrei, linux-kernel, netdev, stable,
	Pravin B Shelar, Thomas Graf

On Tue, 2013-08-20 at 21:02 +0200, Johannes Berg wrote:
> On Tue, 2013-08-20 at 10:28 +0200, Johannes Berg wrote:
> 
> > The only way to fix this that I see right now (that doesn't rewrite the
> > locking completely) would be to make genetlink use parallel_ops itself,
> > thereby removing the genl_lock() in genl_rcv_msg() and breaking all
> > those lock chains that lockdep reported. After that, it should be safe
> > to use genl_lock() inside all the operations. Something like the patch
> > below, perhaps? Completely untested so far.
> 
> Tested now, and it still causes lockdep to complain, though that's a
> lockdep issue I believe, it thinks that genl_mutex and nlk->cb_mutex can
> be inverted although nlk->cb_mutex exists per family, so we need to
> annotate lockdep there.

No, lockdep is correct - generic netlink uses the same cb_mutex for all
families, obviously, since it's all the same netlink family.

I'll just convert it to RCU.

johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-08-20 19:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-19  5:04 3.11-rc6 genetlink locking fix offends lockdep Hugh Dickins
2013-08-19  8:00 ` Johannes Berg
2013-08-19 11:00   ` Ding Tianhong
2013-08-19 11:22     ` Johannes Berg
2013-08-19 18:52       ` Hugh Dickins
2013-08-20  8:28         ` Johannes Berg
2013-08-20  9:04           ` Borislav Petkov
2013-08-20 15:49             ` Hugh Dickins
2013-08-20 19:02           ` Johannes Berg
2013-08-20 19:10             ` Johannes Berg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.