netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: hung task while plugging in cfg80211
       [not found] <20130712074911.01b5945a@corrin.poochiereds.net>
@ 2013-07-24  9:59 ` Stanislaw Gruszka
       [not found]   ` <20130724095911.GA1637-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Stanislaw Gruszka @ 2013-07-24  9:59 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-wireless, linville, netdev, Thomas Graf, Stephen Hemminger,
	rjones, Marcel Holtmann

On Fri, Jul 12, 2013 at 07:49:11AM -0400, Jeff Layton wrote:
> I updated the kernel on my rawhide KVM guest, and noticed that the
> ethernet interface wasn't coming up at all. While poking around, I saw
> this stack trace pop up:
> 
> Jul 12 07:29:25 rawhide kernel: [  241.156196] INFO: task modprobe:501 blocked for more than 120 seconds.
> Jul 12 07:29:25 rawhide kernel: [  241.157222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jul 12 07:29:25 rawhide kernel: [  241.158379] modprobe        D ffff880118c526e0  4584   501    500 0x00000080
> Jul 12 07:29:25 rawhide kernel: [  241.159458]  ffff8800d10c3ba8 0000000000000046 00000000001d5300 ffff8800d10c3fd8
> Jul 12 07:29:25 rawhide kernel: [  241.160644]  ffff8800d10c3fd8 00000000001d5300 ffff8800d2ae4dc0 ffff8800d2ae4dc0
> Jul 12 07:29:25 rawhide kernel: [  241.161875]  ffffffff81d0bfa0 ffffffff81d0bfa8 ffffffff00000000 ffffffff81d0bff0
> Jul 12 07:29:25 rawhide kernel: [  241.163096] Call Trace:
> Jul 12 07:29:25 rawhide kernel: [  241.163460]  [<ffffffff817377e9>] schedule+0x29/0x70
> Jul 12 07:29:25 rawhide kernel: [  241.164235]  [<ffffffff817392ad>] rwsem_down_write_failed+0xed/0x1a0
> Jul 12 07:29:25 rawhide kernel: [  241.165489]  [<ffffffff810bb600>] ? update_cpu_load_active+0xb0/0xb0
> Jul 12 07:29:25 rawhide kernel: [  241.166267]  [<ffffffff8137d503>] call_rwsem_down_write_failed+0x13/0x20
> Jul 12 07:29:25 rawhide kernel: [  241.166760]  [<ffffffff81736aad>] ? down_write+0x9d/0xb2
> Jul 12 07:29:25 rawhide kernel: [  241.167189]  [<ffffffff8162d975>] ? genl_lock_all+0x15/0x30
> Jul 12 07:29:25 rawhide kernel: [  241.167603]  [<ffffffff8162d975>] genl_lock_all+0x15/0x30

I reproduced that on qemu-kvm with rawhide LiveCD. Below more
information gathered using sysrq:

(qemu) [ 2458.110724] SysRq : Show Locks Held
[ 2458.111347] 
[ 2458.111347] Showing all locks held in the system:
[ 2458.111347] 1 lock held by NetworkManager/582:
[ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162bc79>] genl_rcv+0x19/0x40
[ 2458.111347] 1 lock held by modprobe/603:
[ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162baa5>] genl_lock_all+0x15/0x30
[ 2458.111347] 
[ 2458.111347] =============================================
[ 2458.111347] 

(qemu) sendkey alt-sysrq-w
(qemu) [ 2461.579457] SysRq : Show Blocked State
[ 2461.580103]   task                        PC stack   pid father
[ 2461.580103] NetworkManager  D ffff880034b84500  4040   582      1 0x00000080
[ 2461.580103]  ffff8800197ff720 0000000000000046 00000000001d5340 ffff8800197fffd8
[ 2461.580103]  ffff8800197fffd8 00000000001d5340 ffff880019631700 7fffffffffffffff
[ 2461.580103]  ffff8800197ff880 ffff8800197ff878 ffff880019631700 ffff880019631700
[ 2461.580103] Call Trace:
[ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
[ 2461.580103]  [<ffffffff81731ad1>] schedule_timeout+0x1c1/0x360
[ 2461.580103]  [<ffffffff810e69eb>] ? mark_held_locks+0xbb/0x140
[ 2461.580103]  [<ffffffff817377ac>] ? _raw_spin_unlock_irq+0x2c/0x50
[ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 2461.580103]  [<ffffffff81736398>] wait_for_completion_killable+0xe8/0x170
[ 2461.580103]  [<ffffffff810b7fa0>] ? wake_up_state+0x20/0x20
[ 2461.580103]  [<ffffffff81095825>] call_usermodehelper_exec+0x1a5/0x210
[ 2461.580103]  [<ffffffff817362ed>] ? wait_for_completion_killable+0x3d/0x170
[ 2461.580103]  [<ffffffff81095cc3>] __request_module+0x1b3/0x370
[ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 2461.580103]  [<ffffffff8162c5c9>] ctrl_getfamily+0x159/0x190
[ 2461.580103]  [<ffffffff8162d8a4>] genl_family_rcv_msg+0x1f4/0x2e0
[ 2461.580103]  [<ffffffff8162d990>] ? genl_family_rcv_msg+0x2e0/0x2e0
[ 2461.580103]  [<ffffffff8162da1e>] genl_rcv_msg+0x8e/0xd0
[ 2461.580103]  [<ffffffff8162b729>] netlink_rcv_skb+0xa9/0xc0
[ 2461.580103]  [<ffffffff8162bc88>] genl_rcv+0x28/0x40
[ 2461.580103]  [<ffffffff8162ad6d>] netlink_unicast+0xdd/0x190
[ 2461.580103]  [<ffffffff8162b149>] netlink_sendmsg+0x329/0x750
[ 2461.580103]  [<ffffffff815db849>] sock_sendmsg+0x99/0xd0
[ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
[ 2461.580103]  [<ffffffff810e96e8>] ? lock_release_non_nested+0x308/0x350
[ 2461.580103]  [<ffffffff815dbc6e>] ___sys_sendmsg+0x39e/0x3b0
[ 2461.580103]  [<ffffffff810565af>] ? kvm_clock_read+0x2f/0x50
[ 2461.580103]  [<ffffffff810218b9>] ? sched_clock+0x9/0x10
[ 2461.580103]  [<ffffffff810bb2bd>] ? sched_clock_local+0x1d/0x80
[ 2461.580103]  [<ffffffff810bb448>] ? sched_clock_cpu+0xa8/0x100
[ 2461.580103]  [<ffffffff810e33ad>] ? trace_hardirqs_off+0xd/0x10
[ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
[ 2461.580103]  [<ffffffff810e3f7f>] ? lock_release_holdtime.part.28+0xf/0x1a0
[ 2461.580103]  [<ffffffff8120fec9>] ? fget_light+0xf9/0x510
[ 2461.580103]  [<ffffffff8120fe0c>] ? fget_light+0x3c/0x510
[ 2461.580103]  [<ffffffff815dd1d2>] __sys_sendmsg+0x42/0x80
[ 2461.580103]  [<ffffffff815dd222>] SyS_sendmsg+0x12/0x20
[ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
[ 2461.580103] modprobe        D ffff88000f2c8000  4632   603    602 0x00000080
[ 2461.580103]  ffff88000f04fba8 0000000000000046 00000000001d5340 ffff88000f04ffd8
[ 2461.580103]  ffff88000f04ffd8 00000000001d5340 ffff8800377d4500 ffff8800377d4500
[ 2461.580103]  ffffffff81d0b260 ffffffff81d0b268 ffffffff00000000 ffffffff81d0b2b0
[ 2461.580103] Call Trace:
[ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
[ 2461.580103]  [<ffffffff81736d4d>] rwsem_down_write_failed+0xed/0x1a0
[ 2461.580103]  [<ffffffff810bb200>] ? update_cpu_load_active+0x10/0xb0
[ 2461.580103]  [<ffffffff8137b473>] call_rwsem_down_write_failed+0x13/0x20
[ 2461.580103]  [<ffffffff8173492d>] ? down_write+0x9d/0xb2
[ 2461.580103]  [<ffffffff8162baa5>] ? genl_lock_all+0x15/0x30
[ 2461.580103]  [<ffffffff8162baa5>] genl_lock_all+0x15/0x30
[ 2461.580103]  [<ffffffff8162cbb3>] genl_register_family+0x53/0x1f0
[ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
[ 2461.580103]  [<ffffffff8162d650>] genl_register_family_with_ops+0x20/0x80
[ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
[ 2461.580103]  [<ffffffffa017fe84>] nl80211_init+0x24/0xf0 [cfg80211]
[ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
[ 2461.580103]  [<ffffffffa01dc043>] cfg80211_init+0x43/0xdb [cfg80211]
[ 2461.580103]  [<ffffffff810020fa>] do_one_initcall+0xfa/0x1b0
[ 2461.580103]  [<ffffffff8105cb93>] ? set_memory_nx+0x43/0x50
[ 2461.580103]  [<ffffffff810f75af>] load_module+0x1c6f/0x27f0
[ 2461.580103]  [<ffffffff810f2c90>] ? store_uevent+0x40/0x40
[ 2461.580103]  [<ffffffff810f82c6>] SyS_finit_module+0x86/0xb0
[ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
[ 2461.580103] Sched Debug Version: v0.10, 3.11.0-0.rc1.git4.1.fc20.x86_64 #1

So call user space to load cfg80211 module (alias
net-pf-16-proto-16-family-nl80211) from ctrl_getfamily() with read taken
cb_lock semaphore. That deadlock in modprobe process in
genl_register_family() when try to down_write cb_lock.

This is consequence of commit:

commit fb4e156886ce6e8309e912d8b370d192330d19d3
Author: Marcel Holtmann <marcel@holtmann.org>
Date:   Sun Apr 28 16:22:06 2013 -0700

    nl80211: Add generic netlink module alias for cfg80211/nl80211

I prepared a possible fix for that (patch below), but I'm not sure if it
do right thing, it can cause other oddity, please test (with
CONFIG_LOCKDEP compiled). I think I would prefer rather remove
autoloading genetlink modules from kernel or use just one lock
in instead of two cb_lock & genl_mutex .

Stanislaw

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 2fd6dbe..1076fe1 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -877,8 +877,10 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info)
 #ifdef CONFIG_MODULES
 		if (res == NULL) {
 			genl_unlock();
+			up_read(&cb_lock);
 			request_module("net-pf-%d-proto-%d-family-%s",
 				       PF_NETLINK, NETLINK_GENERIC, name);
+			down_read(&cb_lock);
 			genl_lock();
 			res = genl_family_find_byname(name);
 		}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: hung task while plugging in cfg80211
       [not found]   ` <20130724095911.GA1637-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-07-25 14:51     ` Jeff Layton
  2013-07-26  9:00       ` [PATCH 3.11] genetlink: release cb_lock before requesting additional module Stanislaw Gruszka
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2013-07-25 14:51 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linville-2XuSBdqkA4R54TAoqtyWWQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	Thomas Graf, Stephen Hemminger, rjones-H+wXaHxf7aLQT0dZR+AlfA,
	Marcel Holtmann

On Wed, 24 Jul 2013 11:59:12 +0200
Stanislaw Gruszka <sgruszka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Fri, Jul 12, 2013 at 07:49:11AM -0400, Jeff Layton wrote:
> > I updated the kernel on my rawhide KVM guest, and noticed that the
> > ethernet interface wasn't coming up at all. While poking around, I saw
> > this stack trace pop up:
> > 
> > Jul 12 07:29:25 rawhide kernel: [  241.156196] INFO: task modprobe:501 blocked for more than 120 seconds.
> > Jul 12 07:29:25 rawhide kernel: [  241.157222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Jul 12 07:29:25 rawhide kernel: [  241.158379] modprobe        D ffff880118c526e0  4584   501    500 0x00000080
> > Jul 12 07:29:25 rawhide kernel: [  241.159458]  ffff8800d10c3ba8 0000000000000046 00000000001d5300 ffff8800d10c3fd8
> > Jul 12 07:29:25 rawhide kernel: [  241.160644]  ffff8800d10c3fd8 00000000001d5300 ffff8800d2ae4dc0 ffff8800d2ae4dc0
> > Jul 12 07:29:25 rawhide kernel: [  241.161875]  ffffffff81d0bfa0 ffffffff81d0bfa8 ffffffff00000000 ffffffff81d0bff0
> > Jul 12 07:29:25 rawhide kernel: [  241.163096] Call Trace:
> > Jul 12 07:29:25 rawhide kernel: [  241.163460]  [<ffffffff817377e9>] schedule+0x29/0x70
> > Jul 12 07:29:25 rawhide kernel: [  241.164235]  [<ffffffff817392ad>] rwsem_down_write_failed+0xed/0x1a0
> > Jul 12 07:29:25 rawhide kernel: [  241.165489]  [<ffffffff810bb600>] ? update_cpu_load_active+0xb0/0xb0
> > Jul 12 07:29:25 rawhide kernel: [  241.166267]  [<ffffffff8137d503>] call_rwsem_down_write_failed+0x13/0x20
> > Jul 12 07:29:25 rawhide kernel: [  241.166760]  [<ffffffff81736aad>] ? down_write+0x9d/0xb2
> > Jul 12 07:29:25 rawhide kernel: [  241.167189]  [<ffffffff8162d975>] ? genl_lock_all+0x15/0x30
> > Jul 12 07:29:25 rawhide kernel: [  241.167603]  [<ffffffff8162d975>] genl_lock_all+0x15/0x30
> 
> I reproduced that on qemu-kvm with rawhide LiveCD. Below more
> information gathered using sysrq:
> 
> (qemu) [ 2458.110724] SysRq : Show Locks Held
> [ 2458.111347] 
> [ 2458.111347] Showing all locks held in the system:
> [ 2458.111347] 1 lock held by NetworkManager/582:
> [ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162bc79>] genl_rcv+0x19/0x40
> [ 2458.111347] 1 lock held by modprobe/603:
> [ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162baa5>] genl_lock_all+0x15/0x30
> [ 2458.111347] 
> [ 2458.111347] =============================================
> [ 2458.111347] 
> 
> (qemu) sendkey alt-sysrq-w
> (qemu) [ 2461.579457] SysRq : Show Blocked State
> [ 2461.580103]   task                        PC stack   pid father
> [ 2461.580103] NetworkManager  D ffff880034b84500  4040   582      1 0x00000080
> [ 2461.580103]  ffff8800197ff720 0000000000000046 00000000001d5340 ffff8800197fffd8
> [ 2461.580103]  ffff8800197fffd8 00000000001d5340 ffff880019631700 7fffffffffffffff
> [ 2461.580103]  ffff8800197ff880 ffff8800197ff878 ffff880019631700 ffff880019631700
> [ 2461.580103] Call Trace:
> [ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
> [ 2461.580103]  [<ffffffff81731ad1>] schedule_timeout+0x1c1/0x360
> [ 2461.580103]  [<ffffffff810e69eb>] ? mark_held_locks+0xbb/0x140
> [ 2461.580103]  [<ffffffff817377ac>] ? _raw_spin_unlock_irq+0x2c/0x50
> [ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [ 2461.580103]  [<ffffffff81736398>] wait_for_completion_killable+0xe8/0x170
> [ 2461.580103]  [<ffffffff810b7fa0>] ? wake_up_state+0x20/0x20
> [ 2461.580103]  [<ffffffff81095825>] call_usermodehelper_exec+0x1a5/0x210
> [ 2461.580103]  [<ffffffff817362ed>] ? wait_for_completion_killable+0x3d/0x170
> [ 2461.580103]  [<ffffffff81095cc3>] __request_module+0x1b3/0x370
> [ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [ 2461.580103]  [<ffffffff8162c5c9>] ctrl_getfamily+0x159/0x190
> [ 2461.580103]  [<ffffffff8162d8a4>] genl_family_rcv_msg+0x1f4/0x2e0
> [ 2461.580103]  [<ffffffff8162d990>] ? genl_family_rcv_msg+0x2e0/0x2e0
> [ 2461.580103]  [<ffffffff8162da1e>] genl_rcv_msg+0x8e/0xd0
> [ 2461.580103]  [<ffffffff8162b729>] netlink_rcv_skb+0xa9/0xc0
> [ 2461.580103]  [<ffffffff8162bc88>] genl_rcv+0x28/0x40
> [ 2461.580103]  [<ffffffff8162ad6d>] netlink_unicast+0xdd/0x190
> [ 2461.580103]  [<ffffffff8162b149>] netlink_sendmsg+0x329/0x750
> [ 2461.580103]  [<ffffffff815db849>] sock_sendmsg+0x99/0xd0
> [ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
> [ 2461.580103]  [<ffffffff810e96e8>] ? lock_release_non_nested+0x308/0x350
> [ 2461.580103]  [<ffffffff815dbc6e>] ___sys_sendmsg+0x39e/0x3b0
> [ 2461.580103]  [<ffffffff810565af>] ? kvm_clock_read+0x2f/0x50
> [ 2461.580103]  [<ffffffff810218b9>] ? sched_clock+0x9/0x10
> [ 2461.580103]  [<ffffffff810bb2bd>] ? sched_clock_local+0x1d/0x80
> [ 2461.580103]  [<ffffffff810bb448>] ? sched_clock_cpu+0xa8/0x100
> [ 2461.580103]  [<ffffffff810e33ad>] ? trace_hardirqs_off+0xd/0x10
> [ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
> [ 2461.580103]  [<ffffffff810e3f7f>] ? lock_release_holdtime.part.28+0xf/0x1a0
> [ 2461.580103]  [<ffffffff8120fec9>] ? fget_light+0xf9/0x510
> [ 2461.580103]  [<ffffffff8120fe0c>] ? fget_light+0x3c/0x510
> [ 2461.580103]  [<ffffffff815dd1d2>] __sys_sendmsg+0x42/0x80
> [ 2461.580103]  [<ffffffff815dd222>] SyS_sendmsg+0x12/0x20
> [ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
> [ 2461.580103] modprobe        D ffff88000f2c8000  4632   603    602 0x00000080
> [ 2461.580103]  ffff88000f04fba8 0000000000000046 00000000001d5340 ffff88000f04ffd8
> [ 2461.580103]  ffff88000f04ffd8 00000000001d5340 ffff8800377d4500 ffff8800377d4500
> [ 2461.580103]  ffffffff81d0b260 ffffffff81d0b268 ffffffff00000000 ffffffff81d0b2b0
> [ 2461.580103] Call Trace:
> [ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
> [ 2461.580103]  [<ffffffff81736d4d>] rwsem_down_write_failed+0xed/0x1a0
> [ 2461.580103]  [<ffffffff810bb200>] ? update_cpu_load_active+0x10/0xb0
> [ 2461.580103]  [<ffffffff8137b473>] call_rwsem_down_write_failed+0x13/0x20
> [ 2461.580103]  [<ffffffff8173492d>] ? down_write+0x9d/0xb2
> [ 2461.580103]  [<ffffffff8162baa5>] ? genl_lock_all+0x15/0x30
> [ 2461.580103]  [<ffffffff8162baa5>] genl_lock_all+0x15/0x30
> [ 2461.580103]  [<ffffffff8162cbb3>] genl_register_family+0x53/0x1f0
> [ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
> [ 2461.580103]  [<ffffffff8162d650>] genl_register_family_with_ops+0x20/0x80
> [ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
> [ 2461.580103]  [<ffffffffa017fe84>] nl80211_init+0x24/0xf0 [cfg80211]
> [ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
> [ 2461.580103]  [<ffffffffa01dc043>] cfg80211_init+0x43/0xdb [cfg80211]
> [ 2461.580103]  [<ffffffff810020fa>] do_one_initcall+0xfa/0x1b0
> [ 2461.580103]  [<ffffffff8105cb93>] ? set_memory_nx+0x43/0x50
> [ 2461.580103]  [<ffffffff810f75af>] load_module+0x1c6f/0x27f0
> [ 2461.580103]  [<ffffffff810f2c90>] ? store_uevent+0x40/0x40
> [ 2461.580103]  [<ffffffff810f82c6>] SyS_finit_module+0x86/0xb0
> [ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
> [ 2461.580103] Sched Debug Version: v0.10, 3.11.0-0.rc1.git4.1.fc20.x86_64 #1
> 
> So call user space to load cfg80211 module (alias
> net-pf-16-proto-16-family-nl80211) from ctrl_getfamily() with read taken
> cb_lock semaphore. That deadlock in modprobe process in
> genl_register_family() when try to down_write cb_lock.
> 
> This is consequence of commit:
> 
> commit fb4e156886ce6e8309e912d8b370d192330d19d3
> Author: Marcel Holtmann <marcel-kz+m5ild9QBg9hUCZPvPmw@public.gmane.org>
> Date:   Sun Apr 28 16:22:06 2013 -0700
> 
>     nl80211: Add generic netlink module alias for cfg80211/nl80211
> 
> I prepared a possible fix for that (patch below), but I'm not sure if it
> do right thing, it can cause other oddity, please test (with
> CONFIG_LOCKDEP compiled). I think I would prefer rather remove
> autoloading genetlink modules from kernel or use just one lock
> in instead of two cb_lock & genl_mutex .
> 
> Stanislaw
> 
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index 2fd6dbe..1076fe1 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -877,8 +877,10 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info)
>  #ifdef CONFIG_MODULES
>  		if (res == NULL) {
>  			genl_unlock();
> +			up_read(&cb_lock);
>  			request_module("net-pf-%d-proto-%d-family-%s",
>  				       PF_NETLINK, NETLINK_GENERIC, name);
> +			down_read(&cb_lock);
>  			genl_lock();
>  			res = genl_family_find_byname(name);
>  		}

The patch above fixes the problem for me. I'll leave it to others to
decide whether it's the best approach.

Thanks!
-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 3.11] genetlink: release cb_lock before requesting additional module
  2013-07-25 14:51     ` Jeff Layton
@ 2013-07-26  9:00       ` Stanislaw Gruszka
  2013-07-26 17:08         ` Pravin Shelar
  0 siblings, 1 reply; 5+ messages in thread
From: Stanislaw Gruszka @ 2013-07-26  9:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: linux-wireless, linville, netdev, Thomas Graf, Stephen Hemminger,
	rjones, Marcel Holtmann, Jeff Layton

Requesting external module with cb_lock taken can result in
the deadlock like showed below:

[ 2458.111347] Showing all locks held in the system:
[ 2458.111347] 1 lock held by NetworkManager/582:
[ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162bc79>] genl_rcv+0x19/0x40
[ 2458.111347] 1 lock held by modprobe/603:
[ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162baa5>] genl_lock_all+0x15/0x30

[ 2461.579457] SysRq : Show Blocked State
[ 2461.580103]   task                        PC stack   pid father
[ 2461.580103] NetworkManager  D ffff880034b84500  4040   582      1 0x00000080
[ 2461.580103]  ffff8800197ff720 0000000000000046 00000000001d5340 ffff8800197fffd8
[ 2461.580103]  ffff8800197fffd8 00000000001d5340 ffff880019631700 7fffffffffffffff
[ 2461.580103]  ffff8800197ff880 ffff8800197ff878 ffff880019631700 ffff880019631700
[ 2461.580103] Call Trace:
[ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
[ 2461.580103]  [<ffffffff81731ad1>] schedule_timeout+0x1c1/0x360
[ 2461.580103]  [<ffffffff810e69eb>] ? mark_held_locks+0xbb/0x140
[ 2461.580103]  [<ffffffff817377ac>] ? _raw_spin_unlock_irq+0x2c/0x50
[ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 2461.580103]  [<ffffffff81736398>] wait_for_completion_killable+0xe8/0x170
[ 2461.580103]  [<ffffffff810b7fa0>] ? wake_up_state+0x20/0x20
[ 2461.580103]  [<ffffffff81095825>] call_usermodehelper_exec+0x1a5/0x210
[ 2461.580103]  [<ffffffff817362ed>] ? wait_for_completion_killable+0x3d/0x170
[ 2461.580103]  [<ffffffff81095cc3>] __request_module+0x1b3/0x370
[ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 2461.580103]  [<ffffffff8162c5c9>] ctrl_getfamily+0x159/0x190
[ 2461.580103]  [<ffffffff8162d8a4>] genl_family_rcv_msg+0x1f4/0x2e0
[ 2461.580103]  [<ffffffff8162d990>] ? genl_family_rcv_msg+0x2e0/0x2e0
[ 2461.580103]  [<ffffffff8162da1e>] genl_rcv_msg+0x8e/0xd0
[ 2461.580103]  [<ffffffff8162b729>] netlink_rcv_skb+0xa9/0xc0
[ 2461.580103]  [<ffffffff8162bc88>] genl_rcv+0x28/0x40
[ 2461.580103]  [<ffffffff8162ad6d>] netlink_unicast+0xdd/0x190
[ 2461.580103]  [<ffffffff8162b149>] netlink_sendmsg+0x329/0x750
[ 2461.580103]  [<ffffffff815db849>] sock_sendmsg+0x99/0xd0
[ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
[ 2461.580103]  [<ffffffff810e96e8>] ? lock_release_non_nested+0x308/0x350
[ 2461.580103]  [<ffffffff815dbc6e>] ___sys_sendmsg+0x39e/0x3b0
[ 2461.580103]  [<ffffffff810565af>] ? kvm_clock_read+0x2f/0x50
[ 2461.580103]  [<ffffffff810218b9>] ? sched_clock+0x9/0x10
[ 2461.580103]  [<ffffffff810bb2bd>] ? sched_clock_local+0x1d/0x80
[ 2461.580103]  [<ffffffff810bb448>] ? sched_clock_cpu+0xa8/0x100
[ 2461.580103]  [<ffffffff810e33ad>] ? trace_hardirqs_off+0xd/0x10
[ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
[ 2461.580103]  [<ffffffff810e3f7f>] ? lock_release_holdtime.part.28+0xf/0x1a0
[ 2461.580103]  [<ffffffff8120fec9>] ? fget_light+0xf9/0x510
[ 2461.580103]  [<ffffffff8120fe0c>] ? fget_light+0x3c/0x510
[ 2461.580103]  [<ffffffff815dd1d2>] __sys_sendmsg+0x42/0x80
[ 2461.580103]  [<ffffffff815dd222>] SyS_sendmsg+0x12/0x20
[ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
[ 2461.580103] modprobe        D ffff88000f2c8000  4632   603    602 0x00000080
[ 2461.580103]  ffff88000f04fba8 0000000000000046 00000000001d5340 ffff88000f04ffd8
[ 2461.580103]  ffff88000f04ffd8 00000000001d5340 ffff8800377d4500 ffff8800377d4500
[ 2461.580103]  ffffffff81d0b260 ffffffff81d0b268 ffffffff00000000 ffffffff81d0b2b0
[ 2461.580103] Call Trace:
[ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
[ 2461.580103]  [<ffffffff81736d4d>] rwsem_down_write_failed+0xed/0x1a0
[ 2461.580103]  [<ffffffff810bb200>] ? update_cpu_load_active+0x10/0xb0
[ 2461.580103]  [<ffffffff8137b473>] call_rwsem_down_write_failed+0x13/0x20
[ 2461.580103]  [<ffffffff8173492d>] ? down_write+0x9d/0xb2
[ 2461.580103]  [<ffffffff8162baa5>] ? genl_lock_all+0x15/0x30
[ 2461.580103]  [<ffffffff8162baa5>] genl_lock_all+0x15/0x30
[ 2461.580103]  [<ffffffff8162cbb3>] genl_register_family+0x53/0x1f0
[ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
[ 2461.580103]  [<ffffffff8162d650>] genl_register_family_with_ops+0x20/0x80
[ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
[ 2461.580103]  [<ffffffffa017fe84>] nl80211_init+0x24/0xf0 [cfg80211]
[ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
[ 2461.580103]  [<ffffffffa01dc043>] cfg80211_init+0x43/0xdb [cfg80211]
[ 2461.580103]  [<ffffffff810020fa>] do_one_initcall+0xfa/0x1b0
[ 2461.580103]  [<ffffffff8105cb93>] ? set_memory_nx+0x43/0x50
[ 2461.580103]  [<ffffffff810f75af>] load_module+0x1c6f/0x27f0
[ 2461.580103]  [<ffffffff810f2c90>] ? store_uevent+0x40/0x40
[ 2461.580103]  [<ffffffff810f82c6>] SyS_finit_module+0x86/0xb0
[ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
[ 2461.580103] Sched Debug Version: v0.10, 3.11.0-0.rc1.git4.1.fc20.x86_64 #1

Problem start to happen after adding net-pf-16-proto-16-family-nl80211
alias name to cfg80211 module by below commit (though that commit
itself is perfectly fine):

commit fb4e156886ce6e8309e912d8b370d192330d19d3
Author: Marcel Holtmann <marcel@holtmann.org>
Date:   Sun Apr 28 16:22:06 2013 -0700

    nl80211: Add generic netlink module alias for cfg80211/nl80211

Reported-and-tested-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 net/netlink/genetlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 2fd6dbe..1076fe1 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -877,8 +877,10 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info)
 #ifdef CONFIG_MODULES
 		if (res == NULL) {
 			genl_unlock();
+			up_read(&cb_lock);
 			request_module("net-pf-%d-proto-%d-family-%s",
 				       PF_NETLINK, NETLINK_GENERIC, name);
+			down_read(&cb_lock);
 			genl_lock();
 			res = genl_family_find_byname(name);
 		}
-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 3.11] genetlink: release cb_lock before requesting additional module
  2013-07-26  9:00       ` [PATCH 3.11] genetlink: release cb_lock before requesting additional module Stanislaw Gruszka
@ 2013-07-26 17:08         ` Pravin Shelar
  2013-07-28  5:20           ` David Miller
  0 siblings, 1 reply; 5+ messages in thread
From: Pravin Shelar @ 2013-07-26 17:08 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: David S. Miller, linux-wireless, linville, netdev, Thomas Graf,
	Stephen Hemminger, rjones, Marcel Holtmann, Jeff Layton

On Fri, Jul 26, 2013 at 2:00 AM, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> Requesting external module with cb_lock taken can result in
> the deadlock like showed below:
>
> [ 2458.111347] Showing all locks held in the system:
> [ 2458.111347] 1 lock held by NetworkManager/582:
> [ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162bc79>] genl_rcv+0x19/0x40
> [ 2458.111347] 1 lock held by modprobe/603:
> [ 2458.111347]  #0:  (cb_lock){++++++}, at: [<ffffffff8162baa5>] genl_lock_all+0x15/0x30
>
> [ 2461.579457] SysRq : Show Blocked State
> [ 2461.580103]   task                        PC stack   pid father
> [ 2461.580103] NetworkManager  D ffff880034b84500  4040   582      1 0x00000080
> [ 2461.580103]  ffff8800197ff720 0000000000000046 00000000001d5340 ffff8800197fffd8
> [ 2461.580103]  ffff8800197fffd8 00000000001d5340 ffff880019631700 7fffffffffffffff
> [ 2461.580103]  ffff8800197ff880 ffff8800197ff878 ffff880019631700 ffff880019631700
> [ 2461.580103] Call Trace:
> [ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
> [ 2461.580103]  [<ffffffff81731ad1>] schedule_timeout+0x1c1/0x360
> [ 2461.580103]  [<ffffffff810e69eb>] ? mark_held_locks+0xbb/0x140
> [ 2461.580103]  [<ffffffff817377ac>] ? _raw_spin_unlock_irq+0x2c/0x50
> [ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [ 2461.580103]  [<ffffffff81736398>] wait_for_completion_killable+0xe8/0x170
> [ 2461.580103]  [<ffffffff810b7fa0>] ? wake_up_state+0x20/0x20
> [ 2461.580103]  [<ffffffff81095825>] call_usermodehelper_exec+0x1a5/0x210
> [ 2461.580103]  [<ffffffff817362ed>] ? wait_for_completion_killable+0x3d/0x170
> [ 2461.580103]  [<ffffffff81095cc3>] __request_module+0x1b3/0x370
> [ 2461.580103]  [<ffffffff810e6b6d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [ 2461.580103]  [<ffffffff8162c5c9>] ctrl_getfamily+0x159/0x190
> [ 2461.580103]  [<ffffffff8162d8a4>] genl_family_rcv_msg+0x1f4/0x2e0
> [ 2461.580103]  [<ffffffff8162d990>] ? genl_family_rcv_msg+0x2e0/0x2e0
> [ 2461.580103]  [<ffffffff8162da1e>] genl_rcv_msg+0x8e/0xd0
> [ 2461.580103]  [<ffffffff8162b729>] netlink_rcv_skb+0xa9/0xc0
> [ 2461.580103]  [<ffffffff8162bc88>] genl_rcv+0x28/0x40
> [ 2461.580103]  [<ffffffff8162ad6d>] netlink_unicast+0xdd/0x190
> [ 2461.580103]  [<ffffffff8162b149>] netlink_sendmsg+0x329/0x750
> [ 2461.580103]  [<ffffffff815db849>] sock_sendmsg+0x99/0xd0
> [ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
> [ 2461.580103]  [<ffffffff810e96e8>] ? lock_release_non_nested+0x308/0x350
> [ 2461.580103]  [<ffffffff815dbc6e>] ___sys_sendmsg+0x39e/0x3b0
> [ 2461.580103]  [<ffffffff810565af>] ? kvm_clock_read+0x2f/0x50
> [ 2461.580103]  [<ffffffff810218b9>] ? sched_clock+0x9/0x10
> [ 2461.580103]  [<ffffffff810bb2bd>] ? sched_clock_local+0x1d/0x80
> [ 2461.580103]  [<ffffffff810bb448>] ? sched_clock_cpu+0xa8/0x100
> [ 2461.580103]  [<ffffffff810e33ad>] ? trace_hardirqs_off+0xd/0x10
> [ 2461.580103]  [<ffffffff810bb58f>] ? local_clock+0x5f/0x70
> [ 2461.580103]  [<ffffffff810e3f7f>] ? lock_release_holdtime.part.28+0xf/0x1a0
> [ 2461.580103]  [<ffffffff8120fec9>] ? fget_light+0xf9/0x510
> [ 2461.580103]  [<ffffffff8120fe0c>] ? fget_light+0x3c/0x510
> [ 2461.580103]  [<ffffffff815dd1d2>] __sys_sendmsg+0x42/0x80
> [ 2461.580103]  [<ffffffff815dd222>] SyS_sendmsg+0x12/0x20
> [ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
> [ 2461.580103] modprobe        D ffff88000f2c8000  4632   603    602 0x00000080
> [ 2461.580103]  ffff88000f04fba8 0000000000000046 00000000001d5340 ffff88000f04ffd8
> [ 2461.580103]  ffff88000f04ffd8 00000000001d5340 ffff8800377d4500 ffff8800377d4500
> [ 2461.580103]  ffffffff81d0b260 ffffffff81d0b268 ffffffff00000000 ffffffff81d0b2b0
> [ 2461.580103] Call Trace:
> [ 2461.580103]  [<ffffffff817355f9>] schedule+0x29/0x70
> [ 2461.580103]  [<ffffffff81736d4d>] rwsem_down_write_failed+0xed/0x1a0
> [ 2461.580103]  [<ffffffff810bb200>] ? update_cpu_load_active+0x10/0xb0
> [ 2461.580103]  [<ffffffff8137b473>] call_rwsem_down_write_failed+0x13/0x20
> [ 2461.580103]  [<ffffffff8173492d>] ? down_write+0x9d/0xb2
> [ 2461.580103]  [<ffffffff8162baa5>] ? genl_lock_all+0x15/0x30
> [ 2461.580103]  [<ffffffff8162baa5>] genl_lock_all+0x15/0x30
> [ 2461.580103]  [<ffffffff8162cbb3>] genl_register_family+0x53/0x1f0
> [ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
> [ 2461.580103]  [<ffffffff8162d650>] genl_register_family_with_ops+0x20/0x80
> [ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
> [ 2461.580103]  [<ffffffffa017fe84>] nl80211_init+0x24/0xf0 [cfg80211]
> [ 2461.580103]  [<ffffffffa01dc000>] ? 0xffffffffa01dbfff
> [ 2461.580103]  [<ffffffffa01dc043>] cfg80211_init+0x43/0xdb [cfg80211]
> [ 2461.580103]  [<ffffffff810020fa>] do_one_initcall+0xfa/0x1b0
> [ 2461.580103]  [<ffffffff8105cb93>] ? set_memory_nx+0x43/0x50
> [ 2461.580103]  [<ffffffff810f75af>] load_module+0x1c6f/0x27f0
> [ 2461.580103]  [<ffffffff810f2c90>] ? store_uevent+0x40/0x40
> [ 2461.580103]  [<ffffffff810f82c6>] SyS_finit_module+0x86/0xb0
> [ 2461.580103]  [<ffffffff81741ad9>] system_call_fastpath+0x16/0x1b
> [ 2461.580103] Sched Debug Version: v0.10, 3.11.0-0.rc1.git4.1.fc20.x86_64 #1
>
> Problem start to happen after adding net-pf-16-proto-16-family-nl80211
> alias name to cfg80211 module by below commit (though that commit
> itself is perfectly fine):
>
> commit fb4e156886ce6e8309e912d8b370d192330d19d3
> Author: Marcel Holtmann <marcel@holtmann.org>
> Date:   Sun Apr 28 16:22:06 2013 -0700
>
>     nl80211: Add generic netlink module alias for cfg80211/nl80211
>
> Reported-and-tested-by: Jeff Layton <jlayton@redhat.com>
> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
> ---
>  net/netlink/genetlink.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index 2fd6dbe..1076fe1 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -877,8 +877,10 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info)
>  #ifdef CONFIG_MODULES
>                 if (res == NULL) {
>                         genl_unlock();
> +                       up_read(&cb_lock);
>                         request_module("net-pf-%d-proto-%d-family-%s",
>                                        PF_NETLINK, NETLINK_GENERIC, name);
> +                       down_read(&cb_lock);
>                         genl_lock();
>                         res = genl_family_find_byname(name);
>                 }
This is genl issue and it was introduced by commit def3117493eafd
(genl: Allow concurrent genl callbacks.).

Reviewed-by: Pravin B Shelar <pshelar@nicira.com>

> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 3.11] genetlink: release cb_lock before requesting additional module
  2013-07-26 17:08         ` Pravin Shelar
@ 2013-07-28  5:20           ` David Miller
  0 siblings, 0 replies; 5+ messages in thread
From: David Miller @ 2013-07-28  5:20 UTC (permalink / raw)
  To: pshelar
  Cc: sgruszka, linux-wireless, linville, netdev, tgraf, shemminger,
	rjones, marcel, jlayton

From: Pravin Shelar <pshelar@nicira.com>
Date: Fri, 26 Jul 2013 10:08:25 -0700

> On Fri, Jul 26, 2013 at 2:00 AM, Stanislaw Gruszka <sgruszka@redhat.com> wrote:
>> Requesting external module with cb_lock taken can result in
>> the deadlock like showed below:
 ...
>> Problem start to happen after adding net-pf-16-proto-16-family-nl80211
>> alias name to cfg80211 module by below commit (though that commit
>> itself is perfectly fine):
>>
>> commit fb4e156886ce6e8309e912d8b370d192330d19d3
>> Author: Marcel Holtmann <marcel@holtmann.org>
>> Date:   Sun Apr 28 16:22:06 2013 -0700
>>
>>     nl80211: Add generic netlink module alias for cfg80211/nl80211
>>
>> Reported-and-tested-by: Jeff Layton <jlayton@redhat.com>
>> Reported-by: Richard W.M. Jones <rjones@redhat.com>
>> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
 ...
> This is genl issue and it was introduced by commit def3117493eafd
> (genl: Allow concurrent genl callbacks.).
> 
> Reviewed-by: Pravin B Shelar <pshelar@nicira.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-07-28  5:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20130712074911.01b5945a@corrin.poochiereds.net>
2013-07-24  9:59 ` hung task while plugging in cfg80211 Stanislaw Gruszka
     [not found]   ` <20130724095911.GA1637-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-07-25 14:51     ` Jeff Layton
2013-07-26  9:00       ` [PATCH 3.11] genetlink: release cb_lock before requesting additional module Stanislaw Gruszka
2013-07-26 17:08         ` Pravin Shelar
2013-07-28  5:20           ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).