netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] Deadlock in _cfg80211_unregister_wdev()
@ 2021-05-13 23:07 Maximilian Luz
  2021-05-14  8:26 ` Johannes Berg
  2021-05-14 13:46 ` Maximilian Luz
  0 siblings, 2 replies; 6+ messages in thread
From: Maximilian Luz @ 2021-05-13 23:07 UTC (permalink / raw)
  To: Johannes Berg, linux-wireless
  Cc: Marek Szyprowski, Kalle Valo, David S. Miller, Jakub Kicinski,
	Jakub Kicinski, netdev, Amitkumar Karwar, Ganapathi Bhat,
	Xinming Hu

Following commit a05829a7222e ("cfg80211: avoid holding the RTNL when
calling the driver"), the mwifiex_pcie module fails to unload. This also
prevents the device from rebooting / shutting down.

Attempting to unload the module produces the log pasted below. Upon
further investigation, this looks like a deadlock inside
_cfg80211_unregister_wdev():

- According to [1], this function expects the rdev->wiphy.mtx to be
   held.
- Down the line, this function (through some indirections, see third
   trace in log below) calls call_netdevice_notifiers(NETDEV_GOING_DOWN,
   ...) [2].
- One of the registered notifiers seems to be
   cfg80211_netdev_notifier_call(), which attempts to lock
   rdev->wiphy.mtx again [3], completing the deadlock.

Regards,
Max


[1]: https://elixir.bootlin.com/linux/v5.13-rc1/source/net/wireless/core.c#L1130
[2]: https://elixir.bootlin.com/linux/v5.13-rc1/source/net/core/dev.c#L1667
[3]: https://elixir.bootlin.com/linux/v5.13-rc1/source/net/wireless/core.c#L1428

[  245.504760] INFO: task kworker/u16:1:107 blocked for more than 122 seconds.
[  245.504764]       Tainted: G         C OE     5.11.0-1-surface-dev #2
[  245.504765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  245.504766] task:kworker/u16:1   state:D stack:    0 pid:  107 ppid:     2 flags:0x00004000
[  245.504771] Workqueue: events_power_efficient reg_check_chans_work [cfg80211]
[  245.504817] Call Trace:
[  245.504820]  __schedule+0x2dd/0x8b0
[  245.504826]  schedule+0x5b/0xc0
[  245.504829]  schedule_preempt_disabled+0x11/0x20
[  245.504831]  __mutex_lock.constprop.0+0x317/0x500
[  245.504835]  reg_check_chans_work+0x2d/0x3c0 [cfg80211]
[  245.504867]  process_one_work+0x214/0x3e0
[  245.504870]  worker_thread+0x4d/0x3d0
[  245.504872]  ? rescuer_thread+0x410/0x410
[  245.504874]  kthread+0x133/0x150
[  245.504877]  ? __kthread_bind_mask+0x60/0x60
[  245.504880]  ret_from_fork+0x22/0x30
[  245.504900] INFO: task wpa_supplicant:903 blocked for more than 122 seconds.
[  245.504901]       Tainted: G         C OE     5.11.0-1-surface-dev #2
[  245.504902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  245.504903] task:wpa_supplicant  state:D stack:    0 pid:  903 ppid:     1 flags:0x00000000
[  245.504906] Call Trace:
[  245.504907]  __schedule+0x2dd/0x8b0
[  245.504910]  schedule+0x5b/0xc0
[  245.504912]  schedule_preempt_disabled+0x11/0x20
[  245.504914]  __mutex_lock.constprop.0+0x317/0x500
[  245.504917]  nl80211_pre_doit+0x16/0x130 [cfg80211]
[  245.504951]  genl_family_rcv_msg_doit+0xe7/0x160
[  245.504956]  genl_rcv_msg+0xef/0x1e0
[  245.504959]  ? nl80211_send_scan_start+0x90/0x90 [cfg80211]
[  245.504993]  ? genl_get_cmd+0xd0/0xd0
[  245.504996]  netlink_rcv_skb+0x5b/0x100
[  245.504999]  genl_rcv+0x24/0x40
[  245.505002]  netlink_unicast+0x242/0x340
[  245.505004]  netlink_sendmsg+0x243/0x480
[  245.505007]  sock_sendmsg+0x5e/0x60
[  245.505011]  ____sys_sendmsg+0x25a/0x2a0
[  245.505013]  ? copy_msghdr_from_user+0x6e/0xa0
[  245.505017]  ___sys_sendmsg+0x97/0xe0
[  245.505022]  __sys_sendmsg+0x81/0xd0
[  245.505025]  do_syscall_64+0x33/0x40
[  245.505028]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  245.505032] RIP: 0033:0x7fa6606cd737
[  245.505034] RSP: 002b:00007ffddff52178 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  245.505036] RAX: ffffffffffffffda RBX: 0000559697c3e780 RCX: 00007fa6606cd737
[  245.505038] RDX: 0000000000000000 RSI: 00007ffddff521b0 RDI: 0000000000000006
[  245.505039] RBP: 0000559697c3e690 R08: 0000000000000004 R09: 00007fa66078ea60
[  245.505040] R10: 00007ffddff52284 R11: 0000000000000246 R12: 0000559697c7a9a0
[  245.505042] R13: 00007ffddff521b0 R14: 00007ffddff52284 R15: 0000559697c71100
[  245.505072] INFO: task modprobe:1930 blocked for more than 122 seconds.
[  245.505073]       Tainted: G         C OE     5.11.0-1-surface-dev #2
[  245.505074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  245.505075] task:modprobe        state:D stack:    0 pid: 1930 ppid:  1879 flags:0x00004004
[  245.505078] Call Trace:
[  245.505079]  __schedule+0x2dd/0x8b0
[  245.505082]  schedule+0x5b/0xc0
[  245.505084]  schedule_preempt_disabled+0x11/0x20
[  245.505086]  __mutex_lock.constprop.0+0x317/0x500
[  245.505088]  ? vprintk_emit+0x13a/0x270
[  245.505093]  cfg80211_netdev_notifier_call+0x12a/0x4e0 [cfg80211]
[  245.505125]  raw_notifier_call_chain+0x44/0x60
[  245.505127]  __dev_close_many+0x6b/0x120
[  245.505130]  dev_close_many+0x92/0x140
[  245.505132]  unregister_netdevice_many+0x150/0x6f0
[  245.505135]  unregister_netdevice_queue+0x96/0xd0
[  245.505138]  _cfg80211_unregister_wdev+0x135/0x1d0 [cfg80211]
[  245.505167]  mwifiex_del_virtual_intf+0x178/0x1a0 [mwifiex]
[  245.505181]  mwifiex_uninit_sw+0x1cf/0x1f0 [mwifiex]
[  245.505188]  mwifiex_remove_card+0x7b/0x80 [mwifiex]
[  245.505196]  pci_device_remove+0x3b/0xa0
[  245.505200]  __device_release_driver+0x17a/0x230
[  245.505204]  driver_detach+0xc9/0x110
[  245.505206]  bus_remove_driver+0x58/0xd0
[  245.505208]  pci_unregister_driver+0x3b/0x90
[  245.505211]  __do_sys_delete_module+0x19e/0x2a0
[  245.505216]  do_syscall_64+0x33/0x40
[  245.505218]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  245.505221] RIP: 0033:0x7fd803e30cab
[  245.505223] RSP: 002b:00007fffa7589d58 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  245.505224] RAX: ffffffffffffffda RBX: 0000559b9ae6ce80 RCX: 00007fd803e30cab
[  245.505226] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000559b9ae6cee8
[  245.505227] RBP: 0000559b9ae6ce80 R08: 0000000000000000 R09: 0000000000000000
[  245.505228] R10: 00007fd803ea4ac0 R11: 0000000000000206 R12: 0000559b9ae6cee8
[  245.505229] R13: 0000000000000000 R14: 0000000000000000 R15: 0000559b9ae6c5d0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Deadlock in _cfg80211_unregister_wdev()
  2021-05-13 23:07 [BUG] Deadlock in _cfg80211_unregister_wdev() Maximilian Luz
@ 2021-05-14  8:26 ` Johannes Berg
  2021-05-14 11:40   ` Maximilian Luz
  2021-05-15  2:44   ` Brian Norris
  2021-05-14 13:46 ` Maximilian Luz
  1 sibling, 2 replies; 6+ messages in thread
From: Johannes Berg @ 2021-05-14  8:26 UTC (permalink / raw)
  To: Maximilian Luz, linux-wireless
  Cc: Marek Szyprowski, Kalle Valo, David S. Miller, Jakub Kicinski,
	netdev, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu

On Fri, 2021-05-14 at 01:07 +0200, Maximilian Luz wrote:
> Following commit a05829a7222e ("cfg80211: avoid holding the RTNL when
> calling the driver"), the mwifiex_pcie module fails to unload. This also
> prevents the device from rebooting / shutting down.
> 
> Attempting to unload the module
> 

I'm *guessing* that you're attempting to unload the module while the
interface is still up, i.e. you didn't "ip link set wlan0 down" first?

If so, that is likely fixed by this commit as well:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ea6b2098dd02789f68770fd3d5a373732207be2f

However, your log says:

> [  245.504764]       Tainted: G         C OE     5.11.0-1-surface-dev #2

so I have no idea what kernel you're using, because 5.11 did *not*
contain commit a05829a7222e ("cfg80211: avoid holding the RTNL when
calling the driver"). If you backported the bug you get to be
responsible for backporting the fixes too?


If that's all not solving the issue then please try to resolve with gdb
what line of code "cfg80211_netdev_notifier_call+0x12a" is, and please
also clarify exactly what (upstream!) kernel you're using.

johannes



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Deadlock in _cfg80211_unregister_wdev()
  2021-05-14  8:26 ` Johannes Berg
@ 2021-05-14 11:40   ` Maximilian Luz
  2021-05-15  2:44   ` Brian Norris
  1 sibling, 0 replies; 6+ messages in thread
From: Maximilian Luz @ 2021-05-14 11:40 UTC (permalink / raw)
  To: Johannes Berg, linux-wireless
  Cc: Marek Szyprowski, Kalle Valo, David S. Miller, Jakub Kicinski,
	netdev, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu

On 5/14/21 10:26 AM, Johannes Berg wrote:
> On Fri, 2021-05-14 at 01:07 +0200, Maximilian Luz wrote:
>> Following commit a05829a7222e ("cfg80211: avoid holding the RTNL when
>> calling the driver"), the mwifiex_pcie module fails to unload. This also
>> prevents the device from rebooting / shutting down.
>>
>> Attempting to unload the module
>>
> 
> I'm *guessing* that you're attempting to unload the module while the
> interface is still up, i.e. you didn't "ip link set wlan0 down" first?

I did not. Doing so indeed allows unloading of the module.

> If so, that is likely fixed by this commit as well:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ea6b2098dd02789f68770fd3d5a373732207be2f

Thanks for pointing this out. That's backported to 5.12.2, right?
Unfortunately the error still persists there (as on all other post 5.12
kernels, as mentioned below).

> However, your log says:
> 
>> [  245.504764]       Tainted: G         C OE     5.11.0-1-surface-dev #2
> 
> so I have no idea what kernel you're using, because 5.11 did *not*
> contain commit a05829a7222e ("cfg80211: avoid holding the RTNL when
> calling the driver"). If you backported the bug you get to be
> responsible for backporting the fixes too?

This is a log from a05829a7222e checked out, which was pre 5.12-rc1, so the
last tag is still 5.11 which sets the package version in my packaging scripts
to that as well. It's not an actual 5.11 kernel, sorry for the confusion.
There is no backporting going on, just bisecting.

> If that's all not solving the issue then please try to resolve with gdb
> what line of code "cfg80211_netdev_notifier_call+0x12a" is, and please
> also clarify exactly what (upstream!) kernel you're using.

The issue exists on 5.12 through 5.12.4, as well as 5.13-rc1 (which is
why I didn't bother specifying a version, sorry again).

If you really want me to, I'll try to find some time to learn GDB kernel
debugging (never done that before, so might take a bit) however, I think
it's fairly clear what's going wrong and why the fix you've linked below
doesn't apply in this case:

Your fix is fixing

     cfg80211_destroy_iface_wk() takes wiphy_lock
      -> cfg80211_destroy_ifaces()
       -> ieee80211_del_iface
        -> ieeee80211_if_remove
         -> cfg80211_unregister_wdev
          -> unregister_netdevice_queue
           -> dev_close_many
            -> __dev_close_many
             -> raw_notifier_call_chain
              -> cfg80211_netdev_notifier_call

by addressing this in cfg80211_destroy_iface{s,_wk}(). The trace from my
log shows

     mwifiex_uninit_sw() takes wiphy_lock
      -> mwifiex_del_virtual_intf
       -> cfg80211_unregister_netdevice()
        -> cfg80211_unregister_wdev()
         -> _cfg80211_unregister_wdev() has lockdep_assert_held(&rdev->wiphy.mtx)
          -> unregister_netdevice_queue
           -> dev_close_many
            -> __dev_close_many
             -> raw_notifier_call_chain
              -> cfg80211_netdev_notifier_call attempts to take wiphy_lock again

So your fix does not address this particular issue. It doesn't even
touch any of the affected code path. I believe it is instead fixing one
symptom of the same underlying problem.

While the last parts of the trace are the same (specifically following
cfg80211_unregister_wdev()), the lock is initially taken in different
functions. Your fix addresses this by changing cfg80211_destroy_ifaces(),
and cfg80211_destroy_iface_wk() which, however, were never called on the
path that's causing _this_ issue.

Furthermore, if you go through that trace, there's only one notifier
call in __dev_close_many(), which is

   call_netdevice_notifiers(NETDEV_GOING_DOWN, dev)

which I believe I have linked in my previous mail. Since the state value
is NETDEV_GOING_DOWN, this has to be case [3] (unless I'm missing
something).

Regards,
Max

[3]: https://elixir.bootlin.com/linux/v5.13-rc1/source/net/wireless/core.c#L1428

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Deadlock in _cfg80211_unregister_wdev()
  2021-05-13 23:07 [BUG] Deadlock in _cfg80211_unregister_wdev() Maximilian Luz
  2021-05-14  8:26 ` Johannes Berg
@ 2021-05-14 13:46 ` Maximilian Luz
  1 sibling, 0 replies; 6+ messages in thread
From: Maximilian Luz @ 2021-05-14 13:46 UTC (permalink / raw)
  To: Johannes Berg, linux-wireless
  Cc: Marek Szyprowski, Kalle Valo, David S. Miller, Jakub Kicinski,
	netdev, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, dave

On 5/14/21 1:07 AM, Maximilian Luz wrote:
> Following commit a05829a7222e ("cfg80211: avoid holding the RTNL when
> calling the driver"), the mwifiex_pcie module fails to unload. This also
> prevents the device from rebooting / shutting down.
> 
> Attempting to unload the module produces the log pasted below. Upon
> further investigation, this looks like a deadlock inside
> _cfg80211_unregister_wdev():
> 
> - According to [1], this function expects the rdev->wiphy.mtx to be
>    held.
> - Down the line, this function (through some indirections, see third
>    trace in log below) calls call_netdevice_notifiers(NETDEV_GOING_DOWN,
>    ...) [2].
> - One of the registered notifiers seems to be
>    cfg80211_netdev_notifier_call(), which attempts to lock
>    rdev->wiphy.mtx again [3], completing the deadlock.

Looks like the underlying issue also leads to

   https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/

Regards,
Max

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Deadlock in _cfg80211_unregister_wdev()
  2021-05-14  8:26 ` Johannes Berg
  2021-05-14 11:40   ` Maximilian Luz
@ 2021-05-15  2:44   ` Brian Norris
  2021-05-15 11:24     ` Maximilian Luz
  1 sibling, 1 reply; 6+ messages in thread
From: Brian Norris @ 2021-05-15  2:44 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Maximilian Luz, linux-wireless, Marek Szyprowski, Kalle Valo,
	David S. Miller, Jakub Kicinski, netdev, Amitkumar Karwar,
	Ganapathi Bhat, Xinming Hu

On Fri, May 14, 2021 at 10:26:25AM +0200, Johannes Berg wrote:
> If that's all not solving the issue then please try to resolve with gdb
> what line of code "cfg80211_netdev_notifier_call+0x12a" is, and please
> also clarify exactly what (upstream!) kernel you're using.

I can reproduce on v5.12 and v5.12.4 as well. With v5.12.4, I'm at:

net/wireless/core.c:
1428			wiphy_lock(&rdev->wiphy);
include/net/cfg80211.h:
5269		mutex_lock(&wiphy->mtx);


i.e.,

static int cfg80211_netdev_notifier_call(struct notifier_block *nb,
					 unsigned long state, void *ptr)
{
...
	case NETDEV_GOING_DOWN:
		wiphy_lock(&rdev->wiphy);  <--- right here
		cfg80211_leave(rdev, wdev);
		wiphy_unlock(&rdev->wiphy);
...

It would seem like _anyone_ that calls cfg80211_unregister_wdev() with
an interface up will hit this -- not unique to mwifiex. In fact, apart
from the fact that all his line numbers are wrong, Maximilian's original
email points out exactly where the deadlock is.
cfg80211_unregister_wdev() holds the wiphy lock, and the GOING_DOWN
notification also tries to grab it.

It does happen that in many other paths, you've already ensured that you
bring the interface down, so e.g., mac80211 drivers don't tend to hit
this. But I wouldn't be surprised if a few other cfg80211 drivers hit
this too.

The best solution I could figure was to do a similar lock dance done in
nl80211_del_interface() -- close the netdev without holding the wiphy
lock. I'll send out a patch shortly.

Brian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] Deadlock in _cfg80211_unregister_wdev()
  2021-05-15  2:44   ` Brian Norris
@ 2021-05-15 11:24     ` Maximilian Luz
  0 siblings, 0 replies; 6+ messages in thread
From: Maximilian Luz @ 2021-05-15 11:24 UTC (permalink / raw)
  To: Brian Norris, Johannes Berg
  Cc: linux-wireless, Marek Szyprowski, Kalle Valo, David S. Miller,
	Jakub Kicinski, netdev, Amitkumar Karwar, Ganapathi Bhat,
	Xinming Hu

On 5/15/21 4:44 AM, Brian Norris wrote:
> It would seem like _anyone_ that calls cfg80211_unregister_wdev() with
> an interface up will hit this -- not unique to mwifiex. In fact, apart
> from the fact that all his line numbers are wrong, Maximilian's original
> email points out exactly where the deadlock is.
> cfg80211_unregister_wdev() holds the wiphy lock, and the GOING_DOWN
> notification also tries to grab it.
> 
> It does happen that in many other paths, you've already ensured that you
> bring the interface down, so e.g., mac80211 drivers don't tend to hit
> this. But I wouldn't be surprised if a few other cfg80211 drivers hit
> this too.
> 
> The best solution I could figure was to do a similar lock dance done in
> nl80211_del_interface() -- close the netdev without holding the wiphy
> lock. I'll send out a patch shortly.

I believe that if we're going to fix that in the individual drivers,
there should be at least some sort of warning/documentation on
cfg80211_unregister_wdev().

Also someone might want to look at other WiFi drivers calling
cfg80211_unregister_wdev(). For example, I can see a locked call in the
brcm80211 driver, but no previous dev_close() call (see [1]). Haven't
looked in detail though, so I might just be wrong.

I can't help but think that this should maybe be addressed in that
common part instead. I know too little of that subsystem to tell if that
might be infeasible though.

Regards,
Max

[1]: https://elixir.bootlin.com/linux/v5.13-rc1/source/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c#L2445

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-05-15 11:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-13 23:07 [BUG] Deadlock in _cfg80211_unregister_wdev() Maximilian Luz
2021-05-14  8:26 ` Johannes Berg
2021-05-14 11:40   ` Maximilian Luz
2021-05-15  2:44   ` Brian Norris
2021-05-15 11:24     ` Maximilian Luz
2021-05-14 13:46 ` Maximilian Luz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).