[5.13] mwifiex: bring down link before deleting interface
diff mbox series

Message ID 20210515024227.2159311-1-briannorris@chromium.org
State New, archived
Headers show
Series
  • [5.13] mwifiex: bring down link before deleting interface
Related show

Commit Message

Brian Norris May 15, 2021, 2:42 a.m. UTC
We can deadlock when rmmod'ing the driver or going through firmware
reset, because the cfg80211_unregister_wdev() has to bring down the link
for us, ... which then grab the same wiphy lock.

nl80211_del_interface() already handles a very similar case, with a nice
description:

        /*
         * We hold RTNL, so this is safe, without RTNL opencount cannot
         * reach 0, and thus the rdev cannot be deleted.
         *
         * We need to do it for the dev_close(), since that will call
         * the netdev notifiers, and we need to acquire the mutex there
         * but don't know if we get there from here or from some other
         * place (e.g. "ip link set ... down").
         */
        mutex_unlock(&rdev->wiphy.mtx);
...

Do similarly for mwifiex teardown, by ensuring we bring the link down
first.

Sample deadlock trace:

[  247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds.
[  247.110630]       Not tainted 5.12.4 #5
[  247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.124557] task:rmmod           state:D stack:    0 pid: 2119 ppid:  2114 flags:0x00400208
[  247.133905] Call trace:
[  247.136644]  __switch_to+0x130/0x170
[  247.140643]  __schedule+0x714/0xa0c
[  247.144548]  schedule_preempt_disabled+0x88/0xf4
[  247.149714]  __mutex_lock_common+0x43c/0x750
[  247.154496]  mutex_lock_nested+0x5c/0x68
[  247.158884]  cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211]
[  247.165769]  raw_notifier_call_chain+0x4c/0x78
[  247.170742]  call_netdevice_notifiers_info+0x68/0xa4
[  247.176305]  __dev_close_many+0x7c/0x138
[  247.180693]  dev_close_many+0x7c/0x10c
[  247.184893]  unregister_netdevice_many+0xfc/0x654
[  247.190158]  unregister_netdevice_queue+0xb4/0xe0
[  247.195424]  _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211]
[  247.201816]  cfg80211_unregister_wdev+0x20/0x2c [cfg80211]
[  247.208016]  mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex]
[  247.214174]  mwifiex_uninit_sw+0x158/0x1b0 [mwifiex]
[  247.219747]  mwifiex_remove_card+0x38/0xa0 [mwifiex]
[  247.225316]  mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie]
[  247.231451]  pci_device_remove+0x50/0xe0
[  247.235849]  device_release_driver_internal+0x110/0x1b0
[  247.241701]  driver_detach+0x5c/0x9c
[  247.245704]  bus_remove_driver+0x84/0xb8
[  247.250095]  driver_unregister+0x3c/0x60
[  247.254486]  pci_unregister_driver+0x2c/0x90
[  247.259267]  cleanup_module+0x18/0xcdc [mwifiex_pcie]

Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/
Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/
Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
Reported-by: dave@bewaar.me
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
---
 drivers/net/wireless/marvell/mwifiex/main.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

--

Comments

Maximilian Luz May 15, 2021, 11:36 a.m. UTC | #1
On 5/15/21 4:42 AM, Brian Norris wrote:
> We can deadlock when rmmod'ing the driver or going through firmware
> reset, because the cfg80211_unregister_wdev() has to bring down the link
> for us, ... which then grab the same wiphy lock.
> 
> nl80211_del_interface() already handles a very similar case, with a nice
> description:
> 
>          /*
>           * We hold RTNL, so this is safe, without RTNL opencount cannot
>           * reach 0, and thus the rdev cannot be deleted.
>           *
>           * We need to do it for the dev_close(), since that will call
>           * the netdev notifiers, and we need to acquire the mutex there
>           * but don't know if we get there from here or from some other
>           * place (e.g. "ip link set ... down").
>           */
>          mutex_unlock(&rdev->wiphy.mtx);
> ...
> 
> Do similarly for mwifiex teardown, by ensuring we bring the link down
> first.
> 
> Sample deadlock trace:
> 
> [  247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds.
> [  247.110630]       Not tainted 5.12.4 #5
> [  247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  247.124557] task:rmmod           state:D stack:    0 pid: 2119 ppid:  2114 flags:0x00400208
> [  247.133905] Call trace:
> [  247.136644]  __switch_to+0x130/0x170
> [  247.140643]  __schedule+0x714/0xa0c
> [  247.144548]  schedule_preempt_disabled+0x88/0xf4
> [  247.149714]  __mutex_lock_common+0x43c/0x750
> [  247.154496]  mutex_lock_nested+0x5c/0x68
> [  247.158884]  cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211]
> [  247.165769]  raw_notifier_call_chain+0x4c/0x78
> [  247.170742]  call_netdevice_notifiers_info+0x68/0xa4
> [  247.176305]  __dev_close_many+0x7c/0x138
> [  247.180693]  dev_close_many+0x7c/0x10c
> [  247.184893]  unregister_netdevice_many+0xfc/0x654
> [  247.190158]  unregister_netdevice_queue+0xb4/0xe0
> [  247.195424]  _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211]
> [  247.201816]  cfg80211_unregister_wdev+0x20/0x2c [cfg80211]
> [  247.208016]  mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex]
> [  247.214174]  mwifiex_uninit_sw+0x158/0x1b0 [mwifiex]
> [  247.219747]  mwifiex_remove_card+0x38/0xa0 [mwifiex]
> [  247.225316]  mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie]
> [  247.231451]  pci_device_remove+0x50/0xe0
> [  247.235849]  device_release_driver_internal+0x110/0x1b0
> [  247.241701]  driver_detach+0x5c/0x9c
> [  247.245704]  bus_remove_driver+0x84/0xb8
> [  247.250095]  driver_unregister+0x3c/0x60
> [  247.254486]  pci_unregister_driver+0x2c/0x90
> [  247.259267]  cleanup_module+0x18/0xcdc [mwifiex_pcie]
> 
> Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/
> Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/
> Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
> Reported-by: dave@bewaar.me
> Cc: Johannes Berg <johannes@sipsolutions.net>
> Signed-off-by: Brian Norris <briannorris@chromium.org>

Thanks!

Tested-by: Maximilian Luz <luzmaximilian@gmail.com>

> ---
>   drivers/net/wireless/marvell/mwifiex/main.c | 13 ++++++++++---
>   1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c
> index 529dfd8b7ae8..17399d4aa129 100644
> --- a/drivers/net/wireless/marvell/mwifiex/main.c
> +++ b/drivers/net/wireless/marvell/mwifiex/main.c
> @@ -1445,11 +1445,18 @@ static void mwifiex_uninit_sw(struct mwifiex_adapter *adapter)
>   		if (!priv)
>   			continue;
>   		rtnl_lock();
> -		wiphy_lock(adapter->wiphy);
>   		if (priv->netdev &&
> -		    priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED)
> +		    priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) {
> +			/*
> +			 * Close the netdev now, because if we do it later, the
> +			 * netdev notifiers will need to acquire the wiphy lock
> +			 * again --> deadlock.
> +			 */
> +			dev_close(priv->wdev.netdev);
> +			wiphy_lock(adapter->wiphy);
>   			mwifiex_del_virtual_intf(adapter->wiphy, &priv->wdev);
> -		wiphy_unlock(adapter->wiphy);
> +			wiphy_unlock(adapter->wiphy);
> +		}
>   		rtnl_unlock();
>   	}
>   
>
Dave Olsthoorn May 15, 2021, 3:10 p.m. UTC | #2
On 2021-05-15 04:42, Brian Norris wrote:
> We can deadlock when rmmod'ing the driver or going through firmware
> reset, because the cfg80211_unregister_wdev() has to bring down the 
> link
> for us, ... which then grab the same wiphy lock.
> 
> nl80211_del_interface() already handles a very similar case, with a 
> nice
> description:
> 
>         /*
>          * We hold RTNL, so this is safe, without RTNL opencount cannot
>          * reach 0, and thus the rdev cannot be deleted.
>          *
>          * We need to do it for the dev_close(), since that will call
>          * the netdev notifiers, and we need to acquire the mutex there
>          * but don't know if we get there from here or from some other
>          * place (e.g. "ip link set ... down").
>          */
>         mutex_unlock(&rdev->wiphy.mtx);
> ...
> 
> Do similarly for mwifiex teardown, by ensuring we bring the link down
> first.
> 
> Sample deadlock trace:
> 
> [  247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds.
> [  247.110630]       Not tainted 5.12.4 #5
> [  247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  247.124557] task:rmmod           state:D stack:    0 pid: 2119
> ppid:  2114 flags:0x00400208
> [  247.133905] Call trace:
> [  247.136644]  __switch_to+0x130/0x170
> [  247.140643]  __schedule+0x714/0xa0c
> [  247.144548]  schedule_preempt_disabled+0x88/0xf4
> [  247.149714]  __mutex_lock_common+0x43c/0x750
> [  247.154496]  mutex_lock_nested+0x5c/0x68
> [  247.158884]  cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211]
> [  247.165769]  raw_notifier_call_chain+0x4c/0x78
> [  247.170742]  call_netdevice_notifiers_info+0x68/0xa4
> [  247.176305]  __dev_close_many+0x7c/0x138
> [  247.180693]  dev_close_many+0x7c/0x10c
> [  247.184893]  unregister_netdevice_many+0xfc/0x654
> [  247.190158]  unregister_netdevice_queue+0xb4/0xe0
> [  247.195424]  _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211]
> [  247.201816]  cfg80211_unregister_wdev+0x20/0x2c [cfg80211]
> [  247.208016]  mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex]
> [  247.214174]  mwifiex_uninit_sw+0x158/0x1b0 [mwifiex]
> [  247.219747]  mwifiex_remove_card+0x38/0xa0 [mwifiex]
> [  247.225316]  mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie]
> [  247.231451]  pci_device_remove+0x50/0xe0
> [  247.235849]  device_release_driver_internal+0x110/0x1b0
> [  247.241701]  driver_detach+0x5c/0x9c
> [  247.245704]  bus_remove_driver+0x84/0xb8
> [  247.250095]  driver_unregister+0x3c/0x60
> [  247.254486]  pci_unregister_driver+0x2c/0x90
> [  247.259267]  cleanup_module+0x18/0xcdc [mwifiex_pcie]
> 
> Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the 
> driver")
> Cc: stable@vger.kernel.org
> Link:
> https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/
> Link:
> https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/
> Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
> Reported-by: Dave Olsthoorn <dave@bewaar.me>

Thanks!

The firmware still seems to crash quicker than previously, but that's a 
unrelated problem.

Tested-by: Dave Olsthoorn <dave@bewaar.me>

> Cc: Johannes Berg <johannes@sipsolutions.net>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> ---
>  drivers/net/wireless/marvell/mwifiex/main.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/wireless/marvell/mwifiex/main.c
> b/drivers/net/wireless/marvell/mwifiex/main.c
> index 529dfd8b7ae8..17399d4aa129 100644
> --- a/drivers/net/wireless/marvell/mwifiex/main.c
> +++ b/drivers/net/wireless/marvell/mwifiex/main.c
> @@ -1445,11 +1445,18 @@ static void mwifiex_uninit_sw(struct
> mwifiex_adapter *adapter)
>  		if (!priv)
>  			continue;
>  		rtnl_lock();
> -		wiphy_lock(adapter->wiphy);
>  		if (priv->netdev &&
> -		    priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED)
> +		    priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) {
> +			/*
> +			 * Close the netdev now, because if we do it later, the
> +			 * netdev notifiers will need to acquire the wiphy lock
> +			 * again --> deadlock.
> +			 */
> +			dev_close(priv->wdev.netdev);
> +			wiphy_lock(adapter->wiphy);
>  			mwifiex_del_virtual_intf(adapter->wiphy, &priv->wdev);
> -		wiphy_unlock(adapter->wiphy);
> +			wiphy_unlock(adapter->wiphy);
> +		}
>  		rtnl_unlock();
>  	}
Pali Rohár May 15, 2021, 3:40 p.m. UTC | #3
On Saturday 15 May 2021 17:10:31 Dave Olsthoorn wrote:
> The firmware still seems to crash quicker than previously, but that's a
> unrelated problem.

Hello! Do you have some more details (or links) about mentioned firmware crash?
Dave Olsthoorn May 15, 2021, 4:32 p.m. UTC | #4
Hi,

On 2021-05-15 17:40, Pali Rohár wrote:
> On Saturday 15 May 2021 17:10:31 Dave Olsthoorn wrote:
>> The firmware still seems to crash quicker than previously, but that's 
>> a
>> unrelated problem.
> 
> Hello! Do you have some more details (or links) about mentioned 
> firmware crash?

Sure, firmware crashes have always been a problem on the Surface 
devices. They seem to be related, at least for some of the crashes, to 
power management. For this reason I disabled powersaving in 
NetworkManager which used to make it at least stable enough for me, in 
5.13 this trick does not seem to work.

The dmesg log attached shows a firmware crash happening, the card does 
not work even after a reset or remove & rescan on the pci(e) bus.

There are patches [1] which have not been submitted yet and where 
developed as part of the linux-surface effort [2]. From my experience 
these patches resolve most if not all of the firmware crashes.

Regards,
Dave Olsthoorn

[1]: 
https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch
[2]: https://github.com/linux-surface

dmesg.log:
[ 127.367628]   mwifiex_pcie 0000:01:00.0: info: successfully 
disconnected from xx:xx:xx:xx:xx:xx: reason code 3
[ 127.371059]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.371083]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 127.371086]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.371088]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 127.371091]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.371093]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 127.371096]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.371099]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 127.371101]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.371103]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 127.374001]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.374028]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.374103]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.374113]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.375041]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.375311]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.375445]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.375903]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[ 127.379980]   mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
[ 127.448638]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 127.457009]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 127.494749]   mwifiex_pcie 0000:01:00.0: performing 
cancel_work_sync()...
[ 127.494756]   mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
[ 155.308155]   mwifiex_pcie: PCI memory map Virt0: 00000000ff7d3684 PCI 
memory map Virt2: 000000004aa1ebbb
[ 155.357139]   mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip 
FW dnld
[ 155.357146]   mwifiex_pcie 0000:01:00.0: WLAN FW is active
[ 155.510296]   mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: 
mwifiex 1.0 (15.68.19.p21)
[ 155.510319]   mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 
(15.68.19.p21)
[ 155.560724]   mwifiex_pcie 0000:01:00.0 wlp1s0: renamed from mlan0
[ 161.848939]   mwifiex_pcie 0000:01:00.0: info: trying to associate to 
bssid xx:xx:xx:xx:xx:xx
[ 162.027982]   mwifiex_pcie 0000:01:00.0: info: associated to bssid 
xx:xx:xx:xx:xx:xx successfully
[ 162.063062]   IPv6: ADDRCONF(NETDEV_CHANGE): wlp1s0: link becomes 
ready
[ 173.452786]   mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: 
Timeout cmd id = 0x10, act = 0x1
[ 173.452867]   mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0
[ 173.452873]   mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0
[ 173.452877]   mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1
[ 173.452881]   mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0
[ 173.452885]   mwifiex_pcie 0000:01:00.0: last_cmd_index = 1
[ 173.452889]   mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 10 00 28 
00 a4 00 7f 00
[ 173.452894]   mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 01 00 13 
00 00 00 00 00
[ 173.452898]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0
[ 173.452905]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 10 80 
28 80 a4 80 7f 80
[ 173.452911]   mwifiex_pcie 0000:01:00.0: last_event_index = 3
[ 173.452918]   mwifiex_pcie 0000:01:00.0: last_event: 6a 00 17 00 2b 00 
33 00 58 00
[ 173.452925]   mwifiex_pcie 0000:01:00.0: data_sent=1 cmd_sent=1
[ 173.452930]   mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0
[ 173.455878]   mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump 
start===
[ 173.455914]   mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: 
mwifiex 1.0 (15.68.19.p21)
[ 173.455925]   mwifiex_pcie 0000:01:00.0: PCIE register dump start
[ 173.456014]   mwifiex_pcie 0000:01:00.0: pcie scratch register:
[ 173.456026]   mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00
reg:0xcf8, value=0x9141cc
reg:0xcfc, value=0x28c8c00
[ 173.456030]   mwifiex_pcie 0000:01:00.0: PCIE register dump end
[ 173.456173]   mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump 
end===
[ 173.456177]   mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump 
start ==
[ 177.548244]   mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -110
[ 177.548341]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 177.548708]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 177.548712]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 182.816381]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 182.816389]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 183.100502]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 183.100524]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 183.463055]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 183.463059]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 189.465526]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 189.465538]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 195.464760]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 195.464770]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 201.467820]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 201.467826]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 207.463143]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 207.463154]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 213.462237]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 213.462244]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 219.464173]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 219.464180]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 223.116948]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 225.468213]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 225.468219]   mwifiex_pcie 0000:01:00.0: failed to get signal 
information
[ 229.349824]   mwifiex_pcie 0000:01:00.0: Firmware dump Finished!
[ 229.349835]   mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump end 
==
[ 229.350968]   mwifiex_pcie 0000:01:00.0: == mwifiex dump information 
to /sys/class/devcoredump start
[ 229.354949]   mwifiex_pcie 0000:01:00.0: == mwifiex dump information 
to /sys/class/devcoredump end
[ 229.354981]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 229.356825]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 229.357175]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.357184]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 229.357190]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.357195]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 229.357200]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.357205]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 229.357210]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.357215]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 229.357219]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.357224]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 229.357228]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.357233]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
[ 229.359375]   mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
[ 229.362422]   mwifiex_pcie 0000:01:00.0: rx_pending=0, tx_pending=1, 
cmd_pending=0
[ 229.493680]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.493700]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 229.509094]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 230.653787]   mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip 
FW dnld
[ 230.653794]   mwifiex_pcie 0000:01:00.0: WLAN FW is active
[ 241.037922]   mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: 
Timeout cmd id = 0xfa, act = 0x4000
[ 241.038005]   mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0
[ 241.038031]   mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0
[ 241.038048]   mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1
[ 241.038067]   mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0
[ 241.038086]   mwifiex_pcie 0000:01:00.0: last_cmd_index = 2
[ 241.038104]   mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 10 00 fa 
00 a4 00 7f 00
[ 241.038126]   mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 01 00 00 
40 00 00 00 00
[ 241.038147]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0
[ 241.038168]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 10 80 
28 80 a4 80 7f 80
[ 241.038187]   mwifiex_pcie 0000:01:00.0: last_event_index = 3
[ 241.038207]   mwifiex_pcie 0000:01:00.0: last_event: 6a 00 17 00 2b 00 
33 00 58 00
[ 241.038226]   mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1
[ 241.038246]   mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0
[ 241.040764]   mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: 
unregister device
[ 241.057936]   mwifiex_pcie 0000:01:00.0: skipped cancel_work_sync() 
because we're in card reset failure path
[ 241.060488]   mwifiex: Failed to bring up adapter: -5
[ 241.060504]   mwifiex_pcie 0000:01:00.0: reinit failed: -5
Pali Rohár May 15, 2021, 4:53 p.m. UTC | #5
Hello!

On Saturday 15 May 2021 18:32:30 Dave Olsthoorn wrote:
> Hi,
> 
> On 2021-05-15 17:40, Pali Rohár wrote:
> > On Saturday 15 May 2021 17:10:31 Dave Olsthoorn wrote:
> > > The firmware still seems to crash quicker than previously, but
> > > that's a
> > > unrelated problem.
> > 
> > Hello! Do you have some more details (or links) about mentioned firmware
> > crash?
> 
> Sure, firmware crashes have always been a problem on the Surface devices.

What wifi chip you have on these devices? Because very similar firmware
crashes I see on 88W8997 chip (also with mwifiex) when wifi card is
configured in SDIO mode (not PCIe).

I know that there are new version of firmwares for these 88W8xxx chips,
but they are available only under NXP NDA and only for NXP customers.
So it looks like that end users with NXP wifi chips are out of luck.

> They seem to be related, at least for some of the crashes, to power
> management. For this reason I disabled powersaving in NetworkManager which
> used to make it at least stable enough for me, in 5.13 this trick does not
> seem to work.
> 
> The dmesg log attached shows a firmware crash happening, the card does not
> work even after a reset or remove & rescan on the pci(e) bus.

Similar issue, card start working again only after whole system restart.

So this is something which can be resolved only in NXP.

Sharvari, could you look at this issue? It looks like there are many
mwifiex driver or 88W8xxx firmware issues which are common both PCIe and
SDIO modes.

Also could you please look at way how NXP 88W8xxx firmware files could
be updated in linux-firmware repository?

> There are patches [1] which have not been submitted yet and where developed
> as part of the linux-surface effort [2]. From my experience these patches
> resolve most if not all of the firmware crashes.

Is somebody going to cleanup these patches and send them for inclusion
into mainline kernel? I see that most of them are PCIe related, but due
to seeing same issues also on SDIO bus, I guess adding similar hooks
also for SDIO could make also SDIO more stable...

> Regards,
> Dave Olsthoorn
> 
> [1]: https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch
> [2]: https://github.com/linux-surface
> 
> dmesg.log:
> [ 127.367628]   mwifiex_pcie 0000:01:00.0: info: successfully disconnected
> from xx:xx:xx:xx:xx:xx: reason code 3
> [ 127.371059]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.371083]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 127.371086]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.371088]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 127.371091]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.371093]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 127.371096]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.371099]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 127.371101]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.371103]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 127.374001]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.374028]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.374103]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.374113]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.375041]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.375311]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.375445]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.375903]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
> [ 127.379980]   mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
> [ 127.448638]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 127.457009]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 127.494749]   mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()...
> [ 127.494756]   mwifiex_pcie 0000:01:00.0: cancel_work_sync() done
> [ 155.308155]   mwifiex_pcie: PCI memory map Virt0: 00000000ff7d3684 PCI
> memory map Virt2: 000000004aa1ebbb
> [ 155.357139]   mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW
> dnld
> [ 155.357146]   mwifiex_pcie 0000:01:00.0: WLAN FW is active
> [ 155.510296]   mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex
> 1.0 (15.68.19.p21)
> [ 155.510319]   mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0
> (15.68.19.p21)
> [ 155.560724]   mwifiex_pcie 0000:01:00.0 wlp1s0: renamed from mlan0
> [ 161.848939]   mwifiex_pcie 0000:01:00.0: info: trying to associate to
> bssid xx:xx:xx:xx:xx:xx
> [ 162.027982]   mwifiex_pcie 0000:01:00.0: info: associated to bssid
> xx:xx:xx:xx:xx:xx successfully
> [ 162.063062]   IPv6: ADDRCONF(NETDEV_CHANGE): wlp1s0: link becomes ready
> [ 173.452786]   mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout
> cmd id = 0x10, act = 0x1
> [ 173.452867]   mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0
> [ 173.452873]   mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0
> [ 173.452877]   mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1
> [ 173.452881]   mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0
> [ 173.452885]   mwifiex_pcie 0000:01:00.0: last_cmd_index = 1
> [ 173.452889]   mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 10 00 28 00 a4
> 00 7f 00
> [ 173.452894]   mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 01 00 13 00
> 00 00 00 00
> [ 173.452898]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0
> [ 173.452905]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 10 80 28
> 80 a4 80 7f 80
> [ 173.452911]   mwifiex_pcie 0000:01:00.0: last_event_index = 3
> [ 173.452918]   mwifiex_pcie 0000:01:00.0: last_event: 6a 00 17 00 2b 00 33
> 00 58 00
> [ 173.452925]   mwifiex_pcie 0000:01:00.0: data_sent=1 cmd_sent=1
> [ 173.452930]   mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0
> [ 173.455878]   mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump
> start===
> [ 173.455914]   mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex
> 1.0 (15.68.19.p21)
> [ 173.455925]   mwifiex_pcie 0000:01:00.0: PCIE register dump start
> [ 173.456014]   mwifiex_pcie 0000:01:00.0: pcie scratch register:
> [ 173.456026]   mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00
> reg:0xcf8, value=0x9141cc
> reg:0xcfc, value=0x28c8c00
> [ 173.456030]   mwifiex_pcie 0000:01:00.0: PCIE register dump end
> [ 173.456173]   mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump end===
> [ 173.456177]   mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump start ==
> [ 177.548244]   mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -110
> [ 177.548341]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 177.548708]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 177.548712]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 182.816381]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 182.816389]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 183.100502]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 183.100524]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 183.463055]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 183.463059]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 189.465526]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 189.465538]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 195.464760]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 195.464770]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 201.467820]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 201.467826]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 207.463143]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 207.463154]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 213.462237]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 213.462244]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 219.464173]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 219.464180]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 223.116948]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 225.468213]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 225.468219]   mwifiex_pcie 0000:01:00.0: failed to get signal information
> [ 229.349824]   mwifiex_pcie 0000:01:00.0: Firmware dump Finished!
> [ 229.349835]   mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump end ==
> [ 229.350968]   mwifiex_pcie 0000:01:00.0: == mwifiex dump information to
> /sys/class/devcoredump start
> [ 229.354949]   mwifiex_pcie 0000:01:00.0: == mwifiex dump information to
> /sys/class/devcoredump end
> [ 229.354981]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 229.356825]   mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 229.357175]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.357184]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 229.357190]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.357195]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 229.357200]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.357205]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 229.357210]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.357215]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 229.357219]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.357224]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 229.357228]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.357233]   mwifiex_pcie 0000:01:00.0: deleting the crypto keys
> [ 229.359375]   mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
> [ 229.362422]   mwifiex_pcie 0000:01:00.0: rx_pending=0, tx_pending=1,
> cmd_pending=0
> [ 229.493680]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.493700]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 229.509094]   mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 230.653787]   mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW
> dnld
> [ 230.653794]   mwifiex_pcie 0000:01:00.0: WLAN FW is active
> [ 241.037922]   mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout
> cmd id = 0xfa, act = 0x4000
> [ 241.038005]   mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0
> [ 241.038031]   mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0
> [ 241.038048]   mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1
> [ 241.038067]   mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0
> [ 241.038086]   mwifiex_pcie 0000:01:00.0: last_cmd_index = 2
> [ 241.038104]   mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 10 00 fa 00 a4
> 00 7f 00
> [ 241.038126]   mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 01 00 00 40
> 00 00 00 00
> [ 241.038147]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0
> [ 241.038168]   mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 10 80 28
> 80 a4 80 7f 80
> [ 241.038187]   mwifiex_pcie 0000:01:00.0: last_event_index = 3
> [ 241.038207]   mwifiex_pcie 0000:01:00.0: last_event: 6a 00 17 00 2b 00 33
> 00 58 00
> [ 241.038226]   mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1
> [ 241.038246]   mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0
> [ 241.040764]   mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister
> device
> [ 241.057936]   mwifiex_pcie 0000:01:00.0: skipped cancel_work_sync()
> because we're in card reset failure path
> [ 241.060488]   mwifiex: Failed to bring up adapter: -5
> [ 241.060504]   mwifiex_pcie 0000:01:00.0: reinit failed: -5
Dave Olsthoorn May 19, 2021, 7:20 p.m. UTC | #6
Hi,

I'll drop some of the people since this is a sub-thread of the original, 
I'll keep the lists for access to this using lore.kernel.org.

On 2021-05-15 18:53, Pali Rohár wrote:
> Hello!
> 
> On Saturday 15 May 2021 18:32:30 Dave Olsthoorn wrote:
>> Hi,
>> 
>> On 2021-05-15 17:40, Pali Rohár wrote:
>> > On Saturday 15 May 2021 17:10:31 Dave Olsthoorn wrote:
>> > > The firmware still seems to crash quicker than previously, but
>> > > that's a
>> > > unrelated problem.
>> >
>> > Hello! Do you have some more details (or links) about mentioned firmware
>> > crash?
>> 
>> Sure, firmware crashes have always been a problem on the Surface 
>> devices.
> 
> What wifi chip you have on these devices? Because very similar firmware
> crashes I see on 88W8997 chip (also with mwifiex) when wifi card is
> configured in SDIO mode (not PCIe).
> 

The Surface Pro 2017 has an 88W8897.

> I know that there are new version of firmwares for these 88W8xxx chips,
> but they are available only under NXP NDA and only for NXP customers.
> So it looks like that end users with NXP wifi chips are out of luck.
> 
>> They seem to be related, at least for some of the crashes, to power
>> management. For this reason I disabled powersaving in NetworkManager 
>> which
>> used to make it at least stable enough for me, in 5.13 this trick does 
>> not
>> seem to work.
>> 
>> The dmesg log attached shows a firmware crash happening, the card does 
>> not
>> work even after a reset or remove & rescan on the pci(e) bus.
> 
> Similar issue, card start working again only after whole system 
> restart.
> 
> So this is something which can be resolved only in NXP.

After a conversation with the author of the patches, the problem is not 
the power management itself (for most hardware revisions [1]) but a race 
where pci commands are being written while the device is being put to 
sleep. A fix for this problem is included in the patches which make all 
pci commands synchronous instead of asynchronous [2].

After that a the wakeup patch seems relevant [3].

<snip>
>> There are patches [1] which have not been submitted yet and where 
>> developed
>> as part of the linux-surface effort [2]. From my experience these 
>> patches
>> resolve most if not all of the firmware crashes.
> 
> Is somebody going to cleanup these patches and send them for inclusion
> into mainline kernel? I see that most of them are PCIe related, but due
> to seeing same issues also on SDIO bus, I guess adding similar hooks
> also for SDIO could make also SDIO more stable...

The author plans to upstream them, he just hasn't gotten around to it.

Regards,
Dave

[1]: 
https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch#L2237-L2338
[2]: 
https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch#L1152-L1207
[3]: 
https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch#L1992-L2079
Brian Norris June 11, 2021, 12:18 a.m. UTC | #7
On Fri, May 14, 2021 at 7:45 PM Brian Norris <briannorris@chromium.org> wrote:
>
> We can deadlock when rmmod'ing the driver or going through firmware
> reset, because the cfg80211_unregister_wdev() has to bring down the link
> for us, ... which then grab the same wiphy lock.
...
> Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/
> Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/
> Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
> Reported-by: dave@bewaar.me
> Cc: Johannes Berg <johannes@sipsolutions.net>
> Signed-off-by: Brian Norris <briannorris@chromium.org>

Ping - is this going to get merged? It's a 5.12 regression, and we
have multiple people complaining about it (and they tested the fix
too!).

Thanks,
Brian
Kalle Valo June 11, 2021, 9:53 a.m. UTC | #8
Brian Norris <briannorris@chromium.org> writes:

> On Fri, May 14, 2021 at 7:45 PM Brian Norris <briannorris@chromium.org> wrote:
>>
>> We can deadlock when rmmod'ing the driver or going through firmware
>> reset, because the cfg80211_unregister_wdev() has to bring down the link
>> for us, ... which then grab the same wiphy lock.
> ...
>> Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
>> Cc: stable@vger.kernel.org
>> Link:
>> https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/
>> Link:
>> https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/
>> Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
>> Reported-by: dave@bewaar.me
>> Cc: Johannes Berg <johannes@sipsolutions.net>
>> Signed-off-by: Brian Norris <briannorris@chromium.org>
>
> Ping - is this going to get merged? It's a 5.12 regression, and we
> have multiple people complaining about it (and they tested the fix
> too!).

Thanks for the ping, this got piled up under all the -next patches and I
missed it. I'll look at it now.
Kalle Valo June 11, 2021, 10:03 a.m. UTC | #9
Brian Norris <briannorris@chromium.org> wrote:

> We can deadlock when rmmod'ing the driver or going through firmware
> reset, because the cfg80211_unregister_wdev() has to bring down the link
> for us, ... which then grab the same wiphy lock.
> 
> nl80211_del_interface() already handles a very similar case, with a nice
> description:
> 
>         /*
>          * We hold RTNL, so this is safe, without RTNL opencount cannot
>          * reach 0, and thus the rdev cannot be deleted.
>          *
>          * We need to do it for the dev_close(), since that will call
>          * the netdev notifiers, and we need to acquire the mutex there
>          * but don't know if we get there from here or from some other
>          * place (e.g. "ip link set ... down").
>          */
>         mutex_unlock(&rdev->wiphy.mtx);
> ...
> 
> Do similarly for mwifiex teardown, by ensuring we bring the link down
> first.
> 
> Sample deadlock trace:
> 
> [  247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds.
> [  247.110630]       Not tainted 5.12.4 #5
> [  247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  247.124557] task:rmmod           state:D stack:    0 pid: 2119 ppid:  2114 flags:0x00400208
> [  247.133905] Call trace:
> [  247.136644]  __switch_to+0x130/0x170
> [  247.140643]  __schedule+0x714/0xa0c
> [  247.144548]  schedule_preempt_disabled+0x88/0xf4
> [  247.149714]  __mutex_lock_common+0x43c/0x750
> [  247.154496]  mutex_lock_nested+0x5c/0x68
> [  247.158884]  cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211]
> [  247.165769]  raw_notifier_call_chain+0x4c/0x78
> [  247.170742]  call_netdevice_notifiers_info+0x68/0xa4
> [  247.176305]  __dev_close_many+0x7c/0x138
> [  247.180693]  dev_close_many+0x7c/0x10c
> [  247.184893]  unregister_netdevice_many+0xfc/0x654
> [  247.190158]  unregister_netdevice_queue+0xb4/0xe0
> [  247.195424]  _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211]
> [  247.201816]  cfg80211_unregister_wdev+0x20/0x2c [cfg80211]
> [  247.208016]  mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex]
> [  247.214174]  mwifiex_uninit_sw+0x158/0x1b0 [mwifiex]
> [  247.219747]  mwifiex_remove_card+0x38/0xa0 [mwifiex]
> [  247.225316]  mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie]
> [  247.231451]  pci_device_remove+0x50/0xe0
> [  247.235849]  device_release_driver_internal+0x110/0x1b0
> [  247.241701]  driver_detach+0x5c/0x9c
> [  247.245704]  bus_remove_driver+0x84/0xb8
> [  247.250095]  driver_unregister+0x3c/0x60
> [  247.254486]  pci_unregister_driver+0x2c/0x90
> [  247.259267]  cleanup_module+0x18/0xcdc [mwifiex_pcie]
> 
> Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/
> Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/
> Reported-by: Maximilian Luz <luzmaximilian@gmail.com>
> Reported-by: dave@bewaar.me
> Cc: Johannes Berg <johannes@sipsolutions.net>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> Tested-by: Maximilian Luz <luzmaximilian@gmail.com>
> Tested-by: Dave Olsthoorn <dave@bewaar.me>

Patch applied to wireless-drivers.git, thanks.

1f9482aa8d41 mwifiex: bring down link before deleting interface

Patch
diff mbox series

diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c
index 529dfd8b7ae8..17399d4aa129 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.c
+++ b/drivers/net/wireless/marvell/mwifiex/main.c
@@ -1445,11 +1445,18 @@  static void mwifiex_uninit_sw(struct mwifiex_adapter *adapter)
 		if (!priv)
 			continue;
 		rtnl_lock();
-		wiphy_lock(adapter->wiphy);
 		if (priv->netdev &&
-		    priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED)
+		    priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) {
+			/*
+			 * Close the netdev now, because if we do it later, the
+			 * netdev notifiers will need to acquire the wiphy lock
+			 * again --> deadlock.
+			 */
+			dev_close(priv->wdev.netdev);
+			wiphy_lock(adapter->wiphy);
 			mwifiex_del_virtual_intf(adapter->wiphy, &priv->wdev);
-		wiphy_unlock(adapter->wiphy);
+			wiphy_unlock(adapter->wiphy);
+		}
 		rtnl_unlock();
 	}