Re: [BUG] deadlock in nl80211_vendor_cmd

* Re: [BUG] deadlock in nl80211_vendor_cmd
       [not found] ` <CABYd82Z=YXmZPTQhf0K1M4nS2wk3dPBSqx91D8SoUd59AUzpHg@mail.gmail.com>
@ 2022-03-21 17:00   ` William McVicker
  0 siblings, 0 replies; 19+ messages in thread
From: William McVicker @ 2022-03-21 17:00 UTC (permalink / raw)
  To: Johannes Berg, linux-wireless
  Cc: Marek Szyprowski, Kalle Valo, David S. Miller, Jakub Kicinski,
	netdev, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Cc: Android Kernel

On 03/21/2022, Will McVicker wrote:
> On Thu, Mar 17, 2022 at 10:09 AM <willmcvicker@google.com> wrote:
> 
> > Hi,
> >
> > I wanted to report a deadlock that I'm hitting as a result of the upstream
> > commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the
> > driver"). I'm using the Pixel 6 with downstream version of the 5.15 kernel,
> > but I'm pretty sure this will happen on the upstream tip-of-tree kernel as
> > well.
> >
> > Basically, my wlan driver uses the wiphy_vendor_command ops to handle
> > a number of vendor specific operations. One of them in particular deletes
> > a cfg80211 interface. The deadlock happens when thread 1 tries to take the
> > RTNL lock before calling cfg80211_unregister_device() while thread 2 is
> > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
> > wiphy_lock().
> >
> > Here is the call flow:
> >
> > Thread 1:                         Thread 2:
> >
> > nl80211_pre_doit():
> >   -> rtnl_lock()
> >                                       nl80211_pre_doit():
> >                                        -> rtnl_lock()
> >                                        -> <blocked by Thread 1>
> >   -> wiphy_lock()
> >   -> rtnl_unlock()
> >   -> <unblock Thread 1>
> > exit nl80211_pre_doit()
> >                                        <Thread 2 got the RTNL lock>
> >                                        -> wiphy_lock()
> >                                        -> <blocked by Thread 1>
> > nl80211_doit()
> >   -> nl80211_vendor_cmd()
> >       -> rtnl_lock() <DEADLOCK>
> >       -> cfg80211_unregister_device()
> >       -> rtnl_unlock()
> >
> >
> > To be complete, here are the kernel call traces when the deadlock occurs:
> >
> > Thread 1 Call trace:
> >     <Take rtnl before calling cfg80211_unregister_device()>
> >     nl80211_vendor_cmd+0x210/0x218
> >     genl_rcv_msg+0x3ac/0x45c
> >     netlink_rcv_skb+0x130/0x168
> >     genl_rcv+0x38/0x54
> >     netlink_unicast_kernel+0xe4/0x1f4
> >     netlink_unicast+0x128/0x21c
> >     netlink_sendmsg+0x2d8/0x3d8
> >
> > Thread 2 Call trace:
> >     <Take wiphy_lock>
> >     nl80211_pre_doit+0x1b0/0x250
> >     genl_rcv_msg+0x37c/0x45c
> >     netlink_rcv_skb+0x130/0x168
> >     genl_rcv+0x38/0x54
> >     netlink_unicast_kernel+0xe4/0x1f4
> >     netlink_unicast+0x128/0x21c
> >     netlink_sendmsg+0x2d8/0x3d8
> >
> > I'm not an networking expert. So my main question is if I'm allowed to take
> > the RTNL lock inside the nl80211_vendor_cmd callbacks? If so, then
> > regardless of why I take it, we shouldn't be allowing this deadlock
> > situation, right?
> >
> > I hope that helps explain the issue. Let me know if you need any more
> > details.
> >
> > Thanks,
> > Will
> >
> 
> Sorry my CC list got dropped. Adding the following:
> 
> Kalle Valo <kvalo@codeaurora.org>
> "David S. Miller" <davem@davemloft.net>
> Jakub Kicinski <kuba@kernel.org>
> netdev@vger.kernel.org
> Amitkumar Karwar <amitkarwar@gmail.com>
> Ganapathi Bhat <ganapathi.bhat@nxp.com>
> Xinming Hu <huxinming820@gmail.com>
> kernel-team@android.com

Sorry for the noise. The lists bounced due to html. Resending with mutt to make
sure everyone gets this message.

As an update, I was able to fix the deadlock by updating nl80211_pre_doit() to
not hold the RTNL lock while waiting to get the wiphy_lock. This allows us to
take the RTNL lock within nl80211_doit() and have parallel calls to
nl80211_doit(). Below is the logic I tested. Please let me know if I'm heading
in the right direction.

Thanks,
Will

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 686a69381731..bb4ad746509b 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -15227,7 +15227,24 @@ static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff *skb,
 	}
 
 	if (rdev && !(ops->internal_flags & NL80211_FLAG_NO_WIPHY_MTX)) {
-		wiphy_lock(&rdev->wiphy);
+		while (!mutex_trylock(&rdev->wiphy.mtx)) {
+			/* Holding the RTNL lock while waiting for the wiphy lock can lead to
+			 * a deadlock within doit() ops that don't hold the RTNL in pre_doit. So
+			 * we need to release the RTNL lock first while we wait for the wiphy
+			 * lock.
+			 */
+			rtnl_unlock();
+			wiphy_lock(&rdev->wiphy);
+
+			/* Once we get the wiphy_lock, we need to grab the RTNL lock. If we can't
+			 * get it, then we need to unlock the wiphy to avoid a deadlock in
+			 * pre_doit and then retry taking the locks again. */
+			if (!rtnl_trylock()) {
+				wiphy_unlock(&rdev->wiphy);
+				rtnl_lock();
+			} else
+				break;
+		}
 		/* we keep the mutex locked until post_doit */
 		__release(&rdev->wiphy.mtx);
 	}

^ permalink raw reply related	[flat|nested] 19+ messages in thread