All of lore.kernel.org
 help / color / mirror / Atom feed
From: William McVicker <willmcvicker@google.com>
To: Johannes Berg <johannes@sipsolutions.net>,
	linux-wireless@vger.kernel.org
Cc: Marek Szyprowski <m.szyprowski@samsung.com>,
	Kalle Valo <kvalo@codeaurora.org>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	netdev@vger.kernel.org, Amitkumar Karwar <amitkarwar@gmail.com>,
	Ganapathi Bhat <ganapathi.bhat@nxp.com>,
	Xinming Hu <huxinming820@gmail.com>,
	kernel-team@android.com, Paolo Abeni <pabeni@redhat.com>
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd
Date: Thu, 24 Mar 2022 21:58:59 +0000	[thread overview]
Message-ID: <Yjzpo3TfZxtKPMAG@google.com> (raw)
In-Reply-To: <dc556455-51a2-06e8-8ec5-b807c2901b7e@quicinc.com>

On 03/23/2022, Jeff Johnson wrote:
> On 3/22/2022 2:58 PM, William McVicker wrote:
> > On 03/22/2022, Jeff Johnson wrote:
> > > On 3/21/2022 1:07 PM, Johannes Berg wrote:
> > > [..snip..]
> > > 
> > > > > I'm not an networking expert. So my main question is if I'm allowed to take
> > > > > the RTNL lock inside the nl80211_vendor_cmd callbacks?
> > > > 
> > > > Evidently, you're not. It's interesting though, it used to be that we
> > > > called these with the RTNL held, now we don't, and the driver you're
> > > > using somehow "got fixed" to take it, but whoever fixed it didn't take
> > > > into account that this is not possible?
> > > 
> > > On this point I just want to remind that prior to the locking change that a
> > > driver would specify on a per-vendor command basis whether or not it wanted
> > > the rtnl_lock to be held via NL80211_FLAG_NEED_RTNL. I'm guessing for the
> > > command in question the driver did not set this flag since the driver wanted
> > > to explicitly take the lock itself, otherwise it would have deadlocked on
> > > itself with the 5.10 kernel.
> > > 
> > > /jeff
> > 
> > On the 5.10 kernel, the core kernel sets NL80211_FLAG_NEED_RTNL as part of
> > the internal_flags for NL80211_CMD_VENDOR:
> > 
> > net/wireless/nl80211.c:
> >     {
> >        .cmd = NL80211_CMD_VENDOR,
> >        .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> >        .doit = nl80211_vendor_cmd,
> >        .dumpit = nl80211_vendor_cmd_dump,
> >        .flags = GENL_UNS_ADMIN_PERM,
> >        .internal_flags = NL80211_FLAG_NEED_WIPHY |
> >                NL80211_FLAG_NEED_RTNL |
> >                NL80211_FLAG_CLEAR_SKB,
> >     },
> > 
> > So the 5.10 version of this driver doesn't need to directly call rtnl_lock()
> > within the vendor command doit() functions since pre_doit() handles the RTNL
> > locking.
> > 
> > It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
> > requested via the vendor flags. That would require moving the wiphy lock to
> > nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
> > correct order. Is that something you'd be open to Johannes?
> > 
> > --Will
> 
> Thanks for correcting my understanding. I concur that it would be useful for
> vendor commands to be able to specify that a given command needs the RTNL
> lock to be held.
> 
> 

Hi Johannes,

I found that we can hit this same ABBA deadlock within the nl80211 code
before ever even calling into the vendor doit() function. The issue I found
is caused by the way we unlock the RTNL mutex. Here is the call flow that
leads to the deadlock:

Thread 1                         Thread 2
 nl80211_pre_doit():
   rtnl_lock()
   wiphy_lock()                   nl80211_pre_doit():
                                    rtnl_lock() // blocked by Thread 1
   rtnl_unlock():
     netdev_run_todo():
       __rtnl_unlock()
                                    <got RTNL lock>
                                    wiphy_lock() // blocked by Thread 1
       rtnl_lock(); // DEADLOCK
 doit()
 nl80211_post_doit():
   wiphy_unlock();

Basically, unlocking the RTNL within netdev_run_todo() gives another thread
that is waiting for the RTNL in nl80211_pre_doit() a chance to grab the
RTNL lock leading to the deadlock. I found that there are multiple
instances where rtnl_lock() is called within netdev_run_todo(): a couple of
times inside netdev_wait_allrefs() and directly by netdev_run_todo().

Since I'm not really familiar with all the RNTL locking requirements, I was
hoping you could take a look at netdev_run_todo() to see if it's possible
to refactor it to avoid this deadlock. If not, then I don't think we can
call rtnl_unlock() while still holding the wiphy mutex.

Thanks,
Will

  reply	other threads:[~2022-03-24 21:59 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-17 17:09 [BUG] deadlock in nl80211_vendor_cmd willmcvicker
     [not found] ` <CABYd82Z=YXmZPTQhf0K1M4nS2wk3dPBSqx91D8SoUd59AUzpHg@mail.gmail.com>
2022-03-21 17:00   ` William McVicker
2022-03-21 20:07 ` Johannes Berg
2022-03-22  0:15   ` William McVicker
2022-03-22 21:31   ` Jeff Johnson
2022-03-22 21:58     ` William McVicker
2022-03-23 16:06       ` Jeff Johnson
2022-03-24 21:58         ` William McVicker [this message]
2022-03-25 12:04           ` Johannes Berg
2022-03-25 12:06             ` Johannes Berg
2022-03-25 16:49             ` Jakub Kicinski
2022-03-25 17:01               ` Johannes Berg
2022-03-25 18:08                 ` William McVicker
2022-03-25 20:21                   ` Johannes Berg
2022-03-25 20:36                   ` William McVicker
2022-03-25 21:16                     ` Johannes Berg
2022-03-25 21:54                       ` Johannes Berg
2022-03-25 22:18                       ` Jeff Johnson
2022-03-25 23:57                       ` William McVicker
2022-03-26  0:07                         ` Jakub Kicinski
2022-03-26  0:12                           ` William McVicker
2022-03-25 20:40                 ` Jakub Kicinski
2022-03-25 21:25                   ` Johannes Berg
2022-03-25 21:48                     ` Jakub Kicinski
2022-03-25 21:50                       ` Johannes Berg
2022-03-25 12:09       ` Johannes Berg
2022-03-25 15:59         ` Jeff Johnson
2022-03-25 16:04           ` Johannes Berg
2022-03-25 17:14             ` Jeff Johnson
2022-03-25 12:08     ` Johannes Berg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yjzpo3TfZxtKPMAG@google.com \
    --to=willmcvicker@google.com \
    --cc=amitkarwar@gmail.com \
    --cc=davem@davemloft.net \
    --cc=ganapathi.bhat@nxp.com \
    --cc=huxinming820@gmail.com \
    --cc=johannes@sipsolutions.net \
    --cc=kernel-team@android.com \
    --cc=kuba@kernel.org \
    --cc=kvalo@codeaurora.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.