All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Trond Myklebust <trondmy@hammerspace.com>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner()
Date: Mon, 23 May 2022 14:04:39 -0400	[thread overview]
Message-ID: <a719ae7e8fb8b46f84b00b27d800330712486f40.camel@kernel.org> (raw)
In-Reply-To: <93d11e12532f5a10153d3702100271f70373bce6.camel@hammerspace.com>

On Mon, 2022-05-23 at 17:43 +0000, Trond Myklebust wrote:
> On Mon, 2022-05-23 at 12:37 -0400, Jeff Layton wrote:
> > On Mon, 2022-05-23 at 15:41 +0000, Chuck Lever III wrote:
> > > 
> > > > On May 23, 2022, at 11:26 AM, Jeff Layton <jlayton@kernel.org>
> > > > wrote:
> > > > 
> > > > On Mon, 2022-05-23 at 15:00 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > > On May 23, 2022, at 9:40 AM, Jeff Layton <jlayton@kernel.org>
> > > > > > wrote:
> > > > > > 
> > > > > > On Sun, 2022-05-22 at 11:38 -0400, Chuck Lever wrote:
> > > > > > > nfsd4_release_lockowner() holds clp->cl_lock when it calls
> > > > > > > check_for_locks(). However, check_for_locks() calls
> > > > > > > nfsd_file_get()
> > > > > > > / nfsd_file_put() to access the backing inode's flc_posix
> > > > > > > list, and
> > > > > > > nfsd_file_put() can sleep if the inode was recently
> > > > > > > removed.
> > > > > > > 
> > > > > > 
> > > > > > It might be good to add a might_sleep() to nfsd_file_put?
> > > > > 
> > > > > I intend to include the patch you reviewed last week that
> > > > > adds the might_sleep(), as part of this series.
> > > > > 
> > > > > 
> > > > > > > Let's instead rely on the stateowner's reference count to
> > > > > > > gate
> > > > > > > whether the release is permitted. This should be a reliable
> > > > > > > indication of locks-in-use since file lock operations and
> > > > > > > ->lm_get_owner take appropriate references, which are
> > > > > > > released
> > > > > > > appropriately when file locks are removed.
> > > > > > > 
> > > > > > > Reported-by: Dai Ngo <dai.ngo@oracle.com>
> > > > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > > > Cc: stable@vger.kernel.org
> > > > > > > ---
> > > > > > > fs/nfsd/nfs4state.c |    9 +++------
> > > > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > > > > 
> > > > > > > This might be a naive approach, but let's start with it.
> > > > > > > 
> > > > > > > This passes light testing, but it's not clear how much our
> > > > > > > existing
> > > > > > > fleet of tests exercises this area. I've locally built a
> > > > > > > couple of
> > > > > > > pynfs tests (one is based on the one Dai posted last week)
> > > > > > > and they
> > > > > > > pass too.
> > > > > > > 
> > > > > > > I don't believe that FREE_STATEID needs the same
> > > > > > > simplification.
> > > > > > > 
> > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > > index a280256cbb03..b77894e668a4 100644
> > > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > > @@ -7559,12 +7559,9 @@ nfsd4_release_lockowner(struct
> > > > > > > svc_rqst *rqstp,
> > > > > > > 
> > > > > > >                 /* see if there are still any locks
> > > > > > > associated with it */
> > > > > > >                 lo = lockowner(sop);
> > > > > > > -               list_for_each_entry(stp, &sop->so_stateids,
> > > > > > > st_perstateowner) {
> > > > > > > -                       if (check_for_locks(stp-
> > > > > > > > st_stid.sc_file, lo)) {
> > > > > > > -                               status = nfserr_locks_held;
> > > > > > > -                               spin_unlock(&clp->cl_lock);
> > > > > > > -                               return status;
> > > > > > > -                       }
> > > > > > > +               if (atomic_read(&sop->so_count) > 1) {
> > > > > > > +                       spin_unlock(&clp->cl_lock);
> > > > > > > +                       return nfserr_locks_held;
> > > > > > >                 }
> > > > > > > 
> > > > > > >                 nfs4_get_stateowner(sop);
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > lm_get_owner is called from locks_copy_conflock, so if
> > > > > > someone else
> > > > > > happens to be doing a LOCKT or F_GETLK call at the same time
> > > > > > that
> > > > > > RELEASE_LOCKOWNER gets called, then this may end up returning
> > > > > > an error
> > > > > > inappropriately.
> > > > > 
> > > > > IMO releasing the lockowner while it's being used for
> > > > > _anything_
> > > > > seems risky and surprising. If RELEASE_LOCKOWNER succeeds while
> > > > > the client is still using the lockowner for any reason, a
> > > > > subsequent error will occur if the client tries to use it
> > > > > again.
> > > > > Heck, I can see the server failing in mid-COMPOUND with this
> > > > > kind
> > > > > of race. Better I think to just leave the lockowner in place if
> > > > > there's any ambiguity.
> > > > > 
> > > > 
> > > > The problem here is not the client itself calling
> > > > RELEASE_LOCKOWNER
> > > > while it's still in use, but rather a different client altogether
> > > > calling LOCKT (or a local process does a F_GETLK) on an inode
> > > > where a
> > > > lock is held by a client. The LOCKT gets a reference to it (for
> > > > the
> > > > conflock), while the client that has the lockowner releases the
> > > > lock and
> > > > then the lockowner while the refcount is still high.
> > > > 
> > > > The race window for this is probably quite small, but I think
> > > > it's
> > > > theoretically possible. The point is that an elevated refcount on
> > > > the
> > > > lockowner doesn't necessarily mean that locks are actually being
> > > > held by
> > > > it.
> > > 
> > > Sure, I get that the lockowner's reference count is not 100%
> > > reliable. The question is whether it's good enough.
> > > 
> > > We are looking for a mechanism that can simply count the number
> > > of locks held by a lockowner. It sounds like you believe that
> > > lm_get_owner / put_owner might not be a reliable way to do
> > > that.
> > > 
> > > 
> > > > > The spec language does not say RELEASE_LOCKOWNER must not
> > > > > return
> > > > > LOCKS_HELD for other reasons, and it does say that there is no
> > > > > choice of using another NFSERR value (RFC 7530 Section 13.2).
> > > > > 
> > > > 
> > > > What recourse does the client have if this happens? It released
> > > > all of
> > > > its locks and tried to release the lockowner, but the server says
> > > > "locks
> > > > held". Should it just give up at that point? RELEASE_LOCKOWNER is
> > > > a sort
> > > > of a courtesy by the client, I suppose...
> > > 
> > > RELEASE_LOCKOWNER is a courtesy for the server. Most clients
> > > ignore the return code IIUC.
> > > 
> > > So the hazard caused by this race would be a small resource
> > > leak on the server that would go away once the client's lease
> > > was purged.
> > > 
> > > 
> > > > > > My guess is that that would be pretty hard to hit the
> > > > > > timing right, but not impossible.
> > > > > > 
> > > > > > What we may want to do is have the kernel do this check and
> > > > > > only if it
> > > > > > comes back >1 do the actual check for locks. That won't fix
> > > > > > the original
> > > > > > problem though.
> > > > > > 
> > > > > > In other places in nfsd, we've plumbed in a dispose_list head
> > > > > > and
> > > > > > deferred the sleeping functions until the spinlock can be
> > > > > > dropped. I
> > > > > > haven't looked closely at whether that's possible here, but
> > > > > > it may be a
> > > > > > more reliable approach.
> > > > > 
> > > > > That was proposed by Dai last week.
> > > > > 
> > > > > https://lore.kernel.org/linux-nfs/1653079929-18283-1-git-send-email-dai.ngo@oracle.com/T/#u
> > > > > 
> > > > > Trond pointed out that if two separate clients were releasing a
> > > > > lockowner on the same inode, there is nothing that protects the
> > > > > dispose_list, and it would get corrupted.
> > > > > 
> > > > > https://lore.kernel.org/linux-nfs/31E87CEF-C83D-4FA8-A774-F2C389011FCE@oracle.com/T/#mf1fc1ae0503815c0a36ae75a95086c3eff892614
> > > > > 
> > > > 
> > > > Yeah, that doesn't look like what's needed.
> > > > 
> > > > What I was going to suggest is a nfsd_file_put variant that takes
> > > > a
> > > > list_head. If the refcount goes to zero and the thing ends up
> > > > being
> > > > unhashed, then you put it on the dispose list rather than doing
> > > > the
> > > > blocking operations, and then clean it up later.
> > > 
> > > Trond doesn't like that approach; see the e-mail thread.
> > > 
> > 
> > I didn't see him saying that that would be wrong, per-se, but the
> > initial implementation was racy.
> > 
> > His suggestion was just to keep a counter in the lockowner of how
> > many
> > locks are associated with it. That seems like a good suggestion,
> > though
> > you'd probably need to add a parameter to lm_get_owner to indicate
> > whether you were adding a new lock or just doing a conflock copy.
> 
> I don't think this should be necessary. The posix_lock code doesn't
> ever use a struct file_lock that it hasn't allocated itself. We should
> always be calling conflock to copy from whatever struct file_lock that
> the caller passed as an argument.
> 
> IOW: the number of lm_get_owner and lm_put_owner calls should always be
> 100% balanced once all the locks belonging to a specific lock owner are
> removed.
> 

We take references to the owner when we go to add a lock record, or when
copying a conflicting lock. You want to keep a count of the former
without counting the latter.

lm_get_owner gets called for both though. I don't see how you can
disambiguate the two situations w/o some way to indicate that. Adding a
bool argument to lm_get_owner/lm_put_owner ops would be pretty simple to
implement, I think.

-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2022-05-23 18:30 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-22 15:38 [PATCH RFC] NFSD: Fix possible sleep during nfsd4_release_lockowner() Chuck Lever
2022-05-23 13:40 ` Jeff Layton
2022-05-23 15:00   ` Chuck Lever III
2022-05-23 15:26     ` Jeff Layton
2022-05-23 15:41       ` Chuck Lever III
2022-05-23 16:37         ` Jeff Layton
2022-05-23 17:25           ` Chuck Lever III
2022-05-23 17:38             ` Jeff Layton
2022-05-23 19:35               ` Chuck Lever III
2022-05-23 19:43                 ` Jeff Layton
2022-05-23 20:17                   ` J. Bruce Fields
2022-05-23 20:32                     ` Jeff Layton
2022-05-23 17:43           ` Trond Myklebust
2022-05-23 18:04             ` Jeff Layton [this message]
2022-05-23 18:21               ` Trond Myklebust
2022-05-23 18:30                 ` Jeff Layton
2022-05-23 19:13                   ` Chuck Lever III
2022-05-23 19:36                     ` Jeff Layton
2022-05-23 20:29                       ` J. Bruce Fields
2022-05-23 21:15                         ` Jeff Layton
2022-05-23 21:28                           ` J. Bruce Fields
2022-05-24  0:07                             ` J. Bruce Fields
2022-05-23 22:18             ` Chuck Lever III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a719ae7e8fb8b46f84b00b27d800330712486f40.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.