All of lore.kernel.org
 help / color / mirror / Atom feed
From: Olga Kornievskaia <aglo@umich.edu>
To: Trond Myklebust <trondmy@primarydata.com>
Cc: "bfields@fieldses.org" <bfields@fieldses.org>,
	"tibbs@math.uh.edu" <tibbs@math.uh.edu>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: NFS: nfs4_reclaim_open_state: Lock reclaim failed! log spew
Date: Thu, 17 Nov 2016 17:27:00 -0500	[thread overview]
Message-ID: <CAN-5tyFazrkcBVor7YvbOPKGWSCDPH1-NLUjZBKJPoKBrzoy-g@mail.gmail.com> (raw)
In-Reply-To: <1479420942.33885.19.camel@primarydata.com>

On Thu, Nov 17, 2016 at 5:15 PM, Trond Myklebust
<trondmy@primarydata.com> wrote:
> On Thu, 2016-11-17 at 16:53 -0500, Olga Kornievskaia wrote:
>> On Thu, Nov 17, 2016 at 4:45 PM, Trond Myklebust
>> <trondmy@primarydata.com> wrote:
>> >
>> > On Thu, 2016-11-17 at 16:26 -0500, bfields@fieldses.org wrote:
>> > >
>> > > On Thu, Nov 17, 2016 at 04:05:32PM -0500, Olga Kornievskaia
>> > > wrote:
>> > > >
>> > > >
>> > > > On Thu, Nov 17, 2016 at 3:46 PM, bfields@fieldses.org
>> > > > <bfields@fieldses.org> wrote:
>> > > > >
>> > > > >
>> > > > > On Thu, Nov 17, 2016 at 03:29:11PM -0500, Olga Kornievskaia
>> > > > > wrote:
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Nov 17, 2016 at 3:17 PM, bfields@fieldses.org
>> > > > > > <bfields@fieldses.org> wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Nov 17, 2016 at 02:58:12PM -0500, Olga
>> > > > > > > Kornievskaia
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Thu, Nov 17, 2016 at 2:32 PM, bfields@fieldses.org
>> > > > > > > > <bfields@fieldses.org> wrote:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Nov 17, 2016 at 05:45:52PM +0000, Trond
>> > > > > > > > > Myklebust
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Thu, 2016-11-17 at 11:31 -0500, J. Bruce Fields
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Nov 16, 2016 at 02:55:05PM -0600, Jason L
>> > > > > > > > > > > Tibbitts III wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > I'm replying to a rather old message, but the
>> > > > > > > > > > > > issue
>> > > > > > > > > > > > has just now
>> > > > > > > > > > > > popped
>> > > > > > > > > > > > back up again.
>> > > > > > > > > > > >
>> > > > > > > > > > > > To recap, a client stops being able to access
>> > > > > > > > > > > > _any_
>> > > > > > > > > > > > mount on a
>> > > > > > > > > > > > particular server, and "NFS:
>> > > > > > > > > > > > nfs4_reclaim_open_state: Lock reclaim
>> > > > > > > > > > > > failed!" appears several hundred times per
>> > > > > > > > > > > > second
>> > > > > > > > > > > > in the kernel
>> > > > > > > > > > > > log.
>> > > > > > > > > > > > The load goes up by one for ever process
>> > > > > > > > > > > > attempting
>> > > > > > > > > > > > to access any
>> > > > > > > > > > > > mount
>> > > > > > > > > > > > from that particular server.  Mounts to other
>> > > > > > > > > > > > servers are fine, and
>> > > > > > > > > > > > other clients can mount things from that one
>> > > > > > > > > > > > server
>> > > > > > > > > > > > without
>> > > > > > > > > > > > problems.
>> > > > > > > > > > > >
>> > > > > > > > > > > > When I kill every process keeping that
>> > > > > > > > > > > > particular
>> > > > > > > > > > > > mount active and
>> > > > > > > > > > > > then
>> > > > > > > > > > > > umount it, I see:
>> > > > > > > > > > > >
>> > > > > > > > > > > > NFS: nfs4_reclaim_open_state: unhandled error
>> > > > > > > > > > > > -10068
>> > > > > > > > > > >
>> > > > > > > > > > > NFS4ERR_RETRY_UNCACHED_REP.
>> > > > > > > > > > >
>> > > > > > > > > > > So, you're using NFSv4.1 or 4.2, and the server
>> > > > > > > > > > > thinks that the
>> > > > > > > > > > > client
>> > > > > > > > > > > has reused a (slot, sequence number) pair, but
>> > > > > > > > > > > the
>> > > > > > > > > > > server doesn't
>> > > > > > > > > > > have a
>> > > > > > > > > > > cached response to return.
>> > > > > > > > > > >
>> > > > > > > > > > > Hard to know how that happened, and it's not
>> > > > > > > > > > > shown in
>> > > > > > > > > > > the below.
>> > > > > > > > > > > Sounds like a bug, though.
>> > > > > > > > > >
>> > > > > > > > > > ...or a Ctrl-C....
>> > > > > > > > >
>> > > > > > > > > How does that happen?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > If I may chime in...
>> > > > > > > >
>> > > > > > > > Bruce, when an application sends a Ctrl-C and clients's
>> > > > > > > > session slot
>> > > > > > > > has sent out an RPC but didn't process the reply, the
>> > > > > > > > client doesn't
>> > > > > > > > know if the server processed that sequence id or not.
>> > > > > > > > In
>> > > > > > > > that case,
>> > > > > > > > the client doesn't increment the sequence number.
>> > > > > > > > Normally
>> > > > > > > > the client
>> > > > > > > > would handle getting such an error by retrying again
>> > > > > > > > (and
>> > > > > > > > resetting
>> > > > > > > > the slots) but I think during recovery operation the
>> > > > > > > > client
>> > > > > > > > handles
>> > > > > > > > errors differently (by just erroring). I believe the
>> > > > > > > > reasoning that we
>> > > > > > > > don't want to be stuck trying to recover from the
>> > > > > > > > recovery
>> > > > > > > > from the
>> > > > > > > > recovery etc...
>> > > > > > >
>> > > > > > > So in that case the client can end up sending a different
>> > > > > > > rpc
>> > > > > > > reusing
>> > > > > > > the old slot and sequence number?
>> > > > > >
>> > > > > > Correct.
>> > > > >
>> > > > > So that could get UNCACHED_REP as the response.  But if
>> > > > > you're
>> > > > > very
>> > > > > unlucky, couldn't this also happen?:
>> > > > >
>> > > > >         1) the compound previously sent on that slot was
>> > > > > processed by
>> > > > >         the server and cached
>> > > > >         2) the compound you're sending now happens to have
>> > > > > the
>> > > > > same set
>> > > > >         of operations
>> > > > >
>> > > > > with the result that the client doesn't detect that the reply
>> > > > > was
>> > > > > actually to some other rpc, and instead it returns bad data
>> > > > > to
>> > > > > the
>> > > > > application?
>> > > >
>> > > > If you are sending exactly the same operations and arguments,
>> > > > then
>> > > > why
>> > > > is a reply from the cache would lead to bad data?
>> > >
>> > > That would probably be fine, I was wondering what would happen if
>> > > you
>> > > sent the same operation but different arguments.
>> >
>> > >
>> > > So the original cancelled operation is something like
>> > > PUTFH(fh1)+OPEN("foo")+GETFH, and the new one is
>> > > PUTFH(fh2)+OPEN("bar")+GETFH.  In theory couldn't the second one
>> > > succeed
>> > > and leave the client thinking it had opened (fh2, bar) when the
>> > > filehandle it got back was really for (fh1, foo)?
>> > >
>> >
>> > The client would receive a filehandle for fh1/"foo", so it would
>> > apply
>> > any state it thought it had received to that file. However,
>> > normally,
>> > I'd expect to see a NFS4ERR_FALSE_RETRY in this case.
>>
>> I see Bruce's point that if the server only looks up the cache based
>> on the seqid and slot# and doesn't have like a hash of the content
>> which I could see is expensive, then the client in this case would
>> get
>> into trouble of thinking it opened "bar" but really it's "foo". Spec
>> says:
>>
>> Section 18.46.3
>> If the client reuses a slot ID and sequence ID for a completely
>>    different request, the server MAY treat the request as if it is a
>>    retry of what it has already executed.  The server MAY however
>> detect
>>    the client's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY.
>>
>> What is "a completely different request". From the client's point of
>> view sending different args would constitute a different request. But
>> in any case it's a "MAY" so client can't depend on this being
>> implemented.
>>
>
> What's the alternative? Assume the client pre-emptively bumps the seqid
> instead of retrying, then the user presses Ctrl-C again. Repeat a few
> more times. How do I now resync the seqids between the client and
> server other than by trashing the session?

I don't see any alternatives than to reset in that case. But I think
it's better then the possibility of accidentally opening a wrong file?

  reply	other threads:[~2016-11-17 22:27 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-24 21:43 NFS: nfs4_reclaim_open_state: Lock reclaim failed! log spew Jason L Tibbitts III
2016-02-25 19:58 ` J. Bruce Fields
2016-02-29 23:06   ` Jason L Tibbitts III
2016-03-01  0:48     ` J. Bruce Fields
2016-03-01  0:53       ` Jason L Tibbitts III
2016-03-01  1:01         ` J. Bruce Fields
2016-03-01  1:03           ` Jason L Tibbitts III
2016-11-16 20:55             ` Jason L Tibbitts III
2016-11-17 16:31               ` J. Bruce Fields
2016-11-17 17:08                 ` Jason L Tibbitts III
2016-11-17 20:22                   ` Andrew W Elble
2016-11-17 17:45                 ` Trond Myklebust
2016-11-17 19:32                   ` bfields
2016-11-17 19:58                     ` Olga Kornievskaia
2016-11-17 20:17                       ` bfields
2016-11-17 20:29                         ` Olga Kornievskaia
2016-11-17 20:46                           ` bfields
2016-11-17 21:05                             ` Olga Kornievskaia
2016-11-17 21:26                               ` bfields
2016-11-17 21:45                                 ` Trond Myklebust
2016-11-17 21:53                                   ` Olga Kornievskaia
2016-11-17 22:15                                     ` Trond Myklebust
2016-11-17 22:27                                       ` Olga Kornievskaia [this message]
2016-11-17 22:43                                         ` Trond Myklebust
2016-11-18 20:52                                           ` bfields
2016-11-18 22:44                                             ` Trond Myklebust
2016-11-21 18:37                                               ` Fields Bruce James

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAN-5tyFazrkcBVor7YvbOPKGWSCDPH1-NLUjZBKJPoKBrzoy-g@mail.gmail.com \
    --to=aglo@umich.edu \
    --cc=bfields@fieldses.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=tibbs@math.uh.edu \
    --cc=trondmy@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.