All of lore.kernel.org
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "aglo@umich.edu" <aglo@umich.edu>
Cc: "bfields@fieldses.org" <bfields@fieldses.org>,
	"jiufei.xue@linux.alibaba.com" <jiufei.xue@linux.alibaba.com>,
	"Anna.Schumaker@netapp.com" <Anna.Schumaker@netapp.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"joseph.qi@linux.alibaba.com" <joseph.qi@linux.alibaba.com>
Subject: Re: [bug report] task hang while testing xfstests generic/323
Date: Mon, 11 Mar 2019 15:28:53 +0000	[thread overview]
Message-ID: <7ffd7594113b9d7e3105aef61752caa1f01e61e3.camel@hammerspace.com> (raw)
In-Reply-To: <CAN-5tyH52dv7zuCkQoUS04rpbqotWn+SM1rJoxqBgsWG2bKLcg@mail.gmail.com>

On Mon, 2019-03-11 at 11:14 -0400, Olga Kornievskaia wrote:
> On Mon, Mar 11, 2019 at 11:12 AM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > On Mon, 2019-03-11 at 14:30 +0000, Trond Myklebust wrote:
> > > Hi Olga,
> > > 
> > > On Sun, 2019-03-10 at 18:20 -0400, Olga Kornievskaia wrote:
> > > > There are a bunch of cases where multiple operations are using
> > > > the
> > > > same seqid and slot.
> > > > 
> > > > Example of such weirdness is (nfs.seqid == 0x000002f4) &&
> > > > (nfs.slotid
> > > > == 0) and the one leading to the hang.
> > > > 
> > > > In frame 415870, there is an OPEN using that seqid and slot for
> > > > the
> > > > first time (but this slot will be re-used a bunch of times
> > > > before
> > > > it
> > > > gets a reply in frame 415908 with the open stateid
> > > > seq=40).  (also
> > > > in
> > > > this packet there is an example of reuse
> > > > slot=1+seqid=0x000128f7 by
> > > > both TEST_STATEID and OPEN but let's set that aside).
> > > > 
> > > > In frame 415874 (in the same packet), client sends 5 opens on
> > > > the
> > > > SAME
> > > > seqid and slot (all have distinct xids). In a ways that's end
> > > > up
> > > > being
> > > > alright since opens are for the same file and thus reply out of
> > > > the
> > > > cache and the reply is ERR_DELAY. But in frame 415876, client
> > > > sends
> > > > again uses the same seqid and slot  and in this case it's used
> > > > by
> > > > 3opens and a test_stateid.
> > 
> > This should result in exactly 1 bump of the stateid seqid.
> > 
> > > > Client in all this mess never processes the open stateid seq=40
> > > > and
> > > > keeps on resending CLOSE with seq=37 (also to note client
> > > > "missed"
> > > > processing seqid=38 and 39 as well. 39 probably because it was
> > > > a
> > > > reply
> > > > on the same kind of "Reused" slot=1 and seq=0x000128f7. I
> > > > haven't
> > > > tracked 38 but i'm assuming it's the same). I don't know how
> > > > many
> > > > times but after 5mins, I see a TEST_STATEID that again uses the
> > > > same
> > > > seqid+slot (which gets a reply from the cache matching OPEN).
> > > > Also
> > > > open + close (still with seq=37) open is replied to but after
> > > > this
> > > > client goes into a soft lockup logs have
> > > > "nfs4_schedule_state_manager:
> > > > kthread_ruan: -4" over and over again . then a soft lockup.
> > > > 
> > > > Looking back on slot 0. nfs.seqid=0x000002f3 was used in
> > > > frame=415866
> > > > by the TEST_STATEID. This is replied to in frame 415877 (with
> > > > an
> > > > ERR_DELAY). But before the client got a reply, it used the slot
> > > > and
> > > > the seq by frame 415874. TEST_STATEID is a synchronous and
> > > > interruptible operation. I'm suspecting that somehow it was
> > > > interrupted and that's who the slot was able to be re-used by
> > > > the
> > > > frame 415874. But how the several opens were able to get the
> > > > same
> > > > slot
> > > > I don't know..
> > > 
> > > Is this still true with the current linux-next? I would expect
> > > this
> > > patch
> > > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3453d5708b33efe76f40eca1c0ed60923094b971
> > > to change the Linux client behaviour in the above regard.
> > > 
> > 
> > Note also that what you describe would appear to indicate a serious
> > server bug. If the client is reusing a slot+seqid, then the correct
> > server behaviour is to either return one of the errors
> > NFS4ERR_DELAY,
> > NFS4ERR_SEQ_FALSE_RETRY, NFS4ERR_RETRY_UNCACHED_REP,
> > NFS4ERR_SEQ_MISORDERED, or else it must replay the exact same reply
> > that it had cached for the original request.
> > 
> > It is protocol violation for the server to execute new requests on
> > a
> > slot+seqid combination that has already been used. For that reason,
> > it
> > is also impossible for a replay to cause further state changes on
> > the
> > server; a replay means that the server belts out the exact same
> > reply
> > that was attempted sent earlier with no changes (with stateids that
> > match bit for bit what would have been sent earlier).
> > 
> 
> But it is the same requests because all of them are opens on the same
> file same everything.

That is irrelevant. The whole point of the session slot mechanism is to
provide reliable only once semantics as defined in section 2.10.6 of
RFC5661: https://tools.ietf.org/html/rfc5661#section-2.10.6

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



      reply	other threads:[~2019-03-11 15:29 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-28 10:10 [bug report] task hang while testing xfstests generic/323 Jiufei Xue
2019-02-28 22:26 ` Olga Kornievskaia
2019-02-28 23:56   ` Trond Myklebust
2019-03-01  5:19     ` Jiufei Xue
2019-03-01  5:08   ` Jiufei Xue
2019-03-01  8:49     ` Jiufei Xue
2019-03-01 13:08       ` Trond Myklebust
2019-03-02 16:34         ` Jiufei Xue
2019-03-04 15:20         ` Jiufei Xue
2019-03-04 15:50           ` Trond Myklebust
2019-03-05  5:09             ` Jiufei Xue
2019-03-05 14:45               ` Trond Myklebust
2019-03-06  9:59                 ` Jiufei Xue
2019-03-06 16:09                   ` bfields
2019-03-10 22:20                     ` Olga Kornievskaia
2019-03-11 14:30                       ` Trond Myklebust
2019-03-11 15:07                         ` Olga Kornievskaia
2019-03-11 15:13                           ` Olga Kornievskaia
2019-03-15  6:30                             ` Jiufei Xue
2019-03-15 20:33                               ` Olga Kornievskaia
2019-03-15 20:55                                 ` Trond Myklebust
2019-03-16 14:11                                 ` Jiufei Xue
2019-03-19 15:33                                   ` Olga Kornievskaia
2019-03-11 15:12                         ` Trond Myklebust
2019-03-11 15:14                           ` Olga Kornievskaia
2019-03-11 15:28                             ` Trond Myklebust [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ffd7594113b9d7e3105aef61752caa1f01e61e3.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=Anna.Schumaker@netapp.com \
    --cc=aglo@umich.edu \
    --cc=bfields@fieldses.org \
    --cc=jiufei.xue@linux.alibaba.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.