All of lore.kernel.org
 help / color / mirror / Atom feed
From: Olga Kornievskaia <aglo@umich.edu>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: "bfields@fieldses.org" <bfields@fieldses.org>,
	"jiufei.xue@linux.alibaba.com" <jiufei.xue@linux.alibaba.com>,
	"Anna.Schumaker@netapp.com" <Anna.Schumaker@netapp.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"joseph.qi@linux.alibaba.com" <joseph.qi@linux.alibaba.com>
Subject: Re: [bug report] task hang while testing xfstests generic/323
Date: Mon, 11 Mar 2019 11:14:41 -0400	[thread overview]
Message-ID: <CAN-5tyH52dv7zuCkQoUS04rpbqotWn+SM1rJoxqBgsWG2bKLcg@mail.gmail.com> (raw)
In-Reply-To: <7824abe2b39fef6c5e1cd9cc53732fc91728871c.camel@hammerspace.com>

On Mon, Mar 11, 2019 at 11:12 AM Trond Myklebust
<trondmy@hammerspace.com> wrote:
>
> On Mon, 2019-03-11 at 14:30 +0000, Trond Myklebust wrote:
> > Hi Olga,
> >
> > On Sun, 2019-03-10 at 18:20 -0400, Olga Kornievskaia wrote:
> > > There are a bunch of cases where multiple operations are using the
> > > same seqid and slot.
> > >
> > > Example of such weirdness is (nfs.seqid == 0x000002f4) &&
> > > (nfs.slotid
> > > == 0) and the one leading to the hang.
> > >
> > > In frame 415870, there is an OPEN using that seqid and slot for the
> > > first time (but this slot will be re-used a bunch of times before
> > > it
> > > gets a reply in frame 415908 with the open stateid seq=40).  (also
> > > in
> > > this packet there is an example of reuse slot=1+seqid=0x000128f7 by
> > > both TEST_STATEID and OPEN but let's set that aside).
> > >
> > > In frame 415874 (in the same packet), client sends 5 opens on the
> > > SAME
> > > seqid and slot (all have distinct xids). In a ways that's end up
> > > being
> > > alright since opens are for the same file and thus reply out of the
> > > cache and the reply is ERR_DELAY. But in frame 415876, client sends
> > > again uses the same seqid and slot  and in this case it's used by
> > > 3opens and a test_stateid.
>
> This should result in exactly 1 bump of the stateid seqid.
>
> > > Client in all this mess never processes the open stateid seq=40 and
> > > keeps on resending CLOSE with seq=37 (also to note client "missed"
> > > processing seqid=38 and 39 as well. 39 probably because it was a
> > > reply
> > > on the same kind of "Reused" slot=1 and seq=0x000128f7. I haven't
> > > tracked 38 but i'm assuming it's the same). I don't know how many
> > > times but after 5mins, I see a TEST_STATEID that again uses the
> > > same
> > > seqid+slot (which gets a reply from the cache matching OPEN). Also
> > > open + close (still with seq=37) open is replied to but after this
> > > client goes into a soft lockup logs have
> > > "nfs4_schedule_state_manager:
> > > kthread_ruan: -4" over and over again . then a soft lockup.
> > >
> > > Looking back on slot 0. nfs.seqid=0x000002f3 was used in
> > > frame=415866
> > > by the TEST_STATEID. This is replied to in frame 415877 (with an
> > > ERR_DELAY). But before the client got a reply, it used the slot and
> > > the seq by frame 415874. TEST_STATEID is a synchronous and
> > > interruptible operation. I'm suspecting that somehow it was
> > > interrupted and that's who the slot was able to be re-used by the
> > > frame 415874. But how the several opens were able to get the same
> > > slot
> > > I don't know..
> >
> > Is this still true with the current linux-next? I would expect this
> > patch
> > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3453d5708b33efe76f40eca1c0ed60923094b971
> > to change the Linux client behaviour in the above regard.
> >
>
>
> Note also that what you describe would appear to indicate a serious
> server bug. If the client is reusing a slot+seqid, then the correct
> server behaviour is to either return one of the errors NFS4ERR_DELAY,
> NFS4ERR_SEQ_FALSE_RETRY, NFS4ERR_RETRY_UNCACHED_REP,
> NFS4ERR_SEQ_MISORDERED, or else it must replay the exact same reply
> that it had cached for the original request.
>
> It is protocol violation for the server to execute new requests on a
> slot+seqid combination that has already been used. For that reason, it
> is also impossible for a replay to cause further state changes on the
> server; a replay means that the server belts out the exact same reply
> that was attempted sent earlier with no changes (with stateids that
> match bit for bit what would have been sent earlier).
>

But it is the same requests because all of them are opens on the same
file same everything.

> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>

  reply	other threads:[~2019-03-11 15:14 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-28 10:10 [bug report] task hang while testing xfstests generic/323 Jiufei Xue
2019-02-28 22:26 ` Olga Kornievskaia
2019-02-28 23:56   ` Trond Myklebust
2019-03-01  5:19     ` Jiufei Xue
2019-03-01  5:08   ` Jiufei Xue
2019-03-01  8:49     ` Jiufei Xue
2019-03-01 13:08       ` Trond Myklebust
2019-03-02 16:34         ` Jiufei Xue
2019-03-04 15:20         ` Jiufei Xue
2019-03-04 15:50           ` Trond Myklebust
2019-03-05  5:09             ` Jiufei Xue
2019-03-05 14:45               ` Trond Myklebust
2019-03-06  9:59                 ` Jiufei Xue
2019-03-06 16:09                   ` bfields
2019-03-10 22:20                     ` Olga Kornievskaia
2019-03-11 14:30                       ` Trond Myklebust
2019-03-11 15:07                         ` Olga Kornievskaia
2019-03-11 15:13                           ` Olga Kornievskaia
2019-03-15  6:30                             ` Jiufei Xue
2019-03-15 20:33                               ` Olga Kornievskaia
2019-03-15 20:55                                 ` Trond Myklebust
2019-03-16 14:11                                 ` Jiufei Xue
2019-03-19 15:33                                   ` Olga Kornievskaia
2019-03-11 15:12                         ` Trond Myklebust
2019-03-11 15:14                           ` Olga Kornievskaia [this message]
2019-03-11 15:28                             ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAN-5tyH52dv7zuCkQoUS04rpbqotWn+SM1rJoxqBgsWG2bKLcg@mail.gmail.com \
    --to=aglo@umich.edu \
    --cc=Anna.Schumaker@netapp.com \
    --cc=bfields@fieldses.org \
    --cc=jiufei.xue@linux.alibaba.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.