All of lore.kernel.org
 help / color / mirror / Atom feed
From: Olga Kornievskaia <aglo@umich.edu>
To: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
	"bfields@fieldses.org" <bfields@fieldses.org>,
	"Anna.Schumaker@netapp.com" <Anna.Schumaker@netapp.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"joseph.qi@linux.alibaba.com" <joseph.qi@linux.alibaba.com>
Subject: Re: [bug report] task hang while testing xfstests generic/323
Date: Tue, 19 Mar 2019 11:33:10 -0400	[thread overview]
Message-ID: <CAN-5tyHxQ7q9aR=dmS4_-b0269vcepjcv7r6W9-soDhp4McAKA@mail.gmail.com> (raw)
In-Reply-To: <63f00c10-28f5-b1e6-dcf9-f85f5edbfabf@linux.alibaba.com>

On Sat, Mar 16, 2019 at 10:11 AM Jiufei Xue
<jiufei.xue@linux.alibaba.com> wrote:
>
>
> Hi Olga,
> On 2019/3/16 上午4:33, Olga Kornievskaia wrote:
> > On Fri, Mar 15, 2019 at 2:31 AM Jiufei Xue <jiufei.xue@linux.alibaba.com> wrote:
> >>
> >> Hi Olga,
> >>
> >> On 2019/3/11 下午11:13, Olga Kornievskaia wrote:
> >>> Let me double check that. I have reproduced the "infinite loop" or
> >>> CLOSE on the upstream (I'm looking thru the trace points from friday).
> >>
> >> Do you try to capture the packages when reproduced this issue on the
> >> upstream. I still lost kernel packages after some adjustment according
> >> to bfield's suggestion :(
> >
> > Hi Jiufei,
> >
> > Yes I have network trace captures but they are too big to post to the
> > mailing list. I have reproduced the problem on the latest upstream
> > origin/testing branch commit "SUNRPC: Take the transport send lock
> > before binding+connecting". As you have noted before infinite loops is
> > due to client "losing" an update to the seqid.
> >
> > one packet would send out an (recovery) OPEN with slot=0 seqid=Y.
> > tracepoint (nfs4_open_file) would log that status=ERESTARTSYS. The rpc
> > task would be sent and the rpc task would receive a reply but there is
> > nobody there to receive it... This open that got a reply has an
> > updated stateid seqid which client never updates. When CLOSE is sent,
> > it's sent with the "old" stateid and puts the client in an infinite
> > loop. Btw, CLOSE is sent on the interrupted slot which should get
> > FALSE_RETRY which causes the client to terminate the session. But it
> > would still keep sending the CLOSE with the old stateid.
> >
> > Some things I've noticed is that TEST_STATE op (as a part of the
> > nfs41_test_and _free_expired_stateid()) for some reason always has a
> > signal set even before issuing and RPC task so the task never
> > completes (ever).
> >
> > I always thought that OPEN's can't be interrupted but I guess they are
> > since they call rpc_wait_for_completion_task() and that's a killable
> > event. But I don't know how to find out what's sending a signal to the
> > process. I'm rather stuck here trying to figure out where to go from
> > there. So I'm still trying to figure out what's causing the signal or
> > also how to recover from it that the client doesn't lose that seqid.
> >
> >>
> Thank you for you quick relpy.
>
> I have also noticed the ERESTARTSYS status for OPEN op, but I think it
> is returned by the open process which is woken in nfs4_run_open_task().
> I found that the decode routine nfs4_xdr_dec_open returned -121, which
> I thought is the root cause of this problem. Could you please post the
> content of the last OPEN message?


Hi Jiufei,

Yes I think that's why the update isn't happening because the
rpc_status isn't 0.

Trond,

rpc_status of the rpc tasks that were interrupted but are finishing
are not able to succeed because when they try to decode_sequence the
res->st_slot is NULL. Sequence op is not decoded and then when it
tries to decode the PUTFH it throws unexpected op (expecting PUTFH but
has SEQ still there instead).

res->st_slot is going away because after the open(s) were interrupted
and _nfs4_proc_open() returned an error (interrupted), it goes and
frees the slot.

Is it perhaps appropriate to only free the slot there when if
(!data->cancelled) free_slot() otherwise. Let the async RPC task
continue and once it's done it'll free the slot.

How's this for a proposed fix:
Subject: [PATCH 1/1] NFSv4.1 don't free interrupted slot on open

Allow the async rpc task for finish and update the open state if needed,
then free the slot. Otherwise, the async rpc unable to decode the reply.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
---
 fs/nfs/nfs4proc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 4dbb0ee..96c2499 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2933,7 +2933,8 @@ static int _nfs4_open_and_get_state(struct
nfs4_opendata *opendata,
  }

 out:
- nfs4_sequence_free_slot(&opendata->o_res.seq_res);
+ if (!opendata->cancelled)
+ nfs4_sequence_free_slot(&opendata->o_res.seq_res);
  return ret;
 }




>
> Thanks,
> Jiufei.
>
>
>
> >> Thanks,
> >> Jiufei
> >

  reply	other threads:[~2019-03-19 15:33 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-28 10:10 [bug report] task hang while testing xfstests generic/323 Jiufei Xue
2019-02-28 22:26 ` Olga Kornievskaia
2019-02-28 23:56   ` Trond Myklebust
2019-03-01  5:19     ` Jiufei Xue
2019-03-01  5:08   ` Jiufei Xue
2019-03-01  8:49     ` Jiufei Xue
2019-03-01 13:08       ` Trond Myklebust
2019-03-02 16:34         ` Jiufei Xue
2019-03-04 15:20         ` Jiufei Xue
2019-03-04 15:50           ` Trond Myklebust
2019-03-05  5:09             ` Jiufei Xue
2019-03-05 14:45               ` Trond Myklebust
2019-03-06  9:59                 ` Jiufei Xue
2019-03-06 16:09                   ` bfields
2019-03-10 22:20                     ` Olga Kornievskaia
2019-03-11 14:30                       ` Trond Myklebust
2019-03-11 15:07                         ` Olga Kornievskaia
2019-03-11 15:13                           ` Olga Kornievskaia
2019-03-15  6:30                             ` Jiufei Xue
2019-03-15 20:33                               ` Olga Kornievskaia
2019-03-15 20:55                                 ` Trond Myklebust
2019-03-16 14:11                                 ` Jiufei Xue
2019-03-19 15:33                                   ` Olga Kornievskaia [this message]
2019-03-11 15:12                         ` Trond Myklebust
2019-03-11 15:14                           ` Olga Kornievskaia
2019-03-11 15:28                             ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAN-5tyHxQ7q9aR=dmS4_-b0269vcepjcv7r6W9-soDhp4McAKA@mail.gmail.com' \
    --to=aglo@umich.edu \
    --cc=Anna.Schumaker@netapp.com \
    --cc=bfields@fieldses.org \
    --cc=jiufei.xue@linux.alibaba.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.