Re: Spurious instability with NFSoRDMA under moderate load

From: Olga Kornievskaia <aglo@umich.edu>
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Bruce Fields <bfields@redhat.com>,
	Timo Rothenpieler <timo@rothenpieler.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Dai Ngo <dai.ngo@oracle.com>
Subject: Re: Spurious instability with NFSoRDMA under moderate load
Date: Wed, 11 Aug 2021 16:51:41 -0400	[thread overview]
Message-ID: <CAN-5tyGcPiZ83H1-NU67n1jPKrpkrk3E0xXV+d11rQJX6_MbKA@mail.gmail.com> (raw)
In-Reply-To: <D417C606-9E27-431E-B80E-EE927E62A316@oracle.com>

On Wed, Aug 11, 2021 at 4:01 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Aug 11, 2021, at 3:46 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Wed, Aug 11, 2021 at 2:52 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Aug 11, 2021, at 2:38 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >>>
> >>> On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> >>>>>
> >>>>> resulting dmesg and trace logs of both client and server are attached.
> >>>>>
> >>>>> Test procedure:
> >>>>>
> >>>>> - start tracing on client and server
> >>>>> - mount NFS on client
> >>>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
> >>>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
> >>>>> - run xfs_io command again, getting stuck now
> >>>>> - let it sit there stuck for a minute, then cancel it
> >>>>> - run the command again
> >>>>> - while it's still stuck, finished recording the logs and traces
> >>>>
> >>>> The server tries to send CB_OFFLOAD when the offloaded copy
> >>>> completes, but finds the backchannel transport is not connected.
> >>>>
> >>>> The server can't report the problem until the client sends a
> >>>> SEQUENCE operation, but there's really no other traffic going
> >>>> on, so it just waits.
> >>>>
> >>>> The client eventually sends a singleton SEQUENCE to renew its
> >>>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
> >>>> flag set at that point. Client's recovery is to destroy that
> >>>> session and create a new one. That appears to be successful.
> >>>>
> >>>> But the server doesn't send another CB_OFFLOAD to let the client
> >>>> know the copy is complete, so the client hangs.
> >>>>
> >>>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
> >>>> other CB operations suffer from the same "failed to retransmit
> >>>> after the CB path is restored" issue. It might not matter for
> >>>> some of them, but for others like CB_RECALL, that could be
> >>>> important.
> >>>
> >>> Thank you for the analysis Chuck (btw I haven't seen any attachments
> >>> with Timo's posts so I'm assuming some offline communication must have
> >>> happened).
> >>> ?
> >>> I'm looking at the code and wouldn't the mentioned flags be set on the
> >>> CB_SEQUENCE operation?
> >>
> >> CB_SEQUENCE is sent from server to client, and that can't work if
> >> the callback channel is down.
> >>
> >> So the server waits for the client to send a SEQUENCE and it sets
> >> the SEQ4_STATUS_BACKCHANNEL_FAULT in its reply.
> >
> > yes scratch that, this is for when CB_SEQUENCE has it in its reply.
> >
> >>> nfsd4_cb_done() has code to mark the channel
> >>> and retry (or another way of saying this, this code should generically
> >>> handle retrying whatever operation it is be it CB_OFFLOAD or
> >>> CB_RECALL)?
> >>
> >> cb_done() marks the callback fault, but as far as I can tell the
> >> RPC is terminated at that point and there is no subsequent retry.
> >> The RPC_TASK flags on the CB_OFFLOAD operation cause that RPC to
> >> fail immediately if there's no connection.
> >>
> >> And in the BACKCHANNEL_FAULT case, the bc_xprt is destroyed as
> >> part of recovery. I think that would kill all pending RPC tasks.
> >>
> >>
> >>> Is that not working (not sure if this is  a question or a
> >>> statement).... I would think that would be the place to handle this
> >>> problem.
> >>
> >> IMHO the OFFLOAD code needs to note that the CB_OFFLOAD RPC
> >> failed and then try the call again once the new backchannel is
> >> available.
> >
> > I still argue that this needs to be done generically not per operation
> > as CB_RECALL has the same problem.
>
> Probably not just CB_RECALL, but agreed, there doesn't seem to
> be any mechanism that can re-drive callback operations when the
> backchannel is replaced.
>
>
> > If CB_RECALL call is never
> > answered, rpc would fail with ETIMEOUT.
>
> Well we have a mechanism already to deal with that case, which is
> delegation revocation. That's not optimal, but at least it won't
> leave the client hanging forever.
>
>
> > nfsd4_cb_recall_done() returns
> > 1 back to the nfsd4_cb_done() which then based on the rpc task status
> > would set the callback path down but will do nothing else.
> > nfsd4_cb_offload_one() just always returns 1.
> >
> > Now given that you say during the backchannel fault case, we'll have
> > to destroy that backchannel and create a new one.
>
> BACKCHANNEL_FAULT is the particular case that Timo's reproducer
> exercises. I haven't looked closely at the CB_PATH_DOWN and
> CB_PATH_DOWN_SESSION cases, which use BIND_CONN_TO_SESSION,
> to see if those also destroy the backchannel transport.
>
> However I suspect they are just as broken. There isn't much
> regression testing around these scenarios.
>
>
> > Are we losing all those RPCs that flagged it as faulty?
>
> I believe the RPC tasks are already gone at that point. I'm
> just saying that /hypothetically/ if the RPCs were not set up
> to exit immediately, that wouldn't matter because the transport
> and client are destroyed as part of restoring the backchannel.
>
> Perhaps instead of destroying the rpc_clnt, the server should
> retain it and simply recycle the underlying transport data
> structures.
>
>
> > At this point, nothing obvious
> > comes to mind how to engineer the recovery but it should be done.
>
> Well I think we need to:
>
> 1. Review the specs to see if there's any indication of how to
> recover from this situation

I guess one fix we can consider on the client is to make waiting for
the CB_OFFLOAD time based and then use OFFLOAD_STATUS operation to
check on the copy status.

> 2. Perhaps construct a way to move CB RPCs to a workqueue so
> they can be retried indefinitely (or figure out when they should
> stop being retried: COPY_OFFLOAD_CANCEL comes to mind).
>
> 3. Keep an eye peeled for cases where a malicious or broken
> client could cause CB operations to tie up all the server's
> nfsd threads or other resources.
>
> --
> Chuck Lever
>
>
>