Re: interrupted rpcs problem

From: Trond Myklebust <trondmy@hammerspace.com>
To: "aglo@umich.edu" <aglo@umich.edu>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: interrupted rpcs problem
Date: Tue, 14 Jan 2020 22:17:40 +0000	[thread overview]
Message-ID: <754a2f8ea604d137f93bc37080d7bf296882d40f.camel@hammerspace.com> (raw)
In-Reply-To: <1f7833103abdeb9976d5c0aa2f4c1f7c5f06edb4.camel@hammerspace.com>

On Tue, 2020-01-14 at 15:52 -0500, Trond Myklebust wrote:
> On Tue, 2020-01-14 at 13:43 -0500, Olga Kornievskaia wrote:
> > On Mon, Jan 13, 2020 at 4:51 PM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > > On Mon, 2020-01-13 at 16:05 -0500, Olga Kornievskaia wrote:
> > > > On Mon, Jan 13, 2020 at 1:24 PM Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > > On Mon, 2020-01-13 at 13:09 -0500, Olga Kornievskaia wrote:
> > > > > > On Mon, Jan 13, 2020 at 11:49 AM Trond Myklebust
> > > > > > <trondmy@hammerspace.com> wrote:
> > > > > > > On Mon, 2020-01-13 at 11:08 -0500, Olga Kornievskaia
> > > > > > > wrote:
> > > > > > > > On Fri, Jan 10, 2020 at 4:03 PM Trond Myklebust <
> > > > > > > > trondmy@hammerspace.com> wrote:
> > > > > > > > > On Fri, 2020-01-10 at 14:29 -0500, Olga Kornievskaia
> > > > > > > > > wrote:
> > > > > > > > > > Hi folks,
> > > > > > > > > > 
> > > > > > > > > > We are having an issue with an interrupted RPCs
> > > > > > > > > > again.
> > > > > > > > > > Here's
> > > > > > > > > > what I
> > > > > > > > > > see when xfstests were ctrl-c-ed.
> > > > > > > > > > 
> > > > > > > > > > frame 332 SETATTR call slot=0 seqid=0x000013ca (I'm
> > > > > > > > > > assuming
> > > > > > > > > > this
> > > > > > > > > > is
> > > > > > > > > > interrupted and released)
> > > > > > > > > > frame 333 CLOSE call slot=0 seqid=0x000013cb  (only
> > > > > > > > > > way
> > > > > > > > > > the
> > > > > > > > > > slot
> > > > > > > > > > could
> > > > > > > > > > be free before the reply if it was interrupted,
> > > > > > > > > > right?
> > > > > > > > > > Otherwise
> > > > > > > > > > we
> > > > > > > > > > should never have the slot used by more than one
> > > > > > > > > > outstanding
> > > > > > > > > > RPC)
> > > > > > > > > > frame 334 reply to 333 with SEQ_MIS_ORDERED (I'm
> > > > > > > > > > assuming
> > > > > > > > > > server
> > > > > > > > > > received frame 333 before 332)
> > > > > > > > > > frame 336 CLOSE call slot=0 seqid=0x000013ca (???
> > > > > > > > > > why
> > > > > > > > > > did
> > > > > > > > > > we
> > > > > > > > > > decremented it. I mean I know why it's in the
> > > > > > > > > > current
> > > > > > > > > > code :-
> > > > > > > > > > / )
> > > > > > > > > > frame 337 reply to 336 SEQUENCE with ERR_DELAY
> > > > > > > > > > frame 339 reply to 332 SETATTR which nobody is
> > > > > > > > > > waiting
> > > > > > > > > > for
> > > > > > > > > > frame 543 CLOSE call slot=0 seqid=0x000013ca (retry
> > > > > > > > > > after
> > > > > > > > > > waiting
> > > > > > > > > > for
> > > > > > > > > > err_delay)
> > > > > > > > > > frame 544 reply to 543 with SETATTR (out of the
> > > > > > > > > > cache).
> > > > > > > > > > 
> > > > > > > > > > What this leads to is: file is never closed on the
> > > > > > > > > > server.
> > > > > > > > > > Can't
> > > > > > > > > > remove it. Unmount fails with CLID_BUSY.
> > > > > > > > > > 
> > > > > > > > > > I believe that's the result of commit
> > > > > > > > > > 3453d5708b33efe76f40eca1c0ed60923094b971.
> > > > > > > > > > We used to have code that bumped the sequence up
> > > > > > > > > > when
> > > > > > > > > > the
> > > > > > > > > > slot
> > > > > > > > > > was
> > > > > > > > > > interrupted but after the commit "NFSv4.1: Avoid
> > > > > > > > > > false
> > > > > > > > > > retries
> > > > > > > > > > when
> > > > > > > > > > RPC calls are interrupted".
> > > > > > > > > > 
> > > > > > > > > > Commit has this "The obvious fix is to bump the
> > > > > > > > > > sequence
> > > > > > > > > > number
> > > > > > > > > > pre-emptively if an
> > > > > > > > > >     RPC call is interrupted, but in order to deal
> > > > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > corner
> > > > > > > > > > cases
> > > > > > > > > >     where the interrupted call is not actually
> > > > > > > > > > received
> > > > > > > > > > and
> > > > > > > > > > processed
> > > > > > > > > > by
> > > > > > > > > >     the server, we need to interpret the error
> > > > > > > > > > NFS4ERR_SEQ_MISORDERED
> > > > > > > > > >     as a sign that we need to either wait or locate
> > > > > > > > > > a
> > > > > > > > > > correct
> > > > > > > > > > sequence
> > > > > > > > > >     number that lies between the value we sent, and
> > > > > > > > > > the
> > > > > > > > > > last
> > > > > > > > > > value
> > > > > > > > > > that
> > > > > > > > > >     was acked by a SEQUENCE call on that slot."
> > > > > > > > > > 
> > > > > > > > > > If we can't no longer just bump the sequence up, I
> > > > > > > > > > don't
> > > > > > > > > > think
> > > > > > > > > > the
> > > > > > > > > > correct action is to automatically bump it down (as
> > > > > > > > > > per
> > > > > > > > > > example
> > > > > > > > > > here)?
> > > > > > > > > > The commit doesn't describe the corner case where
> > > > > > > > > > it
> > > > > > > > > > was
> > > > > > > > > > necessary to
> > > > > > > > > > bump the sequence up. I wonder if we can return the
> > > > > > > > > > knowledge
> > > > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > interrupted slot and make a decision based on that
> > > > > > > > > > as
> > > > > > > > > > well as
> > > > > > > > > > whatever
> > > > > > > > > > the other corner case is.
> > > > > > > > > > 
> > > > > > > > > > I guess what I'm getting is, can somebody (Trond)
> > > > > > > > > > provide
> > > > > > > > > > the
> > > > > > > > > > info
> > > > > > > > > > for
> > > > > > > > > > the corner case for this that patch was created. I
> > > > > > > > > > can
> > > > > > > > > > see if
> > > > > > > > > > I
> > > > > > > > > > can
> > > > > > > > > > fix the "common" case which is now broken and not
> > > > > > > > > > break
> > > > > > > > > > the
> > > > > > > > > > corner
> > > > > > > > > > case....
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > There is no pure client side solution for this
> > > > > > > > > problem.
> > > > > > > > > 
> > > > > > > > > The change was made because if you have multiple
> > > > > > > > > interruptions
> > > > > > > > > of
> > > > > > > > > the
> > > > > > > > > RPC call, then the client has to somehow figure out
> > > > > > > > > what
> > > > > > > > > the
> > > > > > > > > correct
> > > > > > > > > slot number is. If it starts low, and then goes high,
> > > > > > > > > and
> > > > > > > > > the
> > > > > > > > > server is
> > > > > > > > > not caching the arguments for the RPC call that is in
> > > > > > > > > the
> > > > > > > > > session
> > > > > > > > > cache, then we will _always_ hit this bug because we
> > > > > > > > > will
> > > > > > > > > always
> > > > > > > > > hit
> > > > > > > > > the replay of the last entry.
> > > > > > > > > 
> > > > > > > > > At least if we start high, and iterate by low, then
> > > > > > > > > we
> > > > > > > > > reduce
> > > > > > > > > the
> > > > > > > > > problem to being a race with the processing of the
> > > > > > > > > interrupted
> > > > > > > > > request
> > > > > > > > > as it is in this case.
> > > > > > > > > 
> > > > > > > > > However, as I said, the real solution here has to
> > > > > > > > > involve
> > > > > > > > > the
> > > > > > > > > server.
> > > > > > > > 
> > > > > > > > Ok I see your point that if the server cached the
> > > > > > > > arguments,
> > > > > > > > then
> > > > > > > > the
> > > > > > > > server would tell that 2nd rpc using the same
> > > > > > > > slot+seqid
> > > > > > > > has
> > > > > > > > different
> > > > > > > > args and would not use the replay cache.
> > > > > > > > 
> > > > > > > > However, I wonder if the client can do better. Can't we
> > > > > > > > be
> > > > > > > > more
> > > > > > > > aware
> > > > > > > > of when we are interrupting the rpc? For instance, if
> > > > > > > > we
> > > > > > > > are
> > > > > > > > interrupted after we started to wait on the RPC,
> > > > > > > > doesn't
> > > > > > > > it
> > > > > > > > mean
> > > > > > > > the
> > > > > > > > rpc is sent on the network and since network is
> > > > > > > > reliable
> > > > > > > > then
> > > > > > > > server
> > > > > > > > must have consumed the seqid for that slot (in this
> > > > > > > > case
> > > > > > > > increment
> > > > > > > > seqid)? That's the case that's failing now.
> > > > > > > > 
> > > > > > > 
> > > > > > > "Reliable transport" does not mean that a client knows
> > > > > > > what
> > > > > > > got
> > > > > > > received and processed by the server and what didn't. All
> > > > > > > the
> > > > > > > client
> > > > > > > knows is that if the connection is still up, then the TCP
> > > > > > > layer
> > > > > > > will
> > > > > > > keep retrying transmission of the request. There are
> > > > > > > plenty
> > > > > > > of
> > > > > > > error
> > > > > > > scenarios where the client gets no information back as to
> > > > > > > whether
> > > > > > > or
> > > > > > > not the data was received by the server (e.g. due to lost
> > > > > > > ACKs).
> > > > > > > 
> > > > > > > Furthermore, if a RPC call is interrupted on the client,
> > > > > > > either
> > > > > > > due
> > > > > > > to
> > > > > > > a timeout or a signal,
> > > > > > 
> > > > > > What timeout are you referring to here since 4.1 rcp can't
> > > > > > timeout. I
> > > > > > think it only leaves a signal.
> > > > > 
> > > > > If you use 'soft' or 'softerr' mount options, then NFSv4.1
> > > > > will
> > > > > time
> > > > > out when the server is being unresponsive. That behaviour is
> > > > > different
> > > > > to the behaviour under a signal, but has the same effect of
> > > > > interrupting the RPC call without us being able to know if
> > > > > the
> > > > > server
> > > > > received the data.
> > > > > 
> > > > > > > then it almost always ends up breaking the
> > > > > > > connection in order to avoid corruption of the data
> > > > > > > stream
> > > > > > > (by
> > > > > > > interrupting the transmission before the entire RPC call
> > > > > > > has
> > > > > > > been
> > > > > > > sent). You generally have to be lucky to see the
> > > > > > > timeout/signal
> > > > > > > occur
> > > > > > > only when all the RPC calls being cancelled have exactly
> > > > > > > fit
> > > > > > > into
> > > > > > > the
> > > > > > > socket buffer.
> > > > > > 
> > > > > > Wouldn't a retransmission (due to a connection reset for
> > > > > > whatever
> > > > > > reason) be different and doesn't involve reprocessing of
> > > > > > the
> > > > > > slot.
> > > > > 
> > > > > I'm not talking about retransmissions here. I'm talking only
> > > > > about
> > > > > NFSv4.x RPC calls that suffer a fatal interruption (i.e. no
> > > > > retransmission).
> > > > > 
> > > > > > > Finally, just because the server's TCP layer ACKed
> > > > > > > receipt
> > > > > > > of
> > > > > > > the
> > > > > > > RPC
> > > > > > > call data, that does not mean that it will process that
> > > > > > > call.
> > > > > > > The
> > > > > > > connection could break before the call is read out of the
> > > > > > > receiving
> > > > > > > socket, or the server may later decide to drop it on the
> > > > > > > floor
> > > > > > > and
> > > > > > > break the connection.
> > > > > > > 
> > > > > > > IOW: the RPC protocol here is not that "reliable
> > > > > > > transport
> > > > > > > implies
> > > > > > > processing is guaranteed". It is rather that "connection
> > > > > > > is
> > > > > > > still
> > > > > > > up
> > > > > > > implies processing may eventually occur".
> > > > > > 
> > > > > > "eventually occur" means that its process of the rpc is
> > > > > > guaranteed
> > > > > > "in
> > > > > > time". Again unless the client is broken, we can't have
> > > > > > more
> > > > > > than
> > > > > > an
> > > > > > interrupted rpc (that has nothing waiting) and the next rpc
> > > > > > (both
> > > > > > of
> > > > > > which will be re-transmitted if connection is dropped)
> > > > > > going
> > > > > > to
> > > > > > the
> > > > > > server.
> > > > > > 
> > > > > > Can we distinguish between interrupted due to re-
> > > > > > transmission 
> > > > > > and
> > > > > > interrupted due to ctrl-c of the thread? If we can't, then
> > > > > > I'll
> > > > > > stop
> > > > > > arguing that client can do better.
> > > > > 
> > > > > There is no "interrupted due to re-transmission" case. We
> > > > > only
> > > > > retransmit NFSv4 requests if the TCP connection breaks.
> > > > > 
> > > > > As far as I'm concerned, this discussion is only about
> > > > > interruptions
> > > > > that cause the RPC call to be abandoned (i.e. fatal timeouts
> > > > > and
> > > > > signals).
> > > > > 
> > > > > > But right now we are left in a bad state. Client leaves
> > > > > > opened
> > > > > > state
> > > > > > on the server and will not allow for files to be deleted. I
> > > > > > think
> > > > > > in
> > > > > > case the "next rpc" is the write that will never be
> > > > > > completed
> > > > > > it
> > > > > > would
> > > > > > leave the machine in a hung state. I just don't see how can
> > > > > > you
> > > > > > justify that having the current code is any better than
> > > > > > having
> > > > > > the
> > > > > > solution that was there before.
> > > > > 
> > > > > That's a general problem with allowing interruptions that is
> > > > > largely
> > > > > orthogonal to the question of which strategy we choose when
> > > > > resynchronising the slot numbers after an interruption has
> > > > > occurred.
> > > > > 
> > > > 
> > > > I'm re-reading the spec and in section 2.10.6.2 we have "A
> > > > requester
> > > > MUST wait for a reply to a request before using the slot for
> > > > another
> > > > request". Are we even legally using the slot when we have an
> > > > interrupted slot?
> > > > 
> > > 
> > > You can certainly argue that. However the fact that the spec
> > > fails
> > > to
> > > address the issue doesn't imply lack of need. I have workloads on
> > > my
> > > own systems that would cause major disruption if I did not allow
> > > them
> > > to time out when the server is unavailable (e.g. with memory
> > > filling up
> > > with dirty pages that can't be cleaned).
> > > 
> > > IOW: I'm quite happy to make a best effort attempt to meet that
> > > requirement, by making 'hard' mounts the default, and by making
> > > signalling be a fatal operation. However I'm unwilling to make it
> > > impossible to fix up my system when the server is unresponsive
> > > just
> > > because the protocol is lazy about providing for that ability.
> > 
> > I'm just trying to think of options.
> > 
> > Here's what's bothering me. Yes my server list is limited but
> > neither
> > Linux not Netapp servers implements argument caching for replay
> > cache
> > due to performance hit. Yes perhaps they should but the problem is
> > they currently don't and customer probably deserve the best
> > solution
> > given existing constraints. Do we have such a solution? Can you
> > argue
> > that the number of problems solved by the current solution is
> > higher
> > than by the other solution. With the current solution, we have
> > (silent) data corruption and resource leakage on the server. I
> > think
> > that's a pretty big problem. What's worse silent data corruption or
> > a
> > hung client because it keeps sending same ops and keeps getting
> > SEQ_MISORDERED (that's a rather big problem too but work arounds
> > exist
> > to prevent data corruption/loss. like forcing the client to
> > re-negotiate the session).
> > 
> > Until servers catch up with addressing false retries, what's the
> > best
> > solution the client can have.
> 
> If I knew of a better client side solution, I would already have
> implemented it.
> 
> > Silent data corruption is in pnfs because the writes are allowed to
> > leave interrupted slot due to a timeout. Alternatively, I propose
> > to
> > then make pnfs writes same as the normal writes. It will remove the
> > data corruption problem. It would still have resource leakage but
> > that
> > seems like a bit better than data corruption.
> 
> How often does this problem actually occur in the field?
> 
> The point is that if the interruption is due to a signal, then that
> is
> a signal that is fatal to the application. That's not silent
> corruption; writes are expected not to complete in that kind of
> situation.
> 
> If the interruption is due to a soft mount timing out, then it is
> because the user deliberately chose that non-default setting, and
> should be aware of the consequences.
> Note that in newer kernels, these soft timeouts do not trigger unless
> the network connection is also down, so there is already a recovery
> process required to re-establish the network connection before any
> further requests can be sent.
> Furthermore, the application should also be seeing a POSIX error when
> fsync() is called, so again, there should be no silent corruption.
> 

Actually, come to think of it, the easiest way to deal with writes is
probably just to turn off the sa_cachethis in the SEQUENCE operation so
that any attempt to replay gets a NFS4ERR_RETRY_UNCACHED_REP.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com