All of lore.kernel.org
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "aglo@umich.edu" <aglo@umich.edu>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: interrupted rpcs problem
Date: Tue, 14 Jan 2020 22:17:40 +0000	[thread overview]
Message-ID: <754a2f8ea604d137f93bc37080d7bf296882d40f.camel@hammerspace.com> (raw)
In-Reply-To: <1f7833103abdeb9976d5c0aa2f4c1f7c5f06edb4.camel@hammerspace.com>

On Tue, 2020-01-14 at 15:52 -0500, Trond Myklebust wrote:
> On Tue, 2020-01-14 at 13:43 -0500, Olga Kornievskaia wrote:
> > On Mon, Jan 13, 2020 at 4:51 PM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > > On Mon, 2020-01-13 at 16:05 -0500, Olga Kornievskaia wrote:
> > > > On Mon, Jan 13, 2020 at 1:24 PM Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > > On Mon, 2020-01-13 at 13:09 -0500, Olga Kornievskaia wrote:
> > > > > > On Mon, Jan 13, 2020 at 11:49 AM Trond Myklebust
> > > > > > <trondmy@hammerspace.com> wrote:
> > > > > > > On Mon, 2020-01-13 at 11:08 -0500, Olga Kornievskaia
> > > > > > > wrote:
> > > > > > > > On Fri, Jan 10, 2020 at 4:03 PM Trond Myklebust <
> > > > > > > > trondmy@hammerspace.com> wrote:
> > > > > > > > > On Fri, 2020-01-10 at 14:29 -0500, Olga Kornievskaia
> > > > > > > > > wrote:
> > > > > > > > > > Hi folks,
> > > > > > > > > > 
> > > > > > > > > > We are having an issue with an interrupted RPCs
> > > > > > > > > > again.
> > > > > > > > > > Here's
> > > > > > > > > > what I
> > > > > > > > > > see when xfstests were ctrl-c-ed.
> > > > > > > > > > 
> > > > > > > > > > frame 332 SETATTR call slot=0 seqid=0x000013ca (I'm
> > > > > > > > > > assuming
> > > > > > > > > > this
> > > > > > > > > > is
> > > > > > > > > > interrupted and released)
> > > > > > > > > > frame 333 CLOSE call slot=0 seqid=0x000013cb  (only
> > > > > > > > > > way
> > > > > > > > > > the
> > > > > > > > > > slot
> > > > > > > > > > could
> > > > > > > > > > be free before the reply if it was interrupted,
> > > > > > > > > > right?
> > > > > > > > > > Otherwise
> > > > > > > > > > we
> > > > > > > > > > should never have the slot used by more than one
> > > > > > > > > > outstanding
> > > > > > > > > > RPC)
> > > > > > > > > > frame 334 reply to 333 with SEQ_MIS_ORDERED (I'm
> > > > > > > > > > assuming
> > > > > > > > > > server
> > > > > > > > > > received frame 333 before 332)
> > > > > > > > > > frame 336 CLOSE call slot=0 seqid=0x000013ca (???
> > > > > > > > > > why
> > > > > > > > > > did
> > > > > > > > > > we
> > > > > > > > > > decremented it. I mean I know why it's in the
> > > > > > > > > > current
> > > > > > > > > > code :-
> > > > > > > > > > / )
> > > > > > > > > > frame 337 reply to 336 SEQUENCE with ERR_DELAY
> > > > > > > > > > frame 339 reply to 332 SETATTR which nobody is
> > > > > > > > > > waiting
> > > > > > > > > > for
> > > > > > > > > > frame 543 CLOSE call slot=0 seqid=0x000013ca (retry
> > > > > > > > > > after
> > > > > > > > > > waiting
> > > > > > > > > > for
> > > > > > > > > > err_delay)
> > > > > > > > > > frame 544 reply to 543 with SETATTR (out of the
> > > > > > > > > > cache).
> > > > > > > > > > 
> > > > > > > > > > What this leads to is: file is never closed on the
> > > > > > > > > > server.
> > > > > > > > > > Can't
> > > > > > > > > > remove it. Unmount fails with CLID_BUSY.
> > > > > > > > > > 
> > > > > > > > > > I believe that's the result of commit
> > > > > > > > > > 3453d5708b33efe76f40eca1c0ed60923094b971.
> > > > > > > > > > We used to have code that bumped the sequence up
> > > > > > > > > > when
> > > > > > > > > > the
> > > > > > > > > > slot
> > > > > > > > > > was
> > > > > > > > > > interrupted but after the commit "NFSv4.1: Avoid
> > > > > > > > > > false
> > > > > > > > > > retries
> > > > > > > > > > when
> > > > > > > > > > RPC calls are interrupted".
> > > > > > > > > > 
> > > > > > > > > > Commit has this "The obvious fix is to bump the
> > > > > > > > > > sequence
> > > > > > > > > > number
> > > > > > > > > > pre-emptively if an
> > > > > > > > > >     RPC call is interrupted, but in order to deal
> > > > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > corner
> > > > > > > > > > cases
> > > > > > > > > >     where the interrupted call is not actually
> > > > > > > > > > received
> > > > > > > > > > and
> > > > > > > > > > processed
> > > > > > > > > > by
> > > > > > > > > >     the server, we need to interpret the error
> > > > > > > > > > NFS4ERR_SEQ_MISORDERED
> > > > > > > > > >     as a sign that we need to either wait or locate
> > > > > > > > > > a
> > > > > > > > > > correct
> > > > > > > > > > sequence
> > > > > > > > > >     number that lies between the value we sent, and
> > > > > > > > > > the
> > > > > > > > > > last
> > > > > > > > > > value
> > > > > > > > > > that
> > > > > > > > > >     was acked by a SEQUENCE call on that slot."
> > > > > > > > > > 
> > > > > > > > > > If we can't no longer just bump the sequence up, I
> > > > > > > > > > don't
> > > > > > > > > > think
> > > > > > > > > > the
> > > > > > > > > > correct action is to automatically bump it down (as
> > > > > > > > > > per
> > > > > > > > > > example
> > > > > > > > > > here)?
> > > > > > > > > > The commit doesn't describe the corner case where
> > > > > > > > > > it
> > > > > > > > > > was
> > > > > > > > > > necessary to
> > > > > > > > > > bump the sequence up. I wonder if we can return the
> > > > > > > > > > knowledge
> > > > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > interrupted slot and make a decision based on that
> > > > > > > > > > as
> > > > > > > > > > well as
> > > > > > > > > > whatever
> > > > > > > > > > the other corner case is.
> > > > > > > > > > 
> > > > > > > > > > I guess what I'm getting is, can somebody (Trond)
> > > > > > > > > > provide
> > > > > > > > > > the
> > > > > > > > > > info
> > > > > > > > > > for
> > > > > > > > > > the corner case for this that patch was created. I
> > > > > > > > > > can
> > > > > > > > > > see if
> > > > > > > > > > I
> > > > > > > > > > can
> > > > > > > > > > fix the "common" case which is now broken and not
> > > > > > > > > > break
> > > > > > > > > > the
> > > > > > > > > > corner
> > > > > > > > > > case....
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > There is no pure client side solution for this
> > > > > > > > > problem.
> > > > > > > > > 
> > > > > > > > > The change was made because if you have multiple
> > > > > > > > > interruptions
> > > > > > > > > of
> > > > > > > > > the
> > > > > > > > > RPC call, then the client has to somehow figure out
> > > > > > > > > what
> > > > > > > > > the
> > > > > > > > > correct
> > > > > > > > > slot number is. If it starts low, and then goes high,
> > > > > > > > > and
> > > > > > > > > the
> > > > > > > > > server is
> > > > > > > > > not caching the arguments for the RPC call that is in
> > > > > > > > > the
> > > > > > > > > session
> > > > > > > > > cache, then we will _always_ hit this bug because we
> > > > > > > > > will
> > > > > > > > > always
> > > > > > > > > hit
> > > > > > > > > the replay of the last entry.
> > > > > > > > > 
> > > > > > > > > At least if we start high, and iterate by low, then
> > > > > > > > > we
> > > > > > > > > reduce
> > > > > > > > > the
> > > > > > > > > problem to being a race with the processing of the
> > > > > > > > > interrupted
> > > > > > > > > request
> > > > > > > > > as it is in this case.
> > > > > > > > > 
> > > > > > > > > However, as I said, the real solution here has to
> > > > > > > > > involve
> > > > > > > > > the
> > > > > > > > > server.
> > > > > > > > 
> > > > > > > > Ok I see your point that if the server cached the
> > > > > > > > arguments,
> > > > > > > > then
> > > > > > > > the
> > > > > > > > server would tell that 2nd rpc using the same
> > > > > > > > slot+seqid
> > > > > > > > has
> > > > > > > > different
> > > > > > > > args and would not use the replay cache.
> > > > > > > > 
> > > > > > > > However, I wonder if the client can do better. Can't we
> > > > > > > > be
> > > > > > > > more
> > > > > > > > aware
> > > > > > > > of when we are interrupting the rpc? For instance, if
> > > > > > > > we
> > > > > > > > are
> > > > > > > > interrupted after we started to wait on the RPC,
> > > > > > > > doesn't
> > > > > > > > it
> > > > > > > > mean
> > > > > > > > the
> > > > > > > > rpc is sent on the network and since network is
> > > > > > > > reliable
> > > > > > > > then
> > > > > > > > server
> > > > > > > > must have consumed the seqid for that slot (in this
> > > > > > > > case
> > > > > > > > increment
> > > > > > > > seqid)? That's the case that's failing now.
> > > > > > > > 
> > > > > > > 
> > > > > > > "Reliable transport" does not mean that a client knows
> > > > > > > what
> > > > > > > got
> > > > > > > received and processed by the server and what didn't. All
> > > > > > > the
> > > > > > > client
> > > > > > > knows is that if the connection is still up, then the TCP
> > > > > > > layer
> > > > > > > will
> > > > > > > keep retrying transmission of the request. There are
> > > > > > > plenty
> > > > > > > of
> > > > > > > error
> > > > > > > scenarios where the client gets no information back as to
> > > > > > > whether
> > > > > > > or
> > > > > > > not the data was received by the server (e.g. due to lost
> > > > > > > ACKs).
> > > > > > > 
> > > > > > > Furthermore, if a RPC call is interrupted on the client,
> > > > > > > either
> > > > > > > due
> > > > > > > to
> > > > > > > a timeout or a signal,
> > > > > > 
> > > > > > What timeout are you referring to here since 4.1 rcp can't
> > > > > > timeout. I
> > > > > > think it only leaves a signal.
> > > > > 
> > > > > If you use 'soft' or 'softerr' mount options, then NFSv4.1
> > > > > will
> > > > > time
> > > > > out when the server is being unresponsive. That behaviour is
> > > > > different
> > > > > to the behaviour under a signal, but has the same effect of
> > > > > interrupting the RPC call without us being able to know if
> > > > > the
> > > > > server
> > > > > received the data.
> > > > > 
> > > > > > > then it almost always ends up breaking the
> > > > > > > connection in order to avoid corruption of the data
> > > > > > > stream
> > > > > > > (by
> > > > > > > interrupting the transmission before the entire RPC call
> > > > > > > has
> > > > > > > been
> > > > > > > sent). You generally have to be lucky to see the
> > > > > > > timeout/signal
> > > > > > > occur
> > > > > > > only when all the RPC calls being cancelled have exactly
> > > > > > > fit
> > > > > > > into
> > > > > > > the
> > > > > > > socket buffer.
> > > > > > 
> > > > > > Wouldn't a retransmission (due to a connection reset for
> > > > > > whatever
> > > > > > reason) be different and doesn't involve reprocessing of
> > > > > > the
> > > > > > slot.
> > > > > 
> > > > > I'm not talking about retransmissions here. I'm talking only
> > > > > about
> > > > > NFSv4.x RPC calls that suffer a fatal interruption (i.e. no
> > > > > retransmission).
> > > > > 
> > > > > > > Finally, just because the server's TCP layer ACKed
> > > > > > > receipt
> > > > > > > of
> > > > > > > the
> > > > > > > RPC
> > > > > > > call data, that does not mean that it will process that
> > > > > > > call.
> > > > > > > The
> > > > > > > connection could break before the call is read out of the
> > > > > > > receiving
> > > > > > > socket, or the server may later decide to drop it on the
> > > > > > > floor
> > > > > > > and
> > > > > > > break the connection.
> > > > > > > 
> > > > > > > IOW: the RPC protocol here is not that "reliable
> > > > > > > transport
> > > > > > > implies
> > > > > > > processing is guaranteed". It is rather that "connection
> > > > > > > is
> > > > > > > still
> > > > > > > up
> > > > > > > implies processing may eventually occur".
> > > > > > 
> > > > > > "eventually occur" means that its process of the rpc is
> > > > > > guaranteed
> > > > > > "in
> > > > > > time". Again unless the client is broken, we can't have
> > > > > > more
> > > > > > than
> > > > > > an
> > > > > > interrupted rpc (that has nothing waiting) and the next rpc
> > > > > > (both
> > > > > > of
> > > > > > which will be re-transmitted if connection is dropped)
> > > > > > going
> > > > > > to
> > > > > > the
> > > > > > server.
> > > > > > 
> > > > > > Can we distinguish between interrupted due to re-
> > > > > > transmission 
> > > > > > and
> > > > > > interrupted due to ctrl-c of the thread? If we can't, then
> > > > > > I'll
> > > > > > stop
> > > > > > arguing that client can do better.
> > > > > 
> > > > > There is no "interrupted due to re-transmission" case. We
> > > > > only
> > > > > retransmit NFSv4 requests if the TCP connection breaks.
> > > > > 
> > > > > As far as I'm concerned, this discussion is only about
> > > > > interruptions
> > > > > that cause the RPC call to be abandoned (i.e. fatal timeouts
> > > > > and
> > > > > signals).
> > > > > 
> > > > > > But right now we are left in a bad state. Client leaves
> > > > > > opened
> > > > > > state
> > > > > > on the server and will not allow for files to be deleted. I
> > > > > > think
> > > > > > in
> > > > > > case the "next rpc" is the write that will never be
> > > > > > completed
> > > > > > it
> > > > > > would
> > > > > > leave the machine in a hung state. I just don't see how can
> > > > > > you
> > > > > > justify that having the current code is any better than
> > > > > > having
> > > > > > the
> > > > > > solution that was there before.
> > > > > 
> > > > > That's a general problem with allowing interruptions that is
> > > > > largely
> > > > > orthogonal to the question of which strategy we choose when
> > > > > resynchronising the slot numbers after an interruption has
> > > > > occurred.
> > > > > 
> > > > 
> > > > I'm re-reading the spec and in section 2.10.6.2 we have "A
> > > > requester
> > > > MUST wait for a reply to a request before using the slot for
> > > > another
> > > > request". Are we even legally using the slot when we have an
> > > > interrupted slot?
> > > > 
> > > 
> > > You can certainly argue that. However the fact that the spec
> > > fails
> > > to
> > > address the issue doesn't imply lack of need. I have workloads on
> > > my
> > > own systems that would cause major disruption if I did not allow
> > > them
> > > to time out when the server is unavailable (e.g. with memory
> > > filling up
> > > with dirty pages that can't be cleaned).
> > > 
> > > IOW: I'm quite happy to make a best effort attempt to meet that
> > > requirement, by making 'hard' mounts the default, and by making
> > > signalling be a fatal operation. However I'm unwilling to make it
> > > impossible to fix up my system when the server is unresponsive
> > > just
> > > because the protocol is lazy about providing for that ability.
> > 
> > I'm just trying to think of options.
> > 
> > Here's what's bothering me. Yes my server list is limited but
> > neither
> > Linux not Netapp servers implements argument caching for replay
> > cache
> > due to performance hit. Yes perhaps they should but the problem is
> > they currently don't and customer probably deserve the best
> > solution
> > given existing constraints. Do we have such a solution? Can you
> > argue
> > that the number of problems solved by the current solution is
> > higher
> > than by the other solution. With the current solution, we have
> > (silent) data corruption and resource leakage on the server. I
> > think
> > that's a pretty big problem. What's worse silent data corruption or
> > a
> > hung client because it keeps sending same ops and keeps getting
> > SEQ_MISORDERED (that's a rather big problem too but work arounds
> > exist
> > to prevent data corruption/loss. like forcing the client to
> > re-negotiate the session).
> > 
> > Until servers catch up with addressing false retries, what's the
> > best
> > solution the client can have.
> 
> If I knew of a better client side solution, I would already have
> implemented it.
> 
> > Silent data corruption is in pnfs because the writes are allowed to
> > leave interrupted slot due to a timeout. Alternatively, I propose
> > to
> > then make pnfs writes same as the normal writes. It will remove the
> > data corruption problem. It would still have resource leakage but
> > that
> > seems like a bit better than data corruption.
> 
> How often does this problem actually occur in the field?
> 
> The point is that if the interruption is due to a signal, then that
> is
> a signal that is fatal to the application. That's not silent
> corruption; writes are expected not to complete in that kind of
> situation.
> 
> If the interruption is due to a soft mount timing out, then it is
> because the user deliberately chose that non-default setting, and
> should be aware of the consequences.
> Note that in newer kernels, these soft timeouts do not trigger unless
> the network connection is also down, so there is already a recovery
> process required to re-establish the network connection before any
> further requests can be sent.
> Furthermore, the application should also be seeing a POSIX error when
> fsync() is called, so again, there should be no silent corruption.
> 

Actually, come to think of it, the easiest way to deal with writes is
probably just to turn off the sa_cachethis in the SEQUENCE operation so
that any attempt to replay gets a NFS4ERR_RETRY_UNCACHED_REP.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



  reply	other threads:[~2020-01-14 22:17 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-10 19:29 interrupted rpcs problem Olga Kornievskaia
2020-01-10 21:03 ` Trond Myklebust
2020-01-13 16:08   ` Olga Kornievskaia
2020-01-13 16:49     ` Trond Myklebust
2020-01-13 18:09       ` Olga Kornievskaia
2020-01-13 18:24         ` Trond Myklebust
2020-01-13 21:05           ` Olga Kornievskaia
2020-01-13 21:51             ` Trond Myklebust
2020-01-14 18:43               ` Olga Kornievskaia
2020-01-14 20:52                 ` Trond Myklebust
2020-01-14 22:17                   ` Trond Myklebust [this message]
2020-01-14 23:37                   ` Olga Kornievskaia
2020-01-15  1:34                     ` Trond Myklebust
2020-02-12 20:09               ` Olga Kornievskaia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=754a2f8ea604d137f93bc37080d7bf296882d40f.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=aglo@umich.edu \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.