interrupted rpcs problem

* interrupted rpcs problem
@ 2020-01-10 19:29 Olga Kornievskaia
  2020-01-10 21:03 ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Olga Kornievskaia @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-nfs

Hi folks,

We are having an issue with an interrupted RPCs again. Here's what I
see when xfstests were ctrl-c-ed.

frame 332 SETATTR call slot=0 seqid=0x000013ca (I'm assuming this is
interrupted and released)
frame 333 CLOSE call slot=0 seqid=0x000013cb  (only way the slot could
be free before the reply if it was interrupted, right? Otherwise we
should never have the slot used by more than one outstanding RPC)
frame 334 reply to 333 with SEQ_MIS_ORDERED (I'm assuming server
received frame 333 before 332)
frame 336 CLOSE call slot=0 seqid=0x000013ca (??? why did we
decremented it. I mean I know why it's in the current code :-/ )
frame 337 reply to 336 SEQUENCE with ERR_DELAY
frame 339 reply to 332 SETATTR which nobody is waiting for
frame 543 CLOSE call slot=0 seqid=0x000013ca (retry after waiting for err_delay)
frame 544 reply to 543 with SETATTR (out of the cache).

What this leads to is: file is never closed on the server. Can't
remove it. Unmount fails with CLID_BUSY.

I believe that's the result of commit 3453d5708b33efe76f40eca1c0ed60923094b971.
We used to have code that bumped the sequence up when the slot was
interrupted but after the commit "NFSv4.1: Avoid false retries when
RPC calls are interrupted".

Commit has this "The obvious fix is to bump the sequence number
pre-emptively if an
    RPC call is interrupted, but in order to deal with the corner cases
    where the interrupted call is not actually received and processed by
    the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED
    as a sign that we need to either wait or locate a correct sequence
    number that lies between the value we sent, and the last value that
    was acked by a SEQUENCE call on that slot."

If we can't no longer just bump the sequence up, I don't think the
correct action is to automatically bump it down (as per example here)?
The commit doesn't describe the corner case where it was necessary to
bump the sequence up. I wonder if we can return the knowledge of the
interrupted slot and make a decision based on that as well as whatever
the other corner case is.

I guess what I'm getting is, can somebody (Trond) provide the info for
the corner case for this that patch was created. I can see if I can
fix the "common" case which is now broken and not break the corner
case....

Thank you

^ permalink raw reply	[flat|nested] 14+ messages in thread