question about handling off an unresponsive server during lease renewal

* question about handling off an unresponsive server during lease renewal
@ 2020-07-13 17:59 Olga Kornievskaia
  2020-07-13 18:15 ` Trond Myklebust
  0 siblings, 1 reply; 3+ messages in thread
From: Olga Kornievskaia @ 2020-07-13 17:59 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

Hi Trond,

To the best of your knowledge, does the client implement this part of
the spec that deals with when the server isn't responding and the
lease is timing out.

RFC5661 section 8.3 talks about:

Transport retransmission delays might become so large as to
      approach or exceed the length of the lease period.  This may be
      particularly likely when the server is unresponsive due to a
      restart; see Section 8.4.2.1.  If the client implementation is not
      careful, transport retransmission delays can result in the client
      failing to detect a server restart before the grace period ends.
      The scenario is that the client is using a transport with
      exponential backoff, such that the maximum retransmission timeout
      exceeds both the grace period and the lease_time attribute.  A
      network partition causes the client's connection's retransmission
      interval to back off, and even after the partition heals, the next
      transport-level retransmission is sent after the server has
      restarted and its grace period ends.

      The client MUST either recover from the ensuing NFS4ERR_NO_GRACE
      errors or it MUST ensure that, despite transport-level
      retransmission intervals that exceed the lease_time, a SEQUENCE
      operation is sent that renews the lease before expiration.  The
      client can achieve this by associating a new connection with the
      session, and sending a SEQUENCE operation on it.  However, if the
      attempt to establish a new connection is delayed for some reason
      (e.g., exponential backoff of the connection establishment
      packets), the client will have to abort the connection
      establishment attempt before the lease expires, and attempt to
      reconnect.

SEQUNCE op is sent and server rebooted, it's coming up (but not responding).
At the TCP layer, TCP is exponentially backing off before retrying. At
some point the timeout goes more than 100s. Which means that by the
time the client resends the server is up and out of grace.

Does the client have any control over not letting the TCP wait for
longer than the lease period and instead, it needs to abort the
connection and start the new one? I mean I sort of find the 2nd
paragraph in contradiction to the fact that the client must never give
up on waiting for a reply from the server? But maybe this is a special
case where the client is supposed to know its lease hasn't been
renewed and it's OK to give up?

^ permalink raw reply	[flat|nested] 3+ messages in thread