Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection

From: Dave Wysochanski <dwysocha@redhat.com>
To: Olga Kornievskaia <olga.kornievskaia@gmail.com>
Cc: trond.myklebust@hammerspace.com,
	Anna Schumaker <anna.schumaker@netapp.com>,
	linux-nfs <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
Date: Fri, 22 Feb 2019 14:17:29 -0500	[thread overview]
Message-ID: <1550863049.9958.3.camel@redhat.com> (raw)
In-Reply-To: <CAN-5tyEpWevmXJb8weecFBm_V=FZm4Zc=b+rrwJFtMgUYhm2tg@mail.gmail.com>

On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote:
> On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <dwysocha@redhat.co
> m> wrote:
> > 
> > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > > Hi Dave,
> > > 
> > > A re-producer is a server that sends an ACK to the client's
> > > FIN/ACK
> > > request and does nothing afterwards (I can reproduce it 100% with
> > > a
> > > hacked up server. It was discovered with a "broken" server that
> > > doesn't fully closes a connection). It leave this client unable
> > > to
> > > connect to this server again indefinitely/forever/reboot required
> > > kind
> > > of state. Once it was considered that doing something like that
> > > to
> > > the
> > > client is a form of an attack (denial-of-server) and thus the
> > > kernel
> > > has a tcp_fin_timeout option after which the kernel will abort
> > > the
> > > connection. However this only applies to the sockets that have
> > > been
> > > closed by the client. This is NOT the case. NFS does not close
> > > the
> > > connection and it ignores kernel's notification of FIN_WAIT2
> > > state.
> > > 
> > 
> > Interesting.  I had the same reproducer but eventually an RST would
> > get
> > sent from the NFS client due to the TCP keepalives.  It sounds like
> > that is not the case anymore.  When I did my testing was over a
> > year
> > and a half ago though.
> 
> after the keepalives the TCP also sends a FIN/ACK if the server
> properly sent a FIN/ACK back, then keep alive will indeed be
> sufficient. Can you check if in your trace server in one time sent
> just an ACK but in another case sent the FIN/ACK?
> 

I had two reproducers actually.  In both cases the NFS server would
never send a FIN.

The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) was
about a NFS server crash after being in the half-close state.  An NFS4
client without that keepalive patch would hang.
This is a valid test case and we check for that now in all releases. 
Here's the outline:
# Outline
# Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER
# Step 2. On NFS client, mount NFS4 server export with "vers=4,timeo=5"
# Step 3. On NFS server, block outgoing FIN packets from the NFS server"
# Step 4. On NFS server, using systemtap script, to delay NFS4OPEN_CONFIRM response for a few seconds
# Step 5. On NFS client, open a file, which should get delayed and trigger a FIN from the NFS client
# Step 6. On NFS server, after a short delay, crash the server by 'echo c > /proc/sysrq-trigger'
# Step 7. On NFS client, wait for NFS server to reboot
# Step 8. On NFS client, wait for the NFS to reconnect TCP connection
# Step 9. On NFS client, wait for NFS4 grace period
# Step 10. On NFS client, try a 'ls' of the file created earlier
# Step 11. On NFS client, sleep for $DELAY seconds to allow all operations to complete
# Step 12. On NFS client, ensure all operations have completed
# Step 13. Cleanup

The second test case (after the keepalive patch) was arguably invalid
test as I used systemtap on the NFS server to never close the socket so
it would never send a FIN.  This obviously should never happen but at
the time I wanted to see if the NFS client would recover.
I did that with this:
probe module("sunrpc").function("svc_xprt_enqueue") {
	printf("%s: %s removing XPT_CLOSE from xprt = %p\n", tz_ctime(gettimeofday_s()), ppfunc(), $xprt);
	$xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2);
}
Here's the full outline for this test:

# This test checks automatic NFS TCP recovery after a specific failure scenario.
# The failure scenario is with an NFSv3 mount that goes idle and the NFS client closes the
# TCP connection, and the final FIN from the NFS server gets lost.
# This can be a tricky scenario since the NFS client's TCP may get stuck in FIN-WAIT-2 indefinitely.
# The NFS client should be able to recover the NFS TCP connection automatically without unmount/mount
# cycle or reboot, and the TCP connection should not be stuck in FIN-WAIT-2 or any other non-connected
# state.
# 
# Outline
# Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER
# Step 2. On NFS client, mount NFS3 server export
# Step 3. On NFS server, using systemtap script, force the NFS server to never send a FIN (avoid close)
# Step 4. On NFS client, verify the NFS mount with 'ls' and TCP connection in TCP_ESTABLISHED
# Step 5. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to transition to TIME_WAIT state
# Step 6. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to disappear
# Step 7. Cleanup
#
# PASS: The state of NFS's TCP connection is usable by the end of the test (commands don't hang, etc)
# FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 indefinitely

> > > One can argue that this is a broken server and we shouldn't
> > > bother.
> > > But this patch is an attempt to argue that the client still
> > > should
> > > care and deal with this condition. However, if the community
> > > feels
> > > that a broken server is a broken server and this form of an
> > > attack is
> > > not interested, this patch can live will be an archive for later
> > > or
> > > never.
> > > 
> > 
> > This isn't IPoIB is it?
> 
> No, just normal TCP.
> 
> 
> > Actually, fwiw, looking back I had speculated on changes in this
> > area.
> > I'm adding you to the CC list of this bug which had some of my
> > musings
> > on it:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> > That bug I ended up closing when we could no longer prove there was
> > any
> > state where the NFS client could get stuck in FIN_WAIT2 after the
> > keepalive patch.
> 
> I can reproduce current behavior with the current upstream code.
> 
> 
> > It can happen that the server only sends the ACK back to the
> > clients
> > FIN,ACK so in general the testcase is valid.  But then the question
> > is
> > how long should one wait for the final data and FIN from the
> > server, or
> > are there ever instances where you shouldn't wait forever?  Is
> > there a
> > way for us to know for sure there is no data left to receive so
> > it's
> > safe to timeout?  No RPCs outstanding?
> 
> Yep all good questions which I don't have answers to. I realize that
> for instance, that a server might send an ACK and then send a FIN/ACK
> after that. But why is it bad for the client to proactively send an
> RST (by shutting down the connection in TCP_FIN_WAIT2 if the client
> was shutting down the connection anyway.
> 
I think you could lose data from the server if you RST and don't wait
but other than that I don't know.  We may be splitting hairs here
though if there's no demonstrable harm to the application (NFS).  As
Trond points out elsewhere reconnect / port reuse may be an issue.

When I looked at this in the second bug I was almost convinced that
there was a problem with the 'close' method in xs_tcp_ops - we needed
to be calling xs_close(), but I ran into some problems and I wasn't
sure about the test case.

> > I don't claim to know many of the subtleties here as far as would
> > the
> > server wait forever in LAST_ACK or do implementations eventually
> > timeout after some time?  Seems like if these last packets get lost
> > there is likely a bug somewhere (either firewall or TCP stack,
> > etc).
> > https://tools.ietf.org/html/rfc793#page-22
> > 
> > It looks like at least some people are putting timeouts into their
> > stacks though I'm not sure that's a good idea or not.
> 
> I saw only one other driver in the kernel that does have handling of
> the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...

One final thought - is it possible that we could have a network that
does not support TCP keepalives?  In that instance, I'll bet we could
get an indefinite hang in FIN_WAIT2 on my first test case (server
crashed after the half-close state).

It does not look like TCP keepalives have to be implemented and maybe
some routers / NATs / firewalls disallow them (due to extra traffic)?
https://tools.ietf.org/html/rfc1122#page-101