Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection

From: Trond Myklebust <trondmy@hammerspace.com>
To: "olga.kornievskaia@gmail.com" <olga.kornievskaia@gmail.com>,
	"dwysocha@redhat.com" <dwysocha@redhat.com>
Cc: "anna.schumaker@netapp.com" <anna.schumaker@netapp.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
Date: Fri, 22 Feb 2019 20:00:18 +0000	[thread overview]
Message-ID: <cac1ddc0f4f2e70e1bfc2dc142f9675b68073440.camel@hammerspace.com> (raw)
In-Reply-To: <1550863049.9958.3.camel@redhat.com>

On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote:
> On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote:
> > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <
> > dwysocha@redhat.co
> > m> wrote:
> > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > > > Hi Dave,
> > > > 
> > > > A re-producer is a server that sends an ACK to the client's
> > > > FIN/ACK
> > > > request and does nothing afterwards (I can reproduce it 100%
> > > > with
> > > > a
> > > > hacked up server. It was discovered with a "broken" server that
> > > > doesn't fully closes a connection). It leave this client unable
> > > > to
> > > > connect to this server again indefinitely/forever/reboot
> > > > required
> > > > kind
> > > > of state. Once it was considered that doing something like that
> > > > to
> > > > the
> > > > client is a form of an attack (denial-of-server) and thus the
> > > > kernel
> > > > has a tcp_fin_timeout option after which the kernel will abort
> > > > the
> > > > connection. However this only applies to the sockets that have
> > > > been
> > > > closed by the client. This is NOT the case. NFS does not close
> > > > the
> > > > connection and it ignores kernel's notification of FIN_WAIT2
> > > > state.
> > > > 
> > > 
> > > Interesting.  I had the same reproducer but eventually an RST
> > > would
> > > get
> > > sent from the NFS client due to the TCP keepalives.  It sounds
> > > like
> > > that is not the case anymore.  When I did my testing was over a
> > > year
> > > and a half ago though.
> > 
> > after the keepalives the TCP also sends a FIN/ACK if the server
> > properly sent a FIN/ACK back, then keep alive will indeed be
> > sufficient. Can you check if in your trace server in one time sent
> > just an ACK but in another case sent the FIN/ACK?
> > 
> 
> I had two reproducers actually.  In both cases the NFS server would
> never send a FIN.
> 
> The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421)
> was
> about a NFS server crash after being in the half-close state.  An
> NFS4
> client without that keepalive patch would hang.
> This is a valid test case and we check for that now in all releases. 
> Here's the outline:
> # Outline
> # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> 0 port 2049 and host $NFS_SERVER
> # Step 2. On NFS client, mount NFS4 server export with
> "vers=4,timeo=5"
> # Step 3. On NFS server, block outgoing FIN packets from the NFS
> server"
> # Step 4. On NFS server, using systemtap script, to delay
> NFS4OPEN_CONFIRM response for a few seconds
> # Step 5. On NFS client, open a file, which should get delayed and
> trigger a FIN from the NFS client
> # Step 6. On NFS server, after a short delay, crash the server by
> 'echo c > /proc/sysrq-trigger'
> # Step 7. On NFS client, wait for NFS server to reboot
> # Step 8. On NFS client, wait for the NFS to reconnect TCP connection
> # Step 9. On NFS client, wait for NFS4 grace period
> # Step 10. On NFS client, try a 'ls' of the file created earlier
> # Step 11. On NFS client, sleep for $DELAY seconds to allow all
> operations to complete
> # Step 12. On NFS client, ensure all operations have completed
> # Step 13. Cleanup
> 
> 
> The second test case (after the keepalive patch) was arguably invalid
> test as I used systemtap on the NFS server to never close the socket
> so
> it would never send a FIN.  This obviously should never happen but at
> the time I wanted to see if the NFS client would recover.
> I did that with this:
> probe module("sunrpc").function("svc_xprt_enqueue") {
> 	printf("%s: %s removing XPT_CLOSE from xprt = %p\n",
> tz_ctime(gettimeofday_s()), ppfunc(), $xprt);
> 	$xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2);
> }
> Here's the full outline for this test:
> 
> # This test checks automatic NFS TCP recovery after a specific
> failure scenario.
> # The failure scenario is with an NFSv3 mount that goes idle and the
> NFS client closes the
> # TCP connection, and the final FIN from the NFS server gets lost.
> # This can be a tricky scenario since the NFS client's TCP may get
> stuck in FIN-WAIT-2 indefinitely.
> # The NFS client should be able to recover the NFS TCP connection
> automatically without unmount/mount
> # cycle or reboot, and the TCP connection should not be stuck in FIN-
> WAIT-2 or any other non-connected
> # state.
> # 
> # Outline
> # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> 0 port 2049 and host $NFS_SERVER
> # Step 2. On NFS client, mount NFS3 server export
> # Step 3. On NFS server, using systemtap script, force the NFS server
> to never send a FIN (avoid close)
> # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP
> connection in TCP_ESTABLISHED
> # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS
> TCP connection to transition to TIME_WAIT state
> # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS
> TCP connection to disappear
> # Step 7. Cleanup
> #
> # PASS: The state of NFS's TCP connection is usable by the end of the
> test (commands don't hang, etc)
> # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2
> indefinitely
> 
> 
> > > > One can argue that this is a broken server and we shouldn't
> > > > bother.
> > > > But this patch is an attempt to argue that the client still
> > > > should
> > > > care and deal with this condition. However, if the community
> > > > feels
> > > > that a broken server is a broken server and this form of an
> > > > attack is
> > > > not interested, this patch can live will be an archive for
> > > > later
> > > > or
> > > > never.
> > > > 
> > > 
> > > This isn't IPoIB is it?
> > 
> > No, just normal TCP.
> > 
> > 
> > > Actually, fwiw, looking back I had speculated on changes in this
> > > area.
> > > I'm adding you to the CC list of this bug which had some of my
> > > musings
> > > on it:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> > > That bug I ended up closing when we could no longer prove there
> > > was
> > > any
> > > state where the NFS client could get stuck in FIN_WAIT2 after the
> > > keepalive patch.
> > 
> > I can reproduce current behavior with the current upstream code.
> > 
> > 
> > > It can happen that the server only sends the ACK back to the
> > > clients
> > > FIN,ACK so in general the testcase is valid.  But then the
> > > question
> > > is
> > > how long should one wait for the final data and FIN from the
> > > server, or
> > > are there ever instances where you shouldn't wait forever?  Is
> > > there a
> > > way for us to know for sure there is no data left to receive so
> > > it's
> > > safe to timeout?  No RPCs outstanding?
> > 
> > Yep all good questions which I don't have answers to. I realize
> > that
> > for instance, that a server might send an ACK and then send a
> > FIN/ACK
> > after that. But why is it bad for the client to proactively send an
> > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client
> > was shutting down the connection anyway.
> > 
> I think you could lose data from the server if you RST and don't wait
> but other than that I don't know.  We may be splitting hairs here
> though if there's no demonstrable harm to the application (NFS).  As
> Trond points out elsewhere reconnect / port reuse may be an issue.
> 
> When I looked at this in the second bug I was almost convinced that
> there was a problem with the 'close' method in xs_tcp_ops - we needed
> to be calling xs_close(), but I ran into some problems and I wasn't
> sure about the test case.
> 
> 
> > > I don't claim to know many of the subtleties here as far as would
> > > the
> > > server wait forever in LAST_ACK or do implementations eventually
> > > timeout after some time?  Seems like if these last packets get
> > > lost
> > > there is likely a bug somewhere (either firewall or TCP stack,
> > > etc).
> > > https://tools.ietf.org/html/rfc793#page-22
> > > 
> > > It looks like at least some people are putting timeouts into
> > > their
> > > stacks though I'm not sure that's a good idea or not.
> > 
> > I saw only one other driver in the kernel that does have handling
> > of
> > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...
> 
> One final thought - is it possible that we could have a network that
> does not support TCP keepalives?  In that instance, I'll bet we could
> get an indefinite hang in FIN_WAIT2 on my first test case (server
> crashed after the half-close state).
> 
> It does not look like TCP keepalives have to be implemented and maybe
> some routers / NATs / firewalls disallow them (due to extra traffic)?
> https://tools.ietf.org/html/rfc1122#page-101

Again, that's a corner case of a corner case. I've never seen any
reports of this occurring.

On the other hand, I _have_ seen lots of reports of problems due to the
reconnection issues when the server still holds the socket open. Those
reports were the main reason for the current design.

-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space