Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection

From: Olga Kornievskaia <olga.kornievskaia@gmail.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: "dwysocha@redhat.com" <dwysocha@redhat.com>,
	"anna.schumaker@netapp.com" <anna.schumaker@netapp.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
Date: Tue, 5 Mar 2019 12:22:55 -0500	[thread overview]
Message-ID: <CAN-5tyHd8QS+etzRjo_=ikg4smzL8T+xe6a3FMdge871VFKKsg@mail.gmail.com> (raw)
In-Reply-To: <cac1ddc0f4f2e70e1bfc2dc142f9675b68073440.camel@hammerspace.com>

On Fri, Feb 22, 2019 at 3:00 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote:
> > On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote:
> > > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <
> > > dwysocha@redhat.co
> > > m> wrote:
> > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > > > > Hi Dave,
> > > > >
> > > > > A re-producer is a server that sends an ACK to the client's
> > > > > FIN/ACK
> > > > > request and does nothing afterwards (I can reproduce it 100%
> > > > > with
> > > > > a
> > > > > hacked up server. It was discovered with a "broken" server that
> > > > > doesn't fully closes a connection). It leave this client unable
> > > > > to
> > > > > connect to this server again indefinitely/forever/reboot
> > > > > required
> > > > > kind
> > > > > of state. Once it was considered that doing something like that
> > > > > to
> > > > > the
> > > > > client is a form of an attack (denial-of-server) and thus the
> > > > > kernel
> > > > > has a tcp_fin_timeout option after which the kernel will abort
> > > > > the
> > > > > connection. However this only applies to the sockets that have
> > > > > been
> > > > > closed by the client. This is NOT the case. NFS does not close
> > > > > the
> > > > > connection and it ignores kernel's notification of FIN_WAIT2
> > > > > state.
> > > > >
> > > >
> > > > Interesting.  I had the same reproducer but eventually an RST
> > > > would
> > > > get
> > > > sent from the NFS client due to the TCP keepalives.  It sounds
> > > > like
> > > > that is not the case anymore.  When I did my testing was over a
> > > > year
> > > > and a half ago though.
> > >
> > > after the keepalives the TCP also sends a FIN/ACK if the server
> > > properly sent a FIN/ACK back, then keep alive will indeed be
> > > sufficient. Can you check if in your trace server in one time sent
> > > just an ACK but in another case sent the FIN/ACK?
> > >
> >
> > I had two reproducers actually.  In both cases the NFS server would
> > never send a FIN.
> >
> > The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421)
> > was
> > about a NFS server crash after being in the half-close state.  An
> > NFS4
> > client without that keepalive patch would hang.
> > This is a valid test case and we check for that now in all releases.
> > Here's the outline:
> > # Outline
> > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> > 0 port 2049 and host $NFS_SERVER
> > # Step 2. On NFS client, mount NFS4 server export with
> > "vers=4,timeo=5"
> > # Step 3. On NFS server, block outgoing FIN packets from the NFS
> > server"
> > # Step 4. On NFS server, using systemtap script, to delay
> > NFS4OPEN_CONFIRM response for a few seconds
> > # Step 5. On NFS client, open a file, which should get delayed and
> > trigger a FIN from the NFS client
> > # Step 6. On NFS server, after a short delay, crash the server by
> > 'echo c > /proc/sysrq-trigger'
> > # Step 7. On NFS client, wait for NFS server to reboot
> > # Step 8. On NFS client, wait for the NFS to reconnect TCP connection
> > # Step 9. On NFS client, wait for NFS4 grace period
> > # Step 10. On NFS client, try a 'ls' of the file created earlier
> > # Step 11. On NFS client, sleep for $DELAY seconds to allow all
> > operations to complete
> > # Step 12. On NFS client, ensure all operations have completed
> > # Step 13. Cleanup
> >
> >
> > The second test case (after the keepalive patch) was arguably invalid
> > test as I used systemtap on the NFS server to never close the socket
> > so
> > it would never send a FIN.  This obviously should never happen but at
> > the time I wanted to see if the NFS client would recover.
> > I did that with this:
> > probe module("sunrpc").function("svc_xprt_enqueue") {
> >       printf("%s: %s removing XPT_CLOSE from xprt = %p\n",
> > tz_ctime(gettimeofday_s()), ppfunc(), $xprt);
> >       $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2);
> > }
> > Here's the full outline for this test:
> >
> > # This test checks automatic NFS TCP recovery after a specific
> > failure scenario.
> > # The failure scenario is with an NFSv3 mount that goes idle and the
> > NFS client closes the
> > # TCP connection, and the final FIN from the NFS server gets lost.
> > # This can be a tricky scenario since the NFS client's TCP may get
> > stuck in FIN-WAIT-2 indefinitely.
> > # The NFS client should be able to recover the NFS TCP connection
> > automatically without unmount/mount
> > # cycle or reboot, and the TCP connection should not be stuck in FIN-
> > WAIT-2 or any other non-connected
> > # state.
> > #
> > # Outline
> > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> > 0 port 2049 and host $NFS_SERVER
> > # Step 2. On NFS client, mount NFS3 server export
> > # Step 3. On NFS server, using systemtap script, force the NFS server
> > to never send a FIN (avoid close)
> > # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP
> > connection in TCP_ESTABLISHED
> > # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS
> > TCP connection to transition to TIME_WAIT state
> > # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS
> > TCP connection to disappear
> > # Step 7. Cleanup
> > #
> > # PASS: The state of NFS's TCP connection is usable by the end of the
> > test (commands don't hang, etc)
> > # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2
> > indefinitely
> >
> >
> > > > > One can argue that this is a broken server and we shouldn't
> > > > > bother.
> > > > > But this patch is an attempt to argue that the client still
> > > > > should
> > > > > care and deal with this condition. However, if the community
> > > > > feels
> > > > > that a broken server is a broken server and this form of an
> > > > > attack is
> > > > > not interested, this patch can live will be an archive for
> > > > > later
> > > > > or
> > > > > never.
> > > > >
> > > >
> > > > This isn't IPoIB is it?
> > >
> > > No, just normal TCP.
> > >
> > >
> > > > Actually, fwiw, looking back I had speculated on changes in this
> > > > area.
> > > > I'm adding you to the CC list of this bug which had some of my
> > > > musings
> > > > on it:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> > > > That bug I ended up closing when we could no longer prove there
> > > > was
> > > > any
> > > > state where the NFS client could get stuck in FIN_WAIT2 after the
> > > > keepalive patch.
> > >
> > > I can reproduce current behavior with the current upstream code.
> > >
> > >
> > > > It can happen that the server only sends the ACK back to the
> > > > clients
> > > > FIN,ACK so in general the testcase is valid.  But then the
> > > > question
> > > > is
> > > > how long should one wait for the final data and FIN from the
> > > > server, or
> > > > are there ever instances where you shouldn't wait forever?  Is
> > > > there a
> > > > way for us to know for sure there is no data left to receive so
> > > > it's
> > > > safe to timeout?  No RPCs outstanding?
> > >
> > > Yep all good questions which I don't have answers to. I realize
> > > that
> > > for instance, that a server might send an ACK and then send a
> > > FIN/ACK
> > > after that. But why is it bad for the client to proactively send an
> > > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client
> > > was shutting down the connection anyway.
> > >
> > I think you could lose data from the server if you RST and don't wait
> > but other than that I don't know.  We may be splitting hairs here
> > though if there's no demonstrable harm to the application (NFS).  As
> > Trond points out elsewhere reconnect / port reuse may be an issue.
> >
> > When I looked at this in the second bug I was almost convinced that
> > there was a problem with the 'close' method in xs_tcp_ops - we needed
> > to be calling xs_close(), but I ran into some problems and I wasn't
> > sure about the test case.
> >
> >
> > > > I don't claim to know many of the subtleties here as far as would
> > > > the
> > > > server wait forever in LAST_ACK or do implementations eventually
> > > > timeout after some time?  Seems like if these last packets get
> > > > lost
> > > > there is likely a bug somewhere (either firewall or TCP stack,
> > > > etc).
> > > > https://tools.ietf.org/html/rfc793#page-22
> > > >
> > > > It looks like at least some people are putting timeouts into
> > > > their
> > > > stacks though I'm not sure that's a good idea or not.
> > >
> > > I saw only one other driver in the kernel that does have handling
> > > of
> > > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...
> >
> > One final thought - is it possible that we could have a network that
> > does not support TCP keepalives?  In that instance, I'll bet we could
> > get an indefinite hang in FIN_WAIT2 on my first test case (server
> > crashed after the half-close state).
> >
> > It does not look like TCP keepalives have to be implemented and maybe
> > some routers / NATs / firewalls disallow them (due to extra traffic)?
> > https://tools.ietf.org/html/rfc1122#page-101
>
>
> Again, that's a corner case of a corner case. I've never seen any
> reports of this occurring.
>
> On the other hand, I _have_ seen lots of reports of problems due to the
> reconnection issues when the server still holds the socket open. Those
> reports were the main reason for the current design.

Continuing on with this thread: what I can gather so far.

NFS has an idle timer itself which expires
        INIT_WORK(&xprt->task_cleanup, xprt_autoclose);
        if (xprt_has_timer(xprt))
               timer_setup(&xprt->timer, xprt_init_autodisconnect, 0);
        else
               timer_setup(&xprt->timer, NULL, 0);

The xprt_autoclose() is what triggers TCP client to send tcp_send_fin().

Xprt_autoclose() calls kernel_Sock_shutdown() but I don’t believe
that’s sufficient to “close the socket”. What is needed is to call
sock_release() so socket is not orphaned.

As a reply to its FIN/ACK it gets back an ACK (putting it in FIN_WAIT2
which TCP notifies the NFS and NFS ignores it instead of closing the
socket as it does for other states like CLOSE_WAIT). Socket is not
closed so no tcp_fin_timeout is set.

>
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space
>
>