linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/1] SUNRPC: fix handling of half-closed connection
@ 2019-02-20 14:56 Olga Kornievskaia
  2019-02-22 12:12 ` Dave Wysochanski
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2019-02-20 14:56 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker; +Cc: linux-nfs

From: Olga Kornievskaia <kolga@netapp.com>

When server replies with an ACK to client's FIN/ACK, client ends
up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
Instead, make sure to close and reset client's socket and transport
when transitioned into that state.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
---
 net/sunrpc/xprtsock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 618e9c2..812e5e3 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1502,6 +1502,7 @@ static void xs_tcp_state_change(struct sock *sk)
 		clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
 		smp_mb__after_atomic();
 		break;
+	case TCP_FIN_WAIT2:
 	case TCP_CLOSE_WAIT:
 		/* The server initiated a shutdown of the socket */
 		xprt->connect_cookie++;
@@ -2152,6 +2153,7 @@ static void xs_tcp_shutdown(struct rpc_xprt *xprt)
 		kernel_sock_shutdown(sock, SHUT_RDWR);
 		trace_rpc_socket_shutdown(xprt, sock);
 		break;
+	case TCP_FIN_WAIT2:
 	case TCP_CLOSE:
 	case TCP_TIME_WAIT:
 		xs_reset_transport(transport);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-20 14:56 [PATCH 1/1] SUNRPC: fix handling of half-closed connection Olga Kornievskaia
@ 2019-02-22 12:12 ` Dave Wysochanski
  2019-02-22 13:45   ` Trond Myklebust
  2019-02-22 14:44   ` Olga Kornievskaia
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Wysochanski @ 2019-02-22 12:12 UTC (permalink / raw)
  To: Olga Kornievskaia, trond.myklebust, anna.schumaker; +Cc: linux-nfs

Hi Olga,

Do you have a reproducer for this?  A number of months ago I did a
significant amount of testing with half-closed connections, after we
had reports of connections stuck in FIN_WAIT2 in some older kernels. 
What I found was with kernels that had the tcp keepalives (commit
7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a
hang of a few minutes, after which time the tcp keepalive code would
reset the connection.

That said it was a while ago and something subtle may have changed. 
Also I'm not not sure if your header implies an indefinite hang or just
a few minutes.

Thanks.


On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> From: Olga Kornievskaia <kolga@netapp.com>
> 
> When server replies with an ACK to client's FIN/ACK, client ends
> up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> Instead, make sure to close and reset client's socket and transport
> when transitioned into that state.
> 
> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> ---
>  net/sunrpc/xprtsock.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 618e9c2..812e5e3 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -1502,6 +1502,7 @@ static void xs_tcp_state_change(struct sock
> *sk)
>  		clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
>  		smp_mb__after_atomic();
>  		break;
> +	case TCP_FIN_WAIT2:
>  	case TCP_CLOSE_WAIT:
>  		/* The server initiated a shutdown of the socket */
>  		xprt->connect_cookie++;
> @@ -2152,6 +2153,7 @@ static void xs_tcp_shutdown(struct rpc_xprt
> *xprt)
>  		kernel_sock_shutdown(sock, SHUT_RDWR);
>  		trace_rpc_socket_shutdown(xprt, sock);
>  		break;
> +	case TCP_FIN_WAIT2:
>  	case TCP_CLOSE:
>  	case TCP_TIME_WAIT:
>  		xs_reset_transport(transport);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 12:12 ` Dave Wysochanski
@ 2019-02-22 13:45   ` Trond Myklebust
  2019-02-22 14:46     ` Olga Kornievskaia
  2019-02-22 14:44   ` Olga Kornievskaia
  1 sibling, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2019-02-22 13:45 UTC (permalink / raw)
  To: anna.schumaker, olga.kornievskaia, dwysocha; +Cc: linux-nfs

On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote:
> Hi Olga,
> 
> Do you have a reproducer for this?  A number of months ago I did a
> significant amount of testing with half-closed connections, after we
> had reports of connections stuck in FIN_WAIT2 in some older kernels. 
> What I found was with kernels that had the tcp keepalives (commit
> 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a
> hang of a few minutes, after which time the tcp keepalive code would
> reset the connection.
> 
> That said it was a while ago and something subtle may have changed. 
> Also I'm not not sure if your header implies an indefinite hang or
> just
> a few minutes.
> 
> Thanks.
> 
> 
> On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> > From: Olga Kornievskaia <kolga@netapp.com>
> > 
> > When server replies with an ACK to client's FIN/ACK, client ends
> > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> > Instead, make sure to close and reset client's socket and transport
> > when transitioned into that state.

So, please do note that we do not want to ignore the FIN_WAIT2 state
because it implies that the server has not closed the socket on its
side. That again means that we cannot re-establish a connection using
the same source IP+port to the server, which is problematic for
protocols such as NFSv3 which rely on standard duplicate reply cache
for correct replay semantics.

This is why we don't just set the TCP_LINGER2 socket option and call
sock_release(). The choice to try to wait it out is deliberate because
the alternative is that we end up with busy-waiting re-connection
attempts.

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 12:12 ` Dave Wysochanski
  2019-02-22 13:45   ` Trond Myklebust
@ 2019-02-22 14:44   ` Olga Kornievskaia
  2019-02-22 16:32     ` Dave Wysochanski
  1 sibling, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2019-02-22 14:44 UTC (permalink / raw)
  To: Dave Wysochanski; +Cc: trond.myklebust, Anna Schumaker, linux-nfs

Hi Dave,

A re-producer is a server that sends an ACK to the client's FIN/ACK
request and does nothing afterwards (I can reproduce it 100% with a
hacked up server. It was discovered with a "broken" server that
doesn't fully closes a connection). It leave this client unable to
connect to this server again indefinitely/forever/reboot required kind
of state. Once it was considered that doing something like that to the
client is a form of an attack (denial-of-server) and thus the kernel
has a tcp_fin_timeout option after which the kernel will abort the
connection. However this only applies to the sockets that have been
closed by the client. This is NOT the case. NFS does not close the
connection and it ignores kernel's notification of FIN_WAIT2 state.

One can argue that this is a broken server and we shouldn't bother.
But this patch is an attempt to argue that the client still should
care and deal with this condition. However, if the community feels
that a broken server is a broken server and this form of an attack is
not interested, this patch can live will be an archive for later or
never.

On Fri, Feb 22, 2019 at 7:12 AM Dave Wysochanski <dwysocha@redhat.com> wrote:
>
> Hi Olga,
>
> Do you have a reproducer for this?  A number of months ago I did a
> significant amount of testing with half-closed connections, after we
> had reports of connections stuck in FIN_WAIT2 in some older kernels.
> What I found was with kernels that had the tcp keepalives (commit
> 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a
> hang of a few minutes, after which time the tcp keepalive code would
> reset the connection.
>
> That said it was a while ago and something subtle may have changed.
> Also I'm not not sure if your header implies an indefinite hang or just
> a few minutes.
>
> Thanks.
>
>
> On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> > From: Olga Kornievskaia <kolga@netapp.com>
> >
> > When server replies with an ACK to client's FIN/ACK, client ends
> > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> > Instead, make sure to close and reset client's socket and transport
> > when transitioned into that state.
> >
> > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > ---
> >  net/sunrpc/xprtsock.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > index 618e9c2..812e5e3 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -1502,6 +1502,7 @@ static void xs_tcp_state_change(struct sock
> > *sk)
> >               clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
> >               smp_mb__after_atomic();
> >               break;
> > +     case TCP_FIN_WAIT2:
> >       case TCP_CLOSE_WAIT:
> >               /* The server initiated a shutdown of the socket */
> >               xprt->connect_cookie++;
> > @@ -2152,6 +2153,7 @@ static void xs_tcp_shutdown(struct rpc_xprt
> > *xprt)
> >               kernel_sock_shutdown(sock, SHUT_RDWR);
> >               trace_rpc_socket_shutdown(xprt, sock);
> >               break;
> > +     case TCP_FIN_WAIT2:
> >       case TCP_CLOSE:
> >       case TCP_TIME_WAIT:
> >               xs_reset_transport(transport);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 13:45   ` Trond Myklebust
@ 2019-02-22 14:46     ` Olga Kornievskaia
  2019-02-22 15:05       ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2019-02-22 14:46 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: anna.schumaker, dwysocha, linux-nfs

On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote:
> > Hi Olga,
> >
> > Do you have a reproducer for this?  A number of months ago I did a
> > significant amount of testing with half-closed connections, after we
> > had reports of connections stuck in FIN_WAIT2 in some older kernels.
> > What I found was with kernels that had the tcp keepalives (commit
> > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce a
> > hang of a few minutes, after which time the tcp keepalive code would
> > reset the connection.
> >
> > That said it was a while ago and something subtle may have changed.
> > Also I'm not not sure if your header implies an indefinite hang or
> > just
> > a few minutes.
> >
> > Thanks.
> >
> >
> > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> > > From: Olga Kornievskaia <kolga@netapp.com>
> > >
> > > When server replies with an ACK to client's FIN/ACK, client ends
> > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> > > Instead, make sure to close and reset client's socket and transport
> > > when transitioned into that state.
>

Hi Trond,

> So, please do note that we do not want to ignore the FIN_WAIT2 state

But we do ignore the FIN_WAIT2 state.

> because it implies that the server has not closed the socket on its
> side.

That's correct.

> That again means that we cannot re-establish a connection using
> the same source IP+port to the server, which is problematic for
> protocols such as NFSv3 which rely on standard duplicate reply cache
> for correct replay semantics.

that's exactly what's happening that a client is unable to establish a
new connection to the server. With the patch, the client does an RST
and it re-uses the port and all is well for NFSv3.

> This is why we don't just set the TCP_LINGER2 socket option and call
> sock_release(). The choice to try to wait it out is deliberate because
> the alternative is that we end up with busy-waiting re-connection
> attempts.

Why would it busy-wait? In my testing, RST happens and new connection
is established?


>
> Cheers
>   Trond
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 14:46     ` Olga Kornievskaia
@ 2019-02-22 15:05       ` Trond Myklebust
  2019-02-22 15:11         ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2019-02-22 15:05 UTC (permalink / raw)
  To: olga.kornievskaia; +Cc: anna.schumaker, linux-nfs, dwysocha

On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote:
> On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust <
> trondmy@hammerspace.com> wrote:
> > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote:
> > > Hi Olga,
> > > 
> > > Do you have a reproducer for this?  A number of months ago I did
> > > a
> > > significant amount of testing with half-closed connections, after
> > > we
> > > had reports of connections stuck in FIN_WAIT2 in some older
> > > kernels.
> > > What I found was with kernels that had the tcp keepalives (commit
> > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce
> > > a
> > > hang of a few minutes, after which time the tcp keepalive code
> > > would
> > > reset the connection.
> > > 
> > > That said it was a while ago and something subtle may have
> > > changed.
> > > Also I'm not not sure if your header implies an indefinite hang
> > > or
> > > just
> > > a few minutes.
> > > 
> > > Thanks.
> > > 
> > > 
> > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> > > > From: Olga Kornievskaia <kolga@netapp.com>
> > > > 
> > > > When server replies with an ACK to client's FIN/ACK, client
> > > > ends
> > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> > > > Instead, make sure to close and reset client's socket and
> > > > transport
> > > > when transitioned into that state.
> 
> Hi Trond,
> 
> > So, please do note that we do not want to ignore the FIN_WAIT2
> > state
> 
> But we do ignore the FIN_WAIT2 state.

We do not. We wait for the server to send a FIN, which is precisely the
reason for which FIN_WAIT2 exists.

> 
> > because it implies that the server has not closed the socket on its
> > side.
> 
> That's correct.
> 
> > That again means that we cannot re-establish a connection using
> > the same source IP+port to the server, which is problematic for
> > protocols such as NFSv3 which rely on standard duplicate reply
> > cache
> > for correct replay semantics.
> 
> that's exactly what's happening that a client is unable to establish
> a
> new connection to the server. With the patch, the client does an RST
> and it re-uses the port and all is well for NFSv3.

RST is not guaranteed to be delivered to the recipient. That's why the
TCP protocol defines FIN: it is a guaranteed to be delivered because it
is ACKed.

> > This is why we don't just set the TCP_LINGER2 socket option and
> > call
> > sock_release(). The choice to try to wait it out is deliberate
> > because
> > the alternative is that we end up with busy-waiting re-connection
> > attempts.
> 
> Why would it busy-wait? In my testing, RST happens and new connection
> is established?

Only if the server has dropped the connection without notifying the
client.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 15:05       ` Trond Myklebust
@ 2019-02-22 15:11         ` Olga Kornievskaia
  2019-02-22 15:50           ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2019-02-22 15:11 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: anna.schumaker, linux-nfs, dwysocha

On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust
<trondmy@hammerspace.com> wrote:
>
> On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote:
> > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote:
> > > > Hi Olga,
> > > >
> > > > Do you have a reproducer for this?  A number of months ago I did
> > > > a
> > > > significant amount of testing with half-closed connections, after
> > > > we
> > > > had reports of connections stuck in FIN_WAIT2 in some older
> > > > kernels.
> > > > What I found was with kernels that had the tcp keepalives (commit
> > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only reproduce
> > > > a
> > > > hang of a few minutes, after which time the tcp keepalive code
> > > > would
> > > > reset the connection.
> > > >
> > > > That said it was a while ago and something subtle may have
> > > > changed.
> > > > Also I'm not not sure if your header implies an indefinite hang
> > > > or
> > > > just
> > > > a few minutes.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> > > > > From: Olga Kornievskaia <kolga@netapp.com>
> > > > >
> > > > > When server replies with an ACK to client's FIN/ACK, client
> > > > > ends
> > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> > > > > Instead, make sure to close and reset client's socket and
> > > > > transport
> > > > > when transitioned into that state.
> >
> > Hi Trond,
> >
> > > So, please do note that we do not want to ignore the FIN_WAIT2
> > > state
> >
> > But we do ignore the FIN_WAIT2 state.
>
> We do not. We wait for the server to send a FIN, which is precisely the
> reason for which FIN_WAIT2 exists.
>
> >
> > > because it implies that the server has not closed the socket on its
> > > side.
> >
> > That's correct.
> >
> > > That again means that we cannot re-establish a connection using
> > > the same source IP+port to the server, which is problematic for
> > > protocols such as NFSv3 which rely on standard duplicate reply
> > > cache
> > > for correct replay semantics.
> >
> > that's exactly what's happening that a client is unable to establish
> > a
> > new connection to the server. With the patch, the client does an RST
> > and it re-uses the port and all is well for NFSv3.
>
> RST is not guaranteed to be delivered to the recipient. That's why the
> TCP protocol defines FIN: it is a guaranteed to be delivered because it
> is ACKed.
>
> > > This is why we don't just set the TCP_LINGER2 socket option and
> > > call
> > > sock_release(). The choice to try to wait it out is deliberate
> > > because
> > > the alternative is that we end up with busy-waiting re-connection
> > > attempts.
> >
> > Why would it busy-wait? In my testing, RST happens and new connection
> > is established?
>
> Only if the server has dropped the connection without notifying the
> client.

Yes the server dropped the connection without notifying the client (or
perhaps something in the middle did it as an attack). Again, I raise
this concern for the sake of dealing with this as an attack. I have no
intentions of catering to broken servers. If this is not a possible
attack, then we don't have to deal with it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 15:11         ` Olga Kornievskaia
@ 2019-02-22 15:50           ` Trond Myklebust
  2019-02-22 17:02             ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2019-02-22 15:50 UTC (permalink / raw)
  To: olga.kornievskaia; +Cc: anna.schumaker, linux-nfs, dwysocha

On Fri, 2019-02-22 at 10:11 -0500, Olga Kornievskaia wrote:
> On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote:
> > > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust <
> > > trondmy@hammerspace.com> wrote:
> > > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote:
> > > > > Hi Olga,
> > > > > 
> > > > > Do you have a reproducer for this?  A number of months ago I
> > > > > did
> > > > > a
> > > > > significant amount of testing with half-closed connections,
> > > > > after
> > > > > we
> > > > > had reports of connections stuck in FIN_WAIT2 in some older
> > > > > kernels.
> > > > > What I found was with kernels that had the tcp keepalives
> > > > > (commit
> > > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only
> > > > > reproduce
> > > > > a
> > > > > hang of a few minutes, after which time the tcp keepalive
> > > > > code
> > > > > would
> > > > > reset the connection.
> > > > > 
> > > > > That said it was a while ago and something subtle may have
> > > > > changed.
> > > > > Also I'm not not sure if your header implies an indefinite
> > > > > hang
> > > > > or
> > > > > just
> > > > > a few minutes.
> > > > > 
> > > > > Thanks.
> > > > > 
> > > > > 
> > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> > > > > > From: Olga Kornievskaia <kolga@netapp.com>
> > > > > > 
> > > > > > When server replies with an ACK to client's FIN/ACK, client
> > > > > > ends
> > > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> > > > > > Instead, make sure to close and reset client's socket and
> > > > > > transport
> > > > > > when transitioned into that state.
> > > 
> > > Hi Trond,
> > > 
> > > > So, please do note that we do not want to ignore the FIN_WAIT2
> > > > state
> > > 
> > > But we do ignore the FIN_WAIT2 state.
> > 
> > We do not. We wait for the server to send a FIN, which is precisely
> > the
> > reason for which FIN_WAIT2 exists.
> > 
> > > > because it implies that the server has not closed the socket on
> > > > its
> > > > side.
> > > 
> > > That's correct.
> > > 
> > > > That again means that we cannot re-establish a connection using
> > > > the same source IP+port to the server, which is problematic for
> > > > protocols such as NFSv3 which rely on standard duplicate reply
> > > > cache
> > > > for correct replay semantics.
> > > 
> > > that's exactly what's happening that a client is unable to
> > > establish
> > > a
> > > new connection to the server. With the patch, the client does an
> > > RST
> > > and it re-uses the port and all is well for NFSv3.
> > 
> > RST is not guaranteed to be delivered to the recipient. That's why
> > the
> > TCP protocol defines FIN: it is a guaranteed to be delivered
> > because it
> > is ACKed.
> > 
> > > > This is why we don't just set the TCP_LINGER2 socket option and
> > > > call
> > > > sock_release(). The choice to try to wait it out is deliberate
> > > > because
> > > > the alternative is that we end up with busy-waiting re-
> > > > connection
> > > > attempts.
> > > 
> > > Why would it busy-wait? In my testing, RST happens and new
> > > connection
> > > is established?
> > 
> > Only if the server has dropped the connection without notifying the
> > client.
> 
> Yes the server dropped the connection without notifying the client
> (or
> perhaps something in the middle did it as an attack). Again, I raise
> this concern for the sake of dealing with this as an attack. I have
> no
> intentions of catering to broken servers. If this is not a possible
> attack, then we don't have to deal with it.

A man in the middle might be able to intercept the FIN from the server
and ACK it, causing the connection to be closed on that server.
However, as Dave pointed out, why wouldn't the keepalive mechanism then
eventually kick in and close the socket on the client as well?

If the FIN is not ACKed, then the server is supposed to keep
retransmitting it. Until that ACK is received, it cannot close the
socket without violating the TCP protocol.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 14:44   ` Olga Kornievskaia
@ 2019-02-22 16:32     ` Dave Wysochanski
  2019-02-22 17:10       ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Wysochanski @ 2019-02-22 16:32 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: trond.myklebust, Anna Schumaker, linux-nfs

On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> Hi Dave,
> 
> A re-producer is a server that sends an ACK to the client's FIN/ACK
> request and does nothing afterwards (I can reproduce it 100% with a
> hacked up server. It was discovered with a "broken" server that
> doesn't fully closes a connection). It leave this client unable to
> connect to this server again indefinitely/forever/reboot required
> kind
> of state. Once it was considered that doing something like that to
> the
> client is a form of an attack (denial-of-server) and thus the kernel
> has a tcp_fin_timeout option after which the kernel will abort the
> connection. However this only applies to the sockets that have been
> closed by the client. This is NOT the case. NFS does not close the
> connection and it ignores kernel's notification of FIN_WAIT2 state.
> 
Interesting.  I had the same reproducer but eventually an RST would get
sent from the NFS client due to the TCP keepalives.  It sounds like
that is not the case anymore.  When I did my testing was over a year
and a half ago though.


> One can argue that this is a broken server and we shouldn't bother.
> But this patch is an attempt to argue that the client still should
> care and deal with this condition. However, if the community feels
> that a broken server is a broken server and this form of an attack is
> not interested, this patch can live will be an archive for later or
> never.
> 

This isn't IPoIB is it?

Actually, fwiw, looking back I had speculated on changes in this area. 
I'm adding you to the CC list of this bug which had some of my musings
on it:
https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
That bug I ended up closing when we could no longer prove there was any
state where the NFS client could get stuck in FIN_WAIT2 after the
keepalive patch.

It can happen that the server only sends the ACK back to the clients
FIN,ACK so in general the testcase is valid.  But then the question is
how long should one wait for the final data and FIN from the server, or
are there ever instances where you shouldn't wait forever?  Is there a
way for us to know for sure there is no data left to receive so it's
safe to timeout?  No RPCs outstanding?

I don't claim to know many of the subtleties here as far as would the
server wait forever in LAST_ACK or do implementations eventually
timeout after some time?  Seems like if these last packets get lost
there is likely a bug somewhere (either firewall or TCP stack, etc).
https://tools.ietf.org/html/rfc793#page-22

It looks like at least some people are putting timeouts into their
stacks though I'm not sure that's a good idea or not.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 15:50           ` Trond Myklebust
@ 2019-02-22 17:02             ` Olga Kornievskaia
  2019-02-22 17:09               ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2019-02-22 17:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: anna.schumaker, linux-nfs, dwysocha

On Fri, Feb 22, 2019 at 10:50 AM Trond Myklebust
<trondmy@hammerspace.com> wrote:
>
> On Fri, 2019-02-22 at 10:11 -0500, Olga Kornievskaia wrote:
> > On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > > On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote:
> > > > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote:
> > > > > > Hi Olga,
> > > > > >
> > > > > > Do you have a reproducer for this?  A number of months ago I
> > > > > > did
> > > > > > a
> > > > > > significant amount of testing with half-closed connections,
> > > > > > after
> > > > > > we
> > > > > > had reports of connections stuck in FIN_WAIT2 in some older
> > > > > > kernels.
> > > > > > What I found was with kernels that had the tcp keepalives
> > > > > > (commit
> > > > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only
> > > > > > reproduce
> > > > > > a
> > > > > > hang of a few minutes, after which time the tcp keepalive
> > > > > > code
> > > > > > would
> > > > > > reset the connection.
> > > > > >
> > > > > > That said it was a while ago and something subtle may have
> > > > > > changed.
> > > > > > Also I'm not not sure if your header implies an indefinite
> > > > > > hang
> > > > > > or
> > > > > > just
> > > > > > a few minutes.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia wrote:
> > > > > > > From: Olga Kornievskaia <kolga@netapp.com>
> > > > > > >
> > > > > > > When server replies with an ACK to client's FIN/ACK, client
> > > > > > > ends
> > > > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount hangs.
> > > > > > > Instead, make sure to close and reset client's socket and
> > > > > > > transport
> > > > > > > when transitioned into that state.
> > > >
> > > > Hi Trond,
> > > >
> > > > > So, please do note that we do not want to ignore the FIN_WAIT2
> > > > > state
> > > >
> > > > But we do ignore the FIN_WAIT2 state.
> > >
> > > We do not. We wait for the server to send a FIN, which is precisely
> > > the
> > > reason for which FIN_WAIT2 exists.
> > >
> > > > > because it implies that the server has not closed the socket on
> > > > > its
> > > > > side.
> > > >
> > > > That's correct.
> > > >
> > > > > That again means that we cannot re-establish a connection using
> > > > > the same source IP+port to the server, which is problematic for
> > > > > protocols such as NFSv3 which rely on standard duplicate reply
> > > > > cache
> > > > > for correct replay semantics.
> > > >
> > > > that's exactly what's happening that a client is unable to
> > > > establish
> > > > a
> > > > new connection to the server. With the patch, the client does an
> > > > RST
> > > > and it re-uses the port and all is well for NFSv3.
> > >
> > > RST is not guaranteed to be delivered to the recipient. That's why
> > > the
> > > TCP protocol defines FIN: it is a guaranteed to be delivered
> > > because it
> > > is ACKed.
> > >
> > > > > This is why we don't just set the TCP_LINGER2 socket option and
> > > > > call
> > > > > sock_release(). The choice to try to wait it out is deliberate
> > > > > because
> > > > > the alternative is that we end up with busy-waiting re-
> > > > > connection
> > > > > attempts.
> > > >
> > > > Why would it busy-wait? In my testing, RST happens and new
> > > > connection
> > > > is established?
> > >
> > > Only if the server has dropped the connection without notifying the
> > > client.
> >
> > Yes the server dropped the connection without notifying the client
> > (or
> > perhaps something in the middle did it as an attack). Again, I raise
> > this concern for the sake of dealing with this as an attack. I have
> > no
> > intentions of catering to broken servers. If this is not a possible
> > attack, then we don't have to deal with it.
>
> A man in the middle might be able to intercept the FIN from the server
> and ACK it, causing the connection to be closed on that server.
> However, as Dave pointed out, why wouldn't the keepalive mechanism then
> eventually kick in and close the socket on the client as well?

The mechanism is already kicked in and got stuck in FIN_WAIT2. NFS
connection was idle, so TCP layer was sending keep-alives. Then it
sent a FIN/ACK to which the server replied with just an ACK. Kernel
notified NFS that we are in FIN_WAIT2 and I believe it is NFS
responsibility to act accordingly. Kernel then keeps sending
"keep-alives" forever.  Because of this code:

        case TCP_FIN_WAIT1:
        case TCP_FIN_WAIT2:
                /* RFC 793 says to queue data in these states,
                 * RFC 1122 says we MUST send a reset.
                 * BSD 4.4 also does reset.
                 */
                if (sk->sk_shutdown & RCV_SHUTDOWN) {
                        if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
                            after(TCP_SKB_CB(skb)->end_seq - th->fin,
tp->rcv_nxt)) {
                                NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONDATA);
                                tcp_reset(sk);   << this is never triggered
                                return 1;
                        }
                }

In our case TCP_SKB_CB(skb)->end_seq always equals
TCP_SKB_CB(skb)->seq. (No i don't know the meaning of end_seq and seq
:-/).

> If the FIN is not ACKed, then the server is supposed to keep
> retransmitting it. Until that ACK is received, it cannot close the
> socket without violating the TCP protocol.

Something in the middle can keep intercepting the the FIN/ACK from the
server and keep sending an ACK back?

>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 17:02             ` Olga Kornievskaia
@ 2019-02-22 17:09               ` Trond Myklebust
  0 siblings, 0 replies; 16+ messages in thread
From: Trond Myklebust @ 2019-02-22 17:09 UTC (permalink / raw)
  To: olga.kornievskaia; +Cc: anna.schumaker, linux-nfs, dwysocha

On Fri, 2019-02-22 at 12:02 -0500, Olga Kornievskaia wrote:
> On Fri, Feb 22, 2019 at 10:50 AM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > On Fri, 2019-02-22 at 10:11 -0500, Olga Kornievskaia wrote:
> > > On Fri, Feb 22, 2019 at 10:06 AM Trond Myklebust
> > > <trondmy@hammerspace.com> wrote:
> > > > On Fri, 2019-02-22 at 09:46 -0500, Olga Kornievskaia wrote:
> > > > > On Fri, Feb 22, 2019 at 8:45 AM Trond Myklebust <
> > > > > trondmy@hammerspace.com> wrote:
> > > > > > On Fri, 2019-02-22 at 07:12 -0500, Dave Wysochanski wrote:
> > > > > > > Hi Olga,
> > > > > > > 
> > > > > > > Do you have a reproducer for this?  A number of months
> > > > > > > ago I
> > > > > > > did
> > > > > > > a
> > > > > > > significant amount of testing with half-closed
> > > > > > > connections,
> > > > > > > after
> > > > > > > we
> > > > > > > had reports of connections stuck in FIN_WAIT2 in some
> > > > > > > older
> > > > > > > kernels.
> > > > > > > What I found was with kernels that had the tcp keepalives
> > > > > > > (commit
> > > > > > > 7f260e8575bf53b93b77978c1e39f8e67612759c), I could only
> > > > > > > reproduce
> > > > > > > a
> > > > > > > hang of a few minutes, after which time the tcp keepalive
> > > > > > > code
> > > > > > > would
> > > > > > > reset the connection.
> > > > > > > 
> > > > > > > That said it was a while ago and something subtle may
> > > > > > > have
> > > > > > > changed.
> > > > > > > Also I'm not not sure if your header implies an
> > > > > > > indefinite
> > > > > > > hang
> > > > > > > or
> > > > > > > just
> > > > > > > a few minutes.
> > > > > > > 
> > > > > > > Thanks.
> > > > > > > 
> > > > > > > 
> > > > > > > On Wed, 2019-02-20 at 09:56 -0500, Olga Kornievskaia
> > > > > > > wrote:
> > > > > > > > From: Olga Kornievskaia <kolga@netapp.com>
> > > > > > > > 
> > > > > > > > When server replies with an ACK to client's FIN/ACK,
> > > > > > > > client
> > > > > > > > ends
> > > > > > > > up stuck in a TCP_FIN_WAIT2 state and client's mount
> > > > > > > > hangs.
> > > > > > > > Instead, make sure to close and reset client's socket
> > > > > > > > and
> > > > > > > > transport
> > > > > > > > when transitioned into that state.
> > > > > 
> > > > > Hi Trond,
> > > > > 
> > > > > > So, please do note that we do not want to ignore the
> > > > > > FIN_WAIT2
> > > > > > state
> > > > > 
> > > > > But we do ignore the FIN_WAIT2 state.
> > > > 
> > > > We do not. We wait for the server to send a FIN, which is
> > > > precisely
> > > > the
> > > > reason for which FIN_WAIT2 exists.
> > > > 
> > > > > > because it implies that the server has not closed the
> > > > > > socket on
> > > > > > its
> > > > > > side.
> > > > > 
> > > > > That's correct.
> > > > > 
> > > > > > That again means that we cannot re-establish a connection
> > > > > > using
> > > > > > the same source IP+port to the server, which is problematic
> > > > > > for
> > > > > > protocols such as NFSv3 which rely on standard duplicate
> > > > > > reply
> > > > > > cache
> > > > > > for correct replay semantics.
> > > > > 
> > > > > that's exactly what's happening that a client is unable to
> > > > > establish
> > > > > a
> > > > > new connection to the server. With the patch, the client does
> > > > > an
> > > > > RST
> > > > > and it re-uses the port and all is well for NFSv3.
> > > > 
> > > > RST is not guaranteed to be delivered to the recipient. That's
> > > > why
> > > > the
> > > > TCP protocol defines FIN: it is a guaranteed to be delivered
> > > > because it
> > > > is ACKed.
> > > > 
> > > > > > This is why we don't just set the TCP_LINGER2 socket option
> > > > > > and
> > > > > > call
> > > > > > sock_release(). The choice to try to wait it out is
> > > > > > deliberate
> > > > > > because
> > > > > > the alternative is that we end up with busy-waiting re-
> > > > > > connection
> > > > > > attempts.
> > > > > 
> > > > > Why would it busy-wait? In my testing, RST happens and new
> > > > > connection
> > > > > is established?
> > > > 
> > > > Only if the server has dropped the connection without notifying
> > > > the
> > > > client.
> > > 
> > > Yes the server dropped the connection without notifying the
> > > client
> > > (or
> > > perhaps something in the middle did it as an attack). Again, I
> > > raise
> > > this concern for the sake of dealing with this as an attack. I
> > > have
> > > no
> > > intentions of catering to broken servers. If this is not a
> > > possible
> > > attack, then we don't have to deal with it.
> > 
> > A man in the middle might be able to intercept the FIN from the
> > server
> > and ACK it, causing the connection to be closed on that server.
> > However, as Dave pointed out, why wouldn't the keepalive mechanism
> > then
> > eventually kick in and close the socket on the client as well?
> 
> The mechanism is already kicked in and got stuck in FIN_WAIT2. NFS
> connection was idle, so TCP layer was sending keep-alives. Then it
> sent a FIN/ACK to which the server replied with just an ACK. Kernel
> notified NFS that we are in FIN_WAIT2 and I believe it is NFS
> responsibility to act accordingly. Kernel then keeps sending
> "keep-alives" forever.  Because of this code:
> 
>         case TCP_FIN_WAIT1:
>         case TCP_FIN_WAIT2:
>                 /* RFC 793 says to queue data in these states,
>                  * RFC 1122 says we MUST send a reset.
>                  * BSD 4.4 also does reset.
>                  */
>                 if (sk->sk_shutdown & RCV_SHUTDOWN) {
>                         if (TCP_SKB_CB(skb)->end_seq !=
> TCP_SKB_CB(skb)->seq &&
>                             after(TCP_SKB_CB(skb)->end_seq - th->fin,
> tp->rcv_nxt)) {
>                                 NET_INC_STATS(sock_net(sk),
> LINUX_MIB_TCPABORTONDATA);
>                                 tcp_reset(sk);   << this is never
> triggered
>                                 return 1;
>                         }
>                 }
> 
> In our case TCP_SKB_CB(skb)->end_seq always equals
> TCP_SKB_CB(skb)->seq. (No i don't know the meaning of end_seq and seq
> :-/).

Right, but if the connection is closed on the server, then it should be
sending RST replies to all these keepalives, one of which will
presumably eventually reach the client.

> > If the FIN is not ACKed, then the server is supposed to keep
> > retransmitting it. Until that ACK is received, it cannot close the
> > socket without violating the TCP protocol.
> 
> Something in the middle can keep intercepting the the FIN/ACK from
> the
> server and keep sending an ACK back?

Sure, but if can do that (which would entail being able to guess the
TCP segment seq nums) it can also be in a position to generally mess
with the TCP connection. Why do we care about that case?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 16:32     ` Dave Wysochanski
@ 2019-02-22 17:10       ` Olga Kornievskaia
  2019-02-22 19:17         ` Dave Wysochanski
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2019-02-22 17:10 UTC (permalink / raw)
  To: Dave Wysochanski; +Cc: trond.myklebust, Anna Schumaker, linux-nfs

On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <dwysocha@redhat.com> wrote:
>
> On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > Hi Dave,
> >
> > A re-producer is a server that sends an ACK to the client's FIN/ACK
> > request and does nothing afterwards (I can reproduce it 100% with a
> > hacked up server. It was discovered with a "broken" server that
> > doesn't fully closes a connection). It leave this client unable to
> > connect to this server again indefinitely/forever/reboot required
> > kind
> > of state. Once it was considered that doing something like that to
> > the
> > client is a form of an attack (denial-of-server) and thus the kernel
> > has a tcp_fin_timeout option after which the kernel will abort the
> > connection. However this only applies to the sockets that have been
> > closed by the client. This is NOT the case. NFS does not close the
> > connection and it ignores kernel's notification of FIN_WAIT2 state.
> >
> Interesting.  I had the same reproducer but eventually an RST would get
> sent from the NFS client due to the TCP keepalives.  It sounds like
> that is not the case anymore.  When I did my testing was over a year
> and a half ago though.

after the keepalives the TCP also sends a FIN/ACK if the server
properly sent a FIN/ACK back, then keep alive will indeed be
sufficient. Can you check if in your trace server in one time sent
just an ACK but in another case sent the FIN/ACK?

> > One can argue that this is a broken server and we shouldn't bother.
> > But this patch is an attempt to argue that the client still should
> > care and deal with this condition. However, if the community feels
> > that a broken server is a broken server and this form of an attack is
> > not interested, this patch can live will be an archive for later or
> > never.
> >
>
> This isn't IPoIB is it?

No, just normal TCP.


> Actually, fwiw, looking back I had speculated on changes in this area.
> I'm adding you to the CC list of this bug which had some of my musings
> on it:
> https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> That bug I ended up closing when we could no longer prove there was any
> state where the NFS client could get stuck in FIN_WAIT2 after the
> keepalive patch.

I can reproduce current behavior with the current upstream code.


> It can happen that the server only sends the ACK back to the clients
> FIN,ACK so in general the testcase is valid.  But then the question is
> how long should one wait for the final data and FIN from the server, or
> are there ever instances where you shouldn't wait forever?  Is there a
> way for us to know for sure there is no data left to receive so it's
> safe to timeout?  No RPCs outstanding?

Yep all good questions which I don't have answers to. I realize that
for instance, that a server might send an ACK and then send a FIN/ACK
after that. But why is it bad for the client to proactively send an
RST (by shutting down the connection in TCP_FIN_WAIT2 if the client
was shutting down the connection anyway.

> I don't claim to know many of the subtleties here as far as would the
> server wait forever in LAST_ACK or do implementations eventually
> timeout after some time?  Seems like if these last packets get lost
> there is likely a bug somewhere (either firewall or TCP stack, etc).
> https://tools.ietf.org/html/rfc793#page-22
>
> It looks like at least some people are putting timeouts into their
> stacks though I'm not sure that's a good idea or not.

I saw only one other driver in the kernel that does have handling of
the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 17:10       ` Olga Kornievskaia
@ 2019-02-22 19:17         ` Dave Wysochanski
  2019-02-22 20:00           ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Wysochanski @ 2019-02-22 19:17 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: trond.myklebust, Anna Schumaker, linux-nfs

On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote:
> On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <dwysocha@redhat.co
> m> wrote:
> > 
> > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > > Hi Dave,
> > > 
> > > A re-producer is a server that sends an ACK to the client's
> > > FIN/ACK
> > > request and does nothing afterwards (I can reproduce it 100% with
> > > a
> > > hacked up server. It was discovered with a "broken" server that
> > > doesn't fully closes a connection). It leave this client unable
> > > to
> > > connect to this server again indefinitely/forever/reboot required
> > > kind
> > > of state. Once it was considered that doing something like that
> > > to
> > > the
> > > client is a form of an attack (denial-of-server) and thus the
> > > kernel
> > > has a tcp_fin_timeout option after which the kernel will abort
> > > the
> > > connection. However this only applies to the sockets that have
> > > been
> > > closed by the client. This is NOT the case. NFS does not close
> > > the
> > > connection and it ignores kernel's notification of FIN_WAIT2
> > > state.
> > > 
> > 
> > Interesting.  I had the same reproducer but eventually an RST would
> > get
> > sent from the NFS client due to the TCP keepalives.  It sounds like
> > that is not the case anymore.  When I did my testing was over a
> > year
> > and a half ago though.
> 
> after the keepalives the TCP also sends a FIN/ACK if the server
> properly sent a FIN/ACK back, then keep alive will indeed be
> sufficient. Can you check if in your trace server in one time sent
> just an ACK but in another case sent the FIN/ACK?
> 

I had two reproducers actually.  In both cases the NFS server would
never send a FIN.

The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) was
about a NFS server crash after being in the half-close state.  An NFS4
client without that keepalive patch would hang.
This is a valid test case and we check for that now in all releases. 
Here's the outline:
# Outline
# Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER
# Step 2. On NFS client, mount NFS4 server export with "vers=4,timeo=5"
# Step 3. On NFS server, block outgoing FIN packets from the NFS server"
# Step 4. On NFS server, using systemtap script, to delay NFS4OPEN_CONFIRM response for a few seconds
# Step 5. On NFS client, open a file, which should get delayed and trigger a FIN from the NFS client
# Step 6. On NFS server, after a short delay, crash the server by 'echo c > /proc/sysrq-trigger'
# Step 7. On NFS client, wait for NFS server to reboot
# Step 8. On NFS client, wait for the NFS to reconnect TCP connection
# Step 9. On NFS client, wait for NFS4 grace period
# Step 10. On NFS client, try a 'ls' of the file created earlier
# Step 11. On NFS client, sleep for $DELAY seconds to allow all operations to complete
# Step 12. On NFS client, ensure all operations have completed
# Step 13. Cleanup


The second test case (after the keepalive patch) was arguably invalid
test as I used systemtap on the NFS server to never close the socket so
it would never send a FIN.  This obviously should never happen but at
the time I wanted to see if the NFS client would recover.
I did that with this:
probe module("sunrpc").function("svc_xprt_enqueue") {
	printf("%s: %s removing XPT_CLOSE from xprt = %p\n", tz_ctime(gettimeofday_s()), ppfunc(), $xprt);
	$xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2);
}
Here's the full outline for this test:

# This test checks automatic NFS TCP recovery after a specific failure scenario.
# The failure scenario is with an NFSv3 mount that goes idle and the NFS client closes the
# TCP connection, and the final FIN from the NFS server gets lost.
# This can be a tricky scenario since the NFS client's TCP may get stuck in FIN-WAIT-2 indefinitely.
# The NFS client should be able to recover the NFS TCP connection automatically without unmount/mount
# cycle or reboot, and the TCP connection should not be stuck in FIN-WAIT-2 or any other non-connected
# state.
# 
# Outline
# Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER
# Step 2. On NFS client, mount NFS3 server export
# Step 3. On NFS server, using systemtap script, force the NFS server to never send a FIN (avoid close)
# Step 4. On NFS client, verify the NFS mount with 'ls' and TCP connection in TCP_ESTABLISHED
# Step 5. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to transition to TIME_WAIT state
# Step 6. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to disappear
# Step 7. Cleanup
#
# PASS: The state of NFS's TCP connection is usable by the end of the test (commands don't hang, etc)
# FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 indefinitely


> > > One can argue that this is a broken server and we shouldn't
> > > bother.
> > > But this patch is an attempt to argue that the client still
> > > should
> > > care and deal with this condition. However, if the community
> > > feels
> > > that a broken server is a broken server and this form of an
> > > attack is
> > > not interested, this patch can live will be an archive for later
> > > or
> > > never.
> > > 
> > 
> > This isn't IPoIB is it?
> 
> No, just normal TCP.
> 
> 
> > Actually, fwiw, looking back I had speculated on changes in this
> > area.
> > I'm adding you to the CC list of this bug which had some of my
> > musings
> > on it:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> > That bug I ended up closing when we could no longer prove there was
> > any
> > state where the NFS client could get stuck in FIN_WAIT2 after the
> > keepalive patch.
> 
> I can reproduce current behavior with the current upstream code.
> 
> 
> > It can happen that the server only sends the ACK back to the
> > clients
> > FIN,ACK so in general the testcase is valid.  But then the question
> > is
> > how long should one wait for the final data and FIN from the
> > server, or
> > are there ever instances where you shouldn't wait forever?  Is
> > there a
> > way for us to know for sure there is no data left to receive so
> > it's
> > safe to timeout?  No RPCs outstanding?
> 
> Yep all good questions which I don't have answers to. I realize that
> for instance, that a server might send an ACK and then send a FIN/ACK
> after that. But why is it bad for the client to proactively send an
> RST (by shutting down the connection in TCP_FIN_WAIT2 if the client
> was shutting down the connection anyway.
> 
I think you could lose data from the server if you RST and don't wait
but other than that I don't know.  We may be splitting hairs here
though if there's no demonstrable harm to the application (NFS).  As
Trond points out elsewhere reconnect / port reuse may be an issue.

When I looked at this in the second bug I was almost convinced that
there was a problem with the 'close' method in xs_tcp_ops - we needed
to be calling xs_close(), but I ran into some problems and I wasn't
sure about the test case.


> > I don't claim to know many of the subtleties here as far as would
> > the
> > server wait forever in LAST_ACK or do implementations eventually
> > timeout after some time?  Seems like if these last packets get lost
> > there is likely a bug somewhere (either firewall or TCP stack,
> > etc).
> > https://tools.ietf.org/html/rfc793#page-22
> > 
> > It looks like at least some people are putting timeouts into their
> > stacks though I'm not sure that's a good idea or not.
> 
> I saw only one other driver in the kernel that does have handling of
> the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...


One final thought - is it possible that we could have a network that
does not support TCP keepalives?  In that instance, I'll bet we could
get an indefinite hang in FIN_WAIT2 on my first test case (server
crashed after the half-close state).

It does not look like TCP keepalives have to be implemented and maybe
some routers / NATs / firewalls disallow them (due to extra traffic)?
https://tools.ietf.org/html/rfc1122#page-101



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 19:17         ` Dave Wysochanski
@ 2019-02-22 20:00           ` Trond Myklebust
  2019-02-23 17:31             ` Dave Wysochanski
  2019-03-05 17:22             ` Olga Kornievskaia
  0 siblings, 2 replies; 16+ messages in thread
From: Trond Myklebust @ 2019-02-22 20:00 UTC (permalink / raw)
  To: olga.kornievskaia, dwysocha; +Cc: anna.schumaker, linux-nfs

On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote:
> On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote:
> > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <
> > dwysocha@redhat.co
> > m> wrote:
> > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > > > Hi Dave,
> > > > 
> > > > A re-producer is a server that sends an ACK to the client's
> > > > FIN/ACK
> > > > request and does nothing afterwards (I can reproduce it 100%
> > > > with
> > > > a
> > > > hacked up server. It was discovered with a "broken" server that
> > > > doesn't fully closes a connection). It leave this client unable
> > > > to
> > > > connect to this server again indefinitely/forever/reboot
> > > > required
> > > > kind
> > > > of state. Once it was considered that doing something like that
> > > > to
> > > > the
> > > > client is a form of an attack (denial-of-server) and thus the
> > > > kernel
> > > > has a tcp_fin_timeout option after which the kernel will abort
> > > > the
> > > > connection. However this only applies to the sockets that have
> > > > been
> > > > closed by the client. This is NOT the case. NFS does not close
> > > > the
> > > > connection and it ignores kernel's notification of FIN_WAIT2
> > > > state.
> > > > 
> > > 
> > > Interesting.  I had the same reproducer but eventually an RST
> > > would
> > > get
> > > sent from the NFS client due to the TCP keepalives.  It sounds
> > > like
> > > that is not the case anymore.  When I did my testing was over a
> > > year
> > > and a half ago though.
> > 
> > after the keepalives the TCP also sends a FIN/ACK if the server
> > properly sent a FIN/ACK back, then keep alive will indeed be
> > sufficient. Can you check if in your trace server in one time sent
> > just an ACK but in another case sent the FIN/ACK?
> > 
> 
> I had two reproducers actually.  In both cases the NFS server would
> never send a FIN.
> 
> The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421)
> was
> about a NFS server crash after being in the half-close state.  An
> NFS4
> client without that keepalive patch would hang.
> This is a valid test case and we check for that now in all releases. 
> Here's the outline:
> # Outline
> # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> 0 port 2049 and host $NFS_SERVER
> # Step 2. On NFS client, mount NFS4 server export with
> "vers=4,timeo=5"
> # Step 3. On NFS server, block outgoing FIN packets from the NFS
> server"
> # Step 4. On NFS server, using systemtap script, to delay
> NFS4OPEN_CONFIRM response for a few seconds
> # Step 5. On NFS client, open a file, which should get delayed and
> trigger a FIN from the NFS client
> # Step 6. On NFS server, after a short delay, crash the server by
> 'echo c > /proc/sysrq-trigger'
> # Step 7. On NFS client, wait for NFS server to reboot
> # Step 8. On NFS client, wait for the NFS to reconnect TCP connection
> # Step 9. On NFS client, wait for NFS4 grace period
> # Step 10. On NFS client, try a 'ls' of the file created earlier
> # Step 11. On NFS client, sleep for $DELAY seconds to allow all
> operations to complete
> # Step 12. On NFS client, ensure all operations have completed
> # Step 13. Cleanup
> 
> 
> The second test case (after the keepalive patch) was arguably invalid
> test as I used systemtap on the NFS server to never close the socket
> so
> it would never send a FIN.  This obviously should never happen but at
> the time I wanted to see if the NFS client would recover.
> I did that with this:
> probe module("sunrpc").function("svc_xprt_enqueue") {
> 	printf("%s: %s removing XPT_CLOSE from xprt = %p\n",
> tz_ctime(gettimeofday_s()), ppfunc(), $xprt);
> 	$xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2);
> }
> Here's the full outline for this test:
> 
> # This test checks automatic NFS TCP recovery after a specific
> failure scenario.
> # The failure scenario is with an NFSv3 mount that goes idle and the
> NFS client closes the
> # TCP connection, and the final FIN from the NFS server gets lost.
> # This can be a tricky scenario since the NFS client's TCP may get
> stuck in FIN-WAIT-2 indefinitely.
> # The NFS client should be able to recover the NFS TCP connection
> automatically without unmount/mount
> # cycle or reboot, and the TCP connection should not be stuck in FIN-
> WAIT-2 or any other non-connected
> # state.
> # 
> # Outline
> # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> 0 port 2049 and host $NFS_SERVER
> # Step 2. On NFS client, mount NFS3 server export
> # Step 3. On NFS server, using systemtap script, force the NFS server
> to never send a FIN (avoid close)
> # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP
> connection in TCP_ESTABLISHED
> # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS
> TCP connection to transition to TIME_WAIT state
> # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS
> TCP connection to disappear
> # Step 7. Cleanup
> #
> # PASS: The state of NFS's TCP connection is usable by the end of the
> test (commands don't hang, etc)
> # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2
> indefinitely
> 
> 
> > > > One can argue that this is a broken server and we shouldn't
> > > > bother.
> > > > But this patch is an attempt to argue that the client still
> > > > should
> > > > care and deal with this condition. However, if the community
> > > > feels
> > > > that a broken server is a broken server and this form of an
> > > > attack is
> > > > not interested, this patch can live will be an archive for
> > > > later
> > > > or
> > > > never.
> > > > 
> > > 
> > > This isn't IPoIB is it?
> > 
> > No, just normal TCP.
> > 
> > 
> > > Actually, fwiw, looking back I had speculated on changes in this
> > > area.
> > > I'm adding you to the CC list of this bug which had some of my
> > > musings
> > > on it:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> > > That bug I ended up closing when we could no longer prove there
> > > was
> > > any
> > > state where the NFS client could get stuck in FIN_WAIT2 after the
> > > keepalive patch.
> > 
> > I can reproduce current behavior with the current upstream code.
> > 
> > 
> > > It can happen that the server only sends the ACK back to the
> > > clients
> > > FIN,ACK so in general the testcase is valid.  But then the
> > > question
> > > is
> > > how long should one wait for the final data and FIN from the
> > > server, or
> > > are there ever instances where you shouldn't wait forever?  Is
> > > there a
> > > way for us to know for sure there is no data left to receive so
> > > it's
> > > safe to timeout?  No RPCs outstanding?
> > 
> > Yep all good questions which I don't have answers to. I realize
> > that
> > for instance, that a server might send an ACK and then send a
> > FIN/ACK
> > after that. But why is it bad for the client to proactively send an
> > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client
> > was shutting down the connection anyway.
> > 
> I think you could lose data from the server if you RST and don't wait
> but other than that I don't know.  We may be splitting hairs here
> though if there's no demonstrable harm to the application (NFS).  As
> Trond points out elsewhere reconnect / port reuse may be an issue.
> 
> When I looked at this in the second bug I was almost convinced that
> there was a problem with the 'close' method in xs_tcp_ops - we needed
> to be calling xs_close(), but I ran into some problems and I wasn't
> sure about the test case.
> 
> 
> > > I don't claim to know many of the subtleties here as far as would
> > > the
> > > server wait forever in LAST_ACK or do implementations eventually
> > > timeout after some time?  Seems like if these last packets get
> > > lost
> > > there is likely a bug somewhere (either firewall or TCP stack,
> > > etc).
> > > https://tools.ietf.org/html/rfc793#page-22
> > > 
> > > It looks like at least some people are putting timeouts into
> > > their
> > > stacks though I'm not sure that's a good idea or not.
> > 
> > I saw only one other driver in the kernel that does have handling
> > of
> > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...
> 
> One final thought - is it possible that we could have a network that
> does not support TCP keepalives?  In that instance, I'll bet we could
> get an indefinite hang in FIN_WAIT2 on my first test case (server
> crashed after the half-close state).
> 
> It does not look like TCP keepalives have to be implemented and maybe
> some routers / NATs / firewalls disallow them (due to extra traffic)?
> https://tools.ietf.org/html/rfc1122#page-101


Again, that's a corner case of a corner case. I've never seen any
reports of this occurring.

On the other hand, I _have_ seen lots of reports of problems due to the
reconnection issues when the server still holds the socket open. Those
reports were the main reason for the current design.

-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 20:00           ` Trond Myklebust
@ 2019-02-23 17:31             ` Dave Wysochanski
  2019-03-05 17:22             ` Olga Kornievskaia
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Wysochanski @ 2019-02-23 17:31 UTC (permalink / raw)
  To: Trond Myklebust, olga.kornievskaia; +Cc: anna.schumaker, linux-nfs

On Fri, 2019-02-22 at 20:00 +0000, Trond Myklebust wrote:
> On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote:
> > On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote:
> > > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <
> > > dwysocha@redhat.co
> > > m> wrote:
> > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > > > > Hi Dave,
> > > > > 
> > > > > A re-producer is a server that sends an ACK to the client's
> > > > > FIN/ACK
> > > > > request and does nothing afterwards (I can reproduce it 100%
> > > > > with
> > > > > a
> > > > > hacked up server. It was discovered with a "broken" server
> > > > > that
> > > > > doesn't fully closes a connection). It leave this client
> > > > > unable
> > > > > to
> > > > > connect to this server again indefinitely/forever/reboot
> > > > > required
> > > > > kind
> > > > > of state. Once it was considered that doing something like
> > > > > that
> > > > > to
> > > > > the
> > > > > client is a form of an attack (denial-of-server) and thus the
> > > > > kernel
> > > > > has a tcp_fin_timeout option after which the kernel will
> > > > > abort
> > > > > the
> > > > > connection. However this only applies to the sockets that
> > > > > have
> > > > > been
> > > > > closed by the client. This is NOT the case. NFS does not
> > > > > close
> > > > > the
> > > > > connection and it ignores kernel's notification of FIN_WAIT2
> > > > > state.
> > > > > 
> > > > 
> > > > Interesting.  I had the same reproducer but eventually an RST
> > > > would
> > > > get
> > > > sent from the NFS client due to the TCP keepalives.  It sounds
> > > > like
> > > > that is not the case anymore.  When I did my testing was over a
> > > > year
> > > > and a half ago though.
> > > 
> > > after the keepalives the TCP also sends a FIN/ACK if the server
> > > properly sent a FIN/ACK back, then keep alive will indeed be
> > > sufficient. Can you check if in your trace server in one time
> > > sent
> > > just an ACK but in another case sent the FIN/ACK?
> > > 
> > 
> > I had two reproducers actually.  In both cases the NFS server would
> > never send a FIN.
> > 
> > The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421)
> > was
> > about a NFS server crash after being in the half-close state.  An
> > NFS4
> > client without that keepalive patch would hang.
> > This is a valid test case and we check for that now in all
> > releases. 
> > Here's the outline:
> > # Outline
> > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE
> > -s
> > 0 port 2049 and host $NFS_SERVER
> > # Step 2. On NFS client, mount NFS4 server export with
> > "vers=4,timeo=5"
> > # Step 3. On NFS server, block outgoing FIN packets from the NFS
> > server"
> > # Step 4. On NFS server, using systemtap script, to delay
> > NFS4OPEN_CONFIRM response for a few seconds
> > # Step 5. On NFS client, open a file, which should get delayed and
> > trigger a FIN from the NFS client
> > # Step 6. On NFS server, after a short delay, crash the server by
> > 'echo c > /proc/sysrq-trigger'
> > # Step 7. On NFS client, wait for NFS server to reboot
> > # Step 8. On NFS client, wait for the NFS to reconnect TCP
> > connection
> > # Step 9. On NFS client, wait for NFS4 grace period
> > # Step 10. On NFS client, try a 'ls' of the file created earlier
> > # Step 11. On NFS client, sleep for $DELAY seconds to allow all
> > operations to complete
> > # Step 12. On NFS client, ensure all operations have completed
> > # Step 13. Cleanup
> > 
> > 
> > The second test case (after the keepalive patch) was arguably
> > invalid
> > test as I used systemtap on the NFS server to never close the
> > socket
> > so
> > it would never send a FIN.  This obviously should never happen but
> > at
> > the time I wanted to see if the NFS client would recover.
> > I did that with this:
> > probe module("sunrpc").function("svc_xprt_enqueue") {
> > 	printf("%s: %s removing XPT_CLOSE from xprt = %p\n",
> > tz_ctime(gettimeofday_s()), ppfunc(), $xprt);
> > 	$xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2);
> > }
> > Here's the full outline for this test:
> > 
> > # This test checks automatic NFS TCP recovery after a specific
> > failure scenario.
> > # The failure scenario is with an NFSv3 mount that goes idle and
> > the
> > NFS client closes the
> > # TCP connection, and the final FIN from the NFS server gets lost.
> > # This can be a tricky scenario since the NFS client's TCP may get
> > stuck in FIN-WAIT-2 indefinitely.
> > # The NFS client should be able to recover the NFS TCP connection
> > automatically without unmount/mount
> > # cycle or reboot, and the TCP connection should not be stuck in
> > FIN-
> > WAIT-2 or any other non-connected
> > # state.
> > # 
> > # Outline
> > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE
> > -s
> > 0 port 2049 and host $NFS_SERVER
> > # Step 2. On NFS client, mount NFS3 server export
> > # Step 3. On NFS server, using systemtap script, force the NFS
> > server
> > to never send a FIN (avoid close)
> > # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP
> > connection in TCP_ESTABLISHED
> > # Step 5. On NFS client, wait up to $time_limit seconds to allow
> > NFS
> > TCP connection to transition to TIME_WAIT state
> > # Step 6. On NFS client, wait up to $time_limit seconds to allow
> > NFS
> > TCP connection to disappear
> > # Step 7. Cleanup
> > #
> > # PASS: The state of NFS's TCP connection is usable by the end of
> > the
> > test (commands don't hang, etc)
> > # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2
> > indefinitely
> > 
> > 
> > > > > One can argue that this is a broken server and we shouldn't
> > > > > bother.
> > > > > But this patch is an attempt to argue that the client still
> > > > > should
> > > > > care and deal with this condition. However, if the community
> > > > > feels
> > > > > that a broken server is a broken server and this form of an
> > > > > attack is
> > > > > not interested, this patch can live will be an archive for
> > > > > later
> > > > > or
> > > > > never.
> > > > > 
> > > > 
> > > > This isn't IPoIB is it?
> > > 
> > > No, just normal TCP.
> > > 
> > > 
> > > > Actually, fwiw, looking back I had speculated on changes in
> > > > this
> > > > area.
> > > > I'm adding you to the CC list of this bug which had some of my
> > > > musings
> > > > on it:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> > > > That bug I ended up closing when we could no longer prove there
> > > > was
> > > > any
> > > > state where the NFS client could get stuck in FIN_WAIT2 after
> > > > the
> > > > keepalive patch.
> > > 
> > > I can reproduce current behavior with the current upstream code.
> > > 
> > > 
> > > > It can happen that the server only sends the ACK back to the
> > > > clients
> > > > FIN,ACK so in general the testcase is valid.  But then the
> > > > question
> > > > is
> > > > how long should one wait for the final data and FIN from the
> > > > server, or
> > > > are there ever instances where you shouldn't wait forever?  Is
> > > > there a
> > > > way for us to know for sure there is no data left to receive so
> > > > it's
> > > > safe to timeout?  No RPCs outstanding?
> > > 
> > > Yep all good questions which I don't have answers to. I realize
> > > that
> > > for instance, that a server might send an ACK and then send a
> > > FIN/ACK
> > > after that. But why is it bad for the client to proactively send
> > > an
> > > RST (by shutting down the connection in TCP_FIN_WAIT2 if the
> > > client
> > > was shutting down the connection anyway.
> > > 
> > 
> > I think you could lose data from the server if you RST and don't
> > wait
> > but other than that I don't know.  We may be splitting hairs here
> > though if there's no demonstrable harm to the application
> > (NFS).  As
> > Trond points out elsewhere reconnect / port reuse may be an issue.
> > 
> > When I looked at this in the second bug I was almost convinced that
> > there was a problem with the 'close' method in xs_tcp_ops - we
> > needed
> > to be calling xs_close(), but I ran into some problems and I wasn't
> > sure about the test case.
> > 
> > 
> > > > I don't claim to know many of the subtleties here as far as
> > > > would
> > > > the
> > > > server wait forever in LAST_ACK or do implementations
> > > > eventually
> > > > timeout after some time?  Seems like if these last packets get
> > > > lost
> > > > there is likely a bug somewhere (either firewall or TCP stack,
> > > > etc).
> > > > https://tools.ietf.org/html/rfc793#page-22
> > > > 
> > > > It looks like at least some people are putting timeouts into
> > > > their
> > > > stacks though I'm not sure that's a good idea or not.
> > > 
> > > I saw only one other driver in the kernel that does have handling
> > > of
> > > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...
> > 
> > One final thought - is it possible that we could have a network
> > that
> > does not support TCP keepalives?  In that instance, I'll bet we
> > could
> > get an indefinite hang in FIN_WAIT2 on my first test case (server
> > crashed after the half-close state).
> > 
> > It does not look like TCP keepalives have to be implemented and
> > maybe
> > some routers / NATs / firewalls disallow them (due to extra
> > traffic)?
> > https://tools.ietf.org/html/rfc1122#page-101
> 
> 
> Again, that's a corner case of a corner case. I've never seen any
> reports of this occurring.
> 

I agree neither have I, and due to the issue about a year and a half
ago I'm confident I would have known about any stuck NFS due to
FIN_WAIT2 problems.  Though my experience of course centers around
RHEL7 (3.10 based) and RHEL8 beta (4.18 based) so it's not the same
bits as the you and others on the list.

I thought about the "network that does not support keepalives" some
more and for various reasons (not sure it exists, even if it does that
network would be asking for hangs, etc) I don't think it is an
important case to consider.


> On the other hand, I _have_ seen lots of reports of problems due to
> the
> reconnection issues when the server still holds the socket open.
> Those
> reports were the main reason for the current design.
> 
> 

Yes I agree reconnect issues happen a lot more than I would have
thought so I would never want any change for a corner case of corner
case that wasn't truly valid.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] SUNRPC: fix handling of half-closed connection
  2019-02-22 20:00           ` Trond Myklebust
  2019-02-23 17:31             ` Dave Wysochanski
@ 2019-03-05 17:22             ` Olga Kornievskaia
  1 sibling, 0 replies; 16+ messages in thread
From: Olga Kornievskaia @ 2019-03-05 17:22 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: dwysocha, anna.schumaker, linux-nfs

On Fri, Feb 22, 2019 at 3:00 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote:
> > On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote:
> > > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <
> > > dwysocha@redhat.co
> > > m> wrote:
> > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote:
> > > > > Hi Dave,
> > > > >
> > > > > A re-producer is a server that sends an ACK to the client's
> > > > > FIN/ACK
> > > > > request and does nothing afterwards (I can reproduce it 100%
> > > > > with
> > > > > a
> > > > > hacked up server. It was discovered with a "broken" server that
> > > > > doesn't fully closes a connection). It leave this client unable
> > > > > to
> > > > > connect to this server again indefinitely/forever/reboot
> > > > > required
> > > > > kind
> > > > > of state. Once it was considered that doing something like that
> > > > > to
> > > > > the
> > > > > client is a form of an attack (denial-of-server) and thus the
> > > > > kernel
> > > > > has a tcp_fin_timeout option after which the kernel will abort
> > > > > the
> > > > > connection. However this only applies to the sockets that have
> > > > > been
> > > > > closed by the client. This is NOT the case. NFS does not close
> > > > > the
> > > > > connection and it ignores kernel's notification of FIN_WAIT2
> > > > > state.
> > > > >
> > > >
> > > > Interesting.  I had the same reproducer but eventually an RST
> > > > would
> > > > get
> > > > sent from the NFS client due to the TCP keepalives.  It sounds
> > > > like
> > > > that is not the case anymore.  When I did my testing was over a
> > > > year
> > > > and a half ago though.
> > >
> > > after the keepalives the TCP also sends a FIN/ACK if the server
> > > properly sent a FIN/ACK back, then keep alive will indeed be
> > > sufficient. Can you check if in your trace server in one time sent
> > > just an ACK but in another case sent the FIN/ACK?
> > >
> >
> > I had two reproducers actually.  In both cases the NFS server would
> > never send a FIN.
> >
> > The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421)
> > was
> > about a NFS server crash after being in the half-close state.  An
> > NFS4
> > client without that keepalive patch would hang.
> > This is a valid test case and we check for that now in all releases.
> > Here's the outline:
> > # Outline
> > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> > 0 port 2049 and host $NFS_SERVER
> > # Step 2. On NFS client, mount NFS4 server export with
> > "vers=4,timeo=5"
> > # Step 3. On NFS server, block outgoing FIN packets from the NFS
> > server"
> > # Step 4. On NFS server, using systemtap script, to delay
> > NFS4OPEN_CONFIRM response for a few seconds
> > # Step 5. On NFS client, open a file, which should get delayed and
> > trigger a FIN from the NFS client
> > # Step 6. On NFS server, after a short delay, crash the server by
> > 'echo c > /proc/sysrq-trigger'
> > # Step 7. On NFS client, wait for NFS server to reboot
> > # Step 8. On NFS client, wait for the NFS to reconnect TCP connection
> > # Step 9. On NFS client, wait for NFS4 grace period
> > # Step 10. On NFS client, try a 'ls' of the file created earlier
> > # Step 11. On NFS client, sleep for $DELAY seconds to allow all
> > operations to complete
> > # Step 12. On NFS client, ensure all operations have completed
> > # Step 13. Cleanup
> >
> >
> > The second test case (after the keepalive patch) was arguably invalid
> > test as I used systemtap on the NFS server to never close the socket
> > so
> > it would never send a FIN.  This obviously should never happen but at
> > the time I wanted to see if the NFS client would recover.
> > I did that with this:
> > probe module("sunrpc").function("svc_xprt_enqueue") {
> >       printf("%s: %s removing XPT_CLOSE from xprt = %p\n",
> > tz_ctime(gettimeofday_s()), ppfunc(), $xprt);
> >       $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2);
> > }
> > Here's the full outline for this test:
> >
> > # This test checks automatic NFS TCP recovery after a specific
> > failure scenario.
> > # The failure scenario is with an NFSv3 mount that goes idle and the
> > NFS client closes the
> > # TCP connection, and the final FIN from the NFS server gets lost.
> > # This can be a tricky scenario since the NFS client's TCP may get
> > stuck in FIN-WAIT-2 indefinitely.
> > # The NFS client should be able to recover the NFS TCP connection
> > automatically without unmount/mount
> > # cycle or reboot, and the TCP connection should not be stuck in FIN-
> > WAIT-2 or any other non-connected
> > # state.
> > #
> > # Outline
> > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s
> > 0 port 2049 and host $NFS_SERVER
> > # Step 2. On NFS client, mount NFS3 server export
> > # Step 3. On NFS server, using systemtap script, force the NFS server
> > to never send a FIN (avoid close)
> > # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP
> > connection in TCP_ESTABLISHED
> > # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS
> > TCP connection to transition to TIME_WAIT state
> > # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS
> > TCP connection to disappear
> > # Step 7. Cleanup
> > #
> > # PASS: The state of NFS's TCP connection is usable by the end of the
> > test (commands don't hang, etc)
> > # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2
> > indefinitely
> >
> >
> > > > > One can argue that this is a broken server and we shouldn't
> > > > > bother.
> > > > > But this patch is an attempt to argue that the client still
> > > > > should
> > > > > care and deal with this condition. However, if the community
> > > > > feels
> > > > > that a broken server is a broken server and this form of an
> > > > > attack is
> > > > > not interested, this patch can live will be an archive for
> > > > > later
> > > > > or
> > > > > never.
> > > > >
> > > >
> > > > This isn't IPoIB is it?
> > >
> > > No, just normal TCP.
> > >
> > >
> > > > Actually, fwiw, looking back I had speculated on changes in this
> > > > area.
> > > > I'm adding you to the CC list of this bug which had some of my
> > > > musings
> > > > on it:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43
> > > > That bug I ended up closing when we could no longer prove there
> > > > was
> > > > any
> > > > state where the NFS client could get stuck in FIN_WAIT2 after the
> > > > keepalive patch.
> > >
> > > I can reproduce current behavior with the current upstream code.
> > >
> > >
> > > > It can happen that the server only sends the ACK back to the
> > > > clients
> > > > FIN,ACK so in general the testcase is valid.  But then the
> > > > question
> > > > is
> > > > how long should one wait for the final data and FIN from the
> > > > server, or
> > > > are there ever instances where you shouldn't wait forever?  Is
> > > > there a
> > > > way for us to know for sure there is no data left to receive so
> > > > it's
> > > > safe to timeout?  No RPCs outstanding?
> > >
> > > Yep all good questions which I don't have answers to. I realize
> > > that
> > > for instance, that a server might send an ACK and then send a
> > > FIN/ACK
> > > after that. But why is it bad for the client to proactively send an
> > > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client
> > > was shutting down the connection anyway.
> > >
> > I think you could lose data from the server if you RST and don't wait
> > but other than that I don't know.  We may be splitting hairs here
> > though if there's no demonstrable harm to the application (NFS).  As
> > Trond points out elsewhere reconnect / port reuse may be an issue.
> >
> > When I looked at this in the second bug I was almost convinced that
> > there was a problem with the 'close' method in xs_tcp_ops - we needed
> > to be calling xs_close(), but I ran into some problems and I wasn't
> > sure about the test case.
> >
> >
> > > > I don't claim to know many of the subtleties here as far as would
> > > > the
> > > > server wait forever in LAST_ACK or do implementations eventually
> > > > timeout after some time?  Seems like if these last packets get
> > > > lost
> > > > there is likely a bug somewhere (either firewall or TCP stack,
> > > > etc).
> > > > https://tools.ietf.org/html/rfc793#page-22
> > > >
> > > > It looks like at least some people are putting timeouts into
> > > > their
> > > > stacks though I'm not sure that's a good idea or not.
> > >
> > > I saw only one other driver in the kernel that does have handling
> > > of
> > > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ...
> >
> > One final thought - is it possible that we could have a network that
> > does not support TCP keepalives?  In that instance, I'll bet we could
> > get an indefinite hang in FIN_WAIT2 on my first test case (server
> > crashed after the half-close state).
> >
> > It does not look like TCP keepalives have to be implemented and maybe
> > some routers / NATs / firewalls disallow them (due to extra traffic)?
> > https://tools.ietf.org/html/rfc1122#page-101
>
>
> Again, that's a corner case of a corner case. I've never seen any
> reports of this occurring.
>
> On the other hand, I _have_ seen lots of reports of problems due to the
> reconnection issues when the server still holds the socket open. Those
> reports were the main reason for the current design.

Continuing on with this thread: what I can gather so far.

NFS has an idle timer itself which expires
        INIT_WORK(&xprt->task_cleanup, xprt_autoclose);
        if (xprt_has_timer(xprt))
               timer_setup(&xprt->timer, xprt_init_autodisconnect, 0);
        else
               timer_setup(&xprt->timer, NULL, 0);

The xprt_autoclose() is what triggers TCP client to send tcp_send_fin().

Xprt_autoclose() calls kernel_Sock_shutdown() but I don’t believe
that’s sufficient to “close the socket”. What is needed is to call
sock_release() so socket is not orphaned.

As a reply to its FIN/ACK it gets back an ACK (putting it in FIN_WAIT2
which TCP notifies the NFS and NFS ignores it instead of closing the
socket as it does for other states like CLOSE_WAIT). Socket is not
closed so no tcp_fin_timeout is set.

>
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-03-05 17:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-20 14:56 [PATCH 1/1] SUNRPC: fix handling of half-closed connection Olga Kornievskaia
2019-02-22 12:12 ` Dave Wysochanski
2019-02-22 13:45   ` Trond Myklebust
2019-02-22 14:46     ` Olga Kornievskaia
2019-02-22 15:05       ` Trond Myklebust
2019-02-22 15:11         ` Olga Kornievskaia
2019-02-22 15:50           ` Trond Myklebust
2019-02-22 17:02             ` Olga Kornievskaia
2019-02-22 17:09               ` Trond Myklebust
2019-02-22 14:44   ` Olga Kornievskaia
2019-02-22 16:32     ` Dave Wysochanski
2019-02-22 17:10       ` Olga Kornievskaia
2019-02-22 19:17         ` Dave Wysochanski
2019-02-22 20:00           ` Trond Myklebust
2019-02-23 17:31             ` Dave Wysochanski
2019-03-05 17:22             ` Olga Kornievskaia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).