Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-26  8:05 [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS Vijay Subramanian
@ 2012-10-26  8:03 ` Eric Dumazet
  2012-10-26 21:30 ` Julian Anastasov
  2012-10-27 11:57 ` Eric Dumazet
  2 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-10-26  8:03 UTC (permalink / raw)
  To: Vijay Subramanian
  Cc: netdev, davem, edumazet, ncardwell, Venkat Venkatsubra, Elliott Hughes

On Fri, 2012-10-26 at 01:05 -0700, Vijay Subramanian wrote:
> Elliott Hughes <enh@google.com> saw strange behavior when server socket was not
> calling accept(). Client was receiving SYN-ACK back even when socket on server
> side was not yet available. Eric noted server sockets kept resending SYN_ACKS
> and further investigation revealed the following problem.
> 
> If server socket is slow to accept() connections, request_socks can represent
> connections for which the three-way handshake is already done.  From client's
> point of view, the connection is in ESTABLISHED state but on server side, socket
> is not in accept_queue or ESTABLISHED state.  When the syn-ack timer expires,
> because of the order in which tests are performed, server can retransmit the
> synack repeatedly. Following patch prevents the server from retransmitting the
> synack needlessly (and prevents client from replying with ack).  This reduces
> traffic when server is slow to accept() connections.
> 
> If the server socket has received the third ack during connection establishment,
> this is remembered in inet_rsk(req)->acked.  The request_sock will expire in
> around 30 seconds and will be dropped if it does not move into accept_queue.
> 
> With help from Eric Dumazet.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Tested-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> ---
> Changes from V1: Changed Reported-by tag and commit message. Added Acked-by and
> Tested-by tags.
> 
> Ignoring "WARNING: line over 80 characters" in the interest of readability.
> 
>  net/ipv4/inet_connection_sock.c |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index d34ce29..4e8e52e 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
>  					       &expire, &resend);
>  				req->rsk_ops->syn_ack_timeout(parent, req);
>  				if (!expire &&
> -				    (!resend ||
> -				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
> -				     inet_rsk(req)->acked)) {
> +				    (!resend || inet_rsk(req)->acked ||
> +				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL))) {
>  					unsigned long timeo;
>  
>  					if (req->retrans++ == 0)

Thanks Vijay !

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
@ 2012-10-26  8:05 Vijay Subramanian
  2012-10-26  8:03 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Vijay Subramanian @ 2012-10-26  8:05 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, ncardwell, Venkat Venkatsubra, Elliott Hughes,
	Vijay Subramanian

Elliott Hughes <enh@google.com> saw strange behavior when server socket was not
calling accept(). Client was receiving SYN-ACK back even when socket on server
side was not yet available. Eric noted server sockets kept resending SYN_ACKS
and further investigation revealed the following problem.

If server socket is slow to accept() connections, request_socks can represent
connections for which the three-way handshake is already done.  From client's
point of view, the connection is in ESTABLISHED state but on server side, socket
is not in accept_queue or ESTABLISHED state.  When the syn-ack timer expires,
because of the order in which tests are performed, server can retransmit the
synack repeatedly. Following patch prevents the server from retransmitting the
synack needlessly (and prevents client from replying with ack).  This reduces
traffic when server is slow to accept() connections.

If the server socket has received the third ack during connection establishment,
this is remembered in inet_rsk(req)->acked.  The request_sock will expire in
around 30 seconds and will be dropped if it does not move into accept_queue.

With help from Eric Dumazet.

Reported-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
---
Changes from V1: Changed Reported-by tag and commit message. Added Acked-by and
Tested-by tags.

Ignoring "WARNING: line over 80 characters" in the interest of readability.

 net/ipv4/inet_connection_sock.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d34ce29..4e8e52e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
 					       &expire, &resend);
 				req->rsk_ops->syn_ack_timeout(parent, req);
 				if (!expire &&
-				    (!resend ||
-				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
-				     inet_rsk(req)->acked)) {
+				    (!resend || inet_rsk(req)->acked ||
+				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL))) {
 					unsigned long timeo;

 					if (req->retrans++ == 0)
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-26  8:05 [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS Vijay Subramanian
  2012-10-26  8:03 ` Eric Dumazet
@ 2012-10-26 21:30 ` Julian Anastasov
  2012-10-26 21:42   ` Eric Dumazet
  2012-10-27 11:57 ` Eric Dumazet
  2 siblings, 1 reply; 19+ messages in thread
From: Julian Anastasov @ 2012-10-26 21:30 UTC (permalink / raw)
  To: Vijay Subramanian
  Cc: netdev, davem, edumazet, ncardwell, Venkat Venkatsubra, Elliott Hughes


	Hello,

On Fri, 26 Oct 2012, Vijay Subramanian wrote:

> Elliott Hughes <enh@google.com> saw strange behavior when server socket was not
> calling accept(). Client was receiving SYN-ACK back even when socket on server
> side was not yet available. Eric noted server sockets kept resending SYN_ACKS
> and further investigation revealed the following problem.
> 
> If server socket is slow to accept() connections, request_socks can represent
> connections for which the three-way handshake is already done.  From client's
> point of view, the connection is in ESTABLISHED state but on server side, socket
> is not in accept_queue or ESTABLISHED state.  When the syn-ack timer expires,
> because of the order in which tests are performed, server can retransmit the
> synack repeatedly. Following patch prevents the server from retransmitting the
> synack needlessly (and prevents client from replying with ack).  This reduces
> traffic when server is slow to accept() connections.
> 
> If the server socket has received the third ack during connection establishment,
> this is remembered in inet_rsk(req)->acked.  The request_sock will expire in
> around 30 seconds and will be dropped if it does not move into accept_queue.
> 
> With help from Eric Dumazet.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Tested-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> ---
> Changes from V1: Changed Reported-by tag and commit message. Added Acked-by and
> Tested-by tags.
> 
> Ignoring "WARNING: line over 80 characters" in the interest of readability.
> 
>  net/ipv4/inet_connection_sock.c |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index d34ce29..4e8e52e 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
>  					       &expire, &resend);
>  				req->rsk_ops->syn_ack_timeout(parent, req);
>  				if (!expire &&
> -				    (!resend ||
> -				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
> -				     inet_rsk(req)->acked)) {
> +				    (!resend || inet_rsk(req)->acked ||

	Wait a minute, this can cause problem at least
for the TCP_DEFER_ACCEPT mode. It is supposed to timeout
in SYN_RECV state if after silence period (no retransmissions)
and some final retransmissions (until max_retries) client
still does not send data - the request should be expired
without notifying the listener.

	So, the logic in syn_ack_recalc() was tuned to resend
after the TCP_DEFER_ACCEPT period. This patch stops such
resends after TCP_DEFER_ACCEPT period. May be the change
should be in syn_ack_recalc() without hurting TCP_DEFER_ACCEPT?

	Lets analyze the default case without TCP_DEFER_ACCEPT.

	Think for protocols like SMTP where server sends
welcome message. This patch stops SYN-ACK resends, client
sends one ACK (which server drops) and enters EST state.
Client is waiting for welcome message in EST state while
server is waiting silently for ACK message to create child
socket. No progress, may but timeout error in client.

	Is the patch safe for such case? Is there a logic
that creates child socket from request if the dropped ACK
was the last message from client? It must not do it for
TCP_DEFER_ACCEPT.

> +				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL))) {
>  					unsigned long timeo;
>  
>  					if (req->retrans++ == 0)
> -- 
> 1.7.0.4
> 
> --

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-26 21:30 ` Julian Anastasov
@ 2012-10-26 21:42   ` Eric Dumazet
  2012-10-26 22:52     ` Julian Anastasov
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-10-26 21:42 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Vijay Subramanian, netdev, davem, edumazet, ncardwell,
	Venkat Venkatsubra, Elliott Hughes

On Sat, 2012-10-27 at 00:30 +0300, Julian Anastasov wrote:
> 	Hello,
> 
> On Fri, 26 Oct 2012, Vijay Subramanian wrote:
> 
> > Elliott Hughes <enh@google.com> saw strange behavior when server socket was not
> > calling accept(). Client was receiving SYN-ACK back even when socket on server
> > side was not yet available. Eric noted server sockets kept resending SYN_ACKS
> > and further investigation revealed the following problem.
> > 
> > If server socket is slow to accept() connections, request_socks can represent
> > connections for which the three-way handshake is already done.  From client's
> > point of view, the connection is in ESTABLISHED state but on server side, socket
> > is not in accept_queue or ESTABLISHED state.  When the syn-ack timer expires,
> > because of the order in which tests are performed, server can retransmit the
> > synack repeatedly. Following patch prevents the server from retransmitting the
> > synack needlessly (and prevents client from replying with ack).  This reduces
> > traffic when server is slow to accept() connections.
> > 
> > If the server socket has received the third ack during connection establishment,
> > this is remembered in inet_rsk(req)->acked.  The request_sock will expire in
> > around 30 seconds and will be dropped if it does not move into accept_queue.
> > 
> > With help from Eric Dumazet.
> > 
> > Reported-by: Eric Dumazet <edumazet@google.com>
> > Acked-by: Neal Cardwell <ncardwell@google.com>
> > Tested-by: Neal Cardwell <ncardwell@google.com>
> > Acked-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> > ---
> > Changes from V1: Changed Reported-by tag and commit message. Added Acked-by and
> > Tested-by tags.
> > 
> > Ignoring "WARNING: line over 80 characters" in the interest of readability.
> > 
> >  net/ipv4/inet_connection_sock.c |    5 ++---
> >  1 files changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index d34ce29..4e8e52e 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
> >  					       &expire, &resend);
> >  				req->rsk_ops->syn_ack_timeout(parent, req);
> >  				if (!expire &&
> > -				    (!resend ||
> > -				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
> > -				     inet_rsk(req)->acked)) {
> > +				    (!resend || inet_rsk(req)->acked ||
> 
> 	Wait a minute, this can cause problem at least
> for the TCP_DEFER_ACCEPT mode. It is supposed to timeout
> in SYN_RECV state if after silence period (no retransmissions)
> and some final retransmissions (until max_retries) client
> still does not send data - the request should be expired
> without notifying the listener.
> 
> 	So, the logic in syn_ack_recalc() was tuned to resend
> after the TCP_DEFER_ACCEPT period. This patch stops such
> resends after TCP_DEFER_ACCEPT period. May be the change
> should be in syn_ack_recalc() without hurting TCP_DEFER_ACCEPT?
> 
> 	Lets analyze the default case without TCP_DEFER_ACCEPT.
> 
> 	Think for protocols like SMTP where server sends
> welcome message. This patch stops SYN-ACK resends, client
> sends one ACK (which server drops) and enters EST state.
> Client is waiting for welcome message in EST state while
> server is waiting silently for ACK message to create child
> socket. No progress, may but timeout error in client.
> 
> 	Is the patch safe for such case? Is there a logic
> that creates child socket from request if the dropped ACK
> was the last message from client? It must not do it for
> TCP_DEFER_ACCEPT.


I see no impact with TCP_DEFER_ACCEPT handling.

TCP_DEFER_ACCEPT is quite different from this stuff

The 3WHS is completed, and the socket is ready.

But its not delivered to the accept() (listener) until we receive a DATA
frame (or defer timeout elapsed)

We dont resend SYNACK messages for them. We just wait the 4th packet.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-26 21:42   ` Eric Dumazet
@ 2012-10-26 22:52     ` Julian Anastasov
  2012-10-27  0:07       ` Vijay Subramanian
  0 siblings, 1 reply; 19+ messages in thread
From: Julian Anastasov @ 2012-10-26 22:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vijay Subramanian, netdev, davem, edumazet, ncardwell,
	Venkat Venkatsubra, Elliott Hughes


	Hello,

On Fri, 26 Oct 2012, Eric Dumazet wrote:

> On Sat, 2012-10-27 at 00:30 +0300, Julian Anastasov wrote:
> > 
> > 	Wait a minute, this can cause problem at least
> > for the TCP_DEFER_ACCEPT mode. It is supposed to timeout
> > in SYN_RECV state if after silence period (no retransmissions)
> > and some final retransmissions (until max_retries) client
> > still does not send data - the request should be expired
> > without notifying the listener.
> > 
> > 	So, the logic in syn_ack_recalc() was tuned to resend
> > after the TCP_DEFER_ACCEPT period. This patch stops such
> > resends after TCP_DEFER_ACCEPT period. May be the change
> > should be in syn_ack_recalc() without hurting TCP_DEFER_ACCEPT?
> > 
> > 	Lets analyze the default case without TCP_DEFER_ACCEPT.
> > 
> > 	Think for protocols like SMTP where server sends
> > welcome message. This patch stops SYN-ACK resends, client
> > sends one ACK (which server drops) and enters EST state.
> > Client is waiting for welcome message in EST state while
> > server is waiting silently for ACK message to create child
> > socket. No progress, may but timeout error in client.
> > 
> > 	Is the patch safe for such case? Is there a logic
> > that creates child socket from request if the dropped ACK
> > was the last message from client? It must not do it for
> > TCP_DEFER_ACCEPT.
> 
> 
> I see no impact with TCP_DEFER_ACCEPT handling.
> 
> TCP_DEFER_ACCEPT is quite different from this stuff
> 
> The 3WHS is completed, and the socket is ready.
> 
> But its not delivered to the accept() (listener) until we receive a DATA
> frame (or defer timeout elapsed)
> 
> We dont resend SYNACK messages for them. We just wait the 4th packet.

	We have 3 general cases:

1. HTTP-kind of protocol: client sends first, server without
TCP_DEFER_ACCEPT

	Server should retransmit but anyways client will
send packet (ACK) that will move request_sock into child socket.

2. HTTP-kind of protocol: client sends first, server with
TCP_DEFER_ACCEPT

	Server retransmits before 3WHS to get ACK from
client. After ACK we keep request_sock because we do not
want to wakeup listener without data. During TCP_DEFER_ACCEPT
period nothing is retransmitted because we have ACK from
client. After TCP_DEFER_ACCEPT period we start retransmissions
because the server needs such request_socks to become
child sockets after the TCP_DEFER_ACCEPT period and because
received ACK is the only way to create child socket.
Server wants to accept() them. If TCP_DEFER_ACCEPT is
above sysctl_tcp_synack_retries there are no such
retransmissions because server does not want to accept()
such request_socks without data. So, servers decide
what they want with the TCP_DEFER_ACCEPT period value.

3. SMTP-kind of protocol: server sends first,
TCP_DEFER_ACCEPT must not be used by server.

	3WHS is completed, there is no 4th packet from
client. It is the server side that needs to move request_sock
to child socket, to accept the connection and to send
welcome message. AFAIK, child socket is created only on
ACK. Or I'm missing something? Now the question is:
how request_sock is moved into child socket if ACK is
dropped on syn_recv_sock failure and we do not send SYN-ACK 
retransmissions to trigger ACK from client? Client does not
plan to send new packet in short time.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-26 22:52     ` Julian Anastasov
@ 2012-10-27  0:07       ` Vijay Subramanian
  2012-10-27  8:43         ` Julian Anastasov
  2012-10-27  8:50         ` Eric Dumazet
  0 siblings, 2 replies; 19+ messages in thread
From: Vijay Subramanian @ 2012-10-27  0:07 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, netdev, davem, edumazet, ncardwell,
	Venkat Venkatsubra, Elliott Hughes

Julian,
Thanks for the review.

>
>         We have 3 general cases:
>
> 1. HTTP-kind of protocol: client sends first, server without
> TCP_DEFER_ACCEPT
>
>         Server should retransmit but anyways client will
> send packet (ACK) that will move request_sock into child socket.

In  this case, is this retransmit from server and ack from client
necessary? I believe that when client finally sends fourth packet i.e.
data, server socket (currently request_sock) can move into accept
queue and into ESTABLISHED state. So this patch just removes
the syn-ack /ack transmissions in between.

>
> 2. HTTP-kind of protocol: client sends first, server with
> TCP_DEFER_ACCEPT
>
>         Server retransmits before 3WHS to get ACK from
> client. After ACK we keep request_sock because we do not
> want to wakeup listener without data. During TCP_DEFER_ACCEPT
> period nothing is retransmitted because we have ACK from
> client. After TCP_DEFER_ACCEPT period we start retransmissions
> because the server needs such request_socks to become
> child sockets after the TCP_DEFER_ACCEPT period and because
> received ACK is the only way to create child socket.
> Server wants to accept() them.

As I mentioned, if client is sending data first, then the data packet
from client will also cause the creation of child socket.
If server sends the synack again, if client is not ready with data, it
will just send an ack back. This is a needless synack/ack.
On the other hand if client is ready with data, it will send it and
server socket will create child socket if accept queue has space.

Even with DEFER_ACCEPT sockets, if a third ack comes in without data,
it is remembered in inet_rsk(req)->acked. So,
logic is the same. If inet_rsk(req)->acked==1, it means we have
received third ack and client is in ESTABLISHED state.
So why bother resending needless synacks?

> If TCP_DEFER_ACCEPT is
> above sysctl_tcp_synack_retries there are no such
> retransmissions because server does not want to accept()
> such request_socks without data. So, servers decide
> what they want with the TCP_DEFER_ACCEPT period value.
>
> 3. SMTP-kind of protocol: server sends first,
> TCP_DEFER_ACCEPT must not be used by server.
>         3WHS is completed, there is no 4th packet from
> client. It is the server side that needs to move request_sock
> to child socket, to accept the connection and to send
> welcome message. AFAIK, child socket is created only on
> ACK. Or I'm missing something? Now the question is:

You raise a good point. As far as I can see, it looks like the
request_sock will not become a full socket
if server is the one that has to send the data first. The problem it
seems is that we always have to wait
for the client to send something to create the full socket. Since
server side already knows that TWHS has finished,
maybe we can find a way to move a request_sock to accept_queue as
space opens up without sending synack.

> how request_sock is moved into child socket if ACK is
> dropped on syn_recv_sock failure and we do not send SYN-ACK
> retransmissions to trigger ACK from client? Client does not
> plan to send new packet in short time.

Agreed. Unless I am missing something too, for cases where server
socket sends data first, the patch will break things though for
defer-sockets
it seems ok. Maybe we need another approach?

Thanks for the feedback. I will wait for suggestions as to what to do
with the patch and have another look at the code.

Regards,
Vijay

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-27  0:07       ` Vijay Subramanian
@ 2012-10-27  8:43         ` Julian Anastasov
  2012-10-27  8:50         ` Eric Dumazet
  1 sibling, 0 replies; 19+ messages in thread
From: Julian Anastasov @ 2012-10-27  8:43 UTC (permalink / raw)
  To: Vijay Subramanian
  Cc: Eric Dumazet, netdev, davem, edumazet, ncardwell,
	Venkat Venkatsubra, Elliott Hughes


	Hello,

On Fri, 26 Oct 2012, Vijay Subramanian wrote:

> Julian,
> Thanks for the review.
> 
> >
> >         We have 3 general cases:
> >
> > 1. HTTP-kind of protocol: client sends first, server without
> > TCP_DEFER_ACCEPT
> >
> >         Server should retransmit but anyways client will
> > send packet (ACK) that will move request_sock into child socket.
> 
> In  this case, is this retransmit from server and ack from client
> necessary? I believe that when client finally sends fourth packet i.e.
> data, server socket (currently request_sock) can move into accept
> queue and into ESTABLISHED state. So this patch just removes
> the syn-ack /ack transmissions in between.

	I agree that we have the right to stop SYN-ACK
retransmits but only if there is a mechanism that converts
request_sock into child socket and wakeups listener when
there is space for new child. For case 1 this patch works
because it is client's job to talk first. But patch can not
differentiate case 1 from case 3 where child is not created.

> > 2. HTTP-kind of protocol: client sends first, server with
> > TCP_DEFER_ACCEPT
> >
> >         Server retransmits before 3WHS to get ACK from
> > client. After ACK we keep request_sock because we do not
> > want to wakeup listener without data. During TCP_DEFER_ACCEPT
> > period nothing is retransmitted because we have ACK from
> > client. After TCP_DEFER_ACCEPT period we start retransmissions
> > because the server needs such request_socks to become
> > child sockets after the TCP_DEFER_ACCEPT period and because
> > received ACK is the only way to create child socket.
> > Server wants to accept() them.
> 
> As I mentioned, if client is sending data first, then the data packet
> from client will also cause the creation of child socket.
> If server sends the synack again, if client is not ready with data, it
> will just send an ack back. This is a needless synack/ack.

	Indeed, here TCP_DEFER_ACCEPT can also benefit from such 
mechanism to convert requests to childs. In this case, child must
appear after the TCP_DEFER_ACCEPT period (in case no data
is received). So, this mechanism should work depending on
the TCP_DEFER_ACCEPT period. We can stop such retransmits
only when new mechanism is implemented.

> On the other hand if client is ready with data, it will send it and
> server socket will create child socket if accept queue has space.
> 
> Even with DEFER_ACCEPT sockets, if a third ack comes in without data,
> it is remembered in inet_rsk(req)->acked. So,
> logic is the same. If inet_rsk(req)->acked==1, it means we have
> received third ack and client is in ESTABLISHED state.
> So why bother resending needless synacks?

	Only to trigger ACK that will create child. We
just want the child after the TCP_DEFER_ACCEPT period has
expired. Retransmits are the current mechanism to
create child, they are not goal. The goal is to create child.

> > If TCP_DEFER_ACCEPT is
> > above sysctl_tcp_synack_retries there are no such
> > retransmissions because server does not want to accept()
> > such request_socks without data. So, servers decide
> > what they want with the TCP_DEFER_ACCEPT period value.
> >
> > 3. SMTP-kind of protocol: server sends first,
> > TCP_DEFER_ACCEPT must not be used by server.
> >         3WHS is completed, there is no 4th packet from
> > client. It is the server side that needs to move request_sock
> > to child socket, to accept the connection and to send
> > welcome message. AFAIK, child socket is created only on
> > ACK. Or I'm missing something? Now the question is:
> 
> You raise a good point. As far as I can see, it looks like the
> request_sock will not become a full socket
> if server is the one that has to send the data first. The problem it
> seems is that we always have to wait
> for the client to send something to create the full socket. Since
> server side already knows that TWHS has finished,
> maybe we can find a way to move a request_sock to accept_queue as
> space opens up without sending synack.

	I'm not sure if that is possible, may be ACK comes
with more data that is not saved in request_sock, even
payload. It is lost. From time to time we get new fields
into struct request_sock but I don't know if we have
everything that is needed to create child, I doubt it.

> > how request_sock is moved into child socket if ACK is
> > dropped on syn_recv_sock failure and we do not send SYN-ACK
> > retransmissions to trigger ACK from client? Client does not
> > plan to send new packet in short time.
> 
> Agreed. Unless I am missing something too, for cases where server
> socket sends data first, the patch will break things though for
> defer-sockets
> it seems ok. Maybe we need another approach?

	The effect of current patch is:

- case 1: saves retransmits, that is good

- case 2: server can not accept child after TCP_DEFER_ACCEPT
period, clients can hate your service. 3WHS is complete
and it is expected longer timers to apply, say 15mins,
while server uses SYN_RECV state timeouts which are
shorter. Client can get RST if data is delayed above
that timers. Without the patch, we will trigger new ACK
from client and new child can enter established state.
TCP_DEFER_ACCEPT period is just an optimization that
avoids wakeups without data for a period but server
still has the following options:

	- send SYN-ACK, on ACK create child and send FIN
	- send SYN-ACK, on ACK create child and wait 15mins for data
	- expire them as request_socks if they do not send data,
	later send RST on delayed data from client

	With the patch we do not have a way to accept childs.
	You restrict server using TCP_DEFER_ACCEPT to
	shorter timeout to get client data because we
	are left only with the option to expire request_socks,
	server can prefer small SYN-ACK retry count but
	long timeout for client data.

- case 3: no welcome, client aborts

> Thanks for the feedback. I will wait for suggestions as to what to do
> with the patch and have another look at the code.

	I'll think on this problem but my knowledge in this
area is limited. May be the TCP gurus know better what can
be done. My guess is that child creation on ACK is mandatory,
especially when ACK comes with payload.

> Regards,
> Vijay

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-27  0:07       ` Vijay Subramanian
  2012-10-27  8:43         ` Julian Anastasov
@ 2012-10-27  8:50         ` Eric Dumazet
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-10-27  8:50 UTC (permalink / raw)
  To: Vijay Subramanian
  Cc: Julian Anastasov, netdev, davem, edumazet, ncardwell,
	Venkat Venkatsubra, Elliott Hughes

On Fri, 2012-10-26 at 17:07 -0700, Vijay Subramanian wrote:

> 
> Thanks for the feedback. I will wait for suggestions as to what to do
> with the patch and have another look at the code.

It seems to me current code for defer accept should be fixed as well.

'Allowing' to send the last SYNACK hoping this wont be lost somewhere
seems really strange.

syn_ack_recalc(...)
{

...
        /*
         * Do not resend while waiting for data after ACK,
         * start to resend on end of deferring period to give
         * last chance for data or ACK to create established socket.
         */
        *resend = !inet_rsk(req)->acked ||
                  req->num_timeout >= rskq_defer_accept - 1;
}

DEFER_ACCEPT should really mean :

We have the socket, only defer putting it in the accept queue until :

- Client sent data.
Or
- A given amount of time has elapsed

Defer accept is typically used by protocols where client sends the first
chunk of data. 

If after XX seconds chunk of data is still not there, sending a SYNACK
wont really help : just pass the fd to application and application
probably will wait another XX seconds for the client request, then
timeout and close the socket.

So instead of sending a final SYNACK just to be able to receive an ACK
and finally put the request into listener accept queue is a waste of
bandwidth. We should put the request into listener accept queue and save
one/two messages.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-26  8:05 [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS Vijay Subramanian
  2012-10-26  8:03 ` Eric Dumazet
  2012-10-26 21:30 ` Julian Anastasov
@ 2012-10-27 11:57 ` Eric Dumazet
  2012-10-27 13:23   ` Julian Anastasov
  2 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-10-27 11:57 UTC (permalink / raw)
  To: Vijay Subramanian
  Cc: netdev, davem, edumazet, ncardwell, Venkat Venkatsubra,
	Elliott Hughes, Yuchung Cheng

On Fri, 2012-10-26 at 01:05 -0700, Vijay Subramanian wrote:
> Elliott Hughes <enh@google.com> saw strange behavior when server socket was not
> calling accept(). Client was receiving SYN-ACK back even when socket on server
> side was not yet available. Eric noted server sockets kept resending SYN_ACKS
> and further investigation revealed the following problem.
> 
> If server socket is slow to accept() connections, request_socks can represent
> connections for which the three-way handshake is already done.  From client's
> point of view, the connection is in ESTABLISHED state but on server side, socket
> is not in accept_queue or ESTABLISHED state.  When the syn-ack timer expires,
> because of the order in which tests are performed, server can retransmit the
> synack repeatedly. Following patch prevents the server from retransmitting the
> synack needlessly (and prevents client from replying with ack).  This reduces
> traffic when server is slow to accept() connections.
> 
> If the server socket has received the third ack during connection establishment,
> this is remembered in inet_rsk(req)->acked.  The request_sock will expire in
> around 30 seconds and will be dropped if it does not move into accept_queue.
> 
> With help from Eric Dumazet.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Tested-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> ---
> Changes from V1: Changed Reported-by tag and commit message. Added Acked-by and
> Tested-by tags.
> 
> Ignoring "WARNING: line over 80 characters" in the interest of readability.
> 
>  net/ipv4/inet_connection_sock.c |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index d34ce29..4e8e52e 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
>  					       &expire, &resend);
>  				req->rsk_ops->syn_ack_timeout(parent, req);
>  				if (!expire &&
> -				    (!resend ||
> -				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
> -				     inet_rsk(req)->acked)) {
> +				    (!resend || inet_rsk(req)->acked ||
> +				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL))) {
>  					unsigned long timeo;
>  
>  					if (req->retrans++ == 0)


Part of the complexity of this is that req->retrans is the number of
timeouts, serving as the exponential backoff base.

Unfortunately we have a side effect because number of retransmits is
wrong for defer accept.

Here is what I suggest : upstream to net-next this patch we use at
Google :

Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Oct 2 02:21:12 2012 -0700

net-tcp: better retrans tracking for defer-accept

For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

Decouple req->retrans field into two fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

(retrans was renamed to make sure we didnt miss an occurrence)

introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a SYN. Prior to this patch,
we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfuly queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
    
Then, we could more easily address this silly SYNACK syndrom.

What do you think ?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-27 11:57 ` Eric Dumazet
@ 2012-10-27 13:23   ` Julian Anastasov
  2012-10-27 13:32     ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Julian Anastasov @ 2012-10-27 13:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vijay Subramanian, netdev, davem, edumazet, ncardwell,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng


	Hello,

On Sat, 27 Oct 2012, Eric Dumazet wrote:

> > @@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
> >  					       &expire, &resend);
> >  				req->rsk_ops->syn_ack_timeout(parent, req);
> >  				if (!expire &&
> > -				    (!resend ||
> > -				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
> > -				     inet_rsk(req)->acked)) {
> > +				    (!resend || inet_rsk(req)->acked ||
> > +				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL))) {
> >  					unsigned long timeo;
> >  
> >  					if (req->retrans++ == 0)
> 
> 
> Part of the complexity of this is that req->retrans is the number of
> timeouts, serving as the exponential backoff base.
> 
> Unfortunately we have a side effect because number of retransmits is
> wrong for defer accept.
> 
> Here is what I suggest : upstream to net-next this patch we use at
> Google :
> 
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Tue Oct 2 02:21:12 2012 -0700
> 
> net-tcp: better retrans tracking for defer-accept
> 
> For passive TCP connections using TCP_DEFER_ACCEPT facility,
> we incorrectly increment req->retrans each time timeout triggers
> while no SYNACK is sent.
> 
> Decouple req->retrans field into two fields :
> 
> num_retrans : number of retransmit
> num_timeout : number of timeouts
> 
> (retrans was renamed to make sure we didnt miss an occurrence)
> 
> introduce inet_rtx_syn_ack() helper to increment num_retrans
> only if ->rtx_syn_ack() succeeded.

	This is dangerous, the first of the cases is route
failure, what if we just added reject route for some attacker?
We will get error forever. May be it is difficult to decide
which error should change the counter. IMHO, such reliability
is not needed, we can be short of memory too.

> Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
> when we re-send a SYNACK in answer to a SYN. Prior to this patch,
> we were not counting these retransmits.

	Such change looks correct. Of course, it has
side effect on current TCP_DEFER_ACCEPT calculations but
it is a TCP_DEFER_ACCEPT implementation problem.

> Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
> only if a synack packet was successfuly queued.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS
  2012-10-27 13:23   ` Julian Anastasov
@ 2012-10-27 13:32     ` Eric Dumazet
  2012-10-27 14:18       ` [PATCH net-next] tcp: better retrans tracking for defer-accept Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-10-27 13:32 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Vijay Subramanian, netdev, davem, edumazet, ncardwell,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng

On Sat, 2012-10-27 at 16:23 +0300, Julian Anastasov wrote:
> 	Hello,
> 
> On Sat, 27 Oct 2012, Eric Dumazet wrote:
> 

> > 
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Tue Oct 2 02:21:12 2012 -0700
> > 
> > net-tcp: better retrans tracking for defer-accept
> > 
> > For passive TCP connections using TCP_DEFER_ACCEPT facility,
> > we incorrectly increment req->retrans each time timeout triggers
> > while no SYNACK is sent.
> > 
> > Decouple req->retrans field into two fields :
> > 
> > num_retrans : number of retransmit
> > num_timeout : number of timeouts
> > 
> > (retrans was renamed to make sure we didnt miss an occurrence)
> > 
> > introduce inet_rtx_syn_ack() helper to increment num_retrans
> > only if ->rtx_syn_ack() succeeded.
> 
> 	This is dangerous, the first of the cases is route
> failure, what if we just added reject route for some attacker?
> We will get error forever. May be it is difficult to decide
> which error should change the counter. IMHO, such reliability
> is not needed, we can be short of memory too.
> 


We increase num_timeout regardless of success or failure sending a
SYNACK (can be a route failure, a memory allocation failure, a full
qdisc...)

So its not 'forever'. The decision to abort a SYN_RECV is based on
num_timeouts only, not anymore on 'number of restransmits'

num_retrans is only counting number of SYNACKS that were sent.

num_retrans <= num_timeouts

(Usually its the same, unless you have errors, or DEFER_ACCEPT
mini-sockets)


> > Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
> > when we re-send a SYNACK in answer to a SYN. Prior to this patch,
> > we were not counting these retransmits.
> 
> 	Such change looks correct. Of course, it has
> side effect on current TCP_DEFER_ACCEPT calculations but
> it is a TCP_DEFER_ACCEPT implementation problem.

Better wait to see the patch, it changes nothing yet for
TCP_DEFER_ACCEPT

It only changes accounting problems, for more precise tracking of tcp
stack behavior.

TCP_DEFER_ACCEPT sockets have this strange accounting bug saying that
some packets were retransmitted, while its not true : We only were
waiting the user request.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-27 13:32     ` Eric Dumazet
@ 2012-10-27 14:18       ` Eric Dumazet
  2012-10-27 18:27         ` Neal Cardwell
                           ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-10-27 14:18 UTC (permalink / raw)
  To: Julian Anastasov, David Miller
  Cc: Vijay Subramanian, netdev, ncardwell, Venkat Venkatsubra,
	Elliott Hughes, Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for wich
we received the ACK from client). Only the last SYNACK is
sent so that we can receive again an ACK from client, to move the
req into accept queue. We plan to change this later to avoid
the useless retransmit (and potential problem as this SYNACK could be
lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 include/net/request_sock.h      |   12 ++++++++----
 net/dccp/minisocks.c            |    3 +--
 net/ipv4/inet_connection_sock.c |   25 ++++++++++++++++++-------
 net/ipv4/inet_diag.c            |    2 +-
 net/ipv4/syncookies.c           |    2 +-
 net/ipv4/tcp_input.c            |    2 +-
 net/ipv4/tcp_ipv4.c             |   16 ++++++++++------
 net/ipv4/tcp_minisocks.c        |    8 ++++----
 net/ipv4/tcp_timer.c            |    8 ++++----
 net/ipv6/syncookies.c           |    2 +-
 net/ipv6/tcp_ipv6.c             |   11 +++++++----
 11 files changed, 56 insertions(+), 35 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 5a7b6c2..3ed0897 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -49,13 +49,16 @@ struct request_sock_ops {
 					   struct request_sock *req);
 };
 
+extern int inet_rtx_syn_ack(struct sock *parent, struct request_sock *req);
+
 /* struct request_sock - mini sock to represent a connection request
  */
 struct request_sock {
 	struct request_sock		*dl_next; /* Must be first member! */
 	u16				mss;
-	u8				retrans;
-	u8				cookie_ts; /* syncookie: encode tcpopts in timestamp */
+	u8				num_retrans; /* number of retransmits */
+	u8				cookie_ts:1; /* syncookie: encode tcpopts in timestamp */
+	u8				num_timeout:7; /* number of timeouts */
 	/* The following two fields can be easily recomputed I think -AK */
 	u32				window_clamp; /* window clamp at creation time */
 	u32				rcv_wnd;	  /* rcv_wnd offered first time */
@@ -233,7 +236,7 @@ static inline int reqsk_queue_removed(struct request_sock_queue *queue,
 {
 	struct listen_sock *lopt = queue->listen_opt;
 
-	if (req->retrans == 0)
+	if (req->num_timeout == 0)
 		atomic_dec(&lopt->qlen_young);
 
 	return atomic_dec_return(&lopt->qlen);
@@ -271,7 +274,8 @@ static inline void reqsk_queue_hash_req(struct request_sock_queue *queue,
 	struct listen_sock *lopt = queue->listen_opt;
 
 	req->expires = jiffies + timeout;
-	req->retrans = 0;
+	req->num_retrans = 0;
+	req->num_timeout = 0;
 	req->sk = NULL;
 	req->dl_next = lopt->syn_table[hash];
 
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index ea850ce..662071b 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -174,8 +174,7 @@ struct sock *dccp_check_req(struct sock *sk, struct sk_buff *skb,
 			 * To protect against Request floods, increment retrans
 			 * counter (backoff, monitored by dccp_response_timer).
 			 */
-			req->retrans++;
-			req->rsk_ops->rtx_syn_ack(sk, req, NULL);
+			inet_rtx_syn_ack(sk, req);
 		}
 		/* Network Duplicate, discard packet */
 		return NULL;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index bee9bd4..4727538 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -521,21 +521,31 @@ static inline void syn_ack_recalc(struct request_sock *req, const int thresh,
 				  int *expire, int *resend)
 {
 	if (!rskq_defer_accept) {
-		*expire = req->retrans >= thresh;
+		*expire = req->num_timeout >= thresh;
 		*resend = 1;
 		return;
 	}
-	*expire = req->retrans >= thresh &&
-		  (!inet_rsk(req)->acked || req->retrans >= max_retries);
+	*expire = req->num_timeout >= thresh &&
+		  (!inet_rsk(req)->acked || req->num_timeout >= max_retries);
 	/*
 	 * Do not resend while waiting for data after ACK,
 	 * start to resend on end of deferring period to give
 	 * last chance for data or ACK to create established socket.
 	 */
 	*resend = !inet_rsk(req)->acked ||
-		  req->retrans >= rskq_defer_accept - 1;
+		  req->num_timeout >= rskq_defer_accept - 1;
 }
 
+int inet_rtx_syn_ack(struct sock *parent, struct request_sock *req)
+{
+	int err = req->rsk_ops->rtx_syn_ack(parent, req, NULL);
+
+	if (!err)
+		req->num_retrans++;
+	return err;
+}
+EXPORT_SYMBOL(inet_rtx_syn_ack);
+
 void inet_csk_reqsk_queue_prune(struct sock *parent,
 				const unsigned long interval,
 				const unsigned long timeout,
@@ -599,13 +609,14 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
 				req->rsk_ops->syn_ack_timeout(parent, req);
 				if (!expire &&
 				    (!resend ||
-				     !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
+				     !inet_rtx_syn_ack(parent, req) ||
 				     inet_rsk(req)->acked)) {
 					unsigned long timeo;
 
-					if (req->retrans++ == 0)
+					if (req->num_timeout++ == 0)
 						atomic_dec(&lopt->qlen_young);
-					timeo = min((timeout << req->retrans), max_rto);
+					timeo = min(timeout << req->num_timeout,
+						    max_rto);
 					req->expires = now + timeo;
 					reqp = &req->dl_next;
 					continue;
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index b29caa7..e511a35 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -620,7 +620,7 @@ static int inet_diag_fill_req(struct sk_buff *skb, struct sock *sk,
 	r->idiag_family = sk->sk_family;
 	r->idiag_state = TCP_SYN_RECV;
 	r->idiag_timer = 1;
-	r->idiag_retrans = req->retrans;
+	r->idiag_retrans = req->num_retrans;
 
 	r->id.idiag_if = sk->sk_bound_dev_if;
 	sock_diag_save_cookie(req, r->id.idiag_cookie);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index ba48e79..b236ef0 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -340,7 +340,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	}
 
 	req->expires	= 0UL;
-	req->retrans	= 0;
+	req->num_retrans = 0;
 
 	/*
 	 * We need to lookup the route here to get at the correct
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 60cf836..e95b4e5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5991,7 +5991,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 				 */
 				if (req) {
 					tcp_synack_rtt_meas(sk, req);
-					tp->total_retrans = req->retrans;
+					tp->total_retrans = req->num_retrans;
 
 					reqsk_fastopen_remove(sk, req, false);
 				} else {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 694ea4c..20849cc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -877,10 +877,13 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
 }
 
 static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req,
-			      struct request_values *rvp)
+			     struct request_values *rvp)
 {
-	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-	return tcp_v4_send_synack(sk, NULL, req, rvp, 0, false);
+	int res = tcp_v4_send_synack(sk, NULL, req, rvp, 0, false);
+
+	if (!res)
+		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+	return res;
 }
 
 /*
@@ -1386,7 +1389,8 @@ static int tcp_v4_conn_req_fastopen(struct sock *sk,
 	struct sock *child;
 	int err;
 
-	req->retrans = 0;
+	req->num_retrans = 0;
+	req->num_timeout = 0;
 	req->sk = NULL;
 
 	child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
@@ -1740,7 +1744,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	tcp_initialize_rcv_mss(newsk);
 	tcp_synack_rtt_meas(newsk, req);
-	newtp->total_retrans = req->retrans;
+	newtp->total_retrans = req->num_retrans;
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Copy over the MD5 key from the original socket */
@@ -2638,7 +2642,7 @@ static void get_openreq4(const struct sock *sk, const struct request_sock *req,
 		0, 0, /* could print option size, but that is af dependent. */
 		1,    /* timers active (only the expire timer) */
 		jiffies_delta_to_clock_t(delta),
-		req->retrans,
+		req->num_retrans,
 		from_kuid_munged(seq_user_ns(f), uid),
 		0,  /* non standard timer */
 		0, /* open_requests have no inode */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 27536ba..0404b3f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -552,7 +552,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			 * it can be estimated (approximately)
 			 * from another data.
 			 */
-			tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans);
+			tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->num_timeout);
 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
 		}
 	}
@@ -581,7 +581,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		 * Note that even if there is new data in the SYN packet
 		 * they will be thrown away too.
 		 */
-		req->rsk_ops->rtx_syn_ack(sk, req, NULL);
+		inet_rtx_syn_ack(sk, req);
 		return NULL;
 	}
 
@@ -695,7 +695,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	/* Got ACK for our SYNACK, so update baseline for SYNACK RTT sample. */
 	if (tmp_opt.saw_tstamp && tmp_opt.rcv_tsecr)
 		tcp_rsk(req)->snt_synack = tmp_opt.rcv_tsecr;
-	else if (req->retrans) /* don't take RTT sample if retrans && ~TS */
+	else if (req->num_retrans) /* don't take RTT sample if retrans && ~TS */
 		tcp_rsk(req)->snt_synack = 0;
 
 	/* For Fast Open no more processing is needed (sk is the
@@ -705,7 +705,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		return sk;
 
 	/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
-	if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+	if (req->num_timeout < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
 	    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
 		inet_rsk(req)->acked = 1;
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPDEFERACCEPTDROP);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index fc04711..62c69ab 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -318,7 +318,7 @@ static void tcp_fastopen_synack_timer(struct sock *sk)
 	req = tcp_sk(sk)->fastopen_rsk;
 	req->rsk_ops->syn_ack_timeout(sk, req);
 
-	if (req->retrans >= max_retries) {
+	if (req->num_timeout >= max_retries) {
 		tcp_write_err(sk);
 		return;
 	}
@@ -327,10 +327,10 @@ static void tcp_fastopen_synack_timer(struct sock *sk)
 	 * regular retransmit because if the child socket has been accepted
 	 * it's not good to give up too easily.
 	 */
-	req->rsk_ops->rtx_syn_ack(sk, req, NULL);
-	req->retrans++;
+	inet_rtx_syn_ack(sk, req);
+	req->num_timeout++;
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-			  TCP_TIMEOUT_INIT << req->retrans, TCP_RTO_MAX);
+			  TCP_TIMEOUT_INIT << req->num_timeout, TCP_RTO_MAX);
 }
 
 /*
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 182ab9a..4016197 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -214,7 +214,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 		ireq6->iif = inet6_iif(skb);
 
 	req->expires = 0UL;
-	req->retrans = 0;
+	req->num_retrans = 0;
 	ireq->ecn_ok		= ecn_ok;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bb6782e..c73d0eb 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -495,9 +495,12 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req,
 			     struct request_values *rvp)
 {
 	struct flowi6 fl6;
+	int res;
 
-	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-	return tcp_v6_send_synack(sk, NULL, &fl6, req, rvp, 0);
+	res = tcp_v6_send_synack(sk, NULL, &fl6, req, rvp, 0);
+	if (!res)
+		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+	return res;
 }
 
 static void tcp_v6_reqsk_destructor(struct request_sock *req)
@@ -1364,7 +1367,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	tcp_initialize_rcv_mss(newsk);
 	tcp_synack_rtt_meas(newsk, req);
-	newtp->total_retrans = req->retrans;
+	newtp->total_retrans = req->num_retrans;
 
 	newinet->inet_daddr = newinet->inet_saddr = LOOPBACK4_IPV6;
 	newinet->inet_rcv_saddr = LOOPBACK4_IPV6;
@@ -1866,7 +1869,7 @@ static void get_openreq6(struct seq_file *seq,
 		   0,0, /* could print option size, but that is af dependent. */
 		   1,   /* timers active (only the expire timer) */
 		   jiffies_to_clock_t(ttd),
-		   req->retrans,
+		   req->num_timeout,
 		   from_kuid_munged(seq_user_ns(seq), uid),
 		   0,  /* non standard timer */
 		   0, /* open_requests have no inode */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-27 14:18       ` [PATCH net-next] tcp: better retrans tracking for defer-accept Eric Dumazet
@ 2012-10-27 18:27         ` Neal Cardwell
  2012-10-27 22:29         ` Julian Anastasov
  2012-11-03 18:46         ` David Miller
  2 siblings, 0 replies; 19+ messages in thread
From: Neal Cardwell @ 2012-10-27 18:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Julian Anastasov, David Miller, Vijay Subramanian, Netdev,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng

On Sat, Oct 27, 2012 at 10:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>

> @@ -2638,7 +2642,7 @@ static void get_openreq4(const struct sock *sk, const struct request_sock *req,
...
> -               req->retrans,
> +               req->num_retrans,
...
> @@ -1866,7 +1869,7 @@ static void get_openreq6(struct seq_file *seq,
...
> -                  req->retrans,
> +                  req->num_timeout,
...

There's a slight inconsistency in the changes to get_openreq4() and
get_openreq6() - the first now uses num_retrans and the second now
uses num_timeout.

Otherwise LGTM.

neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-27 14:18       ` [PATCH net-next] tcp: better retrans tracking for defer-accept Eric Dumazet
  2012-10-27 18:27         ` Neal Cardwell
@ 2012-10-27 22:29         ` Julian Anastasov
  2012-10-28  9:15           ` Eric Dumazet
  2012-11-03 18:46         ` David Miller
  2 siblings, 1 reply; 19+ messages in thread
From: Julian Anastasov @ 2012-10-27 22:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Vijay Subramanian, netdev, ncardwell,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng


	Hello,

On Sat, 27 Oct 2012, Eric Dumazet wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> For passive TCP connections using TCP_DEFER_ACCEPT facility,
> we incorrectly increment req->retrans each time timeout triggers
> while no SYNACK is sent.
> 
> SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for wich
> we received the ACK from client). Only the last SYNACK is
> sent so that we can receive again an ACK from client, to move the
> req into accept queue. We plan to change this later to avoid
> the useless retransmit (and potential problem as this SYNACK could be
> lost)
> 
> TCP_INFO later gives wrong information to user, claiming imaginary
> retransmits.
> 
> Decouple req->retrans field into two independent fields :
> 
> num_retrans : number of retransmit
> num_timeout : number of timeouts
> 
> num_timeout is the counter that is incremented at each timeout,
> regardless of actual SYNACK being sent or not, and used to
> compute the exponential timeout.
> 
> Introduce inet_rtx_syn_ack() helper to increment num_retrans
> only if ->rtx_syn_ack() succeeded.
> 
> Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
> when we re-send a SYNACK in answer to a (retransmitted) SYN.
> Prior to this patch, we were not counting these retransmits.
> 
> Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
> only if a synack packet was successfully queued.
> 
> Reported-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
> Cc: Elliott Hughes <enh@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---

> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index ba48e79..b236ef0 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -340,7 +340,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
>  	}
>  
>  	req->expires	= 0UL;
> -	req->retrans	= 0;
> +	req->num_retrans = 0;

	also req->num_timeout = 0; ?

>  
>  	/*
>  	 * We need to lookup the route here to get at the correct

> diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
> index 182ab9a..4016197 100644
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -214,7 +214,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
>  		ireq6->iif = inet6_iif(skb);
>  
>  	req->expires = 0UL;
> -	req->retrans = 0;
> +	req->num_retrans = 0;

	also req->num_timeout = 0; ?

>  	ireq->ecn_ok		= ecn_ok;
>  	ireq->snd_wscale	= tcp_opt.snd_wscale;
>  	ireq->sack_ok		= tcp_opt.sack_ok;

> @@ -1866,7 +1869,7 @@ static void get_openreq6(struct seq_file *seq,
>  		   0,0, /* could print option size, but that is af dependent. */
>  		   1,   /* timers active (only the expire timer) */
>  		   jiffies_to_clock_t(ttd),
> -		   req->retrans,
> +		   req->num_timeout,

	num_timeout already noted by Neal

>  		   from_kuid_munged(seq_user_ns(seq), uid),
>  		   0,  /* non standard timer */
>  		   0, /* open_requests have no inode */

	I don't see any problems with such patch, just
check if it is for correct tree, I see some atomic operations
with qlen_young in patch.

	One thing to finally decide: should we use limit for
retransmissions or for timeout, is the following better?:

	if (!rskq_defer_accept) {
		*expire = req->num_retrans >= thresh;
			       ^^^^^^^^^^^
		*resend = 1;
		return;
	}
	*expire = req->num_retrans >= thresh &&
		       ^^^^^^^^^^^
		  (!inet_rsk(req)->acked || req->num_timeout >= max_retries);

	Same in tcp_fastopen_synack_timer:

	if (req->num_retrans >= max_retries) {
		 ^^^^^^^^^^^

	By this way limit (thresh) is for retransmissions while
defer period should be time based as before.

	Documentation/networking/ip-sysctl.txt compares
the tcp_synack_retries with timeout, so I'm not sure
which variant is better to use. OTOH, include/net/tcp.h
does not specify minimum in seconds for TCP_SYNACK_RETRIES,
so what should we prefer, the count or its time form?
May be count (num_retrans) is better against SYN attacks.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-27 22:29         ` Julian Anastasov
@ 2012-10-28  9:15           ` Eric Dumazet
  2012-10-28 16:51             ` Julian Anastasov
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-10-28  9:15 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: David Miller, Vijay Subramanian, netdev, ncardwell,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng

On Sun, 2012-10-28 at 01:29 +0300, Julian Anastasov wrote:
> 	Hello,
> 
> On Sat, 27 Oct 2012, Eric Dumazet wrote:
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > For passive TCP connections using TCP_DEFER_ACCEPT facility,
> > we incorrectly increment req->retrans each time timeout triggers
> > while no SYNACK is sent.
> > 
> > SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for wich
> > we received the ACK from client). Only the last SYNACK is
> > sent so that we can receive again an ACK from client, to move the
> > req into accept queue. We plan to change this later to avoid
> > the useless retransmit (and potential problem as this SYNACK could be
> > lost)
> > 
> > TCP_INFO later gives wrong information to user, claiming imaginary
> > retransmits.
> > 
> > Decouple req->retrans field into two independent fields :
> > 
> > num_retrans : number of retransmit
> > num_timeout : number of timeouts
> > 
> > num_timeout is the counter that is incremented at each timeout,
> > regardless of actual SYNACK being sent or not, and used to
> > compute the exponential timeout.
> > 
> > Introduce inet_rtx_syn_ack() helper to increment num_retrans
> > only if ->rtx_syn_ack() succeeded.
> > 
> > Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
> > when we re-send a SYNACK in answer to a (retransmitted) SYN.
> > Prior to this patch, we were not counting these retransmits.
> > 
> > Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
> > only if a synack packet was successfully queued.
> > 
> > Reported-by: Yuchung Cheng <ycheng@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Julian Anastasov <ja@ssi.bg>
> > Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
> > Cc: Elliott Hughes <enh@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > ---
> 
> > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> > index ba48e79..b236ef0 100644
> > --- a/net/ipv4/syncookies.c
> > +++ b/net/ipv4/syncookies.c
> > @@ -340,7 +340,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
> >  	}
> >  
> >  	req->expires	= 0UL;
> > -	req->retrans	= 0;
> > +	req->num_retrans = 0;
> 
> 	also req->num_timeout = 0; ?

This is not needed here.

We construct a temporary req only to be able to use get_cookie_sock()
and build a full socket with :
 icsk->icsk_af_ops->syn_recv_sock(sk, skb, req, dst);

By definition this req wont ever be used in SYN_RECV state, and never be
part of the timer wheel.

We init req->num_retrans to 0 because this field will be the source of

newtp->total_retrans = req->num_retrans;

in tcp_v{4|6}_syn_recv_sock()

> 
> >  
> >  	/*
> >  	 * We need to lookup the route here to get at the correct
> 
> > diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
> > index 182ab9a..4016197 100644
> > --- a/net/ipv6/syncookies.c
> > +++ b/net/ipv6/syncookies.c
> > @@ -214,7 +214,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
> >  		ireq6->iif = inet6_iif(skb);
> >  
> >  	req->expires = 0UL;
> > -	req->retrans = 0;
> > +	req->num_retrans = 0;
> 
> 	also req->num_timeout = 0; ?
> 


Same thing : this is not needed

> >  	ireq->ecn_ok		= ecn_ok;
> >  	ireq->snd_wscale	= tcp_opt.snd_wscale;
> >  	ireq->sack_ok		= tcp_opt.sack_ok;
> 
> > @@ -1866,7 +1869,7 @@ static void get_openreq6(struct seq_file *seq,
> >  		   0,0, /* could print option size, but that is af dependent. */
> >  		   1,   /* timers active (only the expire timer) */
> >  		   jiffies_to_clock_t(ttd),
> > -		   req->retrans,
> > +		   req->num_timeout,
> 
> 	num_timeout already noted by Neal
> 

Yes, thanks. source patch was fine, I messed the rebase.

> >  		   from_kuid_munged(seq_user_ns(seq), uid),
> >  		   0,  /* non standard timer */
> >  		   0, /* open_requests have no inode */
> 
> 	I don't see any problems with such patch, just
> check if it is for correct tree, I see some atomic operations
> with qlen_young in patch.
> 

Arg, my net-next tree got some stuff I forgot to revert, sorry, will
cleanup.

> 	One thing to finally decide: should we use limit for
> retransmissions or for timeout, is the following better?:
> 
> 	if (!rskq_defer_accept) {
> 		*expire = req->num_retrans >= thresh;
> 			       ^^^^^^^^^^^
> 		*resend = 1;
> 		return;
> 	}

Not sure it matters and if this decision is part of this patch.

If a retransmit fails, it seems we zap the request anyway ?

inet_rtx_syn_ack() returns an error and inet_rsk(req)->acked is false ->
we remove the req from queue.

We dont remove the req only if we got a listen queue overflow in
tcp_check_req() : we set acked to 1 in this case.

listen_overflow:
	if (!sysctl_tcp_abort_on_overflow) {
		inet_rsk(req)->acked = 1;
		return NULL;
	}

Using number of timeouts seems better to me. There is no point holding a
req forever if we fail to retransmit SYNACKS.
Client probably gave up.

Thanks a lot for the review !

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-28  9:15           ` Eric Dumazet
@ 2012-10-28 16:51             ` Julian Anastasov
  2012-10-28 20:02               ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Julian Anastasov @ 2012-10-28 16:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Vijay Subramanian, netdev, ncardwell,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng


	Hello,

On Sun, 28 Oct 2012, Eric Dumazet wrote:

> On Sun, 2012-10-28 at 01:29 +0300, Julian Anastasov wrote:
> > 	Hello,
> > 
> > On Sat, 27 Oct 2012, Eric Dumazet wrote:
> > 
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > For passive TCP connections using TCP_DEFER_ACCEPT facility,
> > > we incorrectly increment req->retrans each time timeout triggers
> > > while no SYNACK is sent.
> > > 
> > > SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for wich
> > > we received the ACK from client). Only the last SYNACK is
> > > sent so that we can receive again an ACK from client, to move the
> > > req into accept queue. We plan to change this later to avoid
> > > the useless retransmit (and potential problem as this SYNACK could be
> > > lost)

	I want to note that we do not send only one SYN-ACK
here, we can send many SYN-ACKs after the deferring period if
tcp_synack_retries allows it.

> > 	One thing to finally decide: should we use limit for
> > retransmissions or for timeout, is the following better?:
> > 
> > 	if (!rskq_defer_accept) {
> > 		*expire = req->num_retrans >= thresh;
> > 			       ^^^^^^^^^^^
> > 		*resend = 1;
> > 		return;
> > 	}
> 
> Not sure it matters and if this decision is part of this patch.
> 
> If a retransmit fails, it seems we zap the request anyway ?
> 
> inet_rtx_syn_ack() returns an error and inet_rsk(req)->acked is false ->
> we remove the req from queue.
> 
> We dont remove the req only if we got a listen queue overflow in
> tcp_check_req() : we set acked to 1 in this case.
> 
> listen_overflow:
> 	if (!sysctl_tcp_abort_on_overflow) {
> 		inet_rsk(req)->acked = 1;
> 		return NULL;
> 	}
> 
> Using number of timeouts seems better to me. There is no point holding a
> req forever if we fail to retransmit SYNACKS.

	Yes, my above proposal has the flaw I mentioned
in previous mail (stuck forever on SYN-ACK error).

> Client probably gave up.

	In fact, my concern was for a case where client can
flood us with same SYN. My idea was if 5 SYN-ACKs were
sent in first second, request_sock to expire even when
num_timeout is changing from 0 to 1. I.e. request_sock
to expire based on SYN-ACK count, not on fixed time.

	But I'm not sure what is better here,
to expire request_sock immediately when SYN-ACK reaches
limit or to keep it 63 secs so that we can reduce our
SYN-ACK rate under such SYN attacks. And not only
under attack.

	Here is what happens if we add DROP rule for
SYN-ACKs. We can see that every SYN retransmission is
followed by 2 SYN-ACKs, here is example with loopback:

Initial SYN and SYN-ACK:
12:21:45.773023 IP 127.0.0.1.38450 > 127.0.0.1.22: Flags [S], seq 2096477888, win 32792, options [mss 16396,sackOK,TS val 7978589 ecr 0,nop,wscale 6], length 0
12:21:45.773051 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7978589 ecr 7978589,nop,wscale 6], length 0

SYN retr 1:
12:21:46.775816 IP 127.0.0.1.38450 > 127.0.0.1.22: Flags [S], seq 2096477888, win 32792, options [mss 16396,sackOK,TS val 7979592 ecr 0,nop,wscale 6], length 0
immediate SYN-ACK from tcp_check_req:
12:21:46.775843 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7979592 ecr 7978589,nop,wscale 6], length 0
SYN-ACK from inet_csk_reqsk_queue_prune timer:
12:21:46.975807 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7979792 ecr 7978589,nop,wscale 6], length 0

same for retr 2..5:
12:21:48.779809 IP 127.0.0.1.38450 > 127.0.0.1.22: Flags [S], seq 2096477888, win 32792, options [mss 16396,sackOK,TS val 7981596 ecr 0,nop,wscale 6], length 0
12:21:48.779837 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7981596 ecr 7978589,nop,wscale 6], length 0
12:21:48.975789 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7981792 ecr 7978589,nop,wscale 6], length 0

	This is a waste of bandwidth too. It is true that
client can use different TCP_TIMEOUT_INIT value and this timing
may look different if both sides use different value.
The most silly change I can think of is to add something
like this in syn_ack_recalc (not tested at all):

	/* Avoid double SYN-ACK if client is resending SYN faster:
	 * (num_timeout - num_retrans) >= 0
	 */
	*resend = !((req->num_timeout - req->num_retrans) & 0x40);

	if (!rskq_defer_accept) {
		*expire = req->num_timeout >= thresh;
		return;
	}
	*expire = req->num_timeout >= thresh &&
		  (!inet_rsk(req)->acked || req->num_timeout >= max_retries);
	/*
	 * Do not resend while waiting for data after ACK,
	 * start to resend on end of deferring period to give
	 * last chance for data or ACK to create established socket.
	 */
	if (inet_rsk(req)->acked)
		*resend = req->num_timeout >= rskq_defer_accept - 1;

	If we add some checks in tcp_check_req we can also
restrict the immediate SYN-ACKs up to tcp_synack_retries.

	The idea is:

- expire request_sock as before, based on num_timeout with
the idea to catch many SYN retransmissions and to reduce
SYN-ACK rate from 2*SYN_rate to 1*SYN_rate, up to
tcp_synack_retries SYN-ACKs

- num_retrans accounts sent SYN-ACKs, they can be sent in
response to SYN retr or from timer. If num_retrans increases
faster than num_timeout it means client uses lower
TCP_TIMEOUT_INIT value and sending SYN-ACKs from
tcp_check_req is enough because we apply tcp_synack_retries
once as a SYN-ACK limit and second time as expiration
period.

- If we get 10 SYNs in 1 second, we will send 5 SYN-ACKs
immediately (will be restricted in tcp_check_req), from
second +1 to +31 we will not send SYN-ACKs if
tcp_synack_retries is reached, we will wait for ACK and
for more SYNs to drop, silently. Finally, at +63 we expire
the request_sock. inet_csk_reqsk_queue_prune still
can reduce the expiration period (thresh value) under load.

	Of course, this is material for separate patch,
if idea is liked at all.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-28 16:51             ` Julian Anastasov
@ 2012-10-28 20:02               ` Eric Dumazet
  2012-10-29  9:21                 ` Julian Anastasov
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-10-28 20:02 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: David Miller, Vijay Subramanian, netdev, ncardwell,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng

On Sun, 2012-10-28 at 18:51 +0200, Julian Anastasov wrote:

> 	In fact, my concern was for a case where client can
> flood us with same SYN. My idea was if 5 SYN-ACKs were
> sent in first second, request_sock to expire even when
> num_timeout is changing from 0 to 1. I.e. request_sock
> to expire based on SYN-ACK count, not on fixed time.
> 
> 	But I'm not sure what is better here,
> to expire request_sock immediately when SYN-ACK reaches
> limit or to keep it 63 secs so that we can reduce our
> SYN-ACK rate under such SYN attacks. And not only
> under attack.
> 
> 	Here is what happens if we add DROP rule for
> SYN-ACKs. We can see that every SYN retransmission is
> followed by 2 SYN-ACKs, here is example with loopback:
> 
> Initial SYN and SYN-ACK:
> 12:21:45.773023 IP 127.0.0.1.38450 > 127.0.0.1.22: Flags [S], seq 2096477888, win 32792, options [mss 16396,sackOK,TS val 7978589 ecr 0,nop,wscale 6], length 0
> 12:21:45.773051 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7978589 ecr 7978589,nop,wscale 6], length 0
> 
> SYN retr 1:
> 12:21:46.775816 IP 127.0.0.1.38450 > 127.0.0.1.22: Flags [S], seq 2096477888, win 32792, options [mss 16396,sackOK,TS val 7979592 ecr 0,nop,wscale 6], length 0
> immediate SYN-ACK from tcp_check_req:
> 12:21:46.775843 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7979592 ecr 7978589,nop,wscale 6], length 0
> SYN-ACK from inet_csk_reqsk_queue_prune timer:
> 12:21:46.975807 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7979792 ecr 7978589,nop,wscale 6], length 0
> 
> same for retr 2..5:
> 12:21:48.779809 IP 127.0.0.1.38450 > 127.0.0.1.22: Flags [S], seq 2096477888, win 32792, options [mss 16396,sackOK,TS val 7981596 ecr 0,nop,wscale 6], length 0
> 12:21:48.779837 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7981596 ecr 7978589,nop,wscale 6], length 0
> 12:21:48.975789 IP 127.0.0.1.22 > 127.0.0.1.38450: Flags [S.], seq 1774312921, ack 2096477889, win 32768, options [mss 16396,sackOK,TS val 7981792 ecr 7978589,nop,wscale 6], length 0
> 
> 	This is a waste of bandwidth too. It is true that
> client can use different TCP_TIMEOUT_INIT value and this timing
> may look different if both sides use different value.
> The most silly change I can think of is to add something
> like this in syn_ack_recalc (not tested at all):
> 
> 	/* Avoid double SYN-ACK if client is resending SYN faster:
> 	 * (num_timeout - num_retrans) >= 0
> 	 */
> 	*resend = !((req->num_timeout - req->num_retrans) & 0x40);
> 
> 	if (!rskq_defer_accept) {
> 		*expire = req->num_timeout >= thresh;
> 		return;
> 	}
> 	*expire = req->num_timeout >= thresh &&
> 		  (!inet_rsk(req)->acked || req->num_timeout >= max_retries);
> 	/*
> 	 * Do not resend while waiting for data after ACK,
> 	 * start to resend on end of deferring period to give
> 	 * last chance for data or ACK to create established socket.
> 	 */
> 	if (inet_rsk(req)->acked)
> 		*resend = req->num_timeout >= rskq_defer_accept - 1;
> 
> 	If we add some checks in tcp_check_req we can also
> restrict the immediate SYN-ACKs up to tcp_synack_retries.
> 
> 	The idea is:
> 
> - expire request_sock as before, based on num_timeout with
> the idea to catch many SYN retransmissions and to reduce
> SYN-ACK rate from 2*SYN_rate to 1*SYN_rate, up to
> tcp_synack_retries SYN-ACKs
> 
> - num_retrans accounts sent SYN-ACKs, they can be sent in
> response to SYN retr or from timer. If num_retrans increases
> faster than num_timeout it means client uses lower
> TCP_TIMEOUT_INIT value and sending SYN-ACKs from
> tcp_check_req is enough because we apply tcp_synack_retries
> once as a SYN-ACK limit and second time as expiration
> period.
> 
> - If we get 10 SYNs in 1 second, we will send 5 SYN-ACKs
> immediately (will be restricted in tcp_check_req), from
> second +1 to +31 we will not send SYN-ACKs if
> tcp_synack_retries is reached, we will wait for ACK and
> for more SYNs to drop, silently. Finally, at +63 we expire
> the request_sock. inet_csk_reqsk_queue_prune still
> can reduce the expiration period (thresh value) under load.
> 
> 	Of course, this is material for separate patch,
> if idea is liked at all.
> 
> Regards

On a SYNFLOOD attack, we end up sending one SYNACK per SYN message
anyway ?

If we want to address a non SYNFLOOD attack, why not resetting
req->expire when we send a SYNACK to a retransmitted SYN ?

tcp_check_req()
...
	if (!inet_rtx_syn_ack(sk, req)) {
		req->expire = jiffies +
			min(TCP_TIMEOUT_INIT << req->num_timeout,
			    TCP_RTO_MAX);
	}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-28 20:02               ` Eric Dumazet
@ 2012-10-29  9:21                 ` Julian Anastasov
  0 siblings, 0 replies; 19+ messages in thread
From: Julian Anastasov @ 2012-10-29  9:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Vijay Subramanian, netdev, ncardwell,
	Venkat Venkatsubra, Elliott Hughes, Yuchung Cheng


	Hello,

On Sun, 28 Oct 2012, Eric Dumazet wrote:

> On Sun, 2012-10-28 at 18:51 +0200, Julian Anastasov wrote:
> 
> > 	The idea is:
> > 
> > - expire request_sock as before, based on num_timeout with
> > the idea to catch many SYN retransmissions and to reduce
> > SYN-ACK rate from 2*SYN_rate to 1*SYN_rate, up to
> > tcp_synack_retries SYN-ACKs
> > 
> > - num_retrans accounts sent SYN-ACKs, they can be sent in
> > response to SYN retr or from timer. If num_retrans increases
> > faster than num_timeout it means client uses lower
> > TCP_TIMEOUT_INIT value and sending SYN-ACKs from
> > tcp_check_req is enough because we apply tcp_synack_retries
> > once as a SYN-ACK limit and second time as expiration
> > period.
> > 
> > - If we get 10 SYNs in 1 second, we will send 5 SYN-ACKs
> > immediately (will be restricted in tcp_check_req), from
> > second +1 to +31 we will not send SYN-ACKs if
> > tcp_synack_retries is reached, we will wait for ACK and
> > for more SYNs to drop, silently. Finally, at +63 we expire
> > the request_sock. inet_csk_reqsk_queue_prune still
> > can reduce the expiration period (thresh value) under load.
> > 
> > 	Of course, this is material for separate patch,
> > if idea is liked at all.
> > 
> > Regards
> 
> On a SYNFLOOD attack, we end up sending one SYNACK per SYN message
> anyway ?

	I assume you are referring to want_cookie set by
tcp_syn_flood_action

> If we want to address a non SYNFLOOD attack, why not resetting
> req->expire when we send a SYNACK to a retransmitted SYN ?
> 
> tcp_check_req()
> ...
> 	if (!inet_rtx_syn_ack(sk, req)) {
> 		req->expire = jiffies +
> 			min(TCP_TIMEOUT_INIT << req->num_timeout,
> 			    TCP_RTO_MAX);
> 	}

	This is a good trick, only that it affects the
TCP_DEFER_ACCEPT timing, num_timeout will be increased
in different time.

	But your example reveals another place where
we can optimize: time_after_eq(now, req->expires)
check in inet_csk_reqsk_queue_prune() is immune to
the thresh reduction on load, we do not touch for long
periods of time entries with num_timeout above the desired
thresh. May be we have to use:

time_after_eq(now, req->expires) ||
(req->num_timeout > thresh && !inet_rsk(req)->acked)

	for immediate expire of requests before end of
current period because they already exceed the threshold.
We avoid it only for defer_accept entries with received ACK.
But such checks are not cheap if most of the time we are
not under load.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] tcp: better retrans tracking for defer-accept
  2012-10-27 14:18       ` [PATCH net-next] tcp: better retrans tracking for defer-accept Eric Dumazet
  2012-10-27 18:27         ` Neal Cardwell
  2012-10-27 22:29         ` Julian Anastasov
@ 2012-11-03 18:46         ` David Miller
  2 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2012-11-03 18:46 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ja, subramanian.vijay, netdev, ncardwell, venkat.x.venkatsubra,
	enh, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 27 Oct 2012 16:18:57 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> For passive TCP connections using TCP_DEFER_ACCEPT facility,
> we incorrectly increment req->retrans each time timeout triggers
> while no SYNACK is sent.
> 
> SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for wich
> we received the ACK from client). Only the last SYNACK is
> sent so that we can receive again an ACK from client, to move the
> req into accept queue. We plan to change this later to avoid
> the useless retransmit (and potential problem as this SYNACK could be
> lost)
> 
> TCP_INFO later gives wrong information to user, claiming imaginary
> retransmits.
> 
> Decouple req->retrans field into two independent fields :
> 
> num_retrans : number of retransmit
> num_timeout : number of timeouts
> 
> num_timeout is the counter that is incremented at each timeout,
> regardless of actual SYNACK being sent or not, and used to
> compute the exponential timeout.
> 
> Introduce inet_rtx_syn_ack() helper to increment num_retrans
> only if ->rtx_syn_ack() succeeded.
> 
> Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
> when we re-send a SYNACK in answer to a (retransmitted) SYN.
> Prior to this patch, we were not counting these retransmits.
> 
> Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
> only if a synack packet was successfully queued.
> 
> Reported-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-11-03 18:47 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-26  8:05 [PATCH net-next V2 1/1] tcp: Prevent needless syn-ack rexmt during TWHS Vijay Subramanian
2012-10-26  8:03 ` Eric Dumazet
2012-10-26 21:30 ` Julian Anastasov
2012-10-26 21:42   ` Eric Dumazet
2012-10-26 22:52     ` Julian Anastasov
2012-10-27  0:07       ` Vijay Subramanian
2012-10-27  8:43         ` Julian Anastasov
2012-10-27  8:50         ` Eric Dumazet
2012-10-27 11:57 ` Eric Dumazet
2012-10-27 13:23   ` Julian Anastasov
2012-10-27 13:32     ` Eric Dumazet
2012-10-27 14:18       ` [PATCH net-next] tcp: better retrans tracking for defer-accept Eric Dumazet
2012-10-27 18:27         ` Neal Cardwell
2012-10-27 22:29         ` Julian Anastasov
2012-10-28  9:15           ` Eric Dumazet
2012-10-28 16:51             ` Julian Anastasov
2012-10-28 20:02               ` Eric Dumazet
2012-10-29  9:21                 ` Julian Anastasov
2012-11-03 18:46         ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.