All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 net-next 0/2] tcp: more reliable window probes
@ 2015-05-06 21:26 Eric Dumazet
  2015-05-06 21:26 ` [PATCH v2 net-next 1/2] tcp: adjust window probe timers to safer values Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-05-06 21:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati,
	Eric Dumazet, Eric Dumazet

This series address a problem caused by small rto_min timers in DC,
leading to either timer storms or early flow terminations.

We also add two new SNMP counters for proper monitoring :
TCPWinProbe and TCPKeepAlive

v2: added TCPKeepAlive counter, as suggested by Yuchung & Neal

Eric Dumazet (2):
  tcp: adjust window probe timers to safer values
  tcp:  tcp: add TCPWinProbe and TCPKeepAlive SNMP counters

 include/net/tcp.h         | 27 ++++++++++++++++++++++-----
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c           |  1 +
 net/ipv4/tcp_input.c      |  2 +-
 net/ipv4/tcp_output.c     |  3 ++-
 5 files changed, 27 insertions(+), 7 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 net-next 1/2] tcp: adjust window probe timers to safer values
  2015-05-06 21:26 [PATCH v2 net-next 0/2] tcp: more reliable window probes Eric Dumazet
@ 2015-05-06 21:26 ` Eric Dumazet
  2015-05-06 22:15   ` Neal Cardwell
  2015-05-06 21:26 ` [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters Eric Dumazet
  2015-05-09 20:42 ` [PATCH v2 net-next 0/2] tcp: more reliable window probes David Miller
  2 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-05-06 21:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati,
	Eric Dumazet, Eric Dumazet

With the advent of small rto timers in datacenter TCP,
(ip route ... rto_min x), the following can happen :

1) Qdisc is full, transmit fails.

   TCP sets a timer based on icsk_rto to retry the transmit, without
   exponential backoff.
   With low icsk_rto, and lot of sockets, all cpus are servicing timer
   interrupts like crazy.
   Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
   and 500ms (TCP_RESOURCE_PROBE_INTERVAL)

2) Receivers can send zero windows if they don't drain their receive queue.

   TCP sends zero window probes, based on icsk_rto current value, with
   exponential backoff.
   With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
   some cases), sender can abort in less than one or two minutes !
   If receiver stops the sender, it obviously doesn't care of very tight
   rto. Probability of dropping the ACK reopening the window is not
   worth the risk.

Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
events (but not normal RTO based retransmits)

A followup patch adds a new SNMP counter, as it would have helped a lot
diagnosing this issue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     | 27 ++++++++++++++++++++++-----
 net/ipv4/tcp_input.c  |  2 +-
 net/ipv4/tcp_output.c |  2 +-
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6d204f3f9df8..7a2248a35b13 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1043,14 +1043,31 @@ static inline bool tcp_is_cwnd_limited(const struct sock *sk)
 	return tp->is_cwnd_limited;
 }
 
-static inline void tcp_check_probe_timer(struct sock *sk)
+/* Something is really bad, we could not queue an additional packet,
+ * because qdisc is full or receiver sent a 0 window.
+ * We do not want to add fuel to the fire, or abort too early,
+ * so make sure the timer we arm now is at least 200ms in the future,
+ * regardless of current icsk_rto value (as it could be ~2ms)
+ */
+static inline unsigned long tcp_probe0_base(const struct sock *sk)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
-	const struct inet_connection_sock *icsk = inet_csk(sk);
+	return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN);
+}
 
-	if (!tp->packets_out && !icsk->icsk_pending)
+/* Variant of inet_csk_rto_backoff() used for zero window probes */
+static inline unsigned long tcp_probe0_when(const struct sock *sk,
+					    unsigned long max_when)
+{
+	u64 when = (u64)tcp_probe0_base(sk) << inet_csk(sk)->icsk_backoff;
+
+	return (unsigned long)min_t(u64, when, max_when);
+}
+
+static inline void tcp_check_probe_timer(struct sock *sk)
+{
+	if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending)
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
-					  icsk->icsk_rto, TCP_RTO_MAX);
+					  tcp_probe0_base(sk), TCP_RTO_MAX);
 }
 
 static inline void tcp_init_wl(struct tcp_sock *tp, u32 seq)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df2ca615cd0c..cf8b20ff6658 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3233,7 +3233,7 @@ static void tcp_ack_probe(struct sock *sk)
 		 * This function is not for random using!
 		 */
 	} else {
-		unsigned long when = inet_csk_rto_backoff(icsk, TCP_RTO_MAX);
+		unsigned long when = tcp_probe0_when(sk, TCP_RTO_MAX);
 
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
 					  when, TCP_RTO_MAX);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a369e8a70b2c..b76c719e1979 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3490,7 +3490,7 @@ void tcp_send_probe0(struct sock *sk)
 		probe_max = TCP_RESOURCE_PROBE_INTERVAL;
 	}
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
-				  inet_csk_rto_backoff(icsk, probe_max),
+				  tcp_probe0_when(sk, probe_max),
 				  TCP_RTO_MAX);
 }
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
  2015-05-06 21:26 [PATCH v2 net-next 0/2] tcp: more reliable window probes Eric Dumazet
  2015-05-06 21:26 ` [PATCH v2 net-next 1/2] tcp: adjust window probe timers to safer values Eric Dumazet
@ 2015-05-06 21:26 ` Eric Dumazet
  2015-05-06 22:12   ` Neal Cardwell
                     ` (2 more replies)
  2015-05-09 20:42 ` [PATCH v2 net-next 0/2] tcp: more reliable window probes David Miller
  2 siblings, 3 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-05-06 21:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati,
	Eric Dumazet, Eric Dumazet

Diagnosing problems related to Window Probes has been hard because
we lack a counter.

TCPWinProbe counts the number of ACK packets a sender has to send
at regular intervals to make sure a reverse ACK packet opening back
a window had not been lost.

TCPKeepAlive counts the number of ACK packets sent to keep TCP
flows alive (SO_KEEPALIVE)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h         |  2 +-
 include/uapi/linux/snmp.h |  2 ++
 net/ipv4/proc.c           |  2 ++
 net/ipv4/tcp_output.c     | 13 +++++++------
 net/ipv4/tcp_timer.c      |  2 +-
 5 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a2248a35b13..b8ea12880fd9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -527,7 +527,7 @@ int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int, gfp_t);
 
 void tcp_send_probe0(struct sock *);
 void tcp_send_partial(struct sock *);
-int tcp_write_wakeup(struct sock *);
+int tcp_write_wakeup(struct sock *, int mib);
 void tcp_send_fin(struct sock *sk);
 void tcp_send_active_reset(struct sock *sk, gfp_t priority);
 int tcp_send_synack(struct sock *);
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 6a6fb747c78d..eee8968407f0 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -276,6 +276,8 @@ enum
 	LINUX_MIB_TCPACKSKIPPEDFINWAIT2,	/* TCPACKSkippedFinWait2 */
 	LINUX_MIB_TCPACKSKIPPEDTIMEWAIT,	/* TCPACKSkippedTimeWait */
 	LINUX_MIB_TCPACKSKIPPEDCHALLENGE,	/* TCPACKSkippedChallenge */
+	LINUX_MIB_TCPWINPROBE,			/* TCPWinProbe */
+	LINUX_MIB_TCPKEEPALIVE,			/* TCPKeepAlive */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index e1f3b911dd1e..da5d483e236a 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -298,6 +298,8 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPACKSkippedFinWait2", LINUX_MIB_TCPACKSKIPPEDFINWAIT2),
 	SNMP_MIB_ITEM("TCPACKSkippedTimeWait", LINUX_MIB_TCPACKSKIPPEDTIMEWAIT),
 	SNMP_MIB_ITEM("TCPACKSkippedChallenge", LINUX_MIB_TCPACKSKIPPEDCHALLENGE),
+	SNMP_MIB_ITEM("TCPWinProbe", LINUX_MIB_TCPWINPROBE),
+	SNMP_MIB_ITEM("TCPKeepAlive", LINUX_MIB_TCPKEEPALIVE),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b76c719e1979..7386d32cd670 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3382,7 +3382,7 @@ EXPORT_SYMBOL_GPL(tcp_send_ack);
  * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
  * out-of-date with SND.UNA-1 to probe window.
  */
-static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
+static int tcp_xmit_probe_skb(struct sock *sk, int urgent, int mib)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -3400,6 +3400,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
 	 */
 	tcp_init_nondata_skb(skb, tp->snd_una - !urgent, TCPHDR_ACK);
 	skb_mstamp_get(&skb->skb_mstamp);
+	NET_INC_STATS_BH(sock_net(sk), mib);
 	return tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC);
 }
 
@@ -3407,12 +3408,12 @@ void tcp_send_window_probe(struct sock *sk)
 {
 	if (sk->sk_state == TCP_ESTABLISHED) {
 		tcp_sk(sk)->snd_wl1 = tcp_sk(sk)->rcv_nxt - 1;
-		tcp_xmit_probe_skb(sk, 0);
+		tcp_xmit_probe_skb(sk, 0, LINUX_MIB_TCPWINPROBE);
 	}
 }
 
 /* Initiate keepalive or window probe from timer. */
-int tcp_write_wakeup(struct sock *sk)
+int tcp_write_wakeup(struct sock *sk, int mib)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -3449,8 +3450,8 @@ int tcp_write_wakeup(struct sock *sk)
 		return err;
 	} else {
 		if (between(tp->snd_up, tp->snd_una + 1, tp->snd_una + 0xFFFF))
-			tcp_xmit_probe_skb(sk, 1);
-		return tcp_xmit_probe_skb(sk, 0);
+			tcp_xmit_probe_skb(sk, 1, mib);
+		return tcp_xmit_probe_skb(sk, 0, mib);
 	}
 }
 
@@ -3464,7 +3465,7 @@ void tcp_send_probe0(struct sock *sk)
 	unsigned long probe_max;
 	int err;
 
-	err = tcp_write_wakeup(sk);
+	err = tcp_write_wakeup(sk, LINUX_MIB_TCPWINPROBE);
 
 	if (tp->packets_out || !tcp_send_head(sk)) {
 		/* Cancel probe timer, if it is not required. */
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 8c65dc147d8b..65bf670e8714 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -616,7 +616,7 @@ static void tcp_keepalive_timer (unsigned long data)
 			tcp_write_err(sk);
 			goto out;
 		}
-		if (tcp_write_wakeup(sk) <= 0) {
+		if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
 			icsk->icsk_probes_out++;
 			elapsed = keepalive_intvl_when(tp);
 		} else {
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
  2015-05-06 21:26 ` [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters Eric Dumazet
@ 2015-05-06 22:12   ` Neal Cardwell
  2015-05-06 22:13   ` Neal Cardwell
  2015-05-07 17:45   ` Nandita Dukkipati
  2 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2015-05-06 22:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Eric Dumazet

On Wed, May 6, 2015 at 5:26 PM, Eric Dumazet <edumazet@google.com> wrote:
> Diagnosing problems related to Window Probes has been hard because
> we lack a counter.
>
> TCPWinProbe counts the number of ACK packets a sender has to send
> at regular intervals to make sure a reverse ACK packet opening back
> a window had not been lost.
>
> TCPKeepAlive counts the number of ACK packets sent to keep TCP
> flows alive (SO_KEEPALIVE)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
  2015-05-06 21:26 ` [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters Eric Dumazet
  2015-05-06 22:12   ` Neal Cardwell
@ 2015-05-06 22:13   ` Neal Cardwell
  2015-05-07 17:45   ` Nandita Dukkipati
  2 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2015-05-06 22:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Eric Dumazet

On Wed, May 6, 2015 at 5:26 PM, Eric Dumazet <edumazet@google.com> wrote:
> Diagnosing problems related to Window Probes has been hard because
> we lack a counter.
>
> TCPWinProbe counts the number of ACK packets a sender has to send
> at regular intervals to make sure a reverse ACK packet opening back
> a window had not been lost.
>
> TCPKeepAlive counts the number of ACK packets sent to keep TCP
> flows alive (SO_KEEPALIVE)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: adjust window probe timers to safer values
  2015-05-06 21:26 ` [PATCH v2 net-next 1/2] tcp: adjust window probe timers to safer values Eric Dumazet
@ 2015-05-06 22:15   ` Neal Cardwell
  0 siblings, 0 replies; 8+ messages in thread
From: Neal Cardwell @ 2015-05-06 22:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Eric Dumazet

On Wed, May 6, 2015 at 5:26 PM, Eric Dumazet <edumazet@google.com> wrote:
> With the advent of small rto timers in datacenter TCP,
> (ip route ... rto_min x), the following can happen :
>
> 1) Qdisc is full, transmit fails.
>
>    TCP sets a timer based on icsk_rto to retry the transmit, without
>    exponential backoff.
>    With low icsk_rto, and lot of sockets, all cpus are servicing timer
>    interrupts like crazy.
>    Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
>    and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
>
> 2) Receivers can send zero windows if they don't drain their receive queue.
>
>    TCP sends zero window probes, based on icsk_rto current value, with
>    exponential backoff.
>    With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
>    some cases), sender can abort in less than one or two minutes !
>    If receiver stops the sender, it obviously doesn't care of very tight
>    rto. Probability of dropping the ACK reopening the window is not
>    worth the risk.
>
> Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
> events (but not normal RTO based retransmits)
>
> A followup patch adds a new SNMP counter, as it would have helped a lot
> diagnosing this issue.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
  2015-05-06 21:26 ` [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters Eric Dumazet
  2015-05-06 22:12   ` Neal Cardwell
  2015-05-06 22:13   ` Neal Cardwell
@ 2015-05-07 17:45   ` Nandita Dukkipati
  2 siblings, 0 replies; 8+ messages in thread
From: Nandita Dukkipati @ 2015-05-07 17:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, netdev, Neal Cardwell, Yuchung Cheng, Eric Dumazet

> Diagnosing problems related to Window Probes has been hard because
> we lack a counter.
>
> TCPWinProbe counts the number of ACK packets a sender has to send
> at regular intervals to make sure a reverse ACK packet opening back
> a window had not been lost.
>
> TCPKeepAlive counts the number of ACK packets sent to keep TCP
> flows alive (SO_KEEPALIVE)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Acked-by: Nandita Dukkipati <nanditad@google.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 net-next 0/2] tcp: more reliable window probes
  2015-05-06 21:26 [PATCH v2 net-next 0/2] tcp: more reliable window probes Eric Dumazet
  2015-05-06 21:26 ` [PATCH v2 net-next 1/2] tcp: adjust window probe timers to safer values Eric Dumazet
  2015-05-06 21:26 ` [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters Eric Dumazet
@ 2015-05-09 20:42 ` David Miller
  2 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2015-05-09 20:42 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, ncardwell, ycheng, nanditad, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Wed,  6 May 2015 14:26:23 -0700

> This series address a problem caused by small rto_min timers in DC,
> leading to either timer storms or early flow terminations.
> 
> We also add two new SNMP counters for proper monitoring :
> TCPWinProbe and TCPKeepAlive
> 
> v2: added TCPKeepAlive counter, as suggested by Yuchung & Neal

Series applied, thanks Eric.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-09 20:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06 21:26 [PATCH v2 net-next 0/2] tcp: more reliable window probes Eric Dumazet
2015-05-06 21:26 ` [PATCH v2 net-next 1/2] tcp: adjust window probe timers to safer values Eric Dumazet
2015-05-06 22:15   ` Neal Cardwell
2015-05-06 21:26 ` [PATCH v2 net-next 2/2] tcp: add TCPWinProbe and TCPKeepAlive SNMP counters Eric Dumazet
2015-05-06 22:12   ` Neal Cardwell
2015-05-06 22:13   ` Neal Cardwell
2015-05-07 17:45   ` Nandita Dukkipati
2015-05-09 20:42 ` [PATCH v2 net-next 0/2] tcp: more reliable window probes David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.