netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/5] tcp_cubic: various fixes
@ 2019-12-23 20:27 Eric Dumazet
  2019-12-23 20:27 ` [PATCH net-next v2 1/5] tcp_cubic: optimize hystart_update() Eric Dumazet
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Eric Dumazet @ 2019-12-23 20:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, Martin KaFai Lau

This patch series converts tcp_cubic to usec clock resolution
for Hystart logic.

This makes Hystart more relevant for data-center flows.
Prior to this series, Hystart was not kicking, or was
kicking without good reason, since the 1ms clock was too coarse.

Last patch also fixes an issue with Hystart vs TCP pacing.

v2: removed a last-minute debug chunk from last patch

Eric Dumazet (5):
  tcp_cubic: optimize hystart_update()
  tcp_cubic: remove one conditional from hystart_update()
  tcp_cubic: switch bictcp_clock() to usec resolution
  tcp_cubic: tweak Hystart detection for short RTT flows
  tcp_cubic: make Hystart aware of pacing

 net/ipv4/tcp_cubic.c | 82 +++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 31 deletions(-)

-- 
2.24.1.735.g03f4e72817-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 1/5] tcp_cubic: optimize hystart_update()
  2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
@ 2019-12-23 20:27 ` Eric Dumazet
  2019-12-26 23:23   ` Neal Cardwell
  2019-12-23 20:27 ` [PATCH net-next v2 2/5] tcp_cubic: remove one conditional from hystart_update() Eric Dumazet
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-12-23 20:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, Martin KaFai Lau

We do not care which bit in ca->found is set.

We avoid accessing hystart and hystart_detect unless really needed,
possibly avoiding one cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 1b3d032a4df2a23faf8f12a9b426638fb1997fef..297936033c335ee57ba6205de50c5ef06b90da60 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -381,9 +381,6 @@ static void hystart_update(struct sock *sk, u32 delay)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct bictcp *ca = inet_csk_ca(sk);
 
-	if (ca->found & hystart_detect)
-		return;
-
 	if (hystart_detect & HYSTART_ACK_TRAIN) {
 		u32 now = bictcp_clock();
 
@@ -391,7 +388,7 @@ static void hystart_update(struct sock *sk, u32 delay)
 		if ((s32)(now - ca->last_ack) <= hystart_ack_delta) {
 			ca->last_ack = now;
 			if ((s32)(now - ca->round_start) > ca->delay_min >> 4) {
-				ca->found |= HYSTART_ACK_TRAIN;
+				ca->found = 1;
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPHYSTARTTRAINDETECT);
 				NET_ADD_STATS(sock_net(sk),
@@ -412,7 +409,7 @@ static void hystart_update(struct sock *sk, u32 delay)
 		} else {
 			if (ca->curr_rtt > ca->delay_min +
 			    HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
-				ca->found |= HYSTART_DELAY;
+				ca->found = 1;
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPHYSTARTDELAYDETECT);
 				NET_ADD_STATS(sock_net(sk),
@@ -450,7 +447,7 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
 		ca->delay_min = delay;
 
 	/* hystart triggers when cwnd is larger than some threshold */
-	if (hystart && tcp_in_slow_start(tp) &&
+	if (!ca->found && hystart && tcp_in_slow_start(tp) &&
 	    tp->snd_cwnd >= hystart_low_window)
 		hystart_update(sk, delay);
 }
-- 
2.24.1.735.g03f4e72817-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 2/5] tcp_cubic: remove one conditional from hystart_update()
  2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
  2019-12-23 20:27 ` [PATCH net-next v2 1/5] tcp_cubic: optimize hystart_update() Eric Dumazet
@ 2019-12-23 20:27 ` Eric Dumazet
  2019-12-26 23:06   ` Neal Cardwell
  2019-12-23 20:27 ` [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution Eric Dumazet
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-12-23 20:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, Martin KaFai Lau

If we initialize ca->curr_rtt to ~0U, we do not need to test
for zero value in hystart_update()

We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
been processed, and thus ca->curr_rtt will have a sane value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 297936033c335ee57ba6205de50c5ef06b90da60..5eff762acd7fe1510eb39fc789b488de8ca648de 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -133,7 +133,7 @@ static inline void bictcp_hystart_reset(struct sock *sk)
 
 	ca->round_start = ca->last_ack = bictcp_clock();
 	ca->end_seq = tp->snd_nxt;
-	ca->curr_rtt = 0;
+	ca->curr_rtt = ~0U;
 	ca->sample_cnt = 0;
 }
 
@@ -402,7 +402,7 @@ static void hystart_update(struct sock *sk, u32 delay)
 	if (hystart_detect & HYSTART_DELAY) {
 		/* obtain the minimum delay of more than sampling packets */
 		if (ca->sample_cnt < HYSTART_MIN_SAMPLES) {
-			if (ca->curr_rtt == 0 || ca->curr_rtt > delay)
+			if (ca->curr_rtt > delay)
 				ca->curr_rtt = delay;
 
 			ca->sample_cnt++;
-- 
2.24.1.735.g03f4e72817-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution
  2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
  2019-12-23 20:27 ` [PATCH net-next v2 1/5] tcp_cubic: optimize hystart_update() Eric Dumazet
  2019-12-23 20:27 ` [PATCH net-next v2 2/5] tcp_cubic: remove one conditional from hystart_update() Eric Dumazet
@ 2019-12-23 20:27 ` Eric Dumazet
  2019-12-27 14:46   ` Neal Cardwell
  2019-12-27 16:08   ` Neal Cardwell
  2019-12-23 20:27 ` [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows Eric Dumazet
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2019-12-23 20:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, Martin KaFai Lau

Current 1ms clock feeds ca->round_start, ca->delay_min,
ca->last_ack.

This is quite problematic for data-center flows, where delay_min
is way below 1 ms.

This means Hystart Train detection triggers every time jiffies value
is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
expression becomes true.

This kind of random behavior can be solved by reusing the existing
usec timestamp that TCP keeps in tp->tcp_mstamp

Note that a followup patch will tweak things a bit, because
during slow start, GRO aggregation on receivers naturally
increases the RTT as TSO packets gradually come to ~64KB size.

To recap, right after this patch CUBIC Hystart train detection
is more aggressive, since short RTT flows might exit slow start at
cwnd = 20, instead of being possibly unbounded.

Following patch will address this problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 5eff762acd7fe1510eb39fc789b488de8ca648de..068775b91fb5790e6e60a6490b49e7a266e4ed51 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -40,8 +40,8 @@
 
 /* Number of delay samples for detecting the increase of delay */
 #define HYSTART_MIN_SAMPLES	8
-#define HYSTART_DELAY_MIN	(4U<<3)
-#define HYSTART_DELAY_MAX	(16U<<3)
+#define HYSTART_DELAY_MIN	(4000U)	/* 4 ms */
+#define HYSTART_DELAY_MAX	(16000U)	/* 16 ms */
 #define HYSTART_DELAY_THRESH(x)	clamp(x, HYSTART_DELAY_MIN, HYSTART_DELAY_MAX)
 
 static int fast_convergence __read_mostly = 1;
@@ -53,7 +53,7 @@ static int tcp_friendliness __read_mostly = 1;
 static int hystart __read_mostly = 1;
 static int hystart_detect __read_mostly = HYSTART_ACK_TRAIN | HYSTART_DELAY;
 static int hystart_low_window __read_mostly = 16;
-static int hystart_ack_delta __read_mostly = 2;
+static int hystart_ack_delta_us __read_mostly = 2000;
 
 static u32 cube_rtt_scale __read_mostly;
 static u32 beta_scale __read_mostly;
@@ -77,8 +77,8 @@ MODULE_PARM_DESC(hystart_detect, "hybrid slow start detection mechanisms"
 		 " 1: packet-train 2: delay 3: both packet-train and delay");
 module_param(hystart_low_window, int, 0644);
 MODULE_PARM_DESC(hystart_low_window, "lower bound cwnd for hybrid slow start");
-module_param(hystart_ack_delta, int, 0644);
-MODULE_PARM_DESC(hystart_ack_delta, "spacing between ack's indicating train (msecs)");
+module_param(hystart_ack_delta_us, int, 0644);
+MODULE_PARM_DESC(hystart_ack_delta_us, "spacing between ack's indicating train (usecs)");
 
 /* BIC TCP Parameters */
 struct bictcp {
@@ -89,7 +89,7 @@ struct bictcp {
 	u32	bic_origin_point;/* origin point of bic function */
 	u32	bic_K;		/* time to origin point
 				   from the beginning of the current epoch */
-	u32	delay_min;	/* min delay (msec << 3) */
+	u32	delay_min;	/* min delay (usec) */
 	u32	epoch_start;	/* beginning of an epoch */
 	u32	ack_cnt;	/* number of acks */
 	u32	tcp_cwnd;	/* estimated tcp cwnd */
@@ -117,13 +117,9 @@ static inline void bictcp_reset(struct bictcp *ca)
 	ca->found = 0;
 }
 
-static inline u32 bictcp_clock(void)
+static inline u32 bictcp_clock_us(const struct sock *sk)
 {
-#if HZ < 1000
-	return ktime_to_ms(ktime_get_real());
-#else
-	return jiffies_to_msecs(jiffies);
-#endif
+	return tcp_sk(sk)->tcp_mstamp;
 }
 
 static inline void bictcp_hystart_reset(struct sock *sk)
@@ -131,7 +127,7 @@ static inline void bictcp_hystart_reset(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct bictcp *ca = inet_csk_ca(sk);
 
-	ca->round_start = ca->last_ack = bictcp_clock();
+	ca->round_start = ca->last_ack = bictcp_clock_us(sk);
 	ca->end_seq = tp->snd_nxt;
 	ca->curr_rtt = ~0U;
 	ca->sample_cnt = 0;
@@ -276,7 +272,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
 	 */
 
 	t = (s32)(tcp_jiffies32 - ca->epoch_start);
-	t += msecs_to_jiffies(ca->delay_min >> 3);
+	t += usecs_to_jiffies(ca->delay_min);
 	/* change the unit from HZ to bictcp_HZ */
 	t <<= BICTCP_HZ;
 	do_div(t, HZ);
@@ -382,12 +378,12 @@ static void hystart_update(struct sock *sk, u32 delay)
 	struct bictcp *ca = inet_csk_ca(sk);
 
 	if (hystart_detect & HYSTART_ACK_TRAIN) {
-		u32 now = bictcp_clock();
+		u32 now = bictcp_clock_us(sk);
 
 		/* first detection parameter - ack-train detection */
-		if ((s32)(now - ca->last_ack) <= hystart_ack_delta) {
+		if ((s32)(now - ca->last_ack) <= hystart_ack_delta_us) {
 			ca->last_ack = now;
-			if ((s32)(now - ca->round_start) > ca->delay_min >> 4) {
+			if ((s32)(now - ca->round_start) > ca->delay_min >> 1) {
 				ca->found = 1;
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPHYSTARTTRAINDETECT);
@@ -421,9 +417,6 @@ static void hystart_update(struct sock *sk, u32 delay)
 	}
 }
 
-/* Track delayed acknowledgment ratio using sliding window
- * ratio = (15*ratio + sample) / 16
- */
 static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
@@ -438,7 +431,7 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
 	if (ca->epoch_start && (s32)(tcp_jiffies32 - ca->epoch_start) < HZ)
 		return;
 
-	delay = (sample->rtt_us << 3) / USEC_PER_MSEC;
+	delay = sample->rtt_us;
 	if (delay == 0)
 		delay = 1;
 
-- 
2.24.1.735.g03f4e72817-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows
  2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
                   ` (2 preceding siblings ...)
  2019-12-23 20:27 ` [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution Eric Dumazet
@ 2019-12-23 20:27 ` Eric Dumazet
  2019-12-27 16:00   ` Neal Cardwell
  2019-12-23 20:27 ` [PATCH net-next v2 5/5] tcp_cubic: make Hystart aware of pacing Eric Dumazet
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-12-23 20:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, Martin KaFai Lau

After switching ca->delay_min to usec resolution, we exit
slow start prematurely for very low RTT flows, setting
snd_ssthresh to 20.

The reason is that delay_min is fed with RTT of small packet
trains. Then as cwnd is increased, TCP sends bigger TSO packets.

LRO/GRO aggregation and/or interrupt mitigation strategies
on receiver tend to inflate RTT samples.

Fix this by adding to delay_min the expected delay of
two TSO packets, given current pacing rate.

Tested:

Sender uses pfifo_fast qdisc

Before :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  11348
  11707
  11562
  11428
  11773
  11534
   9878
  11693
  10597
  10968
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       200                0.0

After :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  14877
  14517
  15797
  18466
  17376
  14833
  17558
  17933
  16039
  18059
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1670               0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 068775b91fb5790e6e60a6490b49e7a266e4ed51..0e5428ed04fe4e50627e21a53c3d17f9f2dade4d 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -436,8 +436,27 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
 		delay = 1;
 
 	/* first time call or link delay decreases */
-	if (ca->delay_min == 0 || ca->delay_min > delay)
-		ca->delay_min = delay;
+	if (ca->delay_min == 0 || ca->delay_min > delay) {
+		unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
+
+		/* Account for TSO/GRO delays.
+		 * Otherwise short RTT flows could get too small ssthresh,
+		 * since during slow start we begin with small TSO packets
+		 * and could lower ca->delay_min too much.
+		 * Ideally even with a very small RTT we would like to have
+		 * at least one TSO packet being sent and received by GRO,
+		 * and another one in qdisc layer.
+		 * We apply another 100% factor because @rate is doubled at
+		 * this point.
+		 * We cap the cushion to 1ms.
+		 */
+		if (rate)
+			delay += min_t(u64, USEC_PER_MSEC,
+				       div64_ul((u64)GSO_MAX_SIZE *
+						4 * USEC_PER_SEC, rate));
+		if (ca->delay_min == 0 || ca->delay_min > delay)
+			ca->delay_min = delay;
+	}
 
 	/* hystart triggers when cwnd is larger than some threshold */
 	if (!ca->found && hystart && tcp_in_slow_start(tp) &&
-- 
2.24.1.735.g03f4e72817-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 5/5] tcp_cubic: make Hystart aware of pacing
  2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
                   ` (3 preceding siblings ...)
  2019-12-23 20:27 ` [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows Eric Dumazet
@ 2019-12-23 20:27 ` Eric Dumazet
  2019-12-27 17:06   ` Neal Cardwell
  2019-12-27 17:32 ` [PATCH net-next v2 0/5] tcp_cubic: various fixes Neal Cardwell
  2019-12-28  0:29 ` David Miller
  6 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-12-23 20:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, Martin KaFai Lau

For years we disabled Hystart ACK train detection at Google
because it was fooled by TCP pacing.

ACK train detection uses a simple heuristic, detecting if
we receive ACK past half the RTT, to exit slow start before
hitting the bottleneck and experience massive drops.

But pacing by design might delay packets up to RTT/2,
so we need to tweak the Hystart logic to be aware of this
extra delay.

Tested:
 Added a 100 usec delay at receiver.

Before:
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
   9117
   7057
   9553
   8300
   7030
   6849
   9533
  10126
   6876
   8473
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1230               0.0

After :
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
   9845
  10103
  10866
  11096
  11936
  11487
  11773
  12188
  11066
  11894
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       6462               0.0

Disabling Hystart ACK Train detection gives similar numbers

echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  11173
  10954
  12455
  10627
  11578
  11583
  11222
  10880
  10665
  11366

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 0e5428ed04fe4e50627e21a53c3d17f9f2dade4d..d02bb283c6890e1692e714e053515c6e4981d83a 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -376,6 +376,7 @@ static void hystart_update(struct sock *sk, u32 delay)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct bictcp *ca = inet_csk_ca(sk);
+	u32 threshold;
 
 	if (hystart_detect & HYSTART_ACK_TRAIN) {
 		u32 now = bictcp_clock_us(sk);
@@ -383,7 +384,17 @@ static void hystart_update(struct sock *sk, u32 delay)
 		/* first detection parameter - ack-train detection */
 		if ((s32)(now - ca->last_ack) <= hystart_ack_delta_us) {
 			ca->last_ack = now;
-			if ((s32)(now - ca->round_start) > ca->delay_min >> 1) {
+
+			threshold = ca->delay_min;
+			/* Hystart ack train triggers if we get ack past
+			 * ca->delay_min/2.
+			 * Pacing might have delayed packets up to RTT/2
+			 * during slow start.
+			 */
+			if (sk->sk_pacing_status == SK_PACING_NONE)
+				threshold >>= 1;
+
+			if ((s32)(now - ca->round_start) > threshold) {
 				ca->found = 1;
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPHYSTARTTRAINDETECT);
-- 
2.24.1.735.g03f4e72817-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 2/5] tcp_cubic: remove one conditional from hystart_update()
  2019-12-23 20:27 ` [PATCH net-next v2 2/5] tcp_cubic: remove one conditional from hystart_update() Eric Dumazet
@ 2019-12-26 23:06   ` Neal Cardwell
  0 siblings, 0 replies; 17+ messages in thread
From: Neal Cardwell @ 2019-12-26 23:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> If we initialize ca->curr_rtt to ~0U, we do not need to test
> for zero value in hystart_update()
>
> We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
> been processed, and thus ca->curr_rtt will have a sane value.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

thanks,
neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 1/5] tcp_cubic: optimize hystart_update()
  2019-12-23 20:27 ` [PATCH net-next v2 1/5] tcp_cubic: optimize hystart_update() Eric Dumazet
@ 2019-12-26 23:23   ` Neal Cardwell
  0 siblings, 0 replies; 17+ messages in thread
From: Neal Cardwell @ 2019-12-26 23:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> We do not care which bit in ca->found is set.
>
> We avoid accessing hystart and hystart_detect unless really needed,
> possibly avoiding one cache line miss.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Thanks, good idea.

> @@ -450,7 +447,7 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
>                 ca->delay_min = delay;
>
>         /* hystart triggers when cwnd is larger than some threshold */
> -       if (hystart && tcp_in_slow_start(tp) &&
> +       if (!ca->found && hystart && tcp_in_slow_start(tp) &&
>             tp->snd_cwnd >= hystart_low_window)
>                 hystart_update(sk, delay);
>  }

For the many connections that exit the initial slow start due to loss
or ECN marks, it seems ca->found would always be 0, so such
connections will be at risk for taking a cache miss to fetch
'hystart'. WDYT about further reordering the logic to avoid that risk
for such connections, e.g.:

  -       if (hystart && tcp_in_slow_start(tp) &&
  +       if (!ca->found && tcp_in_slow_start(tp) && hystart &&

neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution
  2019-12-23 20:27 ` [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution Eric Dumazet
@ 2019-12-27 14:46   ` Neal Cardwell
  2019-12-27 15:24     ` Eric Dumazet
  2019-12-27 16:08   ` Neal Cardwell
  1 sibling, 1 reply; 17+ messages in thread
From: Neal Cardwell @ 2019-12-27 14:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Current 1ms clock feeds ca->round_start, ca->delay_min,
> ca->last_ack.
>
> This is quite problematic for data-center flows, where delay_min
> is way below 1 ms.
>
> This means Hystart Train detection triggers every time jiffies value
> is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
> expression becomes true.
>
> This kind of random behavior can be solved by reusing the existing
> usec timestamp that TCP keeps in tp->tcp_mstamp
...
> @@ -438,7 +431,7 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
>         if (ca->epoch_start && (s32)(tcp_jiffies32 - ca->epoch_start) < HZ)
>                 return;
>
> -       delay = (sample->rtt_us << 3) / USEC_PER_MSEC;
> +       delay = sample->rtt_us;

It seems there is a bug in this patch: it changes the code to not
shift the RTT samples left by 3 bits, and adjusts the
HYSTART_ACK_TRAIN code path to expect the new behavior, but does not
change the HYSTART_DELAY code path to expect the new behavior, so the
HYSTART_DELAY code path is still shifting right by 3 bits, when it
should not... the HYSTART_DELAY remains like this at the end of the
patch series:

        if (hystart_detect & HYSTART_DELAY) {
...
                        if (ca->curr_rtt > ca->delay_min +
                            HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {

AFAICT the patch also should have:

-                            HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
+                           HYSTART_DELAY_THRESH(ca->delay_min)) {

Otherwise this patch looks great to me.

best,
neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution
  2019-12-27 14:46   ` Neal Cardwell
@ 2019-12-27 15:24     ` Eric Dumazet
  2019-12-27 16:07       ` Neal Cardwell
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-12-27 15:24 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Fri, Dec 27, 2019 at 6:46 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Current 1ms clock feeds ca->round_start, ca->delay_min,
> > ca->last_ack.
> >
> > This is quite problematic for data-center flows, where delay_min
> > is way below 1 ms.
> >
> > This means Hystart Train detection triggers every time jiffies value
> > is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
> > expression becomes true.
> >
> > This kind of random behavior can be solved by reusing the existing
> > usec timestamp that TCP keeps in tp->tcp_mstamp
> ...
> > @@ -438,7 +431,7 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
> >         if (ca->epoch_start && (s32)(tcp_jiffies32 - ca->epoch_start) < HZ)
> >                 return;
> >
> > -       delay = (sample->rtt_us << 3) / USEC_PER_MSEC;
> > +       delay = sample->rtt_us;
>
> It seems there is a bug in this patch: it changes the code to not
> shift the RTT samples left by 3 bits, and adjusts the
> HYSTART_ACK_TRAIN code path to expect the new behavior, but does not
> change the HYSTART_DELAY code path to expect the new behavior, so the
> HYSTART_DELAY code path is still shifting right by 3 bits, when it
> should not... the HYSTART_DELAY remains like this at the end of the
> patch series:
>
>         if (hystart_detect & HYSTART_DELAY) {
> ...
>                         if (ca->curr_rtt > ca->delay_min +
>                             HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
>
> AFAICT the patch also should have:
>
> -                            HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
> +                           HYSTART_DELAY_THRESH(ca->delay_min)) {
>


I do not think so Neal.

The HYSTART_DELAY_THRESH(ca->delay_min >> 3) thing really means we
want to apply a 12.5 % factor.

See commit 42eef7a0bb09 "tcp_cubic: refine Hystart delay threshold"
for some context.

After this patch, ca->delay_min is in usec unit, and ca->cur_rtt is
also in usec unit.

Thanks !

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows
  2019-12-23 20:27 ` [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows Eric Dumazet
@ 2019-12-27 16:00   ` Neal Cardwell
  2019-12-28 16:31     ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Neal Cardwell @ 2019-12-27 16:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> After switching ca->delay_min to usec resolution, we exit
> slow start prematurely for very low RTT flows, setting
> snd_ssthresh to 20.
>
> The reason is that delay_min is fed with RTT of small packet
> trains. Then as cwnd is increased, TCP sends bigger TSO packets.
>
> LRO/GRO aggregation and/or interrupt mitigation strategies
> on receiver tend to inflate RTT samples.
>
> Fix this by adding to delay_min the expected delay of
> two TSO packets, given current pacing rate.
...
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 068775b91fb5790e6e60a6490b49e7a266e4ed51..0e5428ed04fe4e50627e21a53c3d17f9f2dade4d 100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -436,8 +436,27 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
>                 delay = 1;
>
>         /* first time call or link delay decreases */
> -       if (ca->delay_min == 0 || ca->delay_min > delay)
> -               ca->delay_min = delay;
> +       if (ca->delay_min == 0 || ca->delay_min > delay) {
> +               unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
> +
> +               /* Account for TSO/GRO delays.
> +                * Otherwise short RTT flows could get too small ssthresh,
> +                * since during slow start we begin with small TSO packets
> +                * and could lower ca->delay_min too much.
> +                * Ideally even with a very small RTT we would like to have
> +                * at least one TSO packet being sent and received by GRO,
> +                * and another one in qdisc layer.
> +                * We apply another 100% factor because @rate is doubled at
> +                * this point.

This comment mentions "@rate is doubled at this point". But AFAICT
this part of the code is executed on every ACK, not just in slow
start, so the rate is not really always doubled at this point. It
might be more clear/precise to say  "@rate can be doubled at this
point"?


> +                * We cap the cushion to 1ms.
> +                */
> +               if (rate)
> +                       delay += min_t(u64, USEC_PER_MSEC,
> +                                      div64_ul((u64)GSO_MAX_SIZE *
> +                                               4 * USEC_PER_SEC, rate));
> +               if (ca->delay_min == 0 || ca->delay_min > delay)
> +                       ca->delay_min = delay;
> +       }
>
>         /* hystart triggers when cwnd is larger than some threshold */
>         if (!ca->found && hystart && tcp_in_slow_start(tp) &&

AFAICT there may be some CPU usage impact here for high-speed
intra-datacenter flows due to an extra div64_ul() being executed on
most ACKs? It seems like if 'ca->delay_min > delay' then the code
executes the div64_ul() to calculate the cushion and add it to
'delay'. Then the  ca->delay_min is recorded with the cushion added.

But the first 'ca->delay_min > delay' comparison that determines
whether we execute the div64_ul() is comparing a ca->delay_min that
has a cushion included to the 'delay' that has no cushion included. So
if the raw 'delay' value is within the cushion of the raw min_rtt
value, then this 'ca->delay_min > delay' expression will evaluate to
true, and we'll run the div64_ul() even though there is probably not a
need to.

How big is the cushion? In back of the envelope terms, for a 10
Gbit/sec link where CUBIC has exited the initial slow start and is
using a pacing rate matching the link rate, then AFAICT the cushion
should be roughly:
  4 * 65536 * 1000000 / (1.2*10*10^9/8) ~= 175 usecs
So if that math is in the right ballpark, then any RTT sample that is
within roughly 175 usec of the true min_rtt is going to have
'ca->delay_min > delay' evaluate to true, and run the div64_ul() when
there is probably no need.

AFAICT there is also some risk with this patch that the steady-state
behavior of CUBIC becomes slightly more aggressive for moderate-rate,
low-RTT cases. In such cases this up to ~1ms of cushion to delay_min
will cause the bictcp_update() line:

 t += usecs_to_jiffies(ca->delay_min);

to calculate the target cwnd at a point in time that is ~1ms further
out than before, and thus possibly grow the cwnd faster than it would
have before.

To avoid the risk of extra divisions on many ACKs, and the risk of
more aggressive cwnd behavior in the steady state, WDYT about an
approach where the cushioned delay_min used by your revised Hystart
code (ACK train heuristic and delay heuristic) is maintained
separately from the "pure" delay_min used elsewhere in the CUBIC code,
perhaps as a new ca->delay_min_hystart field? I am wondering about an
approach something like:

/* Account for TSO/GRO delays. ...
*/
static u32 update_delay_min_hystart(struct sock *sk)
{
        unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
        u32 ack_delay = 0;

        if (ca->found)
                return;  /* Hystart is done */

        if (rate)
                ack_delay = min_t(u64, USEC_PER_MSEC,
                                  div64_ul((u64)GSO_MAX_SIZE *
                                           4 * USEC_PER_SEC, rate));
        ca->delay_min_hystart = ca->delay_min + ack_delay;
}

...
        if (ca->delay_min == 0 || ca->delay_min > delay) {
                ca->delay_min = delay;
                update_delay_min_hystart(sk);
        }

WDYT?

neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution
  2019-12-27 15:24     ` Eric Dumazet
@ 2019-12-27 16:07       ` Neal Cardwell
  0 siblings, 0 replies; 17+ messages in thread
From: Neal Cardwell @ 2019-12-27 16:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Fri, Dec 27, 2019 at 10:24 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Dec 27, 2019 at 6:46 AM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > Current 1ms clock feeds ca->round_start, ca->delay_min,
> > > ca->last_ack.
> > >
> > > This is quite problematic for data-center flows, where delay_min
> > > is way below 1 ms.
> > >
> > > This means Hystart Train detection triggers every time jiffies value
> > > is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
> > > expression becomes true.
> > >
> > > This kind of random behavior can be solved by reusing the existing
> > > usec timestamp that TCP keeps in tp->tcp_mstamp
> > ...
> > > @@ -438,7 +431,7 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
> > >         if (ca->epoch_start && (s32)(tcp_jiffies32 - ca->epoch_start) < HZ)
> > >                 return;
> > >
> > > -       delay = (sample->rtt_us << 3) / USEC_PER_MSEC;
> > > +       delay = sample->rtt_us;
> >
> > It seems there is a bug in this patch: it changes the code to not
> > shift the RTT samples left by 3 bits, and adjusts the
> > HYSTART_ACK_TRAIN code path to expect the new behavior, but does not
> > change the HYSTART_DELAY code path to expect the new behavior, so the
> > HYSTART_DELAY code path is still shifting right by 3 bits, when it
> > should not... the HYSTART_DELAY remains like this at the end of the
> > patch series:
> >
> >         if (hystart_detect & HYSTART_DELAY) {
> > ...
> >                         if (ca->curr_rtt > ca->delay_min +
> >                             HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
> >
> > AFAICT the patch also should have:
> >
> > -                            HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
> > +                           HYSTART_DELAY_THRESH(ca->delay_min)) {
> >
>
>
> I do not think so Neal.
>
> The HYSTART_DELAY_THRESH(ca->delay_min >> 3) thing really means we
> want to apply a 12.5 % factor.
>
> See commit 42eef7a0bb09 "tcp_cubic: refine Hystart delay threshold"
> for some context.
>
> After this patch, ca->delay_min is in usec unit, and ca->cur_rtt is
> also in usec unit.

Oops, of course you are right. So sorry, I forgot this >>3 was for the
12.5% factor, and was reading the code too fast, and with too little
caffeine... :-)

Sorry for the noise!

neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution
  2019-12-23 20:27 ` [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution Eric Dumazet
  2019-12-27 14:46   ` Neal Cardwell
@ 2019-12-27 16:08   ` Neal Cardwell
  1 sibling, 0 replies; 17+ messages in thread
From: Neal Cardwell @ 2019-12-27 16:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Current 1ms clock feeds ca->round_start, ca->delay_min,
> ca->last_ack.
>
> This is quite problematic for data-center flows, where delay_min
> is way below 1 ms.
>
> This means Hystart Train detection triggers every time jiffies value
> is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
> expression becomes true.
>
> This kind of random behavior can be solved by reusing the existing
> usec timestamp that TCP keeps in tp->tcp_mstamp
>
> Note that a followup patch will tweak things a bit, because
> during slow start, GRO aggregation on receivers naturally
> increases the RTT as TSO packets gradually come to ~64KB size.
>
> To recap, right after this patch CUBIC Hystart train detection
> is more aggressive, since short RTT flows might exit slow start at
> cwnd = 20, instead of being possibly unbounded.
>
> Following patch will address this problem.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks!
neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 5/5] tcp_cubic: make Hystart aware of pacing
  2019-12-23 20:27 ` [PATCH net-next v2 5/5] tcp_cubic: make Hystart aware of pacing Eric Dumazet
@ 2019-12-27 17:06   ` Neal Cardwell
  0 siblings, 0 replies; 17+ messages in thread
From: Neal Cardwell @ 2019-12-27 17:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> For years we disabled Hystart ACK train detection at Google
> because it was fooled by TCP pacing.
>
> ACK train detection uses a simple heuristic, detecting if
> we receive ACK past half the RTT, to exit slow start before
> hitting the bottleneck and experience massive drops.
>
> But pacing by design might delay packets up to RTT/2,
> so we need to tweak the Hystart logic to be aware of this
> extra delay.
>
> Tested:
>  Added a 100 usec delay at receiver.
>
> Before:
> nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
>    9117
>    7057
>    9553
>    8300
>    7030
>    6849
>    9533
>   10126
>    6876
>    8473
> TcpExtTCPHystartTrainDetect     10                 0.0
> TcpExtTCPHystartTrainCwnd       1230               0.0
>
> After :
> nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
>    9845
>   10103
>   10866
>   11096
>   11936
>   11487
>   11773
>   12188
>   11066
>   11894
> TcpExtTCPHystartTrainDetect     10                 0.0
> TcpExtTCPHystartTrainCwnd       6462               0.0
>
> Disabling Hystart ACK Train detection gives similar numbers
>
> echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
> nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
>   11173
>   10954
>   12455
>   10627
>   11578
>   11583
>   11222
>   10880
>   10665
>   11366

I guess all these tests are with the "fq" qdisc with pacing enabled?

> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp_cubic.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)

Thanks, Eric. The revised heuristic for paced traffic seems like a
nice improvement.

Acked-by: Neal Cardwell <ncardwell@google.com>

thanks,
neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 0/5] tcp_cubic: various fixes
  2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
                   ` (4 preceding siblings ...)
  2019-12-23 20:27 ` [PATCH net-next v2 5/5] tcp_cubic: make Hystart aware of pacing Eric Dumazet
@ 2019-12-27 17:32 ` Neal Cardwell
  2019-12-28  0:29 ` David Miller
  6 siblings, 0 replies; 17+ messages in thread
From: Neal Cardwell @ 2019-12-27 17:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> This patch series converts tcp_cubic to usec clock resolution
> for Hystart logic.
>
> This makes Hystart more relevant for data-center flows.
> Prior to this series, Hystart was not kicking, or was
> kicking without good reason, since the 1ms clock was too coarse.
>
> Last patch also fixes an issue with Hystart vs TCP pacing.
>
> v2: removed a last-minute debug chunk from last patch
>
> Eric Dumazet (5):
>   tcp_cubic: optimize hystart_update()
>   tcp_cubic: remove one conditional from hystart_update()
>   tcp_cubic: switch bictcp_clock() to usec resolution
>   tcp_cubic: tweak Hystart detection for short RTT flows
>   tcp_cubic: make Hystart aware of pacing
>
>  net/ipv4/tcp_cubic.c | 82 +++++++++++++++++++++++++++-----------------
>  1 file changed, 51 insertions(+), 31 deletions(-)

Thanks for this very nice patch series, Eric.

In reviewing these patches and thinking about the Hystart ACK train
heuristic, I am thinking that another behavior that could fool this
heuristic and cause a spurious/early Hystart exit of slow start would
be application-limited flights of data. In other words, just as pacing
can cause inter-ACK spacing increases that could spuriously trigger
the Hystart ACK train heuristic, AFAICT gaps between application
writes could also cause inter-ACK gaps that could spuriously trigger
the Hystart ACK train heuristic.

AFAICT to avoid such spurious exits of slow start we ought to pass in
the is_app_limited bool in the struct ack_sample, and thereby pass it
through pkts_acked(), bictcp_acked(), and hystart_update().

@@ -3233,9 +3233,12 @@ static int tcp_clean_rtx_queue(struct sock *sk,
u32 prior_fack,
  }

  if (icsk->icsk_ca_ops->pkts_acked) {
- struct ack_sample sample = { .pkts_acked = pkts_acked,
-      .rtt_us = sack->rate->rtt_us,
-      .in_flight = last_in_flight };
+  struct ack_sample sample = {
+    .pkts_acked = pkts_acked,
+    .rtt_us = sack->rate->rtt_us,
+    .in_flight = last_in_flight,
+    .is_app_limited = sack->rate->is_app_limited,
+  };

    icsk->icsk_ca_ops->pkts_acked(sk, &sample);
  }

...and then only trigger the HYSTART_ACK_TRAIN heuristic to exit slow
start if !sample->is_app_limited.

This could be a follow-on patch after this series, or an additional
patch at the end of this series.

WDYT?

neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 0/5] tcp_cubic: various fixes
  2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
                   ` (5 preceding siblings ...)
  2019-12-27 17:32 ` [PATCH net-next v2 0/5] tcp_cubic: various fixes Neal Cardwell
@ 2019-12-28  0:29 ` David Miller
  6 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2019-12-28  0:29 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, soheil, ncardwell, ycheng, kafai

From: Eric Dumazet <edumazet@google.com>
Date: Mon, 23 Dec 2019 12:27:49 -0800

> This patch series converts tcp_cubic to usec clock resolution
> for Hystart logic.
> 
> This makes Hystart more relevant for data-center flows.
> Prior to this series, Hystart was not kicking, or was
> kicking without good reason, since the 1ms clock was too coarse.
> 
> Last patch also fixes an issue with Hystart vs TCP pacing.
> 
> v2: removed a last-minute debug chunk from last patch

Series applied, please address Neil's feedback on patch #4.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows
  2019-12-27 16:00   ` Neal Cardwell
@ 2019-12-28 16:31     ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2019-12-28 16:31 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Martin KaFai Lau

On Fri, Dec 27, 2019 at 8:01 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > After switching ca->delay_min to usec resolution, we exit
> > slow start prematurely for very low RTT flows, setting
> > snd_ssthresh to 20.
> >
> > The reason is that delay_min is fed with RTT of small packet
> > trains. Then as cwnd is increased, TCP sends bigger TSO packets.
> >
> > LRO/GRO aggregation and/or interrupt mitigation strategies
> > on receiver tend to inflate RTT samples.
> >
> > Fix this by adding to delay_min the expected delay of
> > two TSO packets, given current pacing rate.
> ...
> > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> > index 068775b91fb5790e6e60a6490b49e7a266e4ed51..0e5428ed04fe4e50627e21a53c3d17f9f2dade4d 100644
> > --- a/net/ipv4/tcp_cubic.c
> > +++ b/net/ipv4/tcp_cubic.c
> > @@ -436,8 +436,27 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
> >                 delay = 1;
> >
> >         /* first time call or link delay decreases */
> > -       if (ca->delay_min == 0 || ca->delay_min > delay)
> > -               ca->delay_min = delay;
> > +       if (ca->delay_min == 0 || ca->delay_min > delay) {
> > +               unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
> > +
> > +               /* Account for TSO/GRO delays.
> > +                * Otherwise short RTT flows could get too small ssthresh,
> > +                * since during slow start we begin with small TSO packets
> > +                * and could lower ca->delay_min too much.
> > +                * Ideally even with a very small RTT we would like to have
> > +                * at least one TSO packet being sent and received by GRO,
> > +                * and another one in qdisc layer.
> > +                * We apply another 100% factor because @rate is doubled at
> > +                * this point.
>
> This comment mentions "@rate is doubled at this point". But AFAICT
> this part of the code is executed on every ACK, not just in slow
> start, so the rate is not really always doubled at this point. It
> might be more clear/precise to say  "@rate can be doubled at this
> point"?
>
>
> > +                * We cap the cushion to 1ms.
> > +                */
> > +               if (rate)
> > +                       delay += min_t(u64, USEC_PER_MSEC,
> > +                                      div64_ul((u64)GSO_MAX_SIZE *
> > +                                               4 * USEC_PER_SEC, rate));
> > +               if (ca->delay_min == 0 || ca->delay_min > delay)
> > +                       ca->delay_min = delay;
> > +       }
> >
> >         /* hystart triggers when cwnd is larger than some threshold */
> >         if (!ca->found && hystart && tcp_in_slow_start(tp) &&
>
> AFAICT there may be some CPU usage impact here for high-speed
> intra-datacenter flows due to an extra div64_ul() being executed on
> most ACKs? It seems like if 'ca->delay_min > delay' then the code
> executes the div64_ul() to calculate the cushion and add it to
> 'delay'. Then the  ca->delay_min is recorded with the cushion added.
>
> But the first 'ca->delay_min > delay' comparison that determines
> whether we execute the div64_ul() is comparing a ca->delay_min that
> has a cushion included to the 'delay' that has no cushion included. So
> if the raw 'delay' value is within the cushion of the raw min_rtt
> value, then this 'ca->delay_min > delay' expression will evaluate to
> true, and we'll run the div64_ul() even though there is probably not a
> need to.
>
> How big is the cushion? In back of the envelope terms, for a 10
> Gbit/sec link where CUBIC has exited the initial slow start and is
> using a pacing rate matching the link rate, then AFAICT the cushion
> should be roughly:
>   4 * 65536 * 1000000 / (1.2*10*10^9/8) ~= 175 usecs
> So if that math is in the right ballpark, then any RTT sample that is
> within roughly 175 usec of the true min_rtt is going to have
> 'ca->delay_min > delay' evaluate to true, and run the div64_ul() when
> there is probably no need.

In my tests the cushion was about ~100 usec for 10Gbit links.

>
> AFAICT there is also some risk with this patch that the steady-state
> behavior of CUBIC becomes slightly more aggressive for moderate-rate,
> low-RTT cases. In such cases this up to ~1ms of cushion to delay_min
> will cause the bictcp_update() line:
>
>  t += usecs_to_jiffies(ca->delay_min);
>
> to calculate the target cwnd at a point in time that is ~1ms further
> out than before, and thus possibly grow the cwnd faster than it would
> have before.
>
> To avoid the risk of extra divisions on many ACKs, and the risk of
> more aggressive cwnd behavior in the steady state, WDYT about an
> approach where the cushioned delay_min used by your revised Hystart
> code (ACK train heuristic and delay heuristic) is maintained
> separately from the "pure" delay_min used elsewhere in the CUBIC code,
> perhaps as a new ca->delay_min_hystart field? I am wondering about an
> approach something like:
>
> /* Account for TSO/GRO delays. ...
> */
> static u32 update_delay_min_hystart(struct sock *sk)
> {
>         unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
>         u32 ack_delay = 0;
>
>         if (ca->found)
>                 return;  /* Hystart is done */
>
>         if (rate)
>                 ack_delay = min_t(u64, USEC_PER_MSEC,
>                                   div64_ul((u64)GSO_MAX_SIZE *
>                                            4 * USEC_PER_SEC, rate));
>         ca->delay_min_hystart = ca->delay_min + ack_delay;

Note that while testing this, I found that ca->delay_min_hystart was
latched while initial sk_pacing_rate is rather small.

This means we exit slow start with much higher cwnd/ssthresh (Hystart
triggers too late IMO)

$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l
-4000000; done;nstat|egrep "Hystart"
   4209
  13236
   3602
  16635
  16139
   4159
   3963
   3956
   4319
   4121
TcpExtTCPHystartTrainDetect     7                  0.0
TcpExtTCPHystartTrainCwnd       8256               0.0

$ dmesg|tail -7
[  144.637752]  ack_train delay_min 23 (delay_min_hystart 348) cwnd 1704
[  146.924494]  ack_train delay_min 28 (delay_min_hystart 202) cwnd 480
[  150.338917]  ack_train delay_min 20 (delay_min_hystart 264) cwnd 1614
[  151.481355]  ack_train delay_min 23 (delay_min_hystart 172) cwnd 1026
[  152.607829]  ack_train delay_min 20 (delay_min_hystart 241) cwnd 1071
[  153.742571]  ack_train delay_min 20 (delay_min_hystart 237) cwnd 1434
[  154.890514]  ack_train delay_min 21 (delay_min_hystart 116) cwnd 927


We might have to compute the ack_delay even if ca->delay_min has not
been lowered (since sk_pacing_rate is doubling every RTT regardless of
delay_min changes)

> }
>
> ...
>         if (ca->delay_min == 0 || ca->delay_min > delay) {
>                 ca->delay_min = delay;
>                 update_delay_min_hystart(sk);
>         }
>
> WDYT?


I think these are all very good suggestions Neal, I cooked a patch
that I will send after more tests.

Thanks for spending time on this stuff, especially at this time of the year :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-12-28 16:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-23 20:27 [PATCH net-next v2 0/5] tcp_cubic: various fixes Eric Dumazet
2019-12-23 20:27 ` [PATCH net-next v2 1/5] tcp_cubic: optimize hystart_update() Eric Dumazet
2019-12-26 23:23   ` Neal Cardwell
2019-12-23 20:27 ` [PATCH net-next v2 2/5] tcp_cubic: remove one conditional from hystart_update() Eric Dumazet
2019-12-26 23:06   ` Neal Cardwell
2019-12-23 20:27 ` [PATCH net-next v2 3/5] tcp_cubic: switch bictcp_clock() to usec resolution Eric Dumazet
2019-12-27 14:46   ` Neal Cardwell
2019-12-27 15:24     ` Eric Dumazet
2019-12-27 16:07       ` Neal Cardwell
2019-12-27 16:08   ` Neal Cardwell
2019-12-23 20:27 ` [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows Eric Dumazet
2019-12-27 16:00   ` Neal Cardwell
2019-12-28 16:31     ` Eric Dumazet
2019-12-23 20:27 ` [PATCH net-next v2 5/5] tcp_cubic: make Hystart aware of pacing Eric Dumazet
2019-12-27 17:06   ` Neal Cardwell
2019-12-27 17:32 ` [PATCH net-next v2 0/5] tcp_cubic: various fixes Neal Cardwell
2019-12-28  0:29 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).