All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP
@ 2018-03-13 10:25 Ilpo Järvinen
  2018-03-13 10:25 ` [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery Ilpo Järvinen
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection. These are not fixes to just some
remote corner cases of recovery but many/almost all recoveries
without SACK will be impacted by one (or even more than one) of
them. The sender-side changes (1-4) are not mainly, if any, to
improve performance (throughput) but address correctness
(congestion control responses should not get incorrectly
reverted) and burstiness (that may cause additional problems
later as some of the packets in such bursts may get dropped
needing again to resort to loss recovery that is likely
similarly bursty).

v1 -> v2:
- Tried to improve changelogs
- Reworked FRTO undo fix location
- Removed extra parenthesis from EXPR (and while at it, reverse
  also the sides of &&)
- Pass prior_snd_una rather than flag around to avoid moving
  tcp_packet_delayed call
- Pass tp instead of sk. Sk was there only due to a subsequent
  change (that I think is only net-next material) limiting the
  use of the transient state to only RTO recoveries as it won't
  be needed after NewReno recovery that won't do unnecessary
  rexmits like the non-SACK RTO recovery does

v2 -> v3:
- Remove unnecessarily added braces

tcp: feed correct number of pkts acked to cc
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-13 10:25 [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Ilpo Järvinen
@ 2018-03-13 10:25 ` Ilpo Järvinen
  2018-03-27  2:07   ` Yuchung Cheng
  2018-03-13 10:25 ` [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows Ilpo Järvinen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

A miscalculation for the number of acknowledged packets occurs during
RTO recovery whenever SACK is not enabled and a cumulative ACK covers
any non-retransmitted skbs. The reason is that pkts_acked value
calculated in tcp_clean_rtx_queue is not correct for slow start after
RTO as it may include segments that were not lost and therefore did
not need retransmissions in the slow start following the RTO. Then
tcp_slow_start will add the excess into cwnd bloating it and
triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..4a26c09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 	long seq_rtt_us = -1L;
 	long ca_rtt_us = -1L;
 	u32 pkts_acked = 0;
+	u32 rexmit_acked = 0;
+	u32 newdata_acked = 0;
 	u32 last_in_flight = 0;
 	bool rtt_update;
 	int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (unlikely(sacked & TCPCB_RETRANS)) {
-			if (sacked & TCPCB_SACKED_RETRANS)
+			if (sacked & TCPCB_SACKED_RETRANS) {
 				tp->retrans_out -= acked_pcount;
+				rexmit_acked += acked_pcount;
+			}
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
 			last_ackt = skb->skb_mstamp;
@@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				reord = start_seq;
 			if (!after(scb->end_seq, tp->high_seq))
 				flag |= FLAG_ORIG_SACK_ACKED;
+			else
+				newdata_acked += acked_pcount;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (tcp_is_reno(tp)) {
+			/* Due to discontinuity on RTO in the artificial
+			 * sacked_out calculations, TCP must restrict
+			 * pkts_acked without SACK to rexmits and new data
+			 * segments
+			 */
+			if (icsk->icsk_ca_state == TCP_CA_Loss)
+				pkts_acked = rexmit_acked + newdata_acked;
+
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
 			int delta;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-03-13 10:25 [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Ilpo Järvinen
  2018-03-13 10:25 ` [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery Ilpo Järvinen
@ 2018-03-13 10:25 ` Ilpo Järvinen
  2018-03-28 22:26   ` Yuchung Cheng
  2018-03-13 10:25 ` [PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible() Ilpo Järvinen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a26c09..c60745c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				pkts_acked = rexmit_acked + newdata_acked;
 
 			tcp_remove_reno_sacks(sk, pkts_acked);
+
+			/* If any of the cumulatively ACKed segments was
+			 * retransmitted, non-SACK case cannot confirm that
+			 * progress was due to original transmission due to
+			 * lack of TCPCB_SACKED_ACKED bits even if some of
+			 * the packets may have been never retransmitted.
+			 */
+			if (flag & FLAG_RETRANS_DATA_ACKED)
+				flag &= ~FLAG_ORIG_SACK_ACKED;
 		} else {
 			int delta;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()
  2018-03-13 10:25 [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Ilpo Järvinen
  2018-03-13 10:25 ` [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery Ilpo Järvinen
  2018-03-13 10:25 ` [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows Ilpo Järvinen
@ 2018-03-13 10:25 ` Ilpo Järvinen
  2018-03-13 10:25 ` [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled Ilpo Järvinen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

No functional changes. This change simplifies the next change
slightly.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c60745c..72ecfbb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 	}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp,
+					    const u32 snd_una)
+{
+	return tcp_is_reno(tp) && (snd_una == tp->high_seq);
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
 	return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
 	} else if (tp->rack.reo_wnd_persist) {
 		tp->rack.reo_wnd_persist--;
 	}
-	if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-		/* Hold old state until something *above* high_seq
-		 * is ACKed. For Reno it is MUST to prevent false
-		 * fast retransmits (RFC2582). SACK TCP is safe. */
+	if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) {
+		/* Hold old state until something *above* high_seq is ACKed
+		 * if false fast retransmit is possible.
+		 */
 		if (!tcp_any_retrans_done(sk))
 			tp->retrans_stamp = 0;
 		return true;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
  2018-03-13 10:25 [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Ilpo Järvinen
                   ` (2 preceding siblings ...)
  2018-03-13 10:25 ` [PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible() Ilpo Järvinen
@ 2018-03-13 10:25 ` Ilpo Järvinen
  2018-03-28 22:36   ` Yuchung Cheng
  2018-03-13 10:25 ` [PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows Ilpo Järvinen
  2018-03-13 13:43 ` [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Eric Dumazet
  5 siblings, 1 reply; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

When a cumulative ACK lands to high_seq at the end of loss
recovery and SACK is not enabled, the sender needs to avoid
false fast retransmits (RFC6582). The avoidance mechanisms is
implemented by remaining in the loss recovery CA state until
one additional cumulative ACK arrives. During the operation of
this avoidance mechanism, there is internal transient in the
use of state variables which will always trigger a bogus undo.

When we enter to this transient state in tcp_try_undo_recovery,
tcp_any_retrans_done is often (always?) false resulting in
clearing retrans_stamp. On the next cumulative ACK,
tcp_try_undo_recovery again executes because CA state still
remains in the same recovery state and tcp_may_undo will always
return true because tcp_packet_delayed has this condition:
    return !tp->retrans_stamp || ...

Check if the false fast retransmit transient avoidance is in
progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
has advanced already on this ACK but CA state still remains
unchanged (CA state is updated slightly later than undo is
checked), prior_snd_una needs to be passed to tcp_packet_delayed
(instead of tp->snd_una). Passing prior_snd_una around to
the tcp_packet_delayed makes this change look more involved than
it really is.

The additional checks done in this change only affect non-SACK
case, the SACK case remains the same.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 72ecfbb..270aa48 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
+				      const u32 prior_snd_una)
 {
-	return !tp->retrans_stamp ||
-	       tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+	if (!tp->retrans_stamp) {
+		/* Sender will be in a transient state with cleared
+		 * retrans_stamp during false fast retransmit prevention
+		 * mechanism
+		 */
+		return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
+	}
+	return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
 	tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct tcp_sock *tp,
+				const u32 prior_snd_una)
 {
-	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+	return tp->undo_marker &&
+	       (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(tp, prior_snd_una)) {
 		int mib_idx;
 
 		/* Happy end! We did not retransmit anything
@@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
+			      bool frto_undo)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (frto_undo || tcp_may_undo(tp)) {
+	if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
 		tcp_undo_cwnd_reduction(sk, true);
 
 		DBGUNDO(sk, "partial loss");
@@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
 static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
-			     int *rexmit)
+			     int *rexmit, const u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool recovered = !before(tp->snd_una, tp->high_seq);
 
 	if ((flag & FLAG_SND_UNA_ADVANCED) &&
-	    tcp_try_undo_loss(sk, false))
+	    tcp_try_undo_loss(sk, prior_snd_una, false))
 		return;
 
 	if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 		 * lost, i.e., never-retransmitted data are (s)acked.
 		 */
 		if ((flag & FLAG_ORIG_SACK_ACKED) &&
-		    tcp_try_undo_loss(sk, true))
+		    tcp_try_undo_loss(sk, prior_snd_una, true))
 			return;
 
 		if (after(tp->snd_nxt, tp->high_seq)) {
@@ -2665,7 +2675,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 
 	if (recovered) {
 		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
-		tcp_try_undo_recovery(sk);
+		tcp_try_undo_recovery(sk, prior_snd_una);
 		return;
 	}
 	if (tcp_is_reno(tp)) {
@@ -2685,7 +2695,7 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tp->undo_marker && tcp_packet_delayed(tp)) {
+	if (tp->undo_marker && tcp_packet_delayed(tp, prior_snd_una)) {
 		/* Plain luck! Hole if filled with delayed
 		 * packet, rather than with a retransmit. Check reordering.
 		 */
@@ -2788,7 +2798,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		case TCP_CA_Recovery:
 			if (tcp_is_reno(tp))
 				tcp_reset_reno_sack(tp);
-			if (tcp_try_undo_recovery(sk))
+			if (tcp_try_undo_recovery(sk, prior_snd_una))
 				return;
 			tcp_end_cwnd_reduction(sk);
 			break;
@@ -2815,7 +2825,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		tcp_rack_identify_loss(sk, ack_flag);
 		break;
 	case TCP_CA_Loss:
-		tcp_process_loss(sk, flag, is_dupack, rexmit);
+		tcp_process_loss(sk, flag, is_dupack, rexmit, prior_snd_una);
 		tcp_rack_identify_loss(sk, ack_flag);
 		if (!(icsk->icsk_ca_state == TCP_CA_Open ||
 		      (*ack_flag & FLAG_LOST_RETRANS)))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
  2018-03-13 10:25 [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Ilpo Järvinen
                   ` (3 preceding siblings ...)
  2018-03-13 10:25 ` [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled Ilpo Järvinen
@ 2018-03-13 10:25 ` Ilpo Järvinen
  2018-03-13 13:43 ` [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Eric Dumazet
  5 siblings, 0 replies; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

Currently, the TCP code is overly eager to increase window on
almost every ACK. It makes those ACKs that the receiver should
sent as dupACKs look like they update window that is not
considered a real dupACK by the non-SACK sender-side code.
Therefore the sender needs to resort to RTO to recover
losses as fast retransmit/fast recovery cannot be triggered
by such masked duplicate ACKs.

This change makes sure that when an ofo segment is received,
no change to window is applied if we are going to send a dupACK.
Even with this change, the window updates keep being maintained
but that occurs "behind the scenes". That is, this change does
not interfere with memory management of the flow which could
have long-term impact for the progress of the flow but only
prevents those updates being seen on the wire on short-term.
It's ok to change the window for non-dupACKs such as the first
ACK after ofo arrivals start if that ACK was using delayed ACKs
and also whenever the ack field advances. As ack field typically
advances once per RTT as the first hole is retransmitted, the
window updates are not delayed entirely during long recoveries.

Even before this fix, tcp_select_window did not allow ACK
shrinking the window for duplicate ACKs (that was previously
even called "treason" but the old charmy message is gone now).
The advertized window can only shrink when also ack field
changes which will not be blocked by this change as it is not
duplicate ACK.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 ++++-
 net/ipv4/tcp_output.c | 43 +++++++++++++++++++++++++------------------
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
 		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-		unused:2;
+		dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+		unused:1;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		unused1	    : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 270aa48..4ff192b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4626,6 +4626,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	bool fragstolen;
 	int eaten;
 
@@ -4669,7 +4670,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 			 * gap in queue is filled.
 			 */
 			if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
-				inet_csk(sk)->icsk_ack.pingpong = 0;
+				icsk->icsk_ack.pingpong = 0;
 		}
 
 		if (tp->rx_opt.num_sacks)
@@ -4719,6 +4720,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		goto queue_and_out;
 	}
 
+	if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+		tp->dupack_wnd_lock = 1;
 	tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 old_win = tp->rcv_wnd;
-	u32 cur_win = tcp_receive_window(tp);
-	u32 new_win = __tcp_select_window(sk);
-
-	/* Never shrink the offered window */
-	if (new_win < cur_win) {
-		/* Danger Will Robinson!
-		 * Don't update rcv_wup/rcv_wnd here or else
-		 * we will not be able to advertise a zero
-		 * window in time.  --DaveM
-		 *
-		 * Relax Will Robinson.
-		 */
-		if (new_win == 0)
-			NET_INC_STATS(sock_net(sk),
-				      LINUX_MIB_TCPWANTZEROWINDOWADV);
-		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+	u32 cur_win;
+	u32 new_win;
+
+	if (tp->dupack_wnd_lock) {
+		new_win = old_win;
+		tp->dupack_wnd_lock = 0;
+	} else {
+		cur_win = tcp_receive_window(tp);
+		new_win = __tcp_select_window(sk);
+		/* Never shrink the offered window */
+		if (new_win < cur_win) {
+			/* Danger Will Robinson!
+			 * Don't update rcv_wup/rcv_wnd here or else
+			 * we will not be able to advertise a zero
+			 * window in time.  --DaveM
+			 *
+			 * Relax Will Robinson.
+			 */
+			if (new_win == 0)
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPWANTZEROWINDOWADV);
+			new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+		}
+		tp->rcv_wnd = new_win;
+		tp->rcv_wup = tp->rcv_nxt;
 	}
-	tp->rcv_wnd = new_win;
-	tp->rcv_wup = tp->rcv_nxt;
 
 	/* Make sure we do not exceed the maximum possible
 	 * scaled window.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP
  2018-03-13 10:25 [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Ilpo Järvinen
                   ` (4 preceding siblings ...)
  2018-03-13 10:25 ` [PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows Ilpo Järvinen
@ 2018-03-13 13:43 ` Eric Dumazet
  2018-03-15 11:13   ` Ilpo Järvinen
  5 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2018-03-13 13:43 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Sergei Shtylyov

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> Here is a series of fixes to issues that occur when SACK is not
> enabled for a TCP connection. These are not fixes to just some
> remote corner cases of recovery but many/almost all recoveries
> without SACK will be impacted by one (or even more than one) of
> them. The sender-side changes (1-4) are not mainly, if any, to
> improve performance (throughput) but address correctness
> (congestion control responses should not get incorrectly
> reverted) and burstiness (that may cause additional problems
> later as some of the packets in such bursts may get dropped
> needing again to resort to loss recovery that is likely
> similarly bursty).

GRO (or similar) adoption a long time ago (not only by linux) had a
serious impact on non SACK flow.
Should we also disable GRO by default ?
(my answer is no, just in case someone wonders)

I am sorry, but I am  still not convinced by your patches, trying to
give a wrong incentive to people keeping their prehistoric stacks,
that have serious problems on wifi anyway, and or middle boxes
"aggregating/compressing ACKS"

The last patch is particularly problematic in my opinion, you have not
provided  packetdrill tests to prove there was no danger.

It seems you leave to us the task of dealing with possible issues,
only added a bold changelog.

Since the bugs you claim to fix are at least 10 years old, and your
fixes are far from being trivial,
please wait our packetdrill regression tests being published.
This should happen in less than one month I believe, in time for linux-4.17

Note that the publication of the updated packetdrill and test suite is
not an easy task,
since we accumulated a lot of hacks both in kernel to cope with timer
slacks and in packetdrill
for various experimental or private features that are not yet in
upstream kernels.

So we are cleaning all this, and will be happy to help you if you need.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP
  2018-03-13 13:43 ` [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Eric Dumazet
@ 2018-03-15 11:13   ` Ilpo Järvinen
  0 siblings, 0 replies; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-15 11:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Sergei Shtylyov

[-- Attachment #1: Type: text/plain, Size: 3836 bytes --]

On Tue, 13 Mar 2018, Eric Dumazet wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Jï¿œrvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > Here is a series of fixes to issues that occur when SACK is not
> > enabled for a TCP connection. These are not fixes to just some
> > remote corner cases of recovery but many/almost all recoveries
> > without SACK will be impacted by one (or even more than one) of
> > them. The sender-side changes (1-4) are not mainly, if any, to
> > improve performance (throughput) but address correctness
> > (congestion control responses should not get incorrectly
> > reverted) and burstiness (that may cause additional problems
> > later as some of the packets in such bursts may get dropped
> > needing again to resort to loss recovery that is likely
> > similarly bursty).
> 
> GRO (or similar) adoption a long time ago (not only by linux) had a
> serious impact on non SACK flow.
> Should we also disable GRO by default ?
> (my answer is no, just in case someone wonders)
>
> I am sorry, but I am still not convinced by your patches, trying to
> give a wrong incentive to people keeping their prehistoric stacks,
> that have serious problems on wifi anyway, and or middle boxes
> "aggregating/compressing ACKS"

So now you think that I'm against high perf enhancements (even after 
having written some bits of them)?

However, I think that in case somebody disables something that is enabled 
by default, be it SACK, GRO, timestamps, etc., TCP he/she will get should 
still make sense (and that regardless of some middleboxes trying hard to 
break otherwise working stuff). But I guess you don't agree here(?) and 
would perhaps even want to remove ability to disable them? Although 
obviously that wouldn't even work with non-SACK (unless RST or so starts 
to get used but even that could be worked-around unfortunately).

Also, I'm somewhat confused your position here. On one hand you said that
tests should be added to regression tests to avoid breaking these non-SACK
cases again implying that things should be fixed and on the other hand you 
seem to say that non-SACK must be left broken?

> The last patch is particularly problematic in my opinion, you have not
> provided packetdrill tests to prove there was no danger.

Can that even be done? Proving absence of danger with packetdrill?
AFAIK that kind of proofs are available only with formal verification.
 
> It seems you leave to us the task of dealing with possible issues,
> only added a bold changelog.

Heh, I tried to add details to the changelog because you explicitly said
I should and now you fault me on doing that :-). Also, you're leaving out 
the detail that I tried to understand the condition that worried you and
found out that the code already disallows any shrinking of advertized 
window for duplicate ACKs (but I guess there just might have been some 
miscommunication between us given that your last reply to 5/5
v1 made no sense to me).

> Since the bugs you claim to fix are at least 10 years old, and your
> fixes are far from being trivial,
> please wait our packetdrill regression tests being published.
> This should happen in less than one month I believe, in time for linux-4.17
> 
> Note that the publication of the updated packetdrill and test suite is
> not an easy task,
> since we accumulated a lot of hacks both in kernel to cope with timer
> slacks and in packetdrill
> for various experimental or private features that are not yet in
> upstream kernels.
> 
> So we are cleaning all this, and will be happy to help you if you need.

I slightly disagree on the trivial bit here as I think it's trivial to 
see the changes only affect non-SACK flows (of which you seem to say you 
don't care if they're broken as that gives incentive to use SACK). But 
then I'm not too worried about waiting a month.


-- 
 i.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-13 10:25 ` [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery Ilpo Järvinen
@ 2018-03-27  2:07   ` Yuchung Cheng
  2018-03-27 14:23     ` Ilpo Järvinen
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-03-27  2:07 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> A miscalculation for the number of acknowledged packets occurs during
> RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> any non-retransmitted skbs. The reason is that pkts_acked value
> calculated in tcp_clean_rtx_queue is not correct for slow start after
> RTO as it may include segments that were not lost and therefore did
> not need retransmissions in the slow start following the RTO. Then
> tcp_slow_start will add the excess into cwnd bloating it and
> triggering a burst.
>
> Instead, we want to pass only the number of retransmitted segments
> that were covered by the cumulative ACK (and potentially newly sent
> data segments too if the cumulative ACK covers that far).
>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  net/ipv4/tcp_input.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9a1b3c1..4a26c09 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>         long seq_rtt_us = -1L;
>         long ca_rtt_us = -1L;
>         u32 pkts_acked = 0;
> +       u32 rexmit_acked = 0;
> +       u32 newdata_acked = 0;
>         u32 last_in_flight = 0;
>         bool rtt_update;
>         int flag = 0;
> @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (unlikely(sacked & TCPCB_RETRANS)) {
> -                       if (sacked & TCPCB_SACKED_RETRANS)
> +                       if (sacked & TCPCB_SACKED_RETRANS) {
>                                 tp->retrans_out -= acked_pcount;
> +                               rexmit_acked += acked_pcount;
> +                       }
>                         flag |= FLAG_RETRANS_DATA_ACKED;
>                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>                         last_ackt = skb->skb_mstamp;
> @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                                 reord = start_seq;
>                         if (!after(scb->end_seq, tp->high_seq))
>                                 flag |= FLAG_ORIG_SACK_ACKED;
> +                       else
> +                               newdata_acked += acked_pcount;
>                 }
>
>                 if (sacked & TCPCB_SACKED_ACKED) {
> @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (tcp_is_reno(tp)) {
> +                       /* Due to discontinuity on RTO in the artificial
> +                        * sacked_out calculations, TCP must restrict
> +                        * pkts_acked without SACK to rexmits and new data
> +                        * segments
> +                        */
> +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
> +                               pkts_acked = rexmit_acked + newdata_acked;
> +
My understanding is there are two problems

1) your fix: the reordering logic in tcp-remove_reno_sacks requires
precise cumulatively acked count, not newly acked count?


2) current code: pkts_acked can substantially over-estimate the newly
delivered pkts in both SACK and non-SACK cases. For example, let's say
99/100 packets are already sacked, and the next ACK acks 100 pkts.
pkts_acked == 100 but really only one packet is delivered. It's wrong
to inform congestion control that 100 packets have just delivered.
AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
the newly delivered packets.

A better fix for both SACK and non-SACK, seems to be moving
ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
calibrated? this is what BBR is currently doing to avoid these pitfalls.


>                         tcp_remove_reno_sacks(sk, pkts_acked);
>                 } else {
>                         int delta;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-27  2:07   ` Yuchung Cheng
@ 2018-03-27 14:23     ` Ilpo Järvinen
  2018-03-27 15:06       ` Yuchung Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-27 14:23 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

[-- Attachment #1: Type: text/plain, Size: 7955 bytes --]

On Mon, 26 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > A miscalculation for the number of acknowledged packets occurs during
> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > any non-retransmitted skbs. The reason is that pkts_acked value
> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > RTO as it may include segments that were not lost and therefore did
> > not need retransmissions in the slow start following the RTO. Then
> > tcp_slow_start will add the excess into cwnd bloating it and
> > triggering a burst.
> >
> > Instead, we want to pass only the number of retransmitted segments
> > that were covered by the cumulative ACK (and potentially newly sent
> > data segments too if the cumulative ACK covers that far).
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..4a26c09 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >         long seq_rtt_us = -1L;
> >         long ca_rtt_us = -1L;
> >         u32 pkts_acked = 0;
> > +       u32 rexmit_acked = 0;
> > +       u32 newdata_acked = 0;
> >         u32 last_in_flight = 0;
> >         bool rtt_update;
> >         int flag = 0;
> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
> > -                       if (sacked & TCPCB_SACKED_RETRANS)
> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
> >                                 tp->retrans_out -= acked_pcount;
> > +                               rexmit_acked += acked_pcount;
> > +                       }
> >                         flag |= FLAG_RETRANS_DATA_ACKED;
> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> >                         last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 reord = start_seq;
> >                         if (!after(scb->end_seq, tp->high_seq))
> >                                 flag |= FLAG_ORIG_SACK_ACKED;
> > +                       else
> > +                               newdata_acked += acked_pcount;
> >                 }
> >
> >                 if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (tcp_is_reno(tp)) {
> > +                       /* Due to discontinuity on RTO in the artificial
> > +                        * sacked_out calculations, TCP must restrict
> > +                        * pkts_acked without SACK to rexmits and new data
> > +                        * segments
> > +                        */
> > +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
> > +                               pkts_acked = rexmit_acked + newdata_acked;
> > +
> My understanding is there are two problems
> 
> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> precise cumulatively acked count, not newly acked count?

While I'm not entirely sure if you intented to say that my fix is broken 
or not, I thought this very difference alot while making the fix and I 
believe that this fix is needed because of the discontinuity at RTO 
(sacked_out is cleared as we set L-bits + lost_out). This is an artifact 
in the imitation of sacked_out for non-SACK but at RTO we can't keep that 
in sync because we set L-bits (and have no S-bits to guide us). Thus, we 
cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.

In tcp_remove_reno_sacks acked - sacked_out is being used to calculate 
tp->delivered, using plain cumulative acked causes congestion control 
breakage later as call to tcp_cong_control will directly use the 
difference in tp->delivered.

This boils down the exact definition of tp->delivered (the one given in 
the header is not detailed enough). I guess you might have better idea
what it exactly is since one of you has added it? There are subtle things 
in the defination that can make it entirely unsuitable for cc decisions. 
Should those segments that we (possibly) already counted into 
tp->delivered during (potentially preceeding) CA_Recovery be added to it 
for _second time_ or not? This fix avoids such double counting (the 
price is that we might underestimate). For SACK+ACK losses, the similar 
question is: Should those segments that we missed counting into 
tp->delivered during preceeding CA_Recovery (due to losing enough SACKs) 
be added into tp->delivered now during RTO recovery or not (I'm not 
proposing we fix this unless we want to fix the both issues at the same 
time here as its impact with SACK is not that significant)? Is 
tp->delivered supposed to under- or overestimate (in the cases we're not 
sure what/when something happened)? ...If it's overestimating under any 
circumstances (for the current ACK), we cannot base our cc decision on it.

tcp_check_reno_reordering might like to have the cumulatively acked count 
but due to the forementioned discontinuity we anyway cannot accurately 
provide in CA_Loss what tcp_limit_reno_sacked expects (and the 
other CA states are unaffected by this fix). While we could call 
tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't 
remove the fundamental discontinuity problem that we have for what 
tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
that tcp_limit_reno_sacked is just never going to work after we zero 
tp->sacked_out (this would actually be the problem #3) or at least ends 
up underestimating the reordering degree by the amount we cleared from 
sacked_out at RTO?

> 2) current code: pkts_acked can substantially over-estimate the newly
> delivered pkts in both SACK and non-SACK cases. For example, let's say
> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
> pkts_acked == 100 but really only one packet is delivered. It's wrong
> to inform congestion control that 100 packets have just delivered.
> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
> the newly delivered packets.
>
> A better fix for both SACK and non-SACK, seems to be moving
> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
> calibrated? this is what BBR is currently doing to avoid these pitfalls.

Unfortunately that would not fix the problematic calculation of 
tp->delivered.

But you might be right that there's a problem #2 (it's hard to notice 
for real though as most of the cc modules don't seem to use it for 
anything). However, this is for pkts_acked so are you sure what the 
cc modules exactly expect: the shrinkage in # of outstanding packets or 
the number of newly delivered packets?

Looking (only) quickly into the modules (that use it other than in 
CA_Open):
- Somewhat suspicious delayed ACK detection logic in cdg
- HTCP might want shrinkage in # of outstanding packets (but I'm not 
entirely sure) for throughput measurement
- I guess TCP illinois is broken (assumes it's newly delivered packets) 
but it would not need to use it at all as it just proxies the value in 
a local variable into tcp_illinois_cong_avoid that has the correct acked 
readily available.

There are separate cong_avoid and cong_control callbacks that are more 
obviously oriented for doing cc where as I'd think pkts_acked callback 
seems more oriented to for RTT-sample related operations. Therefore, I'm 
not entirely sure I this is what is wanted for pkts_acked, especially 
given the HTCP example.


-- 
 i.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-27 14:23     ` Ilpo Järvinen
@ 2018-03-27 15:06       ` Yuchung Cheng
  2018-03-28 12:45         ` Ilpo Järvinen
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-03-27 15:06 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> >
>> > A miscalculation for the number of acknowledged packets occurs during
>> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> > any non-retransmitted skbs. The reason is that pkts_acked value
>> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> > RTO as it may include segments that were not lost and therefore did
>> > not need retransmissions in the slow start following the RTO. Then
>> > tcp_slow_start will add the excess into cwnd bloating it and
>> > triggering a burst.
>> >
>> > Instead, we want to pass only the number of retransmitted segments
>> > that were covered by the cumulative ACK (and potentially newly sent
>> > data segments too if the cumulative ACK covers that far).
>> >
>> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
>> > ---
>> >  net/ipv4/tcp_input.c | 16 +++++++++++++++-
>> >  1 file changed, 15 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> > index 9a1b3c1..4a26c09 100644
>> > --- a/net/ipv4/tcp_input.c
>> > +++ b/net/ipv4/tcp_input.c
>> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >         long seq_rtt_us = -1L;
>> >         long ca_rtt_us = -1L;
>> >         u32 pkts_acked = 0;
>> > +       u32 rexmit_acked = 0;
>> > +       u32 newdata_acked = 0;
>> >         u32 last_in_flight = 0;
>> >         bool rtt_update;
>> >         int flag = 0;
>> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                 }
>> >
>> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
>> > -                       if (sacked & TCPCB_SACKED_RETRANS)
>> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
>> >                                 tp->retrans_out -= acked_pcount;
>> > +                               rexmit_acked += acked_pcount;
>> > +                       }
>> >                         flag |= FLAG_RETRANS_DATA_ACKED;
>> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>> >                         last_ackt = skb->skb_mstamp;
>> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                                 reord = start_seq;
>> >                         if (!after(scb->end_seq, tp->high_seq))
>> >                                 flag |= FLAG_ORIG_SACK_ACKED;
>> > +                       else
>> > +                               newdata_acked += acked_pcount;
>> >                 }
>> >
>> >                 if (sacked & TCPCB_SACKED_ACKED) {
>> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                 }
>> >
>> >                 if (tcp_is_reno(tp)) {
>> > +                       /* Due to discontinuity on RTO in the artificial
>> > +                        * sacked_out calculations, TCP must restrict
>> > +                        * pkts_acked without SACK to rexmits and new data
>> > +                        * segments
>> > +                        */
>> > +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
>> > +                               pkts_acked = rexmit_acked + newdata_acked;
>> > +
>> My understanding is there are two problems
>>
>> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> precise cumulatively acked count, not newly acked count?
>
> While I'm not entirely sure if you intented to say that my fix is broken
> or not, I thought this very difference alot while making the fix and I
> believe that this fix is needed because of the discontinuity at RTO
> (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>
> In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> tp->delivered, using plain cumulative acked causes congestion control
> breakage later as call to tcp_cong_control will directly use the
> difference in tp->delivered.
>
> This boils down the exact definition of tp->delivered (the one given in
> the header is not detailed enough). I guess you might have better idea
> what it exactly is since one of you has added it? There are subtle things
> in the defination that can make it entirely unsuitable for cc decisions.
> Should those segments that we (possibly) already counted into
> tp->delivered during (potentially preceeding) CA_Recovery be added to it
> for _second time_ or not? This fix avoids such double counting (the
Where is the double counting, assuming normal DUPACK behavior?

In the non-sack case:

1. upon receiving a DUPACK, we assume one packet has been delivered by
incrementing tp->delivered in tcp_add_reno_sack()
2. upon receiving a partial ACK or an ACK that acks recovery point
(high_seq), tp->delivered is incremented by (cumulatively acked -
#dupacks) in tcp_remove_reno_sacks()

therefore tp->delivered is tracking the # of packets delivered
(sacked, acked, DUPACK'd) with the most information it could have
inferred.

>From congestion control's perspective, it cares about the delivery
information (e.g. how much), not the sequences (what or how).

I am pointing out that

1) your fix may fix one corner CC packet accounting issue in the
non-SACK and CA_Loss
2) but it does not fix the other (major) CC packet accounting issue in
the SACK case
3) it breaks the dynamic dupthresh / reordering in the non-SACK case
as tcp_check_reno_reordering requires strictly cum. ack.

Therefore I prefer
a) using tp->delivered to fix (1)(2)
b) perhaps improving tp->delivered SACK emulation code in the non-SACK case


> price is that we might underestimate). For SACK+ACK losses, the similar
> question is: Should those segments that we missed counting into
> tp->delivered during preceeding CA_Recovery (due to losing enough SACKs)
> be added into tp->delivered now during RTO recovery or not (I'm not
> proposing we fix this unless we want to fix the both issues at the same
> time here as its impact with SACK is not that significant)? Is
> tp->delivered supposed to under- or overestimate (in the cases we're not
> sure what/when something happened)? ...If it's overestimating under any
> circumstances (for the current ACK), we cannot base our cc decision on it.
>
> tcp_check_reno_reordering might like to have the cumulatively acked count
> but due to the forementioned discontinuity we anyway cannot accurately
> provide in CA_Loss what tcp_limit_reno_sacked expects (and the
> other CA states are unaffected by this fix). While we could call
> tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't
> remove the fundamental discontinuity problem that we have for what
> tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
> that tcp_limit_reno_sacked is just never going to work after we zero
> tp->sacked_out (this would actually be the problem #3) or at least ends
> up underestimating the reordering degree by the amount we cleared from
> sacked_out at RTO?
>
>> 2) current code: pkts_acked can substantially over-estimate the newly
>> delivered pkts in both SACK and non-SACK cases. For example, let's say
>> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
>> pkts_acked == 100 but really only one packet is delivered. It's wrong
>> to inform congestion control that 100 packets have just delivered.
>> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
>> the newly delivered packets.
>>
>> A better fix for both SACK and non-SACK, seems to be moving
>> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
>> calibrated? this is what BBR is currently doing to avoid these pitfalls.
>
> Unfortunately that would not fix the problematic calculation of
> tp->delivered.
>
> But you might be right that there's a problem #2 (it's hard to notice
> for real though as most of the cc modules don't seem to use it for
> anything). However, this is for pkts_acked so are you sure what the
> cc modules exactly expect: the shrinkage in # of outstanding packets or
> the number of newly delivered packets?
>
> Looking (only) quickly into the modules (that use it other than in
> CA_Open):
> - Somewhat suspicious delayed ACK detection logic in cdg
> - HTCP might want shrinkage in # of outstanding packets (but I'm not
> entirely sure) for throughput measurement
> - I guess TCP illinois is broken (assumes it's newly delivered packets)
> but it would not need to use it at all as it just proxies the value in
> a local variable into tcp_illinois_cong_avoid that has the correct acked
> readily available.
>
> There are separate cong_avoid and cong_control callbacks that are more
> obviously oriented for doing cc where as I'd think pkts_acked callback
> seems more oriented to for RTT-sample related operations. Therefore, I'm
> not entirely sure I this is what is wanted for pkts_acked, especially
> given the HTCP example.
>
>
> --
>  i.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-27 15:06       ` Yuchung Cheng
@ 2018-03-28 12:45         ` Ilpo Järvinen
  2018-03-28 14:14           ` Yuchung Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Ilpo Järvinen @ 2018-03-28 12:45 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

[-- Attachment #1: Type: text/plain, Size: 8746 bytes --]

On Tue, 27 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> <ilpo.jarvinen@helsinki.fi> wrote:
> >> >
> >> > A miscalculation for the number of acknowledged packets occurs during
> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> > RTO as it may include segments that were not lost and therefore did
> >> > not need retransmissions in the slow start following the RTO. Then
> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> > triggering a burst.
> >> >
> >> > Instead, we want to pass only the number of retransmitted segments
> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> > data segments too if the cumulative ACK covers that far).
> >> >
> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> >> > ---
> >>
> >> My understanding is there are two problems
> >>
> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> precise cumulatively acked count, not newly acked count?
> >
> > While I'm not entirely sure if you intented to say that my fix is broken
> > or not, I thought this very difference alot while making the fix and I
> > believe that this fix is needed because of the discontinuity at RTO
> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >
> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > tp->delivered, using plain cumulative acked causes congestion control
> > breakage later as call to tcp_cong_control will directly use the
> > difference in tp->delivered.
> >
> > This boils down the exact definition of tp->delivered (the one given in
> > the header is not detailed enough). I guess you might have better idea
> > what it exactly is since one of you has added it? There are subtle things
> > in the defination that can make it entirely unsuitable for cc decisions.
> > Should those segments that we (possibly) already counted into
> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> > for _second time_ or not? This fix avoids such double counting (the
> Where is the double counting, assuming normal DUPACK behavior?
> 
> In the non-sack case:
> 
> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> incrementing tp->delivered in tcp_add_reno_sack()

1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity 
I've tried to point out quite many times already)...

> 2. upon receiving a partial ACK or an ACK that acks recovery point
> (high_seq), tp->delivered is incremented by (cumulatively acked -
> #dupacks) in tcp_remove_reno_sacks()

...and this won't happen correctly anymore after RTO (since non-SACK 
won't keep #dupacks due to the discontinuity). Thus we end up adding
cumulatively acked - 0 to tp->delivered on those ACKs.

> therefore tp->delivered is tracking the # of packets delivered
> (sacked, acked, DUPACK'd) with the most information it could have
> inferred.

Since you didn't answer any of my questions about tp->delivered directly, 
let me rephrase them to this example (non-SACK, of course):

4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for 
three segments. How much should tp->delivered be incremented? 2 or 3?

...I think 2 is the right answer.

> From congestion control's perspective, it cares about the delivery
> information (e.g. how much), not the sequences (what or how).

I guess you must have missed my point. I'm talking about "how much" 
whole the time. It's just about when can we account for it (and when not).

> I am pointing out that
> 
> 1) your fix may fix one corner CC packet accounting issue in the
> non-SACK and CA_Loss
> 2) but it does not fix the other (major) CC packet accounting issue in
> the SACK case

You say major but it's major if and only if one is using one of the 
affected cc modules which were not that many and IMHO not very mainstream 
anyway (check my list from the previous email, not that I'd mind if they'd 
get fixed too). The rest of the cc modules do not seem to use the incorrect
value even if some of them have the pkts_acked callback.

Other CC packet accounting (regardless of SACK) is based on tp->delivered 
and tp->delivered is NOT similarly miscounted with SACK because the logic 
depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK 
loss).

> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> as tcp_check_reno_reordering requires strictly cum. ack.

I think that the current non-SACK reordering detection logic is not that 
sound after RTO (but I'm yet to prove this). ...There seems to be some 
failure modes which is why I even thought of disabling the whole thing
for non-SACK RTO recoveries and as it now seems you do care more about 
non-SACK than you initial claimed, I might even have motivation to fix 
more those corners rather than the worst bugs only ;-). ...But I'll make 
the tp->delivered fix only in this patch to avoid any change this part of 
the code.

> Therefore I prefer
> a) using tp->delivered to fix (1)(2)
> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case

Somehow I get an impression that you might assume/say here that 
tp->delivered is all correct for non-SACK? ...It isn't without a fix.
Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is 
currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect 
pkts_acked is fed to it. ...Thus, b) is an intermediate step in the 
miscalculation chain I'm fixing with this change. B) resolves also (1) 
without additional changes to logic!

I've included below the updated change that only fixes tp->delivered 
calculation (not tested besides compiling). ...But I think it's worse than 
my previous version because tcp_remove_reno_sacks then assumes 
non-sensical L|S skbs (but there seem to be other, worse limitations in 
sacked_out imitation after RTO because we've all skbs marked with L-bit so 
it's not that big deal for me as most of the that imitation code is no-op 
anyway after RTO).


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..e11748d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 
 	if (acked > 0) {
 		/* One ACK acked hole. The rest eat duplicate ACKs. */
-		tp->delivered += max_t(int, acked - tp->sacked_out, 1);
 		if (acked - 1 >= tp->sacked_out)
 			tp->sacked_out = 0;
 		else
@@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 	tcp_verify_left_out(tp);
 }
 
+static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
+{
+	if (acked > 0)
+		tp->delivered += max_t(int, acked - tp->sacked_out, 1);
+}
+
 static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
 {
 	tp->sacked_out = 0;
@@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 	long seq_rtt_us = -1L;
 	long ca_rtt_us = -1L;
 	u32 pkts_acked = 0;
+	u32 rexmit_acked = 0;
+	u32 newdata_acked = 0;
 	u32 last_in_flight = 0;
 	bool rtt_update;
 	int flag = 0;
@@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (unlikely(sacked & TCPCB_RETRANS)) {
-			if (sacked & TCPCB_SACKED_RETRANS)
+			if (sacked & TCPCB_SACKED_RETRANS) {
 				tp->retrans_out -= acked_pcount;
+				rexmit_acked += acked_pcount;
+			}
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
 			last_ackt = skb->skb_mstamp;
@@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				reord = start_seq;
 			if (!after(scb->end_seq, tp->high_seq))
 				flag |= FLAG_ORIG_SACK_ACKED;
+			else
+				newdata_acked += acked_pcount;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (tcp_is_reno(tp)) {
+			tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
+						      pkts_acked :
+						      rexmit_acked + newdata_acked);
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
 			int delta;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-28 12:45         ` Ilpo Järvinen
@ 2018-03-28 14:14           ` Yuchung Cheng
  2018-03-28 15:04             ` Yuchung Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-03-28 14:14 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Tue, 27 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> >> <ilpo.jarvinen@helsinki.fi> wrote:
>> >> >
>> >> > A miscalculation for the number of acknowledged packets occurs during
>> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> >> > any non-retransmitted skbs. The reason is that pkts_acked value
>> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> >> > RTO as it may include segments that were not lost and therefore did
>> >> > not need retransmissions in the slow start following the RTO. Then
>> >> > tcp_slow_start will add the excess into cwnd bloating it and
>> >> > triggering a burst.
>> >> >
>> >> > Instead, we want to pass only the number of retransmitted segments
>> >> > that were covered by the cumulative ACK (and potentially newly sent
>> >> > data segments too if the cumulative ACK covers that far).
>> >> >
>> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
>> >> > ---
>> >>
>> >> My understanding is there are two problems
>> >>
>> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> >> precise cumulatively acked count, not newly acked count?
>> >
>> > While I'm not entirely sure if you intented to say that my fix is broken
>> > or not, I thought this very difference alot while making the fix and I
>> > believe that this fix is needed because of the discontinuity at RTO
>> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
>> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
>> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
>> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>> >
>> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
>> > tp->delivered, using plain cumulative acked causes congestion control
>> > breakage later as call to tcp_cong_control will directly use the
>> > difference in tp->delivered.
>> >
>> > This boils down the exact definition of tp->delivered (the one given in
>> > the header is not detailed enough). I guess you might have better idea
>> > what it exactly is since one of you has added it? There are subtle things
>> > in the defination that can make it entirely unsuitable for cc decisions.
>> > Should those segments that we (possibly) already counted into
>> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
>> > for _second time_ or not? This fix avoids such double counting (the
>> Where is the double counting, assuming normal DUPACK behavior?
>>
>> In the non-sack case:
>>
>> 1. upon receiving a DUPACK, we assume one packet has been delivered by
>> incrementing tp->delivered in tcp_add_reno_sack()
>
> 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> I've tried to point out quite many times already)...
>
>> 2. upon receiving a partial ACK or an ACK that acks recovery point
>> (high_seq), tp->delivered is incremented by (cumulatively acked -
>> #dupacks) in tcp_remove_reno_sacks()
>
> ...and this won't happen correctly anymore after RTO (since non-SACK
> won't keep #dupacks due to the discontinuity). Thus we end up adding
> cumulatively acked - 0 to tp->delivered on those ACKs.
>
>> therefore tp->delivered is tracking the # of packets delivered
>> (sacked, acked, DUPACK'd) with the most information it could have
>> inferred.
>
> Since you didn't answer any of my questions about tp->delivered directly,
> let me rephrase them to this example (non-SACK, of course):
>
> 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> three segments. How much should tp->delivered be incremented? 2 or 3?
>
> ...I think 2 is the right answer.
>
>> From congestion control's perspective, it cares about the delivery
>> information (e.g. how much), not the sequences (what or how).
>
> I guess you must have missed my point. I'm talking about "how much"
> whole the time. It's just about when can we account for it (and when not).
>
>> I am pointing out that
>>
>> 1) your fix may fix one corner CC packet accounting issue in the
>> non-SACK and CA_Loss
>> 2) but it does not fix the other (major) CC packet accounting issue in
>> the SACK case
>
> You say major but it's major if and only if one is using one of the
> affected cc modules which were not that many and IMHO not very mainstream
> anyway (check my list from the previous email, not that I'd mind if they'd
> get fixed too). The rest of the cc modules do not seem to use the incorrect
> value even if some of them have the pkts_acked callback.
>
> Other CC packet accounting (regardless of SACK) is based on tp->delivered
> and tp->delivered is NOT similarly miscounted with SACK because the logic
> depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> loss).
>
>> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
>> as tcp_check_reno_reordering requires strictly cum. ack.
>
> I think that the current non-SACK reordering detection logic is not that
> sound after RTO (but I'm yet to prove this). ...There seems to be some
> failure modes which is why I even thought of disabling the whole thing
> for non-SACK RTO recoveries and as it now seems you do care more about
> non-SACK than you initial claimed, I might even have motivation to fix
> more those corners rather than the worst bugs only ;-). ...But I'll make
> the tp->delivered fix only in this patch to avoid any change this part of
> the code.
>
>> Therefore I prefer
>> a) using tp->delivered to fix (1)(2)
>> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
>
> Somehow I get an impression that you might assume/say here that
[resending in plaintext]
That's wrong impression. Perhaps it's worth re-iterating what I agree
and disagree

1. [agree] there's accounting issue in non-SACK as you discovered
which causes CC misbehavior

2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
    => that field is not used by common C.C. (you said so too)

3. [disagree] adjusting pkts_acked may not affect reordering
accounting in non-sack


For cubic or reno, the main source is the "acked_sacked" passed into
tcp_cong_avoid(). that variable is derived from tp->delivered.
Therefore we need to fix that to address the problem in (1)

I have yet to read your code. Will read later today.

> tp->delivered is all correct for non-SACK? ...It isn't without a fix.
> Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is
> currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect
> pkts_acked is fed to it. ...Thus, b) is an intermediate step in the
> miscalculation chain I'm fixing with this change. B) resolves also (1)
> without additional changes to logic!
>
> I've included below the updated change that only fixes tp->delivered
> calculation (not tested besides compiling). ...But I think it's worse than
> my previous version because tcp_remove_reno_sacks then assumes
> non-sensical L|S skbs (but there seem to be other, worse limitations in
> sacked_out imitation after RTO because we've all skbs marked with L-bit so
> it's not that big deal for me as most of the that imitation code is no-op
> anyway after RTO).
>
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9a1b3c1..e11748d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
>
>         if (acked > 0) {
>                 /* One ACK acked hole. The rest eat duplicate ACKs. */
> -               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
>                 if (acked - 1 >= tp->sacked_out)
>                         tp->sacked_out = 0;
>                 else
> @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
>         tcp_verify_left_out(tp);
>  }
>
> +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
> +{
> +       if (acked > 0)
> +               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> +}
> +
>  static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
>  {
>         tp->sacked_out = 0;
> @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>         long seq_rtt_us = -1L;
>         long ca_rtt_us = -1L;
>         u32 pkts_acked = 0;
> +       u32 rexmit_acked = 0;
> +       u32 newdata_acked = 0;
>         u32 last_in_flight = 0;
>         bool rtt_update;
>         int flag = 0;
> @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (unlikely(sacked & TCPCB_RETRANS)) {
> -                       if (sacked & TCPCB_SACKED_RETRANS)
> +                       if (sacked & TCPCB_SACKED_RETRANS) {
>                                 tp->retrans_out -= acked_pcount;
> +                               rexmit_acked += acked_pcount;
> +                       }
>                         flag |= FLAG_RETRANS_DATA_ACKED;
>                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>                         last_ackt = skb->skb_mstamp;
> @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                                 reord = start_seq;
>                         if (!after(scb->end_seq, tp->high_seq))
>                                 flag |= FLAG_ORIG_SACK_ACKED;
> +                       else
> +                               newdata_acked += acked_pcount;
>                 }
>
>                 if (sacked & TCPCB_SACKED_ACKED) {
> @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (tcp_is_reno(tp)) {
> +                       tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
> +                                                     pkts_acked :
> +                                                     rexmit_acked + newdata_acked);
>                         tcp_remove_reno_sacks(sk, pkts_acked);
>                 } else {
>                         int delta;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-28 14:14           ` Yuchung Cheng
@ 2018-03-28 15:04             ` Yuchung Cheng
  2018-04-04 10:42               ` Ilpo Järvinen
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-03-28 15:04 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ycheng@google.com> wrote:
>
> On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> >> <ilpo.jarvinen@helsinki.fi> wrote:
> >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >> >
> >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> >> <ilpo.jarvinen@helsinki.fi> wrote:
> >> >> >
> >> >> > A miscalculation for the number of acknowledged packets occurs during
> >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> >> > RTO as it may include segments that were not lost and therefore did
> >> >> > not need retransmissions in the slow start following the RTO. Then
> >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> >> > triggering a burst.
> >> >> >
> >> >> > Instead, we want to pass only the number of retransmitted segments
> >> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> >> > data segments too if the cumulative ACK covers that far).
> >> >> >
> >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> >> >> > ---
> >> >>
> >> >> My understanding is there are two problems
> >> >>
> >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> >> precise cumulatively acked count, not newly acked count?
> >> >
> >> > While I'm not entirely sure if you intented to say that my fix is broken
> >> > or not, I thought this very difference alot while making the fix and I
> >> > believe that this fix is needed because of the discontinuity at RTO
> >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >> >
> >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> >> > tp->delivered, using plain cumulative acked causes congestion control
> >> > breakage later as call to tcp_cong_control will directly use the
> >> > difference in tp->delivered.
> >> >
> >> > This boils down the exact definition of tp->delivered (the one given in
> >> > the header is not detailed enough). I guess you might have better idea
> >> > what it exactly is since one of you has added it? There are subtle things
> >> > in the defination that can make it entirely unsuitable for cc decisions.
> >> > Should those segments that we (possibly) already counted into
> >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> >> > for _second time_ or not? This fix avoids such double counting (the
> >> Where is the double counting, assuming normal DUPACK behavior?
> >>
> >> In the non-sack case:
> >>
> >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> >> incrementing tp->delivered in tcp_add_reno_sack()
> >
> > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > I've tried to point out quite many times already)...
> >
> >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> >> (high_seq), tp->delivered is incremented by (cumulatively acked -
> >> #dupacks) in tcp_remove_reno_sacks()
> >
> > ...and this won't happen correctly anymore after RTO (since non-SACK
> > won't keep #dupacks due to the discontinuity). Thus we end up adding
> > cumulatively acked - 0 to tp->delivered on those ACKs.
> >
> >> therefore tp->delivered is tracking the # of packets delivered
> >> (sacked, acked, DUPACK'd) with the most information it could have
> >> inferred.
> >
> > Since you didn't answer any of my questions about tp->delivered directly,
> > let me rephrase them to this example (non-SACK, of course):
> >
> > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> > three segments. How much should tp->delivered be incremented? 2 or 3?
> >
> > ...I think 2 is the right answer.
> >
> >> From congestion control's perspective, it cares about the delivery
> >> information (e.g. how much), not the sequences (what or how).
> >
> > I guess you must have missed my point. I'm talking about "how much"
> > whole the time. It's just about when can we account for it (and when not).
> >
> >> I am pointing out that
> >>
> >> 1) your fix may fix one corner CC packet accounting issue in the
> >> non-SACK and CA_Loss
> >> 2) but it does not fix the other (major) CC packet accounting issue in
> >> the SACK case
> >
> > You say major but it's major if and only if one is using one of the
> > affected cc modules which were not that many and IMHO not very mainstream
> > anyway (check my list from the previous email, not that I'd mind if they'd
> > get fixed too). The rest of the cc modules do not seem to use the incorrect
> > value even if some of them have the pkts_acked callback.
> >
> > Other CC packet accounting (regardless of SACK) is based on tp->delivered
> > and tp->delivered is NOT similarly miscounted with SACK because the logic
> > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> > loss).
> >
> >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> >> as tcp_check_reno_reordering requires strictly cum. ack.
> >
> > I think that the current non-SACK reordering detection logic is not that
> > sound after RTO (but I'm yet to prove this). ...There seems to be some
> > failure modes which is why I even thought of disabling the whole thing
> > for non-SACK RTO recoveries and as it now seems you do care more about
> > non-SACK than you initial claimed, I might even have motivation to fix
> > more those corners rather than the worst bugs only ;-). ...But I'll make
> > the tp->delivered fix only in this patch to avoid any change this part of
> > the code.
> >
> >> Therefore I prefer
> >> a) using tp->delivered to fix (1)(2)
> >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
> >
> > Somehow I get an impression that you might assume/say here that
> [resending in plaintext]
> That's wrong impression. Perhaps it's worth re-iterating what I agree
> and disagree
>
> 1. [agree] there's accounting issue in non-SACK as you discovered
> which causes CC misbehavior
>
> 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
>     => that field is not used by common C.C. (you said so too)
>
> 3. [disagree] adjusting pkts_acked may not affect reordering
> accounting in non-sack
>
>
> For cubic or reno, the main source is the "acked_sacked" passed into
> tcp_cong_avoid(). that variable is derived from tp->delivered.
> Therefore we need to fix that to address the problem in (1)
>
> I have yet to read your code. Will read later today.
Your patch looks good. Some questions / comments:

1. Why only apply to CA_Loss and not also CA_Recovery? this may
mitigate GRO issue I noted in other thread.

2. minor style nit: can we adjust tp->delivered directly in
tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
(e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
control when SACK is used. One day I'd like to rename all these *reno*
to _nonsack_.

Thanks

>
> > tp->delivered is all correct for non-SACK? ...It isn't without a fix.
> > Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is
> > currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect
> > pkts_acked is fed to it. ...Thus, b) is an intermediate step in the
> > miscalculation chain I'm fixing with this change. B) resolves also (1)
> > without additional changes to logic!
> >
> > I've included below the updated change that only fixes tp->delivered
> > calculation (not tested besides compiling). ...But I think it's worse than
> > my previous version because tcp_remove_reno_sacks then assumes
> > non-sensical L|S skbs (but there seem to be other, worse limitations in
> > sacked_out imitation after RTO because we've all skbs marked with L-bit so
> > it's not that big deal for me as most of the that imitation code is no-op
> > anyway after RTO).
> >
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..e11748d 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
> >
> >         if (acked > 0) {
> >                 /* One ACK acked hole. The rest eat duplicate ACKs. */
> > -               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> >                 if (acked - 1 >= tp->sacked_out)
> >                         tp->sacked_out = 0;
> >                 else
> > @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
> >         tcp_verify_left_out(tp);
> >  }
> >
> > +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
> > +{
> > +       if (acked > 0)
> > +               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> > +}
> > +
> >  static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
> >  {
> >         tp->sacked_out = 0;
> > @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >         long seq_rtt_us = -1L;
> >         long ca_rtt_us = -1L;
> >         u32 pkts_acked = 0;
> > +       u32 rexmit_acked = 0;
> > +       u32 newdata_acked = 0;
> >         u32 last_in_flight = 0;
> >         bool rtt_update;
> >         int flag = 0;
> > @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
> > -                       if (sacked & TCPCB_SACKED_RETRANS)
> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
> >                                 tp->retrans_out -= acked_pcount;
> > +                               rexmit_acked += acked_pcount;
> > +                       }
> >                         flag |= FLAG_RETRANS_DATA_ACKED;
> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> >                         last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 reord = start_seq;
> >                         if (!after(scb->end_seq, tp->high_seq))
> >                                 flag |= FLAG_ORIG_SACK_ACKED;
> > +                       else
> > +                               newdata_acked += acked_pcount;
> >                 }
> >
> >                 if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (tcp_is_reno(tp)) {
> > +                       tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
> > +                                                     pkts_acked :
> > +                                                     rexmit_acked + newdata_acked);
> >                         tcp_remove_reno_sacks(sk, pkts_acked);
> >                 } else {
> >                         int delta;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-03-13 10:25 ` [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows Ilpo Järvinen
@ 2018-03-28 22:26   ` Yuchung Cheng
  2018-04-04 10:35     ` Ilpo Järvinen
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-03-28 22:26 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> If SACK is not enabled and the first cumulative ACK after the RTO
> retransmission covers more than the retransmitted skb, a spurious
> FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> The reason is that any non-retransmitted segment acknowledged will
> set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> no indication that it would have been delivered for real (the
> scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> case so the check for that bit won't help like it does with SACK).
> Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> in tcp_process_loss.
>
> We need to use more strict condition for non-SACK case and check
> that none of the cumulatively ACKed segments were retransmitted
> to prove that progress is due to original transmissions. Only then
> keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> non-SACK case.
>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  net/ipv4/tcp_input.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 4a26c09..c60745c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                                 pkts_acked = rexmit_acked + newdata_acked;
>
>                         tcp_remove_reno_sacks(sk, pkts_acked);
> +
> +                       /* If any of the cumulatively ACKed segments was
> +                        * retransmitted, non-SACK case cannot confirm that
> +                        * progress was due to original transmission due to
> +                        * lack of TCPCB_SACKED_ACKED bits even if some of
> +                        * the packets may have been never retransmitted.
> +                        */
> +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> +                               flag &= ~FLAG_ORIG_SACK_ACKED;

How about keeping your excellent comment but move the fix to F-RTO
code directly so it's more clear? this way the flag remains clear that
indicates some never-retransmitted data are acked/sacked.

// pseudo code for illustration

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8d480542aa07..f7f3357de618 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2629,8 +2629,15 @@ static void tcp_process_loss(struct sock *sk,
int flag, bool is_dupack,
        if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
                /* Step 3.b. A timeout is spurious if not all data are
                 * lost, i.e., never-retransmitted data are (s)acked.
+                *
+                * If any of the cumulatively ACKed segments was
+                * retransmitted, non-SACK case cannot confirm that
+                * progress was due to original transmission due to
+                * lack of TCPCB_SACKED_ACKED bits even if some of
+                * the packets may have been never retransmitted.
                 */
                if ((flag & FLAG_ORIG_SACK_ACKED) &&
+                   (tcp_is_sack(tp) || !FLAG_RETRANS_DATA_ACKED) &&
                    tcp_try_undo_loss(sk, true))
                        return;





>                 } else {
>                         int delta;
>
> --
> 2.7.4
>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
  2018-03-13 10:25 ` [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled Ilpo Järvinen
@ 2018-03-28 22:36   ` Yuchung Cheng
  2018-04-04 10:23     ` Ilpo Järvinen
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-03-28 22:36 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> When a cumulative ACK lands to high_seq at the end of loss
> recovery and SACK is not enabled, the sender needs to avoid
> false fast retransmits (RFC6582). The avoidance mechanisms is
> implemented by remaining in the loss recovery CA state until
> one additional cumulative ACK arrives. During the operation of
> this avoidance mechanism, there is internal transient in the
> use of state variables which will always trigger a bogus undo.
Do we have to make undo in non-sack perfect? can we consider a much
simpler but imperfect fix of

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8d480542aa07..95225d9de0af 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2356,6 +2356,7 @@ static bool tcp_try_undo_recovery(struct sock *sk)
                 * fast retransmits (RFC2582). SACK TCP is safe. */
                if (!tcp_any_retrans_done(sk))
                        tp->retrans_stamp = 0;
+               tp->undo_marker = 0;
                return true;
        }



>
> When we enter to this transient state in tcp_try_undo_recovery,
> tcp_any_retrans_done is often (always?) false resulting in
> clearing retrans_stamp. On the next cumulative ACK,
> tcp_try_undo_recovery again executes because CA state still
> remains in the same recovery state and tcp_may_undo will always
> return true because tcp_packet_delayed has this condition:
>     return !tp->retrans_stamp || ...
>
> Check if the false fast retransmit transient avoidance is in
> progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
> has advanced already on this ACK but CA state still remains
> unchanged (CA state is updated slightly later than undo is
> checked), prior_snd_una needs to be passed to tcp_packet_delayed
> (instead of tp->snd_una). Passing prior_snd_una around to
> the tcp_packet_delayed makes this change look more involved than
> it really is.
>
> The additional checks done in this change only affect non-SACK
> case, the SACK case remains the same.
>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  net/ipv4/tcp_input.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 72ecfbb..270aa48 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
>  /* Nothing was retransmitted or returned timestamp is less
>   * than timestamp of the first retransmission.
>   */
> -static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
> +static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
> +                                     const u32 prior_snd_una)
>  {
> -       return !tp->retrans_stamp ||
> -              tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
> +       if (!tp->retrans_stamp) {
> +               /* Sender will be in a transient state with cleared
> +                * retrans_stamp during false fast retransmit prevention
> +                * mechanism
> +                */
> +               return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
> +       }
> +       return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
>  }
>
>  /* Undo procedures. */
> @@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
>         tp->rack.advanced = 1; /* Force RACK to re-exam losses */
>  }
>
> -static inline bool tcp_may_undo(const struct tcp_sock *tp)
> +static inline bool tcp_may_undo(const struct tcp_sock *tp,
> +                               const u32 prior_snd_una)
>  {
> -       return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
> +       return tp->undo_marker &&
> +              (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
>  }
>
>  /* People celebrate: "We love our President!" */
> -static bool tcp_try_undo_recovery(struct sock *sk)
> +static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>
> -       if (tcp_may_undo(tp)) {
> +       if (tcp_may_undo(tp, prior_snd_una)) {
>                 int mib_idx;
>
>                 /* Happy end! We did not retransmit anything
> @@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
>  }
>
>  /* Undo during loss recovery after partial ACK or using F-RTO. */
> -static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
> +static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
> +                             bool frto_undo)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>
> -       if (frto_undo || tcp_may_undo(tp)) {
> +       if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
>                 tcp_undo_cwnd_reduction(sk, true);
>
>                 DBGUNDO(sk, "partial loss");
> @@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
>   * recovered or spurious. Otherwise retransmits more on partial ACKs.
>   */
>  static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
> -                            int *rexmit)
> +                            int *rexmit, const u32 prior_snd_una)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>         bool recovered = !before(tp->snd_una, tp->high_seq);
>
>         if ((flag & FLAG_SND_UNA_ADVANCED) &&
> -           tcp_try_undo_loss(sk, false))
> +           tcp_try_undo_loss(sk, prior_snd_una, false))
>                 return;
>
>         if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
> @@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
>                  * lost, i.e., never-retransmitted data are (s)acked.
>                  */
>                 if ((flag & FLAG_ORIG_SACK_ACKED) &&
> -                   tcp_try_undo_loss(sk, true))
> +                   tcp_try_undo_loss(sk, prior_snd_una, true))
>                         return;
>
>                 if (after(tp->snd_nxt, tp->high_seq)) {
> @@ -2665,7 +2675,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
>
>         if (recovered) {
>                 /* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
> -               tcp_try_undo_recovery(sk);
> +               tcp_try_undo_recovery(sk, prior_snd_una);
>                 return;
>         }
>         if (tcp_is_reno(tp)) {
> @@ -2685,7 +2695,7 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>
> -       if (tp->undo_marker && tcp_packet_delayed(tp)) {
> +       if (tp->undo_marker && tcp_packet_delayed(tp, prior_snd_una)) {
>                 /* Plain luck! Hole if filled with delayed
>                  * packet, rather than with a retransmit. Check reordering.
>                  */
> @@ -2788,7 +2798,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
>                 case TCP_CA_Recovery:
>                         if (tcp_is_reno(tp))
>                                 tcp_reset_reno_sack(tp);
> -                       if (tcp_try_undo_recovery(sk))
> +                       if (tcp_try_undo_recovery(sk, prior_snd_una))
>                                 return;
>                         tcp_end_cwnd_reduction(sk);
>                         break;
> @@ -2815,7 +2825,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
>                 tcp_rack_identify_loss(sk, ack_flag);
>                 break;
>         case TCP_CA_Loss:
> -               tcp_process_loss(sk, flag, is_dupack, rexmit);
> +               tcp_process_loss(sk, flag, is_dupack, rexmit, prior_snd_una);
>                 tcp_rack_identify_loss(sk, ack_flag);
>                 if (!(icsk->icsk_ca_state == TCP_CA_Open ||
>                       (*ack_flag & FLAG_LOST_RETRANS)))
> --
> 2.7.4
>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
  2018-03-28 22:36   ` Yuchung Cheng
@ 2018-04-04 10:23     ` Ilpo Järvinen
  0 siblings, 0 replies; 25+ messages in thread
From: Ilpo Järvinen @ 2018-04-04 10:23 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

[-- Attachment #1: Type: text/plain, Size: 1826 bytes --]

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > When a cumulative ACK lands to high_seq at the end of loss
> > recovery and SACK is not enabled, the sender needs to avoid
> > false fast retransmits (RFC6582). The avoidance mechanisms is
> > implemented by remaining in the loss recovery CA state until
> > one additional cumulative ACK arrives. During the operation of
> > this avoidance mechanism, there is internal transient in the
> > use of state variables which will always trigger a bogus undo.
>
> Do we have to make undo in non-sack perfect? can we consider a much
> simpler but imperfect fix of
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8d480542aa07..95225d9de0af 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2356,6 +2356,7 @@ static bool tcp_try_undo_recovery(struct sock *sk)
>                  * fast retransmits (RFC2582). SACK TCP is safe. */
>                 if (!tcp_any_retrans_done(sk))
>                         tp->retrans_stamp = 0;
> +               tp->undo_marker = 0;
>                 return true;
>         }

Yes, that's of course a possible and would workaround the issue too. 
In fact, I initially did that kind of fix for myself (I put it into a 
block with tp->retrans_stamp = 0 though). But then I realized that it is 
not that complicated to make the fix locally into tcp_packet_delayed()
(except the annoyance of passing all the necessary state parameters
through the deep static call-chain but that should pose no big challenge 
for the compiler to handle I guess).

BTW, do you know under what circumstances that tcp_any_retrans_done(sk) 
would return non-zero here (snd_una == high_seq so those rexmit 
would need to be above high_seq)?




-- 
 i.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-03-28 22:26   ` Yuchung Cheng
@ 2018-04-04 10:35     ` Ilpo Järvinen
  2018-04-04 16:33       ` Neal Cardwell
  0 siblings, 1 reply; 25+ messages in thread
From: Ilpo Järvinen @ 2018-04-04 10:35 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

[-- Attachment #1: Type: text/plain, Size: 4262 bytes --]

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > If SACK is not enabled and the first cumulative ACK after the RTO
> > retransmission covers more than the retransmitted skb, a spurious
> > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > The reason is that any non-retransmitted segment acknowledged will
> > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > no indication that it would have been delivered for real (the
> > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > case so the check for that bit won't help like it does with SACK).
> > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > in tcp_process_loss.
> >
> > We need to use more strict condition for non-SACK case and check
> > that none of the cumulatively ACKed segments were retransmitted
> > to prove that progress is due to original transmissions. Only then
> > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > non-SACK case.
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 4a26c09..c60745c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 pkts_acked = rexmit_acked + newdata_acked;
> >
> >                         tcp_remove_reno_sacks(sk, pkts_acked);
> > +
> > +                       /* If any of the cumulatively ACKed segments was
> > +                        * retransmitted, non-SACK case cannot confirm that
> > +                        * progress was due to original transmission due to
> > +                        * lack of TCPCB_SACKED_ACKED bits even if some of
> > +                        * the packets may have been never retransmitted.
> > +                        */
> > +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> > +                               flag &= ~FLAG_ORIG_SACK_ACKED;
> 
> How about keeping your excellent comment but move the fix to F-RTO
> code directly so it's more clear? this way the flag remains clear that
> indicates some never-retransmitted data are acked/sacked.
> 
> // pseudo code for illustration
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8d480542aa07..f7f3357de618 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2629,8 +2629,15 @@ static void tcp_process_loss(struct sock *sk,
> int flag, bool is_dupack,
>         if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
>                 /* Step 3.b. A timeout is spurious if not all data are
>                  * lost, i.e., never-retransmitted data are (s)acked.
> +                *
> +                * If any of the cumulatively ACKed segments was
> +                * retransmitted, non-SACK case cannot confirm that
> +                * progress was due to original transmission due to
> +                * lack of TCPCB_SACKED_ACKED bits even if some of
> +                * the packets may have been never retransmitted.
>                  */
>                 if ((flag & FLAG_ORIG_SACK_ACKED) &&
> +                   (tcp_is_sack(tp) || !FLAG_RETRANS_DATA_ACKED) &&
>                     tcp_try_undo_loss(sk, true))
>                         return;

Of course I could put the back there but I really like the new place more 
(which was a result of your suggestion to place the code elsewhere).
IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we 
weren't successful in proving (there in tcp_clean_rtx_queue) that progress 
was due original transmission and thus I would not want falsely indicate 
it with that flag. And there's the non-SACK related block anyway already 
there so it keeps the non-SACK "pollution" off from the SACK code paths.

(In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to 
FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition 
we're after regardless of SACK and less ambiguous in non-SACK case).

-- 
 i.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-03-28 15:04             ` Yuchung Cheng
@ 2018-04-04 10:42               ` Ilpo Järvinen
  2018-04-04 17:17                 ` Neal Cardwell
  0 siblings, 1 reply; 25+ messages in thread
From: Ilpo Järvinen @ 2018-04-04 10:42 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

[-- Attachment #1: Type: text/plain, Size: 8656 bytes --]

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ycheng@google.com> wrote:
> >
> > On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> > <ilpo.jarvinen@helsinki.fi> wrote:
> > > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> > >
> > >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> > >> <ilpo.jarvinen@helsinki.fi> wrote:
> > >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> > >> >
> > >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> > >> >> <ilpo.jarvinen@helsinki.fi> wrote:
> > >> >> >
> > >> >> > A miscalculation for the number of acknowledged packets occurs during
> > >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> > >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > >> >> > RTO as it may include segments that were not lost and therefore did
> > >> >> > not need retransmissions in the slow start following the RTO. Then
> > >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> > >> >> > triggering a burst.
> > >> >> >
> > >> >> > Instead, we want to pass only the number of retransmitted segments
> > >> >> > that were covered by the cumulative ACK (and potentially newly sent
> > >> >> > data segments too if the cumulative ACK covers that far).
> > >> >> >
> > >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > >> >> > ---
> > >> >>
> > >> >> My understanding is there are two problems
> > >> >>
> > >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> > >> >> precise cumulatively acked count, not newly acked count?
> > >> >
> > >> > While I'm not entirely sure if you intented to say that my fix is broken
> > >> > or not, I thought this very difference alot while making the fix and I
> > >> > believe that this fix is needed because of the discontinuity at RTO
> > >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> > >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> > >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> > >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> > >> >
> > >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > >> > tp->delivered, using plain cumulative acked causes congestion control
> > >> > breakage later as call to tcp_cong_control will directly use the
> > >> > difference in tp->delivered.
> > >> >
> > >> > This boils down the exact definition of tp->delivered (the one given in
> > >> > the header is not detailed enough). I guess you might have better idea
> > >> > what it exactly is since one of you has added it? There are subtle things
> > >> > in the defination that can make it entirely unsuitable for cc decisions.
> > >> > Should those segments that we (possibly) already counted into
> > >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> > >> > for _second time_ or not? This fix avoids such double counting (the
> > >> Where is the double counting, assuming normal DUPACK behavior?
> > >>
> > >> In the non-sack case:
> > >>
> > >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> > >> incrementing tp->delivered in tcp_add_reno_sack()
> > >
> > > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > > I've tried to point out quite many times already)...
> > >
> > >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> > >> (high_seq), tp->delivered is incremented by (cumulatively acked -
> > >> #dupacks) in tcp_remove_reno_sacks()
> > >
> > > ...and this won't happen correctly anymore after RTO (since non-SACK
> > > won't keep #dupacks due to the discontinuity). Thus we end up adding
> > > cumulatively acked - 0 to tp->delivered on those ACKs.
> > >
> > >> therefore tp->delivered is tracking the # of packets delivered
> > >> (sacked, acked, DUPACK'd) with the most information it could have
> > >> inferred.
> > >
> > > Since you didn't answer any of my questions about tp->delivered directly,
> > > let me rephrase them to this example (non-SACK, of course):
> > >
> > > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> > > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> > > three segments. How much should tp->delivered be incremented? 2 or 3?
> > >
> > > ...I think 2 is the right answer.
> > >
> > >> From congestion control's perspective, it cares about the delivery
> > >> information (e.g. how much), not the sequences (what or how).
> > >
> > > I guess you must have missed my point. I'm talking about "how much"
> > > whole the time. It's just about when can we account for it (and when not).
> > >
> > >> I am pointing out that
> > >>
> > >> 1) your fix may fix one corner CC packet accounting issue in the
> > >> non-SACK and CA_Loss
> > >> 2) but it does not fix the other (major) CC packet accounting issue in
> > >> the SACK case
> > >
> > > You say major but it's major if and only if one is using one of the
> > > affected cc modules which were not that many and IMHO not very mainstream
> > > anyway (check my list from the previous email, not that I'd mind if they'd
> > > get fixed too). The rest of the cc modules do not seem to use the incorrect
> > > value even if some of them have the pkts_acked callback.
> > >
> > > Other CC packet accounting (regardless of SACK) is based on tp->delivered
> > > and tp->delivered is NOT similarly miscounted with SACK because the logic
> > > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> > > loss).
> > >
> > >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> > >> as tcp_check_reno_reordering requires strictly cum. ack.
> > >
> > > I think that the current non-SACK reordering detection logic is not that
> > > sound after RTO (but I'm yet to prove this). ...There seems to be some
> > > failure modes which is why I even thought of disabling the whole thing
> > > for non-SACK RTO recoveries and as it now seems you do care more about
> > > non-SACK than you initial claimed, I might even have motivation to fix
> > > more those corners rather than the worst bugs only ;-). ...But I'll make
> > > the tp->delivered fix only in this patch to avoid any change this part of
> > > the code.
> > >
> > >> Therefore I prefer
> > >> a) using tp->delivered to fix (1)(2)
> > >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
> > >
> > > Somehow I get an impression that you might assume/say here that
> > [resending in plaintext]
> > That's wrong impression. Perhaps it's worth re-iterating what I agree
> > and disagree
> >
> > 1. [agree] there's accounting issue in non-SACK as you discovered
> > which causes CC misbehavior
> >
> > 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
> >     => that field is not used by common C.C. (you said so too)
> >
> > 3. [disagree] adjusting pkts_acked may not affect reordering
> > accounting in non-sack
> >
> >
> >
> > For cubic or reno, the main source is the "acked_sacked" passed into
> > tcp_cong_avoid(). that variable is derived from tp->delivered.
> > Therefore we need to fix that to address the problem in (1)
> >
> > I have yet to read your code. Will read later today.
> Your patch looks good. Some questions / comments:

Just to be sure that we understand each other the correct way around this 
time, I guess it resolved both of your "disagree" points above?

> 1. Why only apply to CA_Loss and not also CA_Recovery? this may
> mitigate GRO issue I noted in other thread.

Hmm, that's indeed a good idea. I'll give it some more thought but my 
initial impression is that it should work.

I initially thought that the GRO issues just had to do with missing xmit 
opportunities and was confused by the use of "mitigate" here but then 
realized you meant also (or perhaps mainly?) that GRO causes similar 
bursts (once the too small sacked_out runs out).

> 2. minor style nit: can we adjust tp->delivered directly in
> tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
> (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
> control when SACK is used. One day I'd like to rename all these *reno*
> to _nonsack_.

That's what I actually did first but put ended up putting it into own 
function because of line lengths (not a particularly good reason).

...So yes, I can put it into the tcp_clean_rtx_queue in the next version.

I'll also try to keep that "reno" thing in my mind.


-- 
 i.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-04-04 10:35     ` Ilpo Järvinen
@ 2018-04-04 16:33       ` Neal Cardwell
  2018-04-04 17:12         ` Yuchung Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Neal Cardwell @ 2018-04-04 16:33 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Yuchung Cheng, Netdev, Eric Dumazet, sergei.shtylyov

On Wed, Apr 4, 2018 at 6:35 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
wrote:

> On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> > <ilpo.jarvinen@helsinki.fi> wrote:
> > >
> > > If SACK is not enabled and the first cumulative ACK after the RTO
> > > retransmission covers more than the retransmitted skb, a spurious
> > > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > > The reason is that any non-retransmitted segment acknowledged will
> > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > > no indication that it would have been delivered for real (the
> > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > > case so the check for that bit won't help like it does with SACK).
> > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > > in tcp_process_loss.
> > >
> > > We need to use more strict condition for non-SACK case and check
> > > that none of the cumulatively ACKed segments were retransmitted
> > > to prove that progress is due to original transmissions. Only then
> > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > > non-SACK case.
> > >
> > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > > ---
> > >  net/ipv4/tcp_input.c | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 4a26c09..c60745c 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock
*sk, u32 prior_fack,
> > >                                 pkts_acked = rexmit_acked +
newdata_acked;
> > >
> > >                         tcp_remove_reno_sacks(sk, pkts_acked);
> > > +
> > > +                       /* If any of the cumulatively ACKed segments
was
> > > +                        * retransmitted, non-SACK case cannot
confirm that
> > > +                        * progress was due to original transmission
due to
> > > +                        * lack of TCPCB_SACKED_ACKED bits even if
some of
> > > +                        * the packets may have been never
retransmitted.
> > > +                        */
> > > +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> > > +                               flag &= ~FLAG_ORIG_SACK_ACKED;

FWIW I'd vote for this version.

> Of course I could put the back there but I really like the new place more
> (which was a result of your suggestion to place the code elsewhere).
> IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we
> weren't successful in proving (there in tcp_clean_rtx_queue) that progress
> was due original transmission and thus I would not want falsely indicate
> it with that flag. And there's the non-SACK related block anyway already
> there so it keeps the non-SACK "pollution" off from the SACK code paths.

I think that's a compelling argument. In particular, in terms of long-term
maintenance it seems risky to allow an unsound/buggy FLAG_ORIG_SACK_ACKED
bit be returned by tcp_clean_rtx_queue(). If we return an
incorrect/imcomplete FLAG_ORIG_SACK_ACKED bit then I worry that one day we
will forget that for non-SACK flows that bit is incorrect/imcomplete, and
we will add code using that bit but forgetting to check (tcp_is_sack(tp) ||
!FLAG_RETRANS_DATA_ACKED).

> (In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to
> FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition
> we're after regardless of SACK and less ambiguous in non-SACK case).

I'm neutral on this. Not sure if the extra clarity is worth the code churn.

cheers,
neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-04-04 16:33       ` Neal Cardwell
@ 2018-04-04 17:12         ` Yuchung Cheng
  2018-04-04 17:22           ` Neal Cardwell
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-04-04 17:12 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Ilpo Järvinen, Netdev, Eric Dumazet, Sergei Shtylyov

On Wed, Apr 4, 2018 at 9:33 AM, Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Apr 4, 2018 at 6:35 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> wrote:
>
> > On Wed, 28 Mar 2018, Yuchung Cheng wrote:
>
> > > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> > > <ilpo.jarvinen@helsinki.fi> wrote:
> > > >
> > > > If SACK is not enabled and the first cumulative ACK after the RTO
> > > > retransmission covers more than the retransmitted skb, a spurious
> > > > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > > > The reason is that any non-retransmitted segment acknowledged will
> > > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > > > no indication that it would have been delivered for real (the
> > > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > > > case so the check for that bit won't help like it does with SACK).
> > > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > > > in tcp_process_loss.
> > > >
> > > > We need to use more strict condition for non-SACK case and check
> > > > that none of the cumulatively ACKed segments were retransmitted
> > > > to prove that progress is due to original transmissions. Only then
> > > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > > > non-SACK case.
> > > >
> > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > > > ---
> > > >  net/ipv4/tcp_input.c | 9 +++++++++
> > > >  1 file changed, 9 insertions(+)
> > > >
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > index 4a26c09..c60745c 100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock
> *sk, u32 prior_fack,
> > > >                                 pkts_acked = rexmit_acked +
> newdata_acked;
> > > >
> > > >                         tcp_remove_reno_sacks(sk, pkts_acked);
> > > > +
> > > > +                       /* If any of the cumulatively ACKed segments
> was
> > > > +                        * retransmitted, non-SACK case cannot
> confirm that
> > > > +                        * progress was due to original transmission
> due to
> > > > +                        * lack of TCPCB_SACKED_ACKED bits even if
> some of
> > > > +                        * the packets may have been never
> retransmitted.
> > > > +                        */
> > > > +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> > > > +                               flag &= ~FLAG_ORIG_SACK_ACKED;
>
> FWIW I'd vote for this version.
>
> > Of course I could put the back there but I really like the new place more
> > (which was a result of your suggestion to place the code elsewhere).
> > IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we
> > weren't successful in proving (there in tcp_clean_rtx_queue) that progress
> > was due original transmission and thus I would not want falsely indicate
> > it with that flag. And there's the non-SACK related block anyway already
> > there so it keeps the non-SACK "pollution" off from the SACK code paths.
>
> I think that's a compelling argument. In particular, in terms of long-term
> maintenance it seems risky to allow an unsound/buggy FLAG_ORIG_SACK_ACKED
> bit be returned by tcp_clean_rtx_queue(). If we return an
> incorrect/imcomplete FLAG_ORIG_SACK_ACKED bit then I worry that one day we
> will forget that for non-SACK flows that bit is incorrect/imcomplete, and
> we will add code using that bit but forgetting to check (tcp_is_sack(tp) ||
> !FLAG_RETRANS_DATA_ACKED).
Agreed. That's a good point. And I would much preferred to rename that
to FLAG_ORIG_PROGRESS (w/ updated comment).

so I think we're in agreement to use existing patch w/ the new name
FLAG_ORIG_PROGRESS

>
> > (In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to
> > FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition
> > we're after regardless of SACK and less ambiguous in non-SACK case).
>
> I'm neutral on this. Not sure if the extra clarity is worth the code churn.
>
> cheers,
> neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
  2018-04-04 10:42               ` Ilpo Järvinen
@ 2018-04-04 17:17                 ` Neal Cardwell
  0 siblings, 0 replies; 25+ messages in thread
From: Neal Cardwell @ 2018-04-04 17:17 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Yuchung Cheng, Netdev, Eric Dumazet, sergei.shtylyov

On Wed, Apr 4, 2018 at 6:42 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
wrote:

> On Wed, 28 Mar 2018, Yuchung Cheng wrote:
...
> > Your patch looks good. Some questions / comments:

The patch looks good to me as well (I read the Mar 28 version). Thanks for
this fix, Ilpo.

> Just to be sure that we understand each other the correct way around this
> time, I guess it resolved both of your "disagree" points above?

> > 1. Why only apply to CA_Loss and not also CA_Recovery? this may
> > mitigate GRO issue I noted in other thread.

> Hmm, that's indeed a good idea. I'll give it some more thought but my
> initial impression is that it should work.

Great. I would agree with Yuchung's suggestion to apply this fix to
CA_Recovery as well.

> > 2. minor style nit: can we adjust tp->delivered directly in
> > tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
> > (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
> > control when SACK is used. One day I'd like to rename all these *reno*
> > to _nonsack_.

> That's what I actually did first but put ended up putting it into own
> function because of line lengths (not a particularly good reason).

> ...So yes, I can put it into the tcp_clean_rtx_queue in the next version.

> I'll also try to keep that "reno" thing in my mind.

Either approach here sounds fine to me. Though personally I find it more
readable to keep all the non-SACK helpers together and consistently named,
including this one (even if the name is confusing, at least it is
consistent... :-).

neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-04-04 17:12         ` Yuchung Cheng
@ 2018-04-04 17:22           ` Neal Cardwell
  2018-04-04 17:40             ` Yuchung Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Neal Cardwell @ 2018-04-04 17:22 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Ilpo Järvinen, Netdev, Eric Dumazet, sergei.shtylyov

n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ycheng@google.com> wrote:
> Agreed. That's a good point. And I would much preferred to rename that
> to FLAG_ORIG_PROGRESS (w/ updated comment).

> so I think we're in agreement to use existing patch w/ the new name
> FLAG_ORIG_PROGRESS

Yes, SGTM.

I guess this "prevent bogus FRTO undos" patch would go to "net" branch and
the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
branch?

neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-04-04 17:22           ` Neal Cardwell
@ 2018-04-04 17:40             ` Yuchung Cheng
  2018-04-04 17:44               ` Neal Cardwell
  0 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2018-04-04 17:40 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Ilpo Järvinen, Netdev, Eric Dumazet, Sergei Shtylyov

On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell <ncardwell@google.com> wrote:
> n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ycheng@google.com> wrote:
>> Agreed. That's a good point. And I would much preferred to rename that
>> to FLAG_ORIG_PROGRESS (w/ updated comment).
>
>> so I think we're in agreement to use existing patch w/ the new name
>> FLAG_ORIG_PROGRESS
>
> Yes, SGTM.
>
> I guess this "prevent bogus FRTO undos" patch would go to "net" branch and
> the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
> branch?
huh? why not one patch ... this is getting close to patch-split paralyses.

>
> neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
  2018-04-04 17:40             ` Yuchung Cheng
@ 2018-04-04 17:44               ` Neal Cardwell
  0 siblings, 0 replies; 25+ messages in thread
From: Neal Cardwell @ 2018-04-04 17:44 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Ilpo Järvinen, Netdev, Eric Dumazet, sergei.shtylyov

On Wed, Apr 4, 2018 at 1:41 PM Yuchung Cheng <ycheng@google.com> wrote:

> On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell <ncardwell@google.com>
wrote:
> > n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng <ycheng@google.com> wrote:
> >> Agreed. That's a good point. And I would much preferred to rename that
> >> to FLAG_ORIG_PROGRESS (w/ updated comment).
> >
> >> so I think we're in agreement to use existing patch w/ the new name
> >> FLAG_ORIG_PROGRESS
> >
> > Yes, SGTM.
> >
> > I guess this "prevent bogus FRTO undos" patch would go to "net" branch
and
> > the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next"
> > branch?
> huh? why not one patch ... this is getting close to patch-split paralyses.

The flag rename seemed like a cosmetic issue that was not needed for the
fix. Smelled like net-next to me. But I don't feel strongly. However you
guys want to package it is fine with me. :-)

neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-04-04 17:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-13 10:25 [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Ilpo Järvinen
2018-03-13 10:25 ` [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery Ilpo Järvinen
2018-03-27  2:07   ` Yuchung Cheng
2018-03-27 14:23     ` Ilpo Järvinen
2018-03-27 15:06       ` Yuchung Cheng
2018-03-28 12:45         ` Ilpo Järvinen
2018-03-28 14:14           ` Yuchung Cheng
2018-03-28 15:04             ` Yuchung Cheng
2018-04-04 10:42               ` Ilpo Järvinen
2018-04-04 17:17                 ` Neal Cardwell
2018-03-13 10:25 ` [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows Ilpo Järvinen
2018-03-28 22:26   ` Yuchung Cheng
2018-04-04 10:35     ` Ilpo Järvinen
2018-04-04 16:33       ` Neal Cardwell
2018-04-04 17:12         ` Yuchung Cheng
2018-04-04 17:22           ` Neal Cardwell
2018-04-04 17:40             ` Yuchung Cheng
2018-04-04 17:44               ` Neal Cardwell
2018-03-13 10:25 ` [PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible() Ilpo Järvinen
2018-03-13 10:25 ` [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled Ilpo Järvinen
2018-03-28 22:36   ` Yuchung Cheng
2018-04-04 10:23     ` Ilpo Järvinen
2018-03-13 10:25 ` [PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows Ilpo Järvinen
2018-03-13 13:43 ` [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP Eric Dumazet
2018-03-15 11:13   ` Ilpo Järvinen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.