netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/28]: Accurate ECN for TCP
@ 2020-03-18  9:43 Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 01/28] tcp: add tp to prepare for AccECN code Ilpo Järvinen
                   ` (28 more replies)
  0 siblings, 29 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

Hi all,

Here's the full Accurate ECN implementation mostly based on
  https://tools.ietf.org/html/draft-ietf-tcpm-accurate-ecn-11

Comments would be highly appreciated. The GSO/TSO maze of bits
in particular is something I'm somewhat unsure if I got it
right (for a feature that has a software fallback).

There is an extensive set of packetdrill unit tests for most of
the functionality (I'll send separately to packetdrill).

Please note that this submission is not yet intented to be
included to net-next because some small changes seem still
possible to the spec.

 Documentation/networking/ip-sysctl.txt |  12 +-
 drivers/net/tun.c                      |   3 +-
 include/linux/netdev_features.h        |   3 +
 include/linux/skbuff.h                 |   2 +
 include/linux/tcp.h                    |  19 ++
 include/net/tcp.h                      | 221 ++++++++++---
 include/uapi/linux/tcp.h               |   9 +-
 net/ethtool/common.c                   |   1 +
 net/ipv4/bpf_tcp_ca.c                  |   2 +-
 net/ipv4/syncookies.c                  |  12 +
 net/ipv4/tcp.c                         |  10 +-
 net/ipv4/tcp_dctcp.c                   |   2 +-
 net/ipv4/tcp_dctcp.h                   |   2 +-
 net/ipv4/tcp_input.c                   | 558 ++++++++++++++++++++++++++++-----
 net/ipv4/tcp_ipv4.c                    |   8 +-
 net/ipv4/tcp_minisocks.c               |  84 ++++-
 net/ipv4/tcp_offload.c                 |  11 +-
 net/ipv4/tcp_output.c                  | 298 +++++++++++++++---
 net/ipv4/tcp_timer.c                   |   4 +-
 net/ipv6/syncookies.c                  |   1 +
 net/ipv6/tcp_ipv6.c                    |   4 +-
 net/netfilter/nf_log_common.c          |   4 +-

--
 i.   

ps. My apologies if you got a duplicate copy of them. It seems that
answering "no" to git send-email asking "Send this email?" might
still have sent something out.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH 01/28] tcp: add tp to prepare for AccECN code
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 02/28] tcp: fast path functions later Ilpo Järvinen
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

No functional changes.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_input.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6b6b57000dad..3525ec8edd54 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -254,8 +254,10 @@ static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 
 static void tcp_ecn_accept_cwr(struct sock *sk, const struct sk_buff *skb)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
+
 	if (tcp_hdr(skb)->cwr) {
-		tcp_sk(sk)->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+		tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
 
 		/* If the sender is telling us it has entered CWR, then its
 		 * cwnd may be very low (even just 1 packet), so we should ACK
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 02/28] tcp: fast path functions later
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 01/28] tcp: add tp to prepare for AccECN code Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 03/28] tcp: move tcp_in_ack_event later Ilpo Järvinen
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

No functional changes

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/net/tcp.h | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 07f947cc80e6..b97af0ff118f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -673,29 +673,6 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
 	return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
 }
 
-static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
-{
-	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
-			       ntohl(TCP_FLAG_ACK) |
-			       snd_wnd);
-}
-
-static inline void tcp_fast_path_on(struct tcp_sock *tp)
-{
-	__tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
-}
-
-static inline void tcp_fast_path_check(struct sock *sk)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
-	    tp->rcv_wnd &&
-	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
-	    !tp->urg_data)
-		tcp_fast_path_on(tp);
-}
-
 /* Compute the actual rto_min value */
 static inline u32 tcp_rto_min(struct sock *sk)
 {
@@ -1510,6 +1487,29 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
 	return true;
 }
 
+static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
+{
+	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+			       ntohl(TCP_FLAG_ACK) |
+			       snd_wnd);
+}
+
+static inline void tcp_fast_path_on(struct tcp_sock *tp)
+{
+	__tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
+}
+
+static inline void tcp_fast_path_check(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
+	    tp->rcv_wnd &&
+	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+	    !tp->urg_data)
+		tcp_fast_path_on(tp);
+}
+
 bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb,
 			  int mib_idx, u32 *last_oow_ack_time);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 03/28] tcp: move tcp_in_ack_event later
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 01/28] tcp: add tp to prepare for AccECN code Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 02/28] tcp: fast path functions later Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 04/28] tcp: create FLAG_TS_PROGRESS Ilpo Järvinen
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Accurate ECN's heuristics does not know if there is going
to be ACE field based CE counter increase or not until after
rtx queue has been processed. Only then the number of ACKed
bytes/pkts is available.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_input.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3525ec8edd54..860938e0f1b6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3539,12 +3539,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
 	}
 }
 
-static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
+static inline void tcp_in_ack_event(struct sock *sk, int flag)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 
-	if (icsk->icsk_ca_ops->in_ack_event)
-		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+	if (icsk->icsk_ca_ops->in_ack_event) {
+		u32 ack_ev_flags = 0;
+
+		if (flag & FLAG_WIN_UPDATE)
+			ack_ev_flags |= CA_ACK_WIN_UPDATE;
+		if (flag & FLAG_SLOWPATH) {
+			ack_ev_flags = CA_ACK_SLOWPATH;
+			if (flag & FLAG_ECE)
+				ack_ev_flags |= CA_ACK_ECE;
+		}
+
+		icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
+	}
 }
 
 /* Congestion control has updated the cwnd already. So if we're in
@@ -3657,12 +3668,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_snd_una_update(tp, ack);
 		flag |= FLAG_WIN_UPDATE;
 
-		tcp_in_ack_event(sk, CA_ACK_WIN_UPDATE);
-
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPACKS);
 	} else {
-		u32 ack_ev_flags = CA_ACK_SLOWPATH;
-
 		if (ack_seq != TCP_SKB_CB(skb)->end_seq)
 			flag |= FLAG_DATA;
 		else
@@ -3674,15 +3681,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 							&sack_state);
 
-		if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
+		if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb)))
 			flag |= FLAG_ECE;
-			ack_ev_flags |= CA_ACK_ECE;
-		}
-
-		if (flag & FLAG_WIN_UPDATE)
-			ack_ev_flags |= CA_ACK_WIN_UPDATE;
-
-		tcp_in_ack_event(sk, ack_ev_flags);
 	}
 
 	/* We passed data and got it acked, remove any soft error
@@ -3699,6 +3699,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	tcp_rack_update_reo_wnd(sk, &rs);
 
+	tcp_in_ack_event(sk, flag);
+
 	if (tp->tlp_high_seq)
 		tcp_process_tlp_ack(sk, ack, flag);
 	/* If needed, reset TLP/RTO timer; RACK may later override this. */
@@ -3728,6 +3730,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	return 1;
 
 no_queue:
+	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 04/28] tcp: create FLAG_TS_PROGRESS
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (2 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 03/28] tcp: move tcp_in_ack_event later Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 05/28] tcp: extend TCP flags to allow AE bit/ACE field Ilpo Järvinen
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Whenever timestamp advances, it declares progress which
can be used the other parts of the stack to decide that
the ACK is the most recent one seen so far.

AccECN will use this flag when deciding whether to update
CEP from the ACE field or not.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_input.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 860938e0f1b6..7c444541cefd 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -100,6 +100,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
 #define FLAG_NO_CHALLENGE_ACK	0x8000 /* do not call tcp_send_challenge_ack()	*/
 #define FLAG_ACK_MAYBE_DELAYED	0x10000 /* Likely a delayed ACK */
+#define FLAG_TS_PROGRESS	0x20000 /* Positive timestamp delta */
 
 #define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -3492,8 +3493,16 @@ static void tcp_store_ts_recent(struct tcp_sock *tp)
 	tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
 }
 
-static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+static int __tcp_replace_ts_recent(struct tcp_sock *tp, s32 tstamp_delta)
 {
+	tcp_store_ts_recent(tp);
+	return tstamp_delta > 0 ? FLAG_TS_PROGRESS : 0;
+}
+
+static int tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+{
+	s32 delta;
+
 	if (tp->rx_opt.saw_tstamp && !after(seq, tp->rcv_wup)) {
 		/* PAWS bug workaround wrt. ACK frames, the PAWS discard
 		 * extra check below makes sure this can only happen
@@ -3502,9 +3511,13 @@ static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
 		 * Not only, also it occurs for expired timestamps.
 		 */
 
-		if (tcp_paws_check(&tp->rx_opt, 0))
-			tcp_store_ts_recent(tp);
+		if (tcp_paws_check(&tp->rx_opt, 0)) {
+			delta = tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent;
+			return __tcp_replace_ts_recent(tp, delta);
+		}
 	}
+
+	return 0;
 }
 
 /* This routine deals with acks during a TLP episode.
@@ -3656,7 +3669,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	 * is in window.
 	 */
 	if (flag & FLAG_UPDATE_TS_RECENT)
-		tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
+		flag |= tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
 
 	if ((flag & (FLAG_SLOWPATH | FLAG_SND_UNA_ADVANCED)) ==
 	    FLAG_SND_UNA_ADVANCED) {
@@ -5608,6 +5621,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 	    TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
 	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
 		int tcp_header_len = tp->tcp_header_len;
+		int flag = 0;
+		s32 tstamp_delta = 0;
 
 		/* Timestamp header prediction: tcp_header_len
 		 * is automatically equal to th->doff*4 due to pred_flags
@@ -5620,8 +5635,9 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			if (!tcp_parse_aligned_timestamp(tp, th))
 				goto slow_path;
 
+			tstamp_delta = tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent;
 			/* If PAWS failed, check it more carefully in slow path */
-			if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0)
+			if (tstamp_delta < 0)
 				goto slow_path;
 
 			/* DO NOT update ts_recent here, if checksum fails
@@ -5641,12 +5657,12 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 				if (tcp_header_len ==
 				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
 				    tp->rcv_nxt == tp->rcv_wup)
-					tcp_store_ts_recent(tp);
+					flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
 				/* We know that such packets are checksummed
 				 * on entry.
 				 */
-				tcp_ack(sk, skb, 0);
+				tcp_ack(sk, skb, flag);
 				__kfree_skb(skb);
 				tcp_data_snd_check(sk);
 				/* When receiving pure ack in fast path, update
@@ -5676,7 +5692,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			if (tcp_header_len ==
 			    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
 			    tp->rcv_nxt == tp->rcv_wup)
-				tcp_store_ts_recent(tp);
+				flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
 			tcp_rcv_rtt_measure_ts(sk, skb);
 
@@ -5690,7 +5706,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 
 			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
 				/* Well, only one small jumplet in fast path... */
-				tcp_ack(sk, skb, FLAG_DATA);
+				tcp_ack(sk, skb, flag | FLAG_DATA);
 				tcp_data_snd_check(sk);
 				if (!inet_csk_ack_scheduled(sk))
 					goto no_ack;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 05/28] tcp: extend TCP flags to allow AE bit/ACE field
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (3 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 04/28] tcp: create FLAG_TS_PROGRESS Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 06/28] tcp: reorganize SYN ECN code Ilpo Järvinen
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

With AccECN, there's one additional TCP flag to be used (AE)
and ACE field that overloads the definition of AE, CWR, and
ECE flags. As tcp_flags was previously only 1 byte, the
byte-order stuff needs to be added to it's handling.

Reorganize the bitfields to not enlarge the structure.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/net/tcp.h             | 13 ++++++++-----
 include/uapi/linux/tcp.h      |  9 ++++++---
 net/ipv4/tcp_input.c          |  3 ++-
 net/ipv4/tcp_ipv4.c           |  3 ++-
 net/ipv4/tcp_output.c         |  4 ++--
 net/ipv6/tcp_ipv6.c           |  3 ++-
 net/netfilter/nf_log_common.c |  4 +++-
 7 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b97af0ff118f..6b6a1b8b3c6e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -791,7 +791,10 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
 #define TCPHDR_URG 0x20
 #define TCPHDR_ECE 0x40
 #define TCPHDR_CWR 0x80
+#define TCPHDR_AE 0x100
+#define TCPHDR_FLAGS_MASK 0x1ff
 
+#define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
 
 /* This is what the send packet queuing engine uses to pass
@@ -816,7 +819,11 @@ struct tcp_skb_cb {
 			u16	tcp_gso_size;
 		};
 	};
-	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
+	__u16		tcp_flags:9,	/* TCP header flags. (tcp[12-13])	*/
+			unused:4,
+			txstamp_ack:1,	/* Record TX timestamp for ack? */
+			eor:1,		/* Is skb MSG_EOR marked? */
+			has_rxtstamp:1;	/* SKB has a RX timestamp	*/
 
 	__u8		sacked;		/* State flags for SACK.	*/
 #define TCPCB_SACKED_ACKED	0x01	/* SKB ACK'd by a SACK block	*/
@@ -829,10 +836,6 @@ struct tcp_skb_cb {
 				TCPCB_REPAIRED)
 
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
-	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
-			eor:1,		/* Is skb MSG_EOR marked? */
-			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
-			unused:5;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct {
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 1a7fc856e237..7a4bb72d7074 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -28,7 +28,8 @@ struct tcphdr {
 	__be32	seq;
 	__be32	ack_seq;
 #if defined(__LITTLE_ENDIAN_BITFIELD)
-	__u16	res1:4,
+	__u16	ae:1,
+		res1:3,
 		doff:4,
 		fin:1,
 		syn:1,
@@ -40,7 +41,8 @@ struct tcphdr {
 		cwr:1;
 #elif defined(__BIG_ENDIAN_BITFIELD)
 	__u16	doff:4,
-		res1:4,
+		res1:3,
+		ae:1,
 		cwr:1,
 		ece:1,
 		urg:1,
@@ -70,6 +72,7 @@ union tcp_word_hdr {
 #define tcp_flag_word(tp) ( ((union tcp_word_hdr *)(tp))->words [3]) 
 
 enum { 
+	TCP_FLAG_AE = __constant_cpu_to_be32(0x01000000),
 	TCP_FLAG_CWR = __constant_cpu_to_be32(0x00800000),
 	TCP_FLAG_ECE = __constant_cpu_to_be32(0x00400000),
 	TCP_FLAG_URG = __constant_cpu_to_be32(0x00200000),
@@ -78,7 +81,7 @@ enum {
 	TCP_FLAG_RST = __constant_cpu_to_be32(0x00040000),
 	TCP_FLAG_SYN = __constant_cpu_to_be32(0x00020000),
 	TCP_FLAG_FIN = __constant_cpu_to_be32(0x00010000),
-	TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0F000000),
+	TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0E000000),
 	TCP_DATA_OFFSET = __constant_cpu_to_be32(0xF0000000)
 }; 
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7c444541cefd..e49a6f7ad5ce 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6473,7 +6473,8 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	ecn_ok_dst = dst_feature(dst, DST_FEATURE_ECN_MASK);
 	ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
 
-	if (((!ect || th->res1) && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
+	if (((!ect || th->res1 || th->ae) && ecn_ok) ||
+	    tcp_ca_needs_ecn(listen_sk) ||
 	    (ecn_ok_dst & DST_FEATURE_ECN_CA) ||
 	    tcp_bpf_ca_needs_ecn((struct sock *)req))
 		inet_rsk(req)->ecn_ok = 1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 52acf0bc2ee5..f18a2fbf2761 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1850,7 +1850,8 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff * 4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+	TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
+				     TCPHDR_FLAGS_MASK;
 	TCP_SKB_CB(skb)->tcp_tw_isn = 0;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
 	TCP_SKB_CB(skb)->sacked	 = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 306e25d743e8..dc225e616f98 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -389,7 +389,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 /* Constructs common control bits of non-data skb. If SYN/FIN is present,
  * auto increment end seqno.
  */
-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u16 flags)
 {
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
@@ -1167,7 +1167,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	th->seq			= htonl(tcb->seq);
 	th->ack_seq		= htonl(rcv_nxt);
 	*(((__be16 *)th) + 6)	= htons(((tcp_header_size >> 2) << 12) |
-					tcb->tcp_flags);
+					(tcb->tcp_flags & TCPHDR_FLAGS_MASK));
 
 	th->check		= 0;
 	th->urg_ptr		= 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index eaf09e6b7844..73032cd261ea 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1528,7 +1528,8 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff*4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+	TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
+				     TCPHDR_FLAGS_MASK;
 	TCP_SKB_CB(skb)->tcp_tw_isn = 0;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
 	TCP_SKB_CB(skb)->sacked = 0;
diff --git a/net/netfilter/nf_log_common.c b/net/netfilter/nf_log_common.c
index ae5628ddbe6d..1457ead432eb 100644
--- a/net/netfilter/nf_log_common.c
+++ b/net/netfilter/nf_log_common.c
@@ -84,7 +84,9 @@ int nf_log_dump_tcp_header(struct nf_log_buf *m, const struct sk_buff *skb,
 	/* Max length: 9 "RES=0x3C " */
 	nf_log_buf_add(m, "RES=0x%02x ", (u_int8_t)(ntohl(tcp_flag_word(th) &
 					    TCP_RESERVED_BITS) >> 22));
-	/* Max length: 32 "CWR ECE URG ACK PSH RST SYN FIN " */
+	/* Max length: 35 "AE CWR ECE URG ACK PSH RST SYN FIN " */
+	if (th->ae)
+		nf_log_buf_add(m, "AE ");
 	if (th->cwr)
 		nf_log_buf_add(m, "CWR ");
 	if (th->ece)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 06/28] tcp: reorganize SYN ECN code
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (4 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 05/28] tcp: extend TCP flags to allow AE bit/ACE field Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 07/28] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() Ilpo Järvinen
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

No functional changes.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dc225e616f98..116be30c1b2c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -336,10 +336,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 	tp->ecn_flags = 0;
 
 	if (use_ecn) {
-		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tp->ecn_flags = TCP_ECN_OK;
 		if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
 			INET_ECN_xmit(sk);
+
+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
+		tp->ecn_flags = TCP_ECN_OK;
 	}
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 07/28] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (5 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 06/28] tcp: reorganize SYN ECN code Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 08/28] tcp: helpers for ECN mode handling Ilpo Järvinen
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is
called only for data segments, not for ACKs (with AccECN,
also ACKs may get ECN bits).

The extra "layer" in tcp_ecn_check_ce() function just
checks for ECN being enabled, that can be moved into
tcp_ecn_field_check rather than having the __ variant.

No functional changes.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_input.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e49a6f7ad5ce..2f0a9a2ee5c1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -273,10 +273,13 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
 	tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
 }
 
-static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+		return;
+
 	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
 	case INET_ECN_NOT_ECT:
 		/* Funny extension: if ECT is not set on a segment,
@@ -305,12 +308,6 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static void tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
-{
-	if (tcp_sk(sk)->ecn_flags & TCP_ECN_OK)
-		__tcp_ecn_check_ce(sk, skb);
-}
-
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
 {
 	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
@@ -717,7 +714,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 	}
 	icsk->icsk_ack.lrcvtime = now;
 
-	tcp_ecn_check_ce(sk, skb);
+	tcp_data_ecn_check(sk, skb);
 
 	if (skb->len >= 128)
 		tcp_grow_window(sk, skb);
@@ -4578,7 +4575,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	u32 seq, end_seq;
 	bool fragstolen;
 
-	tcp_ecn_check_ce(sk, skb);
+	tcp_data_ecn_check(sk, skb);
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 08/28] tcp: helpers for ECN mode handling
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (6 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 07/28] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 09/28] gso: AccECN support Ilpo Järvinen
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Create helpers for TCP ECN modes. No functional changes.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/net/tcp.h        | 44 ++++++++++++++++++++++++++++++++++++----
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_dctcp.c     |  2 +-
 net/ipv4/tcp_input.c     | 14 ++++++-------
 net/ipv4/tcp_minisocks.c |  4 +++-
 net/ipv4/tcp_output.c    |  6 +++---
 6 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6b6a1b8b3c6e..f4ac4c029215 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -363,10 +363,46 @@ static inline void tcp_dec_quickack_mode(struct sock *sk,
 	}
 }
 
-#define	TCP_ECN_OK		1
-#define	TCP_ECN_QUEUE_CWR	2
-#define	TCP_ECN_DEMAND_CWR	4
-#define	TCP_ECN_SEEN		8
+#define	TCP_ECN_MODE_RFC3168	0x1
+#define	TCP_ECN_QUEUE_CWR	0x2
+#define	TCP_ECN_DEMAND_CWR	0x4
+#define	TCP_ECN_SEEN		0x8
+#define TCP_ECN_MODE_ACCECN	0x10
+
+#define TCP_ECN_DISABLED	0
+#define TCP_ECN_MODE_PENDING	(TCP_ECN_MODE_RFC3168|TCP_ECN_MODE_ACCECN)
+#define TCP_ECN_MODE_ANY	(TCP_ECN_MODE_RFC3168|TCP_ECN_MODE_ACCECN)
+
+static inline bool tcp_ecn_mode_any(const struct tcp_sock *tp)
+{
+	return tp->ecn_flags & TCP_ECN_MODE_ANY;
+}
+
+static inline bool tcp_ecn_mode_rfc3168(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_RFC3168;
+}
+
+static inline bool tcp_ecn_mode_accecn(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_ACCECN;
+}
+
+static inline bool tcp_ecn_disabled(const struct tcp_sock *tp)
+{
+	return !tcp_ecn_mode_any(tp);
+}
+
+static inline bool tcp_ecn_mode_pending(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_PENDING) == TCP_ECN_MODE_PENDING;
+}
+
+static inline void tcp_ecn_mode_set(struct tcp_sock *tp, u8 mode)
+{
+	tp->ecn_flags &= ~TCP_ECN_MODE_ANY;
+	tp->ecn_flags |= mode;
+}
 
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 48aa457a9516..fbf365dd51e4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3254,7 +3254,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 		info->tcpi_rcv_wscale = tp->rx_opt.rcv_wscale;
 	}
 
-	if (tp->ecn_flags & TCP_ECN_OK)
+	if (tcp_ecn_mode_any(tp))
 		info->tcpi_options |= TCPI_OPT_ECN;
 	if (tp->ecn_flags & TCP_ECN_SEEN)
 		info->tcpi_options |= TCPI_OPT_ECN_SEEN;
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 79f705450c16..8cf81e942675 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -76,7 +76,7 @@ static void dctcp_init(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 
-	if ((tp->ecn_flags & TCP_ECN_OK) ||
+	if (tcp_ecn_mode_any(tp) ||
 	    (sk->sk_state == TCP_LISTEN ||
 	     sk->sk_state == TCP_CLOSE)) {
 		struct dctcp *ca = inet_csk_ca(sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2f0a9a2ee5c1..59078fa2240d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -249,7 +249,7 @@ static bool tcp_in_quickack_mode(struct sock *sk)
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 {
-	if (tp->ecn_flags & TCP_ECN_OK)
+	if (tcp_ecn_mode_rfc3168(tp))
 		tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
 }
 
@@ -277,7 +277,7 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+	if (tcp_ecn_disabled(tp))
 		return;
 
 	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
@@ -310,19 +310,19 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
-		tp->ecn_flags &= ~TCP_ECN_OK;
+	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr))
+		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
 static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || !th->cwr))
-		tp->ecn_flags &= ~TCP_ECN_OK;
+	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
+		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
 static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if (th->ece && !th->syn && (tp->ecn_flags & TCP_ECN_OK))
+	if (th->ece && !th->syn && tcp_ecn_mode_rfc3168(tp))
 		return true;
 	return false;
 }
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c8274371c3d0..3b5a137e416c 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -400,7 +400,9 @@ EXPORT_SYMBOL(tcp_openreq_init_rwin);
 static void tcp_ecn_openreq_child(struct tcp_sock *tp,
 				  const struct request_sock *req)
 {
-	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
+	tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+			     TCP_ECN_MODE_RFC3168 :
+			     TCP_ECN_DISABLED);
 }
 
 void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 116be30c1b2c..71a96983987d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -311,7 +311,7 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 	const struct tcp_sock *tp = tcp_sk(sk);
 
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
-	if (!(tp->ecn_flags & TCP_ECN_OK))
+	if (tcp_ecn_disabled(tp))
 		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
 	else if (tcp_ca_needs_ecn(sk) ||
 		 tcp_bpf_ca_needs_ecn(sk))
@@ -340,7 +340,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 			INET_ECN_xmit(sk);
 
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tp->ecn_flags = TCP_ECN_OK;
+		tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
 	}
 }
 
@@ -368,7 +368,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tp->ecn_flags & TCP_ECN_OK) {
+	if (tcp_ecn_mode_rfc3168(tp)) {
 		/* Not-retransmitted data segment: set ECT and inject CWR. */
 		if (skb->len != tcp_header_len &&
 		    !before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 09/28] gso: AccECN support
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (7 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 08/28] tcp: helpers for ECN mode handling Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-19  3:44   ` Eric Dumazet
  2020-03-18  9:43 ` [RFC PATCH 10/28] gro: prevent ACE field corruption & better AccECN handling Ilpo Järvinen
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Handling the CWR flag differs between RFC 3168 ECN and AccECN.
Take it into account in GSO by not clearing the CWR bit.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 drivers/net/tun.c               | 3 ++-
 include/linux/netdev_features.h | 3 +++
 include/linux/skbuff.h          | 2 ++
 net/ethtool/common.c            | 1 +
 net/ipv4/tcp_offload.c          | 6 +++++-
 5 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 228fe449dc6d..d376a7cb0d63 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2788,7 +2788,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
 				   TUN_USER_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
-				   NETIF_F_HW_VLAN_STAG_TX;
+				   NETIF_F_HW_VLAN_STAG_TX |
+				   NETIF_F_GSO_ACCECN;
 		dev->features = dev->hw_features | NETIF_F_LLTX;
 		dev->vlan_features = dev->features &
 				     ~(NETIF_F_HW_VLAN_CTAG_TX |
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 34d050bb1ae6..c7065b468d21 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -83,6 +83,7 @@ enum {
 	NETIF_F_HW_TLS_RECORD_BIT,	/* Offload TLS record */
 	NETIF_F_GRO_FRAGLIST_BIT,	/* Fraglist GRO */
 
+	NETIF_F_GSO_ACCECN_BIT,		/* ... TCP AccECN support */
 	/*
 	 * Add your fresh new feature above and remember to update
 	 * netdev_features_strings[] in net/core/ethtool.c and maybe
@@ -124,6 +125,7 @@ enum {
 #define NETIF_F_SG		__NETIF_F(SG)
 #define NETIF_F_TSO6		__NETIF_F(TSO6)
 #define NETIF_F_TSO_ECN		__NETIF_F(TSO_ECN)
+#define NETIF_F_GSO_ACCECN	__NETIF_F(GSO_ACCECN)
 #define NETIF_F_TSO		__NETIF_F(TSO)
 #define NETIF_F_VLAN_CHALLENGED	__NETIF_F(VLAN_CHALLENGED)
 #define NETIF_F_RXFCS		__NETIF_F(RXFCS)
@@ -205,6 +207,7 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
 
 /* List of features with software fallbacks. */
 #define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | \
+				 NETIF_F_GSO_ACCECN | \
 				 NETIF_F_GSO_SCTP)
 
 /*
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 21749b2cdc9b..fdd73dc126a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -594,6 +594,8 @@ enum {
 	SKB_GSO_UDP_L4 = 1 << 17,
 
 	SKB_GSO_FRAGLIST = 1 << 18,
+
+	SKB_GSO_TCP_ACCECN = 1 << 19,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 7b6969af5ae7..26241b5d62a4 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -27,6 +27,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_TSO_BIT] =              "tx-tcp-segmentation",
 	[NETIF_F_GSO_ROBUST_BIT] =       "tx-gso-robust",
 	[NETIF_F_TSO_ECN_BIT] =          "tx-tcp-ecn-segmentation",
+	[NETIF_F_GSO_ACCECN_BIT] =       "tx-tcp-accecn-segmentation",
 	[NETIF_F_TSO_MANGLEID_BIT] =	 "tx-tcp-mangleid-segmentation",
 	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
 	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index e09147ac9a99..7a81cf438010 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -65,6 +65,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	struct sk_buff *gso_skb = skb;
 	__sum16 newcheck;
 	bool ooo_okay, copy_destructor;
+	bool ecn_cwr_mask;
 
 	th = tcp_hdr(skb);
 	thlen = th->doff * 4;
@@ -121,6 +122,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	newcheck = ~csum_fold((__force __wsum)((__force u32)th->check +
 					       (__force u32)delta));
 
+	ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
+
 	while (skb->next) {
 		th->fin = th->psh = 0;
 		th->check = newcheck;
@@ -140,7 +143,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 		th = tcp_hdr(skb);
 
 		th->seq = htonl(seq);
-		th->cwr = 0;
+
+		th->cwr &= ecn_cwr_mask;
 	}
 
 	/* Following permits TCP Small Queues to work well with GSO :
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 10/28] gro: prevent ACE field corruption & better AccECN handling
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (8 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 09/28] gso: AccECN support Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 11/28] tcp: AccECN support to tcp_add_backlog Ilpo Järvinen
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

There are important differences in how the CWR field behaves
in RFC3168 and AccECN.

Thus, it is better to never let anything to receive
a mixed-CWR skb. Set the Accurate ECN GSO flag to
avoid corrupting CWR bits somewhere.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_offload.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 7a81cf438010..58ce382c793e 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -242,7 +242,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 	flush = NAPI_GRO_CB(p)->flush;
 	flush |= (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
-		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
+		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
 	for (i = sizeof(*th); i < thlen; i += 4)
 		flush |= *(u32 *)((u8 *)th + i) ^
@@ -300,7 +300,7 @@ int tcp_gro_complete(struct sk_buff *skb)
 	skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
 
 	if (th->cwr)
-		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 11/28] tcp: AccECN support to tcp_add_backlog
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (9 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 10/28] gro: prevent ACE field corruption & better AccECN handling Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 12/28] tcp: AccECN core Ilpo Järvinen
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

AE flag needs to preserved for AccECN.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_ipv4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f18a2fbf2761..dab0c1b85e95 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1757,7 +1757,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 	    !((TCP_SKB_CB(tail)->tcp_flags &
 	      TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_ACK) ||
 	    ((TCP_SKB_CB(tail)->tcp_flags ^
-	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
+	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)) ||
 #ifdef CONFIG_TLS_DEVICE
 	    tail->decrypted != skb->decrypted ||
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 12/28] tcp: AccECN core
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (10 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 11/28] tcp: AccECN support to tcp_add_backlog Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 13/28] tcp: Pass flags to tcp_send_ack Ilpo Järvinen
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Accurate ECN (without negotiation and AccECN Option).

Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h   |   2 +
 include/net/tcp.h     |  29 ++++++++++++
 net/ipv4/tcp.c        |   1 +
 net/ipv4/tcp_input.c  | 108 +++++++++++++++++++++++++++++++++++-------
 net/ipv4/tcp_output.c |  22 ++++++++-
 5 files changed, 144 insertions(+), 18 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 3dc964010fef..9bdf67dd0d1d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -315,6 +315,8 @@ struct tcp_sock {
 	u32	prr_out;	/* Total number of pkts sent during Recovery. */
 	u32	delivered;	/* Total data packets delivered incl. rexmits */
 	u32	delivered_ce;	/* Like the above but only ECE marked packets */
+	u32	received_ce;	/* Like the above but for received CE marked packets */
+	u32	received_ce_tx; /* Like the above but max transmitted value */
 	u32	lost;		/* Total data packets lost incl. rexmits */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u64	first_tx_mstamp;  /* start of window send phase */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f4ac4c029215..ee938066fd8c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -404,6 +404,16 @@ static inline void tcp_ecn_mode_set(struct tcp_sock *tp, u8 mode)
 	tp->ecn_flags |= mode;
 }
 
+static inline u8 tcp_accecn_ace(const struct tcphdr *th)
+{
+	return (th->ae << 2) | (th->cwr << 1) | th->ece;
+}
+
+static inline u32 tcp_accecn_ace_deficit(const struct tcp_sock *tp)
+{
+	return tp->received_ce - tp->received_ce_tx;
+}
+
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
 	TCP_TW_RST = 1,
@@ -833,6 +843,20 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
 #define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
 
+#define TCP_ACCECN_CEP_ACE_MASK 0x7
+#define TCP_ACCECN_ACE_MAX_DELTA 6
+
+/* To avoid/detect middlebox interference, not all counters start at 0.
+ * See draft-ietf-tcpm-accurate-ecn for the latest values.
+ */
+#define TCP_ACCECN_CEP_INIT_OFFSET 5
+
+static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
+{
+	tp->received_ce = 0;
+	tp->received_ce_tx = 0;
+}
+
 /* This is what the send packet queuing engine uses to pass
  * TCP per-packet control information to the transmission code.
  * We also store the host-order sequence numbers in here too.
@@ -1528,7 +1552,12 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
 
 static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
 {
+	u32 ace = tcp_ecn_mode_accecn(tp) ?
+		  ((tp->delivered_ce + TCP_ACCECN_CEP_INIT_OFFSET) &
+		   TCP_ACCECN_CEP_ACE_MASK) : 0;
+
 	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+			       (ace << 22) |
 			       ntohl(TCP_FLAG_ACK) |
 			       snd_wnd);
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fbf365dd51e4..2ee1e4794c7d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2624,6 +2624,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->window_clamp = 0;
 	tp->delivered = 0;
 	tp->delivered_ce = 0;
+	tcp_accecn_init_counters(tp);
 	tcp_set_ca_state(sk, TCP_CA_Open);
 	tp->is_sack_reneg = 0;
 	tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 59078fa2240d..65bbfadbee67 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -249,6 +249,7 @@ static bool tcp_in_quickack_mode(struct sock *sk)
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 {
+	/* Do not set CWR if in AccECN mode! */
 	if (tcp_ecn_mode_rfc3168(tp))
 		tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
 }
@@ -257,7 +258,7 @@ static void tcp_ecn_accept_cwr(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_hdr(skb)->cwr) {
+	if (tcp_ecn_mode_rfc3168(tp) && tcp_hdr(skb)->cwr) {
 		tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
 
 		/* If the sender is telling us it has entered CWR, then its
@@ -293,17 +294,16 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 		if (tcp_ca_needs_ecn(sk))
 			tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
 
-		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
+		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) &&
+		    tcp_ecn_mode_rfc3168(tp)) {
 			/* Better not delay acks, sender can have a very low cwnd */
 			tcp_enter_quickack_mode(sk, 2);
 			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 		}
-		tp->ecn_flags |= TCP_ECN_SEEN;
 		break;
 	default:
 		if (tcp_ca_needs_ecn(sk))
 			tcp_ca_event(sk, CA_EVENT_ECN_NO_CE);
-		tp->ecn_flags |= TCP_ECN_SEEN;
 		break;
 	}
 }
@@ -320,11 +320,43 @@ static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
 		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
-static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
+static u32 tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
 {
 	if (th->ece && !th->syn && tcp_ecn_mode_rfc3168(tp))
-		return true;
-	return false;
+		return 1;
+	return 0;
+}
+
+/* Returns the ECN CE delta */
+static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
+			      u32 delivered_pkts, int flag)
+{
+	u32 delta, safe_delta;
+	u32 corrected_ace;
+
+	/* Reordered ACK? (...or uncertain due to lack of data to send and ts) */
+	if (!(flag & (FLAG_FORWARD_PROGRESS|FLAG_TS_PROGRESS)))
+		return 0;
+
+	if (!(flag & FLAG_SLOWPATH)) {
+		/* AccECN counter might overflow on large ACKs */
+		if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK)
+			return 0;
+	}
+
+	/* ACE field is not available during handshake */
+	if (flag & FLAG_SYN_ACKED)
+		return 0;
+
+	corrected_ace = tcp_accecn_ace(tcp_hdr(skb)) - TCP_ACCECN_CEP_INIT_OFFSET;
+	delta = (corrected_ace - tp->delivered_ce) & TCP_ACCECN_CEP_ACE_MASK;
+	if (delivered_pkts < TCP_ACCECN_CEP_ACE_MASK)
+		return delta;
+
+	safe_delta = delivered_pkts -
+		     ((delivered_pkts - delta) & TCP_ACCECN_CEP_ACE_MASK);
+
+	return safe_delta;
 }
 
 /* Buffer size and advertised window tuning.
@@ -3590,7 +3622,8 @@ static void tcp_xmit_recovery(struct sock *sk, int rexmit)
 }
 
 /* Returns the number of packets newly acked or sacked by the current ACK */
-static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
+static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered,
+			       u32 ecn_count)
 {
 	const struct net *net = sock_net(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3598,10 +3631,18 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
 
 	delivered = tp->delivered - prior_delivered;
 	NET_ADD_STATS(net, LINUX_MIB_TCPDELIVERED, delivered);
-	if (flag & FLAG_ECE) {
-		tp->delivered_ce += delivered;
-		NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, delivered);
+
+	if (ecn_count) {
+		if (tcp_ecn_mode_rfc3168(tp))
+			ecn_count = delivered;
+
+		tp->delivered_ce += ecn_count;
+		if (tcp_ecn_mode_accecn(tp) &&
+		    tcp_accecn_ace_deficit(tp) >= TCP_ACCECN_ACE_MAX_DELTA)
+			inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+		NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, ecn_count);
 	}
+
 	return delivered;
 }
 
@@ -3621,6 +3662,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 delivered = tp->delivered;
 	u32 lost = tp->lost;
 	int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */
+	u32 ecn_count = 0; /* Did we receive ECE/an AccECN ACE update? */
 	u32 prior_fack;
 
 	sack_state.first_sackt = 0;
@@ -3690,8 +3732,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		if (TCP_SKB_CB(skb)->sacked)
 			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 							&sack_state);
-
-		if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb)))
+		ecn_count = tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb));
+		if (ecn_count > 0)
 			flag |= FLAG_ECE;
 	}
 
@@ -3709,6 +3751,13 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	tcp_rack_update_reo_wnd(sk, &rs);
 
+	if (tcp_ecn_mode_accecn(tp)) {
+		ecn_count = tcp_accecn_process(tp, skb,
+					       tp->delivered - delivered, flag);
+		if (ecn_count > 0)
+			flag |= FLAG_ECE;
+	}
+
 	tcp_in_ack_event(sk, flag);
 
 	if (tp->tlp_high_seq)
@@ -3731,7 +3780,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
 		sk_dst_confirm(sk);
 
-	delivered = tcp_newly_delivered(sk, delivered, flag);
+	delivered = tcp_newly_delivered(sk, delivered, ecn_count);
+
 	lost = tp->lost - lost;			/* freshly marked lost */
 	rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
 	tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate);
@@ -3740,12 +3790,18 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	return 1;
 
 no_queue:
+	if (tcp_ecn_mode_accecn(tp)) {
+		ecn_count = tcp_accecn_process(tp, skb,
+					       tp->delivered - delivered, flag);
+		if (ecn_count > 0)
+			flag |= FLAG_ECE;
+	}
 	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
-		tcp_newly_delivered(sk, delivered, flag);
+		tcp_newly_delivered(sk, delivered, ecn_count);
 	}
 	/* If this ack opens up a zero window, clear backoff.  It was
 	 * being used to time the probes, and is probably far higher than
@@ -3766,7 +3822,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 						&sack_state);
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
-		tcp_newly_delivered(sk, delivered, flag);
+		tcp_newly_delivered(sk, delivered, ecn_count);
 		tcp_xmit_recovery(sk, rexmit);
 	}
 
@@ -5425,6 +5481,21 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
 	}
 }
 
+/* Updates Accurate ECN received counters from the received IP ECN field */
+static void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+	u8 is_ce = INET_ECN_is_ce(ecnfield);
+
+	if (!INET_ECN_is_not_ect(ecnfield)) {
+		tp->ecn_flags |= TCP_ECN_SEEN;
+
+		/* ACE counter tracks *all* segments including pure acks */
+		tp->received_ce += is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+	}
+}
+
 /* Accept RST for rcv_nxt - 1 after a FIN.
  * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
  * FIN is sent followed by a RST packet. The RST is sent with the same
@@ -5656,6 +5727,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 				    tp->rcv_nxt == tp->rcv_wup)
 					flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
+				tcp_ecn_received_counters(sk, skb);
+
 				/* We know that such packets are checksummed
 				 * on entry.
 				 */
@@ -5697,6 +5770,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 
 			/* Bulk data transfer: receiver */
 			__skb_pull(skb, tcp_header_len);
+			tcp_ecn_received_counters(sk, skb);
 			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
 			tcp_event_data_recv(sk, skb);
@@ -5733,6 +5807,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 		return;
 
 step5:
+	tcp_ecn_received_counters(sk, skb);
+
 	if (tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT) < 0)
 		goto discard;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 71a96983987d..a1414d1a8ef1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -360,6 +360,18 @@ tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th)
 		th->ece = 1;
 }
 
+static void tcp_accecn_set_ace(struct tcphdr *th, struct tcp_sock *tp)
+{
+	u32 wire_ace;
+
+	tp->received_ce_tx += min_t(u32, tcp_accecn_ace_deficit(tp),
+				    TCP_ACCECN_ACE_MAX_DELTA);
+	wire_ace = tp->received_ce_tx + TCP_ACCECN_CEP_INIT_OFFSET;
+	th->ece = !!(wire_ace & 0x1);
+	th->cwr = !!(wire_ace & 0x2);
+	th->ae = !!(wire_ace & 0x4);
+}
+
 /* Set up ECN state for a packet on a ESTABLISHED socket that is about to
  * be sent.
  */
@@ -368,11 +380,17 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_ecn_mode_rfc3168(tp)) {
+	if (!tcp_ecn_mode_any(tp))
+		return;
+
+	INET_ECN_xmit(sk);
+	if (tcp_ecn_mode_accecn(tp)) {
+		tcp_accecn_set_ace(th, tp);
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
+	} else {
 		/* Not-retransmitted data segment: set ECT and inject CWR. */
 		if (skb->len != tcp_header_len &&
 		    !before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) {
-			INET_ECN_xmit(sk);
 			if (tp->ecn_flags & TCP_ECN_QUEUE_CWR) {
 				tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
 				th->cwr = 1;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 13/28] tcp: Pass flags to tcp_send_ack
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (11 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 12/28] tcp: AccECN core Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 14/28] tcp: AccECN negotiation Ilpo Järvinen
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Accurate ECN reflector needs to send custom flags to handle
IP-ECN field reflector.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/net/tcp.h     |  4 ++--
 net/ipv4/bpf_tcp_ca.c |  2 +-
 net/ipv4/tcp.c        |  2 +-
 net/ipv4/tcp_dctcp.h  |  2 +-
 net/ipv4/tcp_input.c  | 14 +++++++-------
 net/ipv4/tcp_output.c | 10 +++++-----
 net/ipv4/tcp_timer.c  |  4 ++--
 7 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ee938066fd8c..ddeb11c01faa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -640,8 +640,8 @@ void tcp_send_fin(struct sock *sk);
 void tcp_send_active_reset(struct sock *sk, gfp_t priority);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
-void tcp_send_ack(struct sock *sk);
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags);
+void tcp_send_ack(struct sock *sk, u16 flags);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 574972bc7299..55b78183fbd9 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -146,7 +146,7 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
 BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt)
 {
 	/* bpf_tcp_ca prog cannot have NULL tp */
-	__tcp_send_ack((struct sock *)tp, rcv_nxt);
+	__tcp_send_ack((struct sock *)tp, rcv_nxt, 0);
 	return 0;
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2ee1e4794c7d..edc03a1bf704 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1572,7 +1572,7 @@ static void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		}
 	}
 	if (time_to_ack)
-		tcp_send_ack(sk);
+		tcp_send_ack(sk, 0);
 }
 
 static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h
index d69a77cbd0c7..4b0259111d81 100644
--- a/net/ipv4/tcp_dctcp.h
+++ b/net/ipv4/tcp_dctcp.h
@@ -28,7 +28,7 @@ static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
 		 */
 		if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
 			dctcp_ece_ack_cwr(sk, *ce_state);
-			__tcp_send_ack(sk, *prior_rcv_nxt);
+			__tcp_send_ack(sk, *prior_rcv_nxt, 0);
 		}
 		inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 65bbfadbee67..dbe70a114b1d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3512,7 +3512,7 @@ static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb)
 	if (count > 0) {
 		WRITE_ONCE(challenge_count, count - 1);
 		NET_INC_STATS(net, LINUX_MIB_TCPCHALLENGEACK);
-		tcp_send_ack(sk);
+		tcp_send_ack(sk, 0);
 	}
 }
 
@@ -4255,12 +4255,12 @@ void tcp_fin(struct sock *sk)
 		 * happens, we must ack the received FIN and
 		 * enter the CLOSING state.
 		 */
-		tcp_send_ack(sk);
+		tcp_send_ack(sk, 0);
 		tcp_set_state(sk, TCP_CLOSING);
 		break;
 	case TCP_FIN_WAIT2:
 		/* Received a FIN -- send ACK and enter TIME_WAIT. */
-		tcp_send_ack(sk);
+		tcp_send_ack(sk, 0);
 		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
 		break;
 	default:
@@ -4367,7 +4367,7 @@ static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
 		}
 	}
 
-	tcp_send_ack(sk);
+	tcp_send_ack(sk, 0);
 }
 
 /* These routines update the SACK block as out-of-order packets arrive or
@@ -4427,7 +4427,7 @@ static void tcp_sack_new_ofo_skb(struct sock *sk, u32 seq, u32 end_seq)
 	 */
 	if (this_sack >= TCP_NUM_SACKS) {
 		if (tp->compressed_ack > TCP_FASTRETRANS_THRESH)
-			tcp_send_ack(sk);
+			tcp_send_ack(sk, 0);
 		this_sack--;
 		tp->rx_opt.num_sacks--;
 		sp--;
@@ -5331,7 +5331,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    /* Protocol state mandates a one-time immediate ACK */
 	    inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) {
 send_now:
-		tcp_send_ack(sk);
+		tcp_send_ack(sk, 0);
 		return;
 	}
 
@@ -6126,7 +6126,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 			tcp_drop(sk, skb);
 			return 0;
 		} else {
-			tcp_send_ack(sk);
+			tcp_send_ack(sk, 0);
 		}
 		return -1;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a1414d1a8ef1..c8d0a7baf2d4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3741,7 +3741,7 @@ void tcp_send_delayed_ack(struct sock *sk)
 		 */
 		if (icsk->icsk_ack.blocked ||
 		    time_before_eq(icsk->icsk_ack.timeout, jiffies + (ato >> 2))) {
-			tcp_send_ack(sk);
+			tcp_send_ack(sk, 0);
 			return;
 		}
 
@@ -3754,7 +3754,7 @@ void tcp_send_delayed_ack(struct sock *sk)
 }
 
 /* This routine sends an ack and also updates the window. */
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags)
 {
 	struct sk_buff *buff;
 
@@ -3778,7 +3778,7 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
 
 	/* Reserve space for headers and prepare control bits. */
 	skb_reserve(buff, MAX_TCP_HEADER);
-	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);
+	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK | flags);
 
 	/* We do not want pure acks influencing TCP Small Queues or fq/pacing
 	 * too much.
@@ -3791,9 +3791,9 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
 }
 EXPORT_SYMBOL_GPL(__tcp_send_ack);
 
-void tcp_send_ack(struct sock *sk)
+void tcp_send_ack(struct sock *sk, u16 flags)
 {
-	__tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt);
+	__tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt, flags);
 }
 
 /* This routine sends a packet with an out of date sequence
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c3f26dcd6704..f37289216d37 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -302,7 +302,7 @@ void tcp_delack_timer_handler(struct sock *sk)
 			icsk->icsk_ack.ato      = TCP_ATO_MIN;
 		}
 		tcp_mstamp_refresh(tcp_sk(sk));
-		tcp_send_ack(sk);
+		tcp_send_ack(sk, 0);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKS);
 	}
 
@@ -754,7 +754,7 @@ static enum hrtimer_restart tcp_compressed_ack_kick(struct hrtimer *timer)
 	bh_lock_sock(sk);
 	if (!sock_owned_by_user(sk)) {
 		if (tp->compressed_ack > TCP_FASTRETRANS_THRESH)
-			tcp_send_ack(sk);
+			tcp_send_ack(sk, 0);
 	} else {
 		if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED,
 				      &sk->sk_tsq_flags))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 14/28] tcp: AccECN negotiation
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (12 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 13/28] tcp: Pass flags to tcp_send_ack Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 15/28] tcp: add AccECN rx byte counters Ilpo Järvinen
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h      |   6 ++
 include/net/tcp.h        |  40 ++++++++++++-
 net/ipv4/syncookies.c    |  12 ++++
 net/ipv4/tcp.c           |   1 +
 net/ipv4/tcp_input.c     | 126 +++++++++++++++++++++++++++++++++++----
 net/ipv4/tcp_ipv4.c      |   3 +-
 net/ipv4/tcp_minisocks.c |  51 ++++++++++++++--
 net/ipv4/tcp_output.c    |  68 ++++++++++++++++-----
 net/ipv6/syncookies.c    |   1 +
 net/ipv6/tcp_ipv6.c      |   1 +
 10 files changed, 272 insertions(+), 37 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 9bdf67dd0d1d..a6b1d150cb05 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -149,6 +149,9 @@ struct tcp_request_sock {
 	u64				snt_synack; /* first SYNACK sent time */
 	bool				tfo_listener;
 	bool				is_mptcp;
+	u8				accecn_ok  : 1,
+					syn_ect_snt: 2,
+					syn_ect_rcv: 2;
 	u32				txhash;
 	u32				rcv_isn;
 	u32				snt_isn;
@@ -246,6 +249,9 @@ struct tcp_sock {
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
 	u8	compressed_ack;
+	u8	syn_ect_snt:2,	/* AccECN ECT memory, only */
+		syn_ect_rcv:2,	/* ... needed durign 3WHS + first seqno */
+		ecn_fail:1;	/* ECN reflector detected path mangling */
 	u32	chrono_start;	/* Start time in jiffies of a TCP chrono */
 	u32	chrono_stat[3];	/* Time in jiffies for chrono_stat stats */
 	u8	chrono_type:2,	/* current chronograph type */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ddeb11c01faa..e1dd06bbb472 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -409,11 +409,28 @@ static inline u8 tcp_accecn_ace(const struct tcphdr *th)
 	return (th->ae << 2) | (th->cwr << 1) | th->ece;
 }
 
+/* Infer the ECT value our SYN arrived with from the echoed ACE field */
+static inline int tcp_accecn_extract_syn_ect(u8 ace)
+{
+	if (ace & 0x1)
+		return INET_ECN_ECT_1;
+	if (!(ace & 0x2))
+		return INET_ECN_ECT_0;
+	if (ace & 0x4)
+		return INET_ECN_CE;
+	return INET_ECN_NOT_ECT;
+}
+
 static inline u32 tcp_accecn_ace_deficit(const struct tcp_sock *tp)
 {
 	return tp->received_ce - tp->received_ce_tx;
 }
 
+bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect);
+void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
+			  u8 syn_ect_snt);
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb);
+
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
 	TCP_TW_RST = 1,
@@ -604,6 +621,7 @@ bool cookie_timestamp_decode(const struct net *net,
 			     struct tcp_options_received *opt);
 bool cookie_ecn_ok(const struct tcp_options_received *opt,
 		   const struct net *net, const struct dst_entry *dst);
+bool cookie_accecn_ok(const struct tcphdr *th);
 
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
@@ -842,6 +860,7 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
 
 #define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
+#define TCPHDR_SYNACK_ACCECN (TCPHDR_SYN | TCPHDR_ACK | TCPHDR_CWR)
 
 #define TCP_ACCECN_CEP_ACE_MASK 0x7
 #define TCP_ACCECN_ACE_MAX_DELTA 6
@@ -926,6 +945,15 @@ struct tcp_skb_cb {
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
 
+static inline u16 tcp_accecn_reflector_flags(u8 ect)
+{
+	u32 flags = ect + 2;
+
+	if (ect == 3)
+		flags++;
+	return flags * TCPHDR_ECE;
+}
+
 static inline void bpf_compute_data_end_sk_skb(struct sk_buff *skb)
 {
 	TCP_SKB_CB(skb)->bpf.data_end = skb->data + skb_headlen(skb);
@@ -1059,7 +1087,10 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
 #define TCP_CONG_NEEDS_ECN	0x2
-#define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
+/* Require successfully negotiated AccECN capability */
+#define TCP_CONG_NEEDS_ACCECN	0x4
+#define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN | \
+			 TCP_CONG_NEEDS_ACCECN)
 
 union tcp_cc_info;
 
@@ -1173,6 +1204,13 @@ static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 	return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ECN;
 }
 
+static inline bool tcp_ca_needs_accecn(const struct sock *sk)
+{
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ACCECN;
+}
+
 static inline void tcp_set_ca_state(struct sock *sk, const u8 ca_state)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 9a4f6b16c9bc..1d1ded175042 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -276,6 +276,17 @@ bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
 }
 EXPORT_SYMBOL(cookie_ecn_ok);
 
+/* §4.1 [...] a server can determine that it negotiated AccECN as
+ * [...] if the ACK contains an ACE field with the value 0b010 to
+ * 0b111 (decimal 2 to 7).
+ */
+/* We don't need to check for > 7 as ACE is on 3 bits */
+bool cookie_accecn_ok(const struct tcphdr *th)
+{
+	return tcp_accecn_ace(th) > 0x1;
+}
+EXPORT_SYMBOL(cookie_accecn_ok);
+
 /* On input, sk is a listener.
  * Output is listener if incoming packet would not create a child
  *           NULL if memory could not be allocated.
@@ -398,6 +409,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 
 	ireq->rcv_wscale  = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
+	treq->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst, tsoff);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index edc03a1bf704..624dff543301 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2624,6 +2624,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->window_clamp = 0;
 	tp->delivered = 0;
 	tp->delivered_ce = 0;
+	tp->ecn_fail = 0;
 	tcp_accecn_init_counters(tp);
 	tcp_set_ca_state(sk, TCP_CA_Open);
 	tp->is_sack_reneg = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dbe70a114b1d..bf307be4c659 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -308,14 +308,89 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
+/* §3.1.2 If a TCP server that implements AccECN receives a SYN with the three
+ * TCP header flags (AE, CWR and ECE) set to any combination other than 000,
+ * 011 or 111, it MUST negotiate the use of AccECN as if they had been set to
+ * 111.
+ */
+static inline bool tcp_accecn_syn_requested(const struct tcphdr *th)
+{
+	u8 ace = tcp_accecn_ace(th);
+
+	return ace && ace != 0x3;
+}
+
+/* Check ECN field transition to detect invalid transitions */
+static bool tcp_ect_transition_valid(u8 snt, u8 rcv)
+{
+	if (rcv == snt)
+		return true;
+
+	/* Non-ECT altered to something or something became non-ECT */
+	if ((snt == INET_ECN_NOT_ECT) || (rcv == INET_ECN_NOT_ECT))
+		return false;
+	/* CE -> ECT(0/1)? */
+	if (snt == INET_ECN_CE)
+		return false;
+	return true;
+}
+
+bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u8 ect = tcp_accecn_extract_syn_ect(ace);
+
+	if (!sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)
+		return true;
+
+	if (!tcp_ect_transition_valid(sent_ect, ect)) {
+		tp->ecn_fail = 1;
+		return false;
+	}
+
+	return true;
+}
+
+/* See Table 2 of the AccECN draft */
+static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
+			       u8 ip_dsfield)
 {
-	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr))
+	struct tcp_sock *tp = tcp_sk(sk);
+	u8 ace = tcp_accecn_ace(th);
+
+	switch (ace) {
+	case 0x0:
+	case 0x7:
 		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
+		break;
+	case 0x1:
+	case 0x5:
+		if (tcp_ecn_mode_pending(tp))
+			/* Downgrade from AccECN, or requested initially */
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		break;
+	default:
+		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
+		tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
+		if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) &&
+		    INET_ECN_is_ce(ip_dsfield))
+			tp->received_ce++;
+		break;
+	}
 }
 
-static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
+static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th,
+			    const struct sk_buff *skb)
 {
+	if (tcp_ecn_mode_pending(tp)) {
+		if (!tcp_accecn_syn_requested(th)) {
+			/* Downgrade to classic ECN feedback */
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		} else {
+			tp->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
+		}
+	}
 	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
 		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
@@ -3484,7 +3559,8 @@ bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb,
 }
 
 /* RFC 5961 7 [ACK Throttling] */
-static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb)
+static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb,
+				   bool accecn_reflector)
 {
 	/* unprotected vars, we dont care of overwrites */
 	static u32 challenge_timestamp;
@@ -3512,7 +3588,8 @@ static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb)
 	if (count > 0) {
 		WRITE_ONCE(challenge_count, count - 1);
 		NET_INC_STATS(net, LINUX_MIB_TCPCHALLENGEACK);
-		tcp_send_ack(sk, 0);
+		tcp_send_ack(sk, !accecn_reflector ? 0 :
+				 tcp_accecn_reflector_flags(tp->syn_ect_rcv));
 	}
 }
 
@@ -3678,7 +3755,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
 		if (before(ack, prior_snd_una - tp->max_window)) {
 			if (!(flag & FLAG_NO_CHALLENGE_ACK))
-				tcp_send_challenge_ack(sk, skb);
+				tcp_send_challenge_ack(sk, skb, false);
 			return -1;
 		}
 		goto old_ack;
@@ -5482,7 +5559,7 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
 }
 
 /* Updates Accurate ECN received counters from the received IP ECN field */
-static void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
@@ -5521,6 +5598,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool rst_seq_match = false;
+	bool send_accecn_reflector = false;
 
 	/* RFC1323: H1. Apply PAWS check first. */
 	if (tcp_fast_parse_options(sock_net(sk), skb, th, tp) &&
@@ -5598,7 +5676,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 			if (tp->syn_fastopen && !tp->data_segs_in &&
 			    sk->sk_state == TCP_ESTABLISHED)
 				tcp_fastopen_active_disable(sk);
-			tcp_send_challenge_ack(sk, skb);
+			tcp_send_challenge_ack(sk, skb, false);
 		}
 		goto discard;
 	}
@@ -5609,11 +5687,14 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	 * RFC 5961 4.2 : Send a challenge ack
 	 */
 	if (th->syn) {
+		if (tcp_ecn_mode_accecn(tp)) {
+			send_accecn_reflector = true;
+		}
 syn_challenge:
 		if (syn_inerr)
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
-		tcp_send_challenge_ack(sk, skb);
+		tcp_send_challenge_ack(sk, skb, send_accecn_reflector);
 		goto discard;
 	}
 
@@ -6049,7 +6130,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 *    state to ESTABLISHED..."
 		 */
 
-		tcp_ecn_rcv_synack(tp, th);
+		if (tcp_ecn_mode_any(tp))
+			tcp_ecn_rcv_synack(sk, th, TCP_SKB_CB(skb)->ip_dsfield);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 		tcp_try_undo_spurious_syn(sk);
@@ -6126,7 +6208,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 			tcp_drop(sk, skb);
 			return 0;
 		} else {
-			tcp_send_ack(sk, 0);
+			tcp_send_ack(sk, !tcp_ecn_mode_accecn(tp) ? 0 :
+					 tcp_accecn_reflector_flags(tp->syn_ect_rcv));
 		}
 		return -1;
 	}
@@ -6175,7 +6258,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		tp->snd_wl1    = TCP_SKB_CB(skb)->seq;
 		tp->max_window = tp->snd_wnd;
 
-		tcp_ecn_rcv_syn(tp, th);
+		tcp_ecn_rcv_syn(tp, th, skb);
 
 		tcp_mtup_init(sk);
 		tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
@@ -6334,7 +6417,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 	if (!acceptable) {
 		if (sk->sk_state == TCP_SYN_RECV)
 			return 1;	/* send one RST */
-		tcp_send_challenge_ack(sk, skb);
+		tcp_send_challenge_ack(sk, skb, false);
 		goto discard;
 	}
 	switch (sk->sk_state) {
@@ -6376,6 +6459,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		tp->lsndtime = tcp_jiffies32;
 
 		tcp_initialize_rcv_mss(sk);
+		if (tcp_ecn_mode_accecn(tp))
+			tcp_accecn_third_ack(sk, skb, tp->syn_ect_snt);
 		tcp_fast_path_on(tp);
 		break;
 
@@ -6539,6 +6624,18 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	bool ect, ecn_ok;
 	u32 ecn_ok_dst;
 
+	if (tcp_accecn_syn_requested(th) &&
+	    (net->ipv4.sysctl_tcp_ecn || tcp_ca_needs_accecn(listen_sk))) {
+		inet_rsk(req)->ecn_ok = 1;
+		if ((net->ipv4.sysctl_tcp_ecn >= 2) ||
+		    tcp_ca_needs_accecn(listen_sk)) {
+			tcp_rsk(req)->accecn_ok = 1;
+			tcp_rsk(req)->syn_ect_rcv =
+				TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+		}
+		return;
+	}
+
 	if (!th_ecn)
 		return;
 
@@ -6565,6 +6662,9 @@ static void tcp_openreq_init(struct request_sock *req,
 	tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 	tcp_rsk(req)->snt_synack = 0;
 	tcp_rsk(req)->last_oow_ack_time = 0;
+	tcp_rsk(req)->accecn_ok = 0;
+	tcp_rsk(req)->syn_ect_rcv = 0;
+	tcp_rsk(req)->syn_ect_snt = 0;
 	req->mss = rx_opt->mss_clamp;
 	req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0;
 	ireq->tstamp_ok = rx_opt->tstamp_ok;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index dab0c1b85e95..d3b3e4d011b1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -958,7 +958,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      struct tcp_fastopen_cookie *foc,
 			      enum tcp_synack_type synack_type)
 {
-	const struct inet_request_sock *ireq = inet_rsk(req);
+	struct inet_request_sock *ireq = inet_rsk(req);
 	struct flowi4 fl4;
 	int err = -1;
 	struct sk_buff *skb;
@@ -970,6 +970,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
 
 	if (skb) {
+		tcp_rsk(req)->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK;
 		__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
 
 		rcu_read_lock();
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 3b5a137e416c..57b7cf4658fc 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -397,12 +397,51 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 }
 EXPORT_SYMBOL(tcp_openreq_init_rwin);
 
-static void tcp_ecn_openreq_child(struct tcp_sock *tp,
-				  const struct request_sock *req)
+void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
+			  u8 syn_ect_snt)
 {
-	tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
-			     TCP_ECN_MODE_RFC3168 :
-			     TCP_ECN_DISABLED);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u8 ace = tcp_accecn_ace(tcp_hdr(skb));
+
+	switch (ace) {
+	case 0x0:
+		tp->ecn_fail = 1;
+		break;
+	case 0x7:
+	case 0x5:
+	case 0x1:
+		/* Unused but legal values */
+		break;
+	default:
+		/* Validation only applies to first non-data packet */
+		if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
+		    !TCP_SKB_CB(skb)->sacked &&
+		    tcp_accecn_validate_syn_feedback(sk, ace, syn_ect_snt)) {
+			if ((tcp_accecn_extract_syn_ect(ace) == INET_ECN_CE) &&
+			    !tp->delivered_ce)
+				tp->delivered_ce++;
+		}
+		break;
+	}
+}
+
+static void tcp_ecn_openreq_child(struct sock *sk,
+				  const struct request_sock *req,
+				  const struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	const struct tcp_request_sock *treq = tcp_rsk(req);
+
+	if (treq->accecn_ok) {
+		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
+		tp->syn_ect_snt = treq->syn_ect_snt;
+		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
+		tcp_ecn_received_counters(sk, skb);
+	} else {
+		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+				     TCP_ECN_MODE_RFC3168 :
+				     TCP_ECN_DISABLED);
+	}
 }
 
 void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
@@ -546,7 +585,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
 		newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
 	newtp->rx_opt.mss_clamp = req->mss;
-	tcp_ecn_openreq_child(newtp, req);
+	tcp_ecn_openreq_child(newsk, req, skb);
 	newtp->fastopen_req = NULL;
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c8d0a7baf2d4..adc22d0d75fd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -308,7 +308,7 @@ static u16 tcp_select_window(struct sock *sk)
 /* Packet ECN state for a SYN-ACK */
 static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
 	if (tcp_ecn_disabled(tp))
@@ -316,6 +316,13 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 	else if (tcp_ca_needs_ecn(sk) ||
 		 tcp_bpf_ca_needs_ecn(sk))
 		INET_ECN_xmit(sk);
+
+	if (tp->ecn_flags & TCP_ECN_MODE_ACCECN) {
+		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE;
+		TCP_SKB_CB(skb)->tcp_flags |=
+			tcp_accecn_reflector_flags(tp->syn_ect_rcv);
+		tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK;
+	}
 }
 
 /* Packet ECN state for a SYN.  */
@@ -323,8 +330,10 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
+	bool use_accecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 3 ||
+		tcp_ca_needs_accecn(sk);
 	bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
-		tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
+		tcp_ca_needs_ecn(sk) || bpf_needs_ecn || use_accecn;
 
 	if (!use_ecn) {
 		const struct dst_entry *dst = __sk_dst_get(sk);
@@ -340,36 +349,59 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 			INET_ECN_xmit(sk);
 
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		if (use_accecn) {
+			TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_AE;
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_PENDING);
+			tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK;
+		} else {
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		}
 	}
 }
 
 static void tcp_ecn_clear_syn(struct sock *sk, struct sk_buff *skb)
 {
-	if (sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)
+	if (sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback) {
 		/* tp->ecn_flags are cleared at a later point in time when
 		 * SYN ACK is ultimatively being received.
 		 */
-		TCP_SKB_CB(skb)->tcp_flags &= ~(TCPHDR_ECE | TCPHDR_CWR);
+		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE;
+	}
+}
+
+static void tcp_accecn_echo_syn_ect(struct tcphdr *th, u8 ect)
+{
+	th->ae = !!(ect & 2);
+	th->cwr = ect != INET_ECN_ECT_0;
+	th->ece = ect == INET_ECN_ECT_1;
 }
 
 static void
 tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th)
 {
-	if (inet_rsk(req)->ecn_ok)
+	if (tcp_rsk(req)->accecn_ok)
+		tcp_accecn_echo_syn_ect(th, tcp_rsk(req)->syn_ect_rcv);
+	else if (inet_rsk(req)->ecn_ok)
 		th->ece = 1;
 }
 
-static void tcp_accecn_set_ace(struct tcphdr *th, struct tcp_sock *tp)
+static void tcp_accecn_set_ace(struct tcp_sock *tp, struct sk_buff *skb,
+			       struct tcphdr *th)
 {
 	u32 wire_ace;
 
-	tp->received_ce_tx += min_t(u32, tcp_accecn_ace_deficit(tp),
-				    TCP_ACCECN_ACE_MAX_DELTA);
-	wire_ace = tp->received_ce_tx + TCP_ACCECN_CEP_INIT_OFFSET;
-	th->ece = !!(wire_ace & 0x1);
-	th->cwr = !!(wire_ace & 0x2);
-	th->ae = !!(wire_ace & 0x4);
+	/* The final packet of the 3WHS or anything like it must reflect
+	 * the SYN/ACK ECT instead of putting CEP into ACE field, such
+	 * case show up in tcp_flags.
+	 */
+	if (likely(!(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACE))) {
+		tp->received_ce_tx += min_t(u32, tcp_accecn_ace_deficit(tp),
+					    TCP_ACCECN_ACE_MAX_DELTA);
+		wire_ace = tp->received_ce_tx + TCP_ACCECN_CEP_INIT_OFFSET;
+		th->ece = !!(wire_ace & 0x1);
+		th->cwr = !!(wire_ace & 0x2);
+		th->ae = !!(wire_ace & 0x4);
+	}
 }
 
 /* Set up ECN state for a packet on a ESTABLISHED socket that is about to
@@ -383,9 +415,10 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 	if (!tcp_ecn_mode_any(tp))
 		return;
 
-	INET_ECN_xmit(sk);
+	if (!tp->ecn_fail)
+		INET_ECN_xmit(sk);
 	if (tcp_ecn_mode_accecn(tp)) {
-		tcp_accecn_set_ace(th, tp);
+		tcp_accecn_set_ace(tp, skb, th);
 		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
 	} else {
 		/* Not-retransmitted data segment: set ECT and inject CWR. */
@@ -3034,7 +3067,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 			tcp_retrans_try_collapse(sk, skb, cur_mss);
 	}
 
-	/* RFC3168, section 6.1.1.1. ECN fallback */
+	/* RFC3168, section 6.1.1.1. ECN fallback
+	 * As AccECN uses the same SYN flags (+ AE), this check covers both
+	 * cases.
+	 */
 	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
 		tcp_ecn_clear_syn(sk, skb);
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 13235a012388..6e859d8fa489 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -251,6 +251,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst);
+	tcp_rsk(req)->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, dst, tsoff);
 out:
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 73032cd261ea..ecbab28c98cf 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -507,6 +507,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
 
 	if (skb) {
+		tcp_rsk(req)->syn_ect_snt = np->tclass & INET_ECN_MASK;
 		__tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr,
 				    &ireq->ir_v6_rmt_addr);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 15/28] tcp: add AccECN rx byte counters
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (13 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 14/28] tcp: AccECN negotiation Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-19  3:36   ` Eric Dumazet
  2020-03-18  9:43 ` [RFC PATCH 16/28] tcp: allow embedding leftover into option padding Ilpo Järvinen
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h      |  1 +
 include/net/tcp.h        | 18 +++++++++++++++++-
 net/ipv4/tcp_input.c     | 13 +++++++++----
 net/ipv4/tcp_minisocks.c |  3 ++-
 4 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a6b1d150cb05..6b81d7eb0117 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -323,6 +323,7 @@ struct tcp_sock {
 	u32	delivered_ce;	/* Like the above but only ECE marked packets */
 	u32	received_ce;	/* Like the above but for received CE marked packets */
 	u32	received_ce_tx; /* Like the above but max transmitted value */
+	u32	received_ecn_bytes[3];
 	u32	lost;		/* Total data packets lost incl. rexmits */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u64	first_tx_mstamp;  /* start of window send phase */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e1dd06bbb472..5824447b1fc5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -429,7 +429,8 @@ static inline u32 tcp_accecn_ace_deficit(const struct tcp_sock *tp)
 bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect);
 void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
 			  u8 syn_ect_snt);
-void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb);
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
+			       u32 payload_len);
 
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
@@ -869,11 +870,26 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
  * See draft-ietf-tcpm-accurate-ecn for the latest values.
  */
 #define TCP_ACCECN_CEP_INIT_OFFSET 5
+#define TCP_ACCECN_E1B_INIT_OFFSET 0
+#define TCP_ACCECN_E0B_INIT_OFFSET 1
+#define TCP_ACCECN_CEB_INIT_OFFSET 0
+
+static inline void __tcp_accecn_init_bytes_counters(int *counter_array)
+{
+	BUILD_BUG_ON(INET_ECN_ECT_1 != 0x1);
+	BUILD_BUG_ON(INET_ECN_ECT_0 != 0x2);
+	BUILD_BUG_ON(INET_ECN_CE != 0x3);
+
+	counter_array[INET_ECN_ECT_1 - 1] = 0;
+	counter_array[INET_ECN_ECT_0 - 1] = 0;
+	counter_array[INET_ECN_CE - 1] = 0;
+}
 
 static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
 {
 	tp->received_ce = 0;
 	tp->received_ce_tx = 0;
+	__tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
 }
 
 /* This is what the send packet queuing engine uses to pass
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bf307be4c659..3109e3199906 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5559,7 +5559,8 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
 }
 
 /* Updates Accurate ECN received counters from the received IP ECN field */
-void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
+			       u32 payload_len)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
@@ -5570,6 +5571,10 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
 
 		/* ACE counter tracks *all* segments including pure acks */
 		tp->received_ce += is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+
+		if (payload_len > 0) {
+			tp->received_ecn_bytes[ecnfield - 1] += payload_len;
+		}
 	}
 }
 
@@ -5808,7 +5813,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 				    tp->rcv_nxt == tp->rcv_wup)
 					flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
-				tcp_ecn_received_counters(sk, skb);
+				tcp_ecn_received_counters(sk, skb, 0);
 
 				/* We know that such packets are checksummed
 				 * on entry.
@@ -5851,7 +5856,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 
 			/* Bulk data transfer: receiver */
 			__skb_pull(skb, tcp_header_len);
-			tcp_ecn_received_counters(sk, skb);
+			tcp_ecn_received_counters(sk, skb, len - tcp_header_len);
 			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
 			tcp_event_data_recv(sk, skb);
@@ -5888,7 +5893,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 		return;
 
 step5:
-	tcp_ecn_received_counters(sk, skb);
+	tcp_ecn_received_counters(sk, skb, len - th->doff * 4);
 
 	if (tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT) < 0)
 		goto discard;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 57b7cf4658fc..668edd00e377 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -433,10 +433,11 @@ static void tcp_ecn_openreq_child(struct sock *sk,
 	const struct tcp_request_sock *treq = tcp_rsk(req);
 
 	if (treq->accecn_ok) {
+		const struct tcphdr *th = (const struct tcphdr *)skb->data;
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_snt = treq->syn_ect_snt;
 		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
-		tcp_ecn_received_counters(sk, skb);
+		tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4);
 	} else {
 		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
 				     TCP_ECN_MODE_RFC3168 :
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 16/28] tcp: allow embedding leftover into option padding
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (14 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 15/28] tcp: add AccECN rx byte counters Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 17/28] tcp: AccECN needs to know delivered bytes Ilpo Järvinen
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

There is some waste space in the option usage due to padding
of 32-bit fields. AccECN option can take advantage of those
few bytes as its tail is often consuming just a few odd bytes.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_output.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index adc22d0d75fd..9ff6d14363df 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -504,6 +504,8 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
 #endif
 }
 
+#define NOP_LEFTOVER	((TCPOPT_NOP << 8) | TCPOPT_NOP)
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -521,6 +523,8 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			      struct tcp_out_options *opts)
 {
 	u16 options = opts->options;	/* mungable copy */
+	u16 leftover_bytes = NOP_LEFTOVER;	/* replace next NOPs if avail */
+	int leftover_size = 2;
 
 	if (unlikely(OPTION_MD5 & options)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
@@ -554,17 +558,22 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 	}
 
 	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
+		*ptr++ = htonl((leftover_bytes << 16) |
 			       (TCPOPT_SACK_PERM << 8) |
 			       TCPOLEN_SACK_PERM);
+		leftover_bytes = NOP_LEFTOVER;
 	}
 
 	if (unlikely(OPTION_WSCALE & options)) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
+		u8 highbyte = TCPOPT_NOP;
+
+		if (unlikely(leftover_size == 1))
+			highbyte = leftover_bytes >> 8;
+		*ptr++ = htonl((highbyte << 24) |
 			       (TCPOPT_WINDOW << 16) |
 			       (TCPOLEN_WINDOW << 8) |
 			       opts->ws);
+		leftover_bytes = NOP_LEFTOVER;
 	}
 
 	if (unlikely(opts->num_sack_blocks)) {
@@ -572,8 +581,7 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			tp->duplicate_sack : tp->selective_acks;
 		int this_sack;
 
-		*ptr++ = htonl((TCPOPT_NOP  << 24) |
-			       (TCPOPT_NOP  << 16) |
+		*ptr++ = htonl((leftover_bytes << 16) |
 			       (TCPOPT_SACK <<  8) |
 			       (TCPOLEN_SACK_BASE + (opts->num_sack_blocks *
 						     TCPOLEN_SACK_PERBLOCK)));
@@ -585,6 +593,10 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		}
 
 		tp->rx_opt.dsack = 0;
+	} else if (unlikely(leftover_bytes != NOP_LEFTOVER)) {
+		*ptr++ = htonl((leftover_bytes << 16) |
+			       (TCPOPT_NOP << 8) |
+			       TCPOPT_NOP);
 	}
 
 	if (unlikely(OPTION_FAST_OPEN_COOKIE & options)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 17/28] tcp: AccECN needs to know delivered bytes
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (15 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 16/28] tcp: allow embedding leftover into option padding Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 18/28] tcp: don't early return when sack doesn't fit Ilpo Järvinen
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

AccECN byte counter estimation requires delivered bytes
which can be calculated while processing SACK blocks and
cumulative ACK.

Non-SACK calculation is quite annoying, inaccurate, and
likely bogus.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_input.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3109e3199906..0a63f8a49057 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1235,6 +1235,7 @@ static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
 
 struct tcp_sacktag_state {
 	u32	reord;
+	u32	delivered_bytes;
 	/* Timestamps for earliest and latest never-retransmitted segment
 	 * that was SACKed. RTO needs the earliest RTT to stay conservative,
 	 * but congestion control should still get an accurate delay signal.
@@ -1306,7 +1307,7 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 static u8 tcp_sacktag_one(struct sock *sk,
 			  struct tcp_sacktag_state *state, u8 sacked,
 			  u32 start_seq, u32 end_seq,
-			  int dup_sack, int pcount,
+			  int dup_sack, int pcount, u32 plen,
 			  u64 xmit_time)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -1365,6 +1366,7 @@ static u8 tcp_sacktag_one(struct sock *sk,
 		state->flag |= FLAG_DATA_SACKED;
 		tp->sacked_out += pcount;
 		tp->delivered += pcount;  /* Out-of-order packets delivered */
+		state->delivered_bytes += plen;
 
 		/* Lost marker hint past SACKed? Tweak RFC3517 cnt */
 		if (tp->lost_skb_hint &&
@@ -1406,7 +1408,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 	 * tcp_highest_sack_seq() when skb is highest_sack.
 	 */
 	tcp_sacktag_one(sk, state, TCP_SKB_CB(skb)->sacked,
-			start_seq, end_seq, dup_sack, pcount,
+			start_seq, end_seq, dup_sack, pcount, skb->len,
 			tcp_skb_timestamp_us(skb));
 	tcp_rate_skb_delivered(sk, skb, state->rate);
 
@@ -1696,6 +1698,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 						TCP_SKB_CB(skb)->end_seq,
 						dup_sack,
 						tcp_skb_pcount(skb),
+						skb->len,
 						tcp_skb_timestamp_us(skb));
 			tcp_rate_skb_delivered(sk, skb, state->rate);
 			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
@@ -3239,6 +3242,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 			tp->sacked_out -= acked_pcount;
 		} else if (tcp_is_sack(tp)) {
 			tp->delivered += acked_pcount;
+			sack->delivered_bytes += skb->len;
 			if (!tcp_skb_spurious_retrans(tp, skb))
 				tcp_rack_advance(tp, sacked, scb->end_seq,
 						 tcp_skb_timestamp_us(skb));
@@ -3325,6 +3329,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 			 */
 			if (flag & FLAG_RETRANS_DATA_ACKED)
 				flag &= ~FLAG_ORIG_SACK_ACKED;
+
+			sack->delivered_bytes = (skb ?
+						 TCP_SKB_CB(skb)->seq :
+						 tp->snd_una) - prior_snd_una;
 		} else {
 			int delta;
 
@@ -3742,6 +3750,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 ecn_count = 0; /* Did we receive ECE/an AccECN ACE update? */
 	u32 prior_fack;
 
+	sack_state.delivered_bytes = 0;
 	sack_state.first_sackt = 0;
 	sack_state.rate = &rs;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 18/28] tcp: don't early return when sack doesn't fit
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (16 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 17/28] tcp: AccECN needs to know delivered bytes Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 19/28] tcp: AccECN option Ilpo Järvinen
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

AccECN code will be placed after this fragment so no early
returns please.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_output.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9ff6d14363df..084ebd2e725f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -875,17 +875,16 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
 		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-		if (unlikely(remaining < TCPOLEN_SACK_BASE_ALIGNED +
-					 TCPOLEN_SACK_PERBLOCK))
-			return size;
-
-		opts->num_sack_blocks =
-			min_t(unsigned int, eff_sacks,
-			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-			      TCPOLEN_SACK_PERBLOCK);
-
-		size += TCPOLEN_SACK_BASE_ALIGNED +
-			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+		if (likely(remaining >= TCPOLEN_SACK_BASE_ALIGNED +
+					 TCPOLEN_SACK_PERBLOCK)) {
+			opts->num_sack_blocks =
+				min_t(unsigned int, eff_sacks,
+				      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+				      TCPOLEN_SACK_PERBLOCK);
+
+			size += TCPOLEN_SACK_BASE_ALIGNED +
+				opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+		}
 	}
 
 	return size;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 19/28] tcp: AccECN option
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (17 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 18/28] tcp: don't early return when sack doesn't fit Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 20/28] tcp: AccECN option send control Ilpo Järvinen
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

AccECN option tx & rx side without option send control
related features that are added in a later change.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h   |   4 ++
 include/net/tcp.h     |  16 ++++++
 net/ipv4/tcp_input.c  | 126 +++++++++++++++++++++++++++++++++++++++---
 net/ipv4/tcp_output.c | 112 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 246 insertions(+), 12 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6b81d7eb0117..fd232bb7fae9 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -114,6 +114,7 @@ struct tcp_options_received {
 		snd_wscale : 4,	/* Window scaling received from sender	*/
 		rcv_wscale : 4;	/* Window scaling to send to receiver	*/
 	u8	num_sacks;	/* Number of SACK blocks		*/
+	s8	accecn;		/* AccECN index in header, -1=no option	*/
 	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 #if IS_ENABLED(CONFIG_MPTCP)
@@ -321,9 +322,12 @@ struct tcp_sock {
 	u32	prr_out;	/* Total number of pkts sent during Recovery. */
 	u32	delivered;	/* Total data packets delivered incl. rexmits */
 	u32	delivered_ce;	/* Like the above but only ECE marked packets */
+	u32	delivered_ecn_bytes[3];
 	u32	received_ce;	/* Like the above but for received CE marked packets */
 	u32	received_ce_tx; /* Like the above but max transmitted value */
 	u32	received_ecn_bytes[3];
+	u8	accecn_minlen:2,/* Minimum length of AccECN option sent */
+		estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */
 	u32	lost;		/* Total data packets lost incl. rexmits */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u64	first_tx_mstamp;  /* start of window send phase */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5824447b1fc5..54471c2dedd5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -189,6 +189,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* Magic number to be after the option value for sharing TCP
  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
  */
+#define TCPOPT_ACCECN_MAGIC	0xACCE
 #define TCPOPT_FASTOPEN_MAGIC	0xF989
 #define TCPOPT_SMC_MAGIC	0xE2D4C3D9
 
@@ -204,6 +205,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_FASTOPEN_BASE  2
 #define TCPOLEN_EXP_FASTOPEN_BASE  4
 #define TCPOLEN_EXP_SMC_BASE   6
+#define TCPOLEN_EXP_ACCECN_BASE 4
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -215,6 +217,13 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_MD5SIG_ALIGNED		20
 #define TCPOLEN_MSS_ALIGNED		4
 #define TCPOLEN_EXP_SMC_BASE_ALIGNED	8
+#define TCPOLEN_ACCECN_PERCOUNTER	3
+
+/* Maximum number of byte counters in AccECN option + size */
+#define TCP_ACCECN_NUMCOUNTERS		3
+#define TCP_ACCECN_MAXSIZE		(TCPOLEN_EXP_ACCECN_BASE + \
+					 TCPOLEN_ACCECN_PERCOUNTER * \
+					 TCP_ACCECN_NUMCOUNTERS)
 
 /* Flags in tp->nonagle */
 #define TCP_NAGLE_OFF		1	/* Nagle's algo is disabled */
@@ -363,6 +372,10 @@ static inline void tcp_dec_quickack_mode(struct sock *sk,
 	}
 }
 
+/* sysctl_tcp_ecn value */
+#define TCP_ECN_ENABLE_MASK	0x3
+#define TCP_ACCECN_NO_OPT	0x100
+
 #define	TCP_ECN_MODE_RFC3168	0x1
 #define	TCP_ECN_QUEUE_CWR	0x2
 #define	TCP_ECN_DEMAND_CWR	0x4
@@ -890,6 +903,9 @@ static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
 	tp->received_ce = 0;
 	tp->received_ce_tx = 0;
 	__tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
+	__tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes);
+	tp->accecn_minlen = 0;
+	tp->estimate_ecnfield = 0;
 }
 
 /* This is what the send packet queuing engine uses to pass
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0a63f8a49057..d34b50f73652 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -402,9 +402,92 @@ static u32 tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *
 	return 0;
 }
 
+/* Handles AccECN option ECT and CE 24-bit byte counters update into
+ * the u32 value in tcp_sock. As we're processing TCP options, it is
+ * safe to access from - 1.
+ */
+static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 init_offset)
+{
+	u32 truncated = (get_unaligned_be32(from - 1) - init_offset) & 0xFFFFFFU;
+	u32 delta = (truncated - *cnt) & 0xFFFFFFU;
+	/* If delta has the highest bit set (24th bit) indicating negative,
+	 * sign extend to correct an estimation error in the ecn_bytes
+	 */
+	delta = delta & 0x800000 ? delta | 0xFF000000 : delta;
+	*cnt += delta;
+	return (s32)delta;
+}
+
+static u8 accecn_opt_ecnfield[3] = {
+	INET_ECN_ECT_0, INET_ECN_CE, INET_ECN_ECT_1,
+};
+
+/* Returns true if the byte counters can be used */
+static bool tcp_accecn_process_option(struct tcp_sock *tp,
+				      const struct sk_buff *skb,
+				      u32 delivered_bytes)
+{
+	unsigned char *ptr;
+	unsigned int optlen;
+	int i;
+	bool ambiguous_ecn_bytes_incr = false;
+	bool first_changed = false;
+	bool res;
+
+	if (tp->rx_opt.accecn < 0) {
+		if (tp->estimate_ecnfield) {
+			tp->delivered_ecn_bytes[tp->estimate_ecnfield - 1] +=
+				delivered_bytes;
+			return true;
+		}
+		return false;
+	}
+
+	ptr = skb_transport_header(skb) + tp->rx_opt.accecn;
+	optlen = ptr[1];
+	if (ptr[0] == TCPOPT_EXP) {
+		optlen -= 2;
+		ptr += 2;
+	}
+	ptr += 2;
+
+	res = !!tp->estimate_ecnfield;
+	for (i = 0; i < 3; i++) {
+		if (optlen >= TCPOLEN_ACCECN_PERCOUNTER) {
+			u8 ecnfield = accecn_opt_ecnfield[i];
+			u32 init_offset = i ? 0 : TCP_ACCECN_E0B_INIT_OFFSET;
+			s32 delta;
+
+			delta = tcp_update_ecn_bytes(&(tp->delivered_ecn_bytes[ecnfield - 1]),
+						     ptr, init_offset);
+			if (delta) {
+				if (delta < 0) {
+					res = false;
+					ambiguous_ecn_bytes_incr = true;
+				}
+				if (ecnfield != tp->estimate_ecnfield) {
+					if (!first_changed) {
+						tp->estimate_ecnfield = ecnfield;
+						first_changed = true;
+					} else {
+						res = false;
+						ambiguous_ecn_bytes_incr = true;
+					}
+				}
+			}
+
+			optlen -= TCPOLEN_ACCECN_PERCOUNTER;
+		}
+	}
+	if (ambiguous_ecn_bytes_incr)
+		tp->estimate_ecnfield = 0;
+
+	return res;
+}
+
 /* Returns the ECN CE delta */
 static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
-			      u32 delivered_pkts, int flag)
+			      u32 delivered_pkts, u32 delivered_bytes, int flag)
 {
 	u32 delta, safe_delta;
 	u32 corrected_ace;
@@ -413,6 +496,8 @@ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
 	if (!(flag & (FLAG_FORWARD_PROGRESS|FLAG_TS_PROGRESS)))
 		return 0;
 
+	tcp_accecn_process_option(tp, skb, delivered_bytes);
+
 	if (!(flag & FLAG_SLOWPATH)) {
 		/* AccECN counter might overflow on large ACKs */
 		if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK)
@@ -3839,7 +3924,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	if (tcp_ecn_mode_accecn(tp)) {
 		ecn_count = tcp_accecn_process(tp, skb,
-					       tp->delivered - delivered, flag);
+					       tp->delivered - delivered,
+					       sack_state.delivered_bytes, flag);
 		if (ecn_count > 0)
 			flag |= FLAG_ECE;
 	}
@@ -3878,7 +3964,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 no_queue:
 	if (tcp_ecn_mode_accecn(tp)) {
 		ecn_count = tcp_accecn_process(tp, skb,
-					       tp->delivered - delivered, flag);
+					       tp->delivered - delivered,
+					       sack_state.delivered_bytes, flag);
 		if (ecn_count > 0)
 			flag |= FLAG_ECE;
 	}
@@ -4005,6 +4092,7 @@ void tcp_parse_options(const struct net *net,
 
 	ptr = (const unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
+	opt_rx->accecn = -1;
 
 	while (length > 0) {
 		int opcode = *ptr++;
@@ -4094,12 +4182,16 @@ void tcp_parse_options(const struct net *net,
 				break;
 
 			case TCPOPT_EXP:
+				if (opsize >= TCPOLEN_EXP_ACCECN_BASE &&
+				    get_unaligned_be16(ptr) ==
+				    TCPOPT_ACCECN_MAGIC)
+					opt_rx->accecn = (ptr - 2) - (unsigned char *)th;
 				/* Fast Open option shares code 254 using a
 				 * 16 bits magic number.
 				 */
-				if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE &&
-				    get_unaligned_be16(ptr) ==
-				    TCPOPT_FASTOPEN_MAGIC)
+				else if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE &&
+					 get_unaligned_be16(ptr) ==
+					 TCPOPT_FASTOPEN_MAGIC)
 					tcp_parse_fastopen_option(opsize -
 						TCPOLEN_EXP_FASTOPEN_BASE,
 						ptr + 2, th->syn, foc, true);
@@ -5567,6 +5659,19 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
 	}
 }
 
+/* Maps ECT/CE bits to minimum length of AccECN option */
+static inline unsigned int tcp_ecn_field_to_accecn_len(u8 ecnfield)
+{
+	unsigned int opt;
+
+	opt = (ecnfield - 2) & INET_ECN_MASK;
+	/* Shift+XOR for 11 -> 10 */
+	opt = (opt ^ (opt >> 1)) + 1;
+
+	return opt;
+}
+
+
 /* Updates Accurate ECN received counters from the received IP ECN field */
 void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 			       u32 payload_len)
@@ -5582,7 +5687,9 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 		tp->received_ce += is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs);
 
 		if (payload_len > 0) {
+			u8 minlen = tcp_ecn_field_to_accecn_len(ecnfield);
 			tp->received_ecn_bytes[ecnfield - 1] += payload_len;
+			tp->accecn_minlen = max_t(u8, tp->accecn_minlen, minlen);
 		}
 	}
 }
@@ -6639,9 +6746,10 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	u32 ecn_ok_dst;
 
 	if (tcp_accecn_syn_requested(th) &&
-	    (net->ipv4.sysctl_tcp_ecn || tcp_ca_needs_accecn(listen_sk))) {
+	    ((net->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) ||
+	     tcp_ca_needs_accecn(listen_sk))) {
 		inet_rsk(req)->ecn_ok = 1;
-		if ((net->ipv4.sysctl_tcp_ecn >= 2) ||
+		if (((net->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) >= 2) ||
 		    tcp_ca_needs_accecn(listen_sk)) {
 			tcp_rsk(req)->accecn_ok = 1;
 			tcp_rsk(req)->syn_ect_rcv =
@@ -6655,7 +6763,7 @@ static void tcp_ecn_create_request(struct request_sock *req,
 
 	ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield);
 	ecn_ok_dst = dst_feature(dst, DST_FEATURE_ECN_MASK);
-	ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
+	ecn_ok = (net->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) || ecn_ok_dst;
 
 	if (((!ect || th->res1 || th->ae) && ecn_ok) ||
 	    tcp_ca_needs_ecn(listen_sk) ||
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 084ebd2e725f..7bce1a73ac8f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -330,9 +330,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
-	bool use_accecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 3 ||
+	bool use_accecn =
+		(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) == 3 ||
 		tcp_ca_needs_accecn(sk);
-	bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
+	bool use_ecn =
+		(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ECN_ENABLE_MASK) == 1 ||
 		tcp_ca_needs_ecn(sk) || bpf_needs_ecn || use_accecn;
 
 	if (!use_ecn) {
@@ -468,6 +470,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
 #define OPTION_SMC		(1 << 9)
 #define OPTION_MPTCP		(1 << 10)
+#define OPTION_ACCECN		(1 << 11)
 
 static void smc_options_write(__be32 *ptr, u16 *options)
 {
@@ -488,12 +491,14 @@ struct tcp_out_options {
 	u16 options;		/* bit field of OPTION_* */
 	u16 mss;		/* 0 to disable */
 	u8 ws;			/* window scale, 0 to disable */
-	u8 num_sack_blocks;	/* number of SACK blocks to include */
+	u8 num_sack_blocks:3,	/* number of SACK blocks to include */
+	   num_ecn_bytes:2;	/* number of AccECN fields needed */
 	u8 hash_size;		/* bytes in hash_location */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
 	struct mptcp_out_options mptcp;
+	u32 *ecn_bytes;		/* AccECN ECT/CE byte counters */
 };
 
 static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
@@ -557,6 +562,33 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		*ptr++ = htonl(opts->tsecr);
 	}
 
+	if (unlikely(OPTION_ACCECN & options)) {
+		u32 e0b = opts->ecn_bytes[INET_ECN_ECT_0 - 1] + TCP_ACCECN_E0B_INIT_OFFSET;
+		u32 ceb = opts->ecn_bytes[INET_ECN_CE - 1] + TCP_ACCECN_CEB_INIT_OFFSET;
+		u32 e1b = opts->ecn_bytes[INET_ECN_ECT_1 - 1] + TCP_ACCECN_E1B_INIT_OFFSET;
+		u8 len = TCPOLEN_EXP_ACCECN_BASE +
+			 opts->num_ecn_bytes * TCPOLEN_ACCECN_PERCOUNTER;
+
+		*ptr++ = htonl((TCPOPT_EXP << 24) | (len << 16) |
+			       TCPOPT_ACCECN_MAGIC);
+		if (opts->num_ecn_bytes > 0) {
+			*ptr++ = htonl((e0b << 8) |
+				       (opts->num_ecn_bytes > 1 ?
+					(ceb >> 16) & 0xff :
+					TCPOPT_NOP));
+			if (opts->num_ecn_bytes == 2) {
+				leftover_bytes = (ceb >> 8) & 0xffff;
+			} else {
+				*ptr++ = htonl((ceb << 16) |
+					       ((e1b >> 8) & 0xffff));
+				leftover_bytes = ((e1b & 0xff) << 8) |
+						TCPOPT_NOP;
+				leftover_size = 1;
+			}
+		}
+		if (tp != NULL)
+			tp->accecn_minlen = 0;
+	}
 	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 		*ptr++ = htonl((leftover_bytes << 16) |
 			       (TCPOPT_SACK_PERM << 8) |
@@ -677,6 +709,53 @@ static void mptcp_set_option_cond(const struct request_sock *req,
 	}
 }
 
+/* Initial values for AccECN option, ordered is based on ECN field bits
+ * similar to received_ecn_bytes. Used for SYN/ACK AccECN option.
+ */
+u32 synack_ecn_bytes[3] = { 0, 0, 0 };
+
+static u32 tcp_synack_options_combine_saving(struct tcp_out_options *opts)
+{
+	/* How much there's room for combining with the alignment padding? */
+	if ((opts->options & (OPTION_SACK_ADVERTISE|OPTION_TS)) ==
+	    OPTION_SACK_ADVERTISE)
+		return 2;
+	else if (opts->options & OPTION_WSCALE)
+		return 1;
+	return 0;
+}
+
+/* AccECN can take here and there take advantage of NOPs for alignment of
+ * other options
+ */
+static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required,
+				  int remaining, int max_combine_saving)
+{
+	int size = TCP_ACCECN_MAXSIZE;
+
+	opts->num_ecn_bytes = TCP_ACCECN_NUMCOUNTERS;
+
+	while (opts->num_ecn_bytes >= required) {
+		int leftover_size = size & 0x3;
+		/* Pad to dword if cannot combine */
+		if (leftover_size > max_combine_saving)
+			leftover_size = -((4 - leftover_size) & 0x3);
+
+		if (remaining >= size - leftover_size) {
+			size -= leftover_size;
+			break;
+		}
+
+		opts->num_ecn_bytes--;
+		size -= TCPOLEN_ACCECN_PERCOUNTER;
+	}
+	if (opts->num_ecn_bytes < required)
+		return 0;
+
+	opts->options |= OPTION_ACCECN;
+	return size;
+}
+
 /* Compute TCP options for SYN packets. This is not the final
  * network wire format yet.
  */
@@ -755,6 +834,16 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	/* Simultaneous open SYN/ACK needs AccECN option but not SYN */
+	if (unlikely((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK) &&
+		     tcp_ecn_mode_accecn(tp) &&
+		     !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) &&
+		     (remaining >= TCPOLEN_EXP_ACCECN_BASE))) {
+		opts->ecn_bytes = synack_ecn_bytes;
+		remaining -= tcp_options_fit_accecn(opts, 0, remaining,
+						    tcp_synack_options_combine_saving(opts));
+	}
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -767,6 +856,7 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 				       struct tcp_fastopen_cookie *foc)
 {
 	struct inet_request_sock *ireq = inet_rsk(req);
+	struct tcp_request_sock *treq = tcp_rsk(req);
 	unsigned int remaining = MAX_TCP_OPTION_SPACE;
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -820,6 +910,14 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
+	if (treq->accecn_ok &&
+	    !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) &&
+	    (remaining >= TCPOLEN_EXP_ACCECN_BASE)) {
+		opts->ecn_bytes = synack_ecn_bytes;
+		remaining -= tcp_options_fit_accecn(opts, 0, remaining,
+						    tcp_synack_options_combine_saving(opts));
+	}
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -887,6 +985,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		}
 	}
 
+	if (tcp_ecn_mode_accecn(tp) &&
+	    !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT)) {
+		opts->ecn_bytes = tp->received_ecn_bytes;
+		size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
+					       MAX_TCP_OPTION_SPACE - size,
+					       opts->num_sack_blocks > 0 ? 2 : 0);
+	}
+
 	return size;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 20/28] tcp: AccECN option send control
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (18 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 19/28] tcp: AccECN option Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 21/28] tcp: AccECN option beacon Ilpo Järvinen
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Instead of sending the option in every ACK, limit sending to
those ACKs where the option is necessary:
- Handshake
- First data (only an approximation to avoid extra state)
- "Change-triggered ACK" + the ACK following it. The
  2nd ACK is necessary to unambiguously indicate which
  of the ECN byte counters in increasing. The first
  ACK has two counters increasing due to the ecnfield
  edge.
- ACKs with CE to allow CEP delta calculations to take
  advantage of the option
- Force option to be sent every at least once per 2^22
  bytes. The check is done using bytes_received to avoid
  adding another var into tcp_sock and may demand option
  even if one was recently sent but that is not a big
  deal.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h      |  2 ++
 include/net/tcp.h        |  1 +
 net/ipv4/tcp.c           |  1 +
 net/ipv4/tcp_input.c     | 27 +++++++++++++++++++++++++++
 net/ipv4/tcp_minisocks.c |  2 ++
 net/ipv4/tcp_output.c    |  7 +++++--
 6 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fd232bb7fae9..b3cf33af3eb0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -327,6 +327,8 @@ struct tcp_sock {
 	u32	received_ce_tx; /* Like the above but max transmitted value */
 	u32	received_ecn_bytes[3];
 	u8	accecn_minlen:2,/* Minimum length of AccECN option sent */
+		prev_ecnfield:2,/* ECN bits from the previous segment */
+		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
 		estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */
 	u32	lost;		/* Total data packets lost incl. rexmits */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 54471c2dedd5..4367e21b4521 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -905,6 +905,7 @@ static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
 	__tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
 	__tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes);
 	tp->accecn_minlen = 0;
+	tp->accecn_opt_demand = 0;
 	tp->estimate_ecnfield = 0;
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 624dff543301..a966cfc0214e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2626,6 +2626,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->delivered_ce = 0;
 	tp->ecn_fail = 0;
 	tcp_accecn_init_counters(tp);
+	tp->prev_ecnfield = 0;
 	tcp_set_ca_state(sk, TCP_CA_Open);
 	tp->is_sack_reneg = 0;
 	tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d34b50f73652..504309a73de2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -372,6 +372,7 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
 	default:
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
+		tp->accecn_opt_demand = 2;
 		if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) &&
 		    INET_ECN_is_ce(ip_dsfield))
 			tp->received_ce++;
@@ -388,6 +389,7 @@ static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th,
 			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
 		} else {
 			tp->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+			tp->prev_ecnfield = tp->syn_ect_rcv;
 			tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		}
 	}
@@ -5679,6 +5681,7 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
 	u8 is_ce = INET_ECN_is_ce(ecnfield);
+	u8 ecn_edge = tp->prev_ecnfield != ecnfield;
 
 	if (!INET_ECN_is_not_ect(ecnfield)) {
 		tp->ecn_flags |= TCP_ECN_SEEN;
@@ -5688,8 +5691,31 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 
 		if (payload_len > 0) {
 			u8 minlen = tcp_ecn_field_to_accecn_len(ecnfield);
+			u32 oldbytes = tp->received_ecn_bytes[ecnfield - 1];
+
 			tp->received_ecn_bytes[ecnfield - 1] += payload_len;
 			tp->accecn_minlen = max_t(u8, tp->accecn_minlen, minlen);
+
+			/* Demand AccECN option at least every 2^22 bytes to
+			 * avoid overflowing the ECN byte counters.
+			 */
+			if ((tp->received_ecn_bytes[ecnfield - 1] ^ oldbytes) &
+			    ~((1 << 22) - 1))
+				tp->accecn_opt_demand = max_t(u8, 1,
+							      tp->accecn_opt_demand);
+		}
+	}
+
+	if (ecn_edge || is_ce) {
+		tp->prev_ecnfield = ecnfield;
+		/* Demand Accurate ECN change-triggered ACKs. Two ACK are
+		 * demanded to indicate unambiguously the ecnfield value
+		 * in the latter ACK.
+		 */
+		if (tcp_ecn_mode_accecn(tp)) {
+			if (ecn_edge)
+				inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+			tp->accecn_opt_demand = 2;
 		}
 	}
 }
@@ -5810,6 +5836,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	if (th->syn) {
 		if (tcp_ecn_mode_accecn(tp)) {
 			send_accecn_reflector = true;
+			tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand);
 		}
 syn_challenge:
 		if (syn_inerr)
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 668edd00e377..2e532758a34a 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -437,6 +437,8 @@ static void tcp_ecn_openreq_child(struct sock *sk,
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_snt = treq->syn_ect_snt;
 		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
+		tp->prev_ecnfield = treq->syn_ect_rcv;
+		tp->accecn_opt_demand = 1;
 		tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4);
 	} else {
 		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7bce1a73ac8f..118d5c73bcb9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -586,8 +586,11 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 				leftover_size = 1;
 			}
 		}
-		if (tp != NULL)
+		if (tp != NULL) {
 			tp->accecn_minlen = 0;
+			if (tp->accecn_opt_demand)
+				tp->accecn_opt_demand--;
+		}
 	}
 	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 		*ptr++ = htonl((leftover_bytes << 16) |
@@ -985,7 +988,7 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		}
 	}
 
-	if (tcp_ecn_mode_accecn(tp) &&
+	if (tcp_ecn_mode_accecn(tp) && tp->accecn_opt_demand &&
 	    !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT)) {
 		opts->ecn_bytes = tp->received_ecn_bytes;
 		size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 21/28] tcp: AccECN option beacon
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (19 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 20/28] tcp: AccECN option send control Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 22/28] tcp: AccECN option order bit & failure handling Ilpo Järvinen
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

AccECN requires option to be sent a few times per RTT even
if nothing in the ECN state requires it.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h   |  1 +
 include/net/tcp.h     |  1 +
 net/ipv4/tcp.c        |  1 +
 net/ipv4/tcp_output.c | 16 +++++++++++-----
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b3cf33af3eb0..c381aea5c764 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -330,6 +330,7 @@ struct tcp_sock {
 		prev_ecnfield:2,/* ECN bits from the previous segment */
 		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
 		estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */
+	u64	accecn_opt_tstamp;	/* Last AccECN option sent timestamp */
 	u32	lost;		/* Total data packets lost incl. rexmits */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u64	first_tx_mstamp;  /* start of window send phase */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4367e21b4521..52567d8fca33 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -224,6 +224,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCP_ACCECN_MAXSIZE		(TCPOLEN_EXP_ACCECN_BASE + \
 					 TCPOLEN_ACCECN_PERCOUNTER * \
 					 TCP_ACCECN_NUMCOUNTERS)
+#define TCP_ACCECN_BEACON_FREQ_SHIFT	2 /* Send option at least 2^2 times per RTT */
 
 /* Flags in tp->nonagle */
 #define TCP_NAGLE_OFF		1	/* Nagle's algo is disabled */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a966cfc0214e..cfbdc1468342 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2627,6 +2627,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->ecn_fail = 0;
 	tcp_accecn_init_counters(tp);
 	tp->prev_ecnfield = 0;
+	tp->accecn_opt_tstamp = 0;
 	tcp_set_ca_state(sk, TCP_CA_Open);
 	tp->is_sack_reneg = 0;
 	tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 118d5c73bcb9..f070128b69e6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -588,6 +588,7 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		}
 		if (tp != NULL) {
 			tp->accecn_minlen = 0;
+			tp->accecn_opt_tstamp = tp->tcp_mstamp;
 			if (tp->accecn_opt_demand)
 				tp->accecn_opt_demand--;
 		}
@@ -988,12 +989,17 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		}
 	}
 
-	if (tcp_ecn_mode_accecn(tp) && tp->accecn_opt_demand &&
+	if (tcp_ecn_mode_accecn(tp) &&
 	    !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT)) {
-		opts->ecn_bytes = tp->received_ecn_bytes;
-		size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
-					       MAX_TCP_OPTION_SPACE - size,
-					       opts->num_sack_blocks > 0 ? 2 : 0);
+		if (tp->accecn_opt_demand ||
+		    (tcp_stamp_us_delta(tp->tcp_mstamp, tp->accecn_opt_tstamp) >=
+		     (tp->srtt_us >> (3 + TCP_ACCECN_BEACON_FREQ_SHIFT)))) {
+			opts->ecn_bytes = tp->received_ecn_bytes;
+			size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
+						       MAX_TCP_OPTION_SPACE - size,
+						       opts->num_sack_blocks > 0 ?
+						       2 : 0);
+		}
 	}
 
 	return size;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 22/28] tcp: AccECN option order bit & failure handling
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (20 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 21/28] tcp: AccECN option beacon Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 23/28] tcp: AccECN option ceb/cep heuristic Ilpo Järvinen
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

AccECN option has two possible field orders. Collect the
order bit from first AccECN option that has enough length
to contain it.

AccECN option may fail in various way, handle these:
- Remove option from SYN/ACK rexmits to handle blackholes
- If no option arrives in SYN/ACK, assume Option is not usable
	- If an option arrives later, re-enabled
- If option is zeroed, disable AccECN option processing

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h      |  2 ++
 include/net/tcp.h        | 10 +++++++++
 net/ipv4/tcp.c           |  1 +
 net/ipv4/tcp_input.c     | 46 ++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp_minisocks.c | 32 ++++++++++++++++++++++++++++
 net/ipv4/tcp_output.c    |  4 +++-
 6 files changed, 88 insertions(+), 7 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c381aea5c764..64db51e5d45e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -151,6 +151,7 @@ struct tcp_request_sock {
 	bool				tfo_listener;
 	bool				is_mptcp;
 	u8				accecn_ok  : 1,
+					saw_accecn_opt : 3,
 					syn_ect_snt: 2,
 					syn_ect_rcv: 2;
 	u32				txhash;
@@ -252,6 +253,7 @@ struct tcp_sock {
 	u8	compressed_ack;
 	u8	syn_ect_snt:2,	/* AccECN ECT memory, only */
 		syn_ect_rcv:2,	/* ... needed durign 3WHS + first seqno */
+		saw_accecn_opt:3,    /* A valid AccECN option was seen */
 		ecn_fail:1;	/* ECN reflector detected path mangling */
 	u32	chrono_start;	/* Start time in jiffies of a TCP chrono */
 	u32	chrono_stat[3];	/* Time in jiffies for chrono_stat stats */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 52567d8fca33..a29109fa2ce2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -226,6 +226,14 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 					 TCP_ACCECN_NUMCOUNTERS)
 #define TCP_ACCECN_BEACON_FREQ_SHIFT	2 /* Send option at least 2^2 times per RTT */
 
+/* tp->saw_accecn_opt states, empty seen & orderbit are overloaded */
+#define TCP_ACCECN_OPT_EMPTY_SEEN	0x1
+#define TCP_ACCECN_OPT_ORDERBIT		0x1
+#define TCP_ACCECN_OPT_COUNTER_SEEN	0x2
+#define TCP_ACCECN_OPT_SEEN		(TCP_ACCECN_OPT_COUNTER_SEEN | \
+					 TCP_ACCECN_OPT_EMPTY_SEEN)
+#define TCP_ACCECN_OPT_FAIL		0x4
+
 /* Flags in tp->nonagle */
 #define TCP_NAGLE_OFF		1	/* Nagle's algo is disabled */
 #define TCP_NAGLE_CORK		2	/* Socket is corked	    */
@@ -443,6 +451,7 @@ static inline u32 tcp_accecn_ace_deficit(const struct tcp_sock *tp)
 bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect);
 void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
 			  u8 syn_ect_snt);
+u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset);
 void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 			       u32 payload_len);
 
@@ -885,6 +894,7 @@ static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
  */
 #define TCP_ACCECN_CEP_INIT_OFFSET 5
 #define TCP_ACCECN_E1B_INIT_OFFSET 0
+#define TCP_ACCECN_E1B_FIRST_INIT_OFFSET 0x800001
 #define TCP_ACCECN_E0B_INIT_OFFSET 1
 #define TCP_ACCECN_CEB_INIT_OFFSET 0
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cfbdc1468342..09f73f81e6fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2624,6 +2624,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->window_clamp = 0;
 	tp->delivered = 0;
 	tp->delivered_ce = 0;
+	tp->saw_accecn_opt = 0;
 	tp->ecn_fail = 0;
 	tcp_accecn_init_counters(tp);
 	tp->prev_ecnfield = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 504309a73de2..826dfd5bf114 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -352,7 +352,8 @@ bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect)
 }
 
 /* See Table 2 of the AccECN draft */
-static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
+static void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb,
+			       const struct tcphdr *th,
 			       u8 ip_dsfield)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -372,7 +373,12 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
 	default:
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
-		tp->accecn_opt_demand = 2;
+		if (tp->rx_opt.accecn >= 0 &&
+		    tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
+			tp->saw_accecn_opt = tcp_accecn_option_init(skb,
+								    tp->rx_opt.accecn);
+			tp->accecn_opt_demand = 2;
+		}
 		if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) &&
 		    INET_ECN_is_ce(ip_dsfield))
 			tp->received_ce++;
@@ -436,7 +442,19 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp,
 	bool first_changed = false;
 	bool res;
 
+	if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL)
+		return false;
+
 	if (tp->rx_opt.accecn < 0) {
+		if (!tp->saw_accecn_opt) {
+			/* Too late to enable after this point due to
+			 * potential counter wraps
+			 */
+			if (tp->bytes_sent >= (1 << 23) - 1)
+				tp->saw_accecn_opt = TCP_ACCECN_OPT_FAIL;
+			return false;
+		}
+
 		if (tp->estimate_ecnfield) {
 			tp->delivered_ecn_bytes[tp->estimate_ecnfield - 1] +=
 				delivered_bytes;
@@ -453,11 +471,20 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp,
 	}
 	ptr += 2;
 
+	if (tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN)
+		tp->saw_accecn_opt = tcp_accecn_option_init(skb,
+							    tp->rx_opt.accecn);
+
 	res = !!tp->estimate_ecnfield;
 	for (i = 0; i < 3; i++) {
 		if (optlen >= TCPOLEN_ACCECN_PERCOUNTER) {
-			u8 ecnfield = accecn_opt_ecnfield[i];
-			u32 init_offset = i ? 0 : TCP_ACCECN_E0B_INIT_OFFSET;
+			u8 orderbit = tp->saw_accecn_opt & TCP_ACCECN_OPT_ORDERBIT;
+			int idx = orderbit ? i : 2 - i;
+			u8 ecnfield = accecn_opt_ecnfield[idx];
+			u32 init_offset = i ? 0 :
+					      !orderbit ?
+					      TCP_ACCECN_E0B_INIT_OFFSET :
+					      TCP_ACCECN_E1B_FIRST_INIT_OFFSET;
 			s32 delta;
 
 			delta = tcp_update_ecn_bytes(&(tp->delivered_ecn_bytes[ecnfield - 1]),
@@ -4188,6 +4215,7 @@ void tcp_parse_options(const struct net *net,
 				    get_unaligned_be16(ptr) ==
 				    TCPOPT_ACCECN_MAGIC)
 					opt_rx->accecn = (ptr - 2) - (unsigned char *)th;
+
 				/* Fast Open option shares code 254 using a
 				 * 16 bits magic number.
 				 */
@@ -5836,7 +5864,12 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	if (th->syn) {
 		if (tcp_ecn_mode_accecn(tp)) {
 			send_accecn_reflector = true;
-			tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand);
+			if (tp->rx_opt.accecn >= 0 &&
+			    tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
+				tp->saw_accecn_opt = tcp_accecn_option_init(skb,
+									    tp->rx_opt.accecn);
+				tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand);
+			}
 		}
 syn_challenge:
 		if (syn_inerr)
@@ -6279,7 +6312,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 */
 
 		if (tcp_ecn_mode_any(tp))
-			tcp_ecn_rcv_synack(sk, th, TCP_SKB_CB(skb)->ip_dsfield);
+			tcp_ecn_rcv_synack(sk, skb, th, TCP_SKB_CB(skb)->ip_dsfield);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 		tcp_try_undo_spurious_syn(sk);
@@ -6812,6 +6845,7 @@ static void tcp_openreq_init(struct request_sock *req,
 	tcp_rsk(req)->snt_synack = 0;
 	tcp_rsk(req)->last_oow_ack_time = 0;
 	tcp_rsk(req)->accecn_ok = 0;
+	tcp_rsk(req)->saw_accecn_opt = 0;
 	tcp_rsk(req)->syn_ect_rcv = 0;
 	tcp_rsk(req)->syn_ect_snt = 0;
 	req->mss = rx_opt->mss_clamp;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 2e532758a34a..eda3d0c3af32 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -97,6 +97,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	bool paws_reject = false;
 
 	tmp_opt.saw_tstamp = 0;
+	tmp_opt.accecn = -1;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
 		tcp_parse_options(twsk_net(tw), skb, &tmp_opt, 0, NULL);
 
@@ -437,6 +438,7 @@ static void tcp_ecn_openreq_child(struct sock *sk,
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_snt = treq->syn_ect_snt;
 		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
+		tp->saw_accecn_opt = treq->saw_accecn_opt;
 		tp->prev_ecnfield = treq->syn_ect_rcv;
 		tp->accecn_opt_demand = 1;
 		tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4);
@@ -491,6 +493,32 @@ static void smc_check_reset_syn_req(struct tcp_sock *oldtp,
 #endif
 }
 
+u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset)
+{
+	unsigned char *ptr = skb_transport_header(skb) + opt_offset;
+	unsigned int optlen = ptr[1];
+
+	if (ptr[0] == TCPOPT_EXP) {
+		optlen -= 2;
+		ptr += 2;
+	}
+	ptr += 2;
+
+	if (optlen >= TCPOLEN_ACCECN_PERCOUNTER) {
+		u32 first_field = get_unaligned_be32(ptr - 1) & 0xFFFFFFU;
+		u8 orderbit = first_field >> 23;
+		/* Detect option zeroing. Check the first byte counter value,
+		 * if present, it must be != 0.
+		 */
+		if (!first_field)
+			return TCP_ACCECN_OPT_FAIL;
+
+		return TCP_ACCECN_OPT_COUNTER_SEEN + orderbit;
+	}
+
+	return TCP_ACCECN_OPT_EMPTY_SEEN;
+}
+
 /* This is not only more efficient than what we used to do, it eliminates
  * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
  *
@@ -793,6 +821,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	if (!(flg & TCP_FLAG_ACK))
 		return NULL;
 
+	if (tcp_rsk(req)->accecn_ok && tmp_opt.accecn >= 0 &&
+	    tcp_rsk(req)->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN)
+		tcp_rsk(req)->saw_accecn_opt = tcp_accecn_option_init(skb, tmp_opt.accecn);
+
 	/* For Fast Open no more processing is needed (sk is the
 	 * child socket).
 	 */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f070128b69e6..4cc590a47f43 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -841,6 +841,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	/* Simultaneous open SYN/ACK needs AccECN option but not SYN */
 	if (unlikely((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK) &&
 		     tcp_ecn_mode_accecn(tp) &&
+		     inet_csk(sk)->icsk_retransmits < 2 &&
 		     !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) &&
 		     (remaining >= TCPOLEN_EXP_ACCECN_BASE))) {
 		opts->ecn_bytes = synack_ecn_bytes;
@@ -914,7 +915,7 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
-	if (treq->accecn_ok &&
+	if (treq->accecn_ok && req->num_timeout < 1 &&
 	    !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT) &&
 	    (remaining >= TCPOLEN_EXP_ACCECN_BASE)) {
 		opts->ecn_bytes = synack_ecn_bytes;
@@ -990,6 +991,7 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	}
 
 	if (tcp_ecn_mode_accecn(tp) &&
+	    (tp->saw_accecn_opt & TCP_ACCECN_OPT_SEEN) &&
 	    !(sock_net(sk)->ipv4.sysctl_tcp_ecn & TCP_ACCECN_NO_OPT)) {
 		if (tp->accecn_opt_demand ||
 		    (tcp_stamp_us_delta(tp->tcp_mstamp, tp->accecn_opt_tstamp) >=
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 23/28] tcp: AccECN option ceb/cep heuristic
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (21 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 22/28] tcp: AccECN option order bit & failure handling Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 24/28] tcp: try to fit AccECN option with SACK Ilpo Järvinen
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

The heuristic algorithm from draft-09 Appendix A.2.2.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/net/tcp.h    |  1 +
 net/ipv4/tcp_input.c | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a29109fa2ce2..54a640fed673 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -225,6 +225,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 					 TCPOLEN_ACCECN_PERCOUNTER * \
 					 TCP_ACCECN_NUMCOUNTERS)
 #define TCP_ACCECN_BEACON_FREQ_SHIFT	2 /* Send option at least 2^2 times per RTT */
+#define TCP_ACCECN_SAFETY_SHIFT	1	/* SAFETY_FACTOR in accecn draft */
 
 /* tp->saw_accecn_opt states, empty seen & orderbit are overloaded */
 #define TCP_ACCECN_OPT_EMPTY_SEEN	0x1
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 826dfd5bf114..6bc9995202c8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -518,14 +518,16 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp,
 static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
 			      u32 delivered_pkts, u32 delivered_bytes, int flag)
 {
-	u32 delta, safe_delta;
+	u32 delta, safe_delta, d_ceb;
 	u32 corrected_ace;
+	u32 old_ceb = tp->delivered_ecn_bytes[INET_ECN_CE - 1];
+	bool opt_deltas_valid;
 
 	/* Reordered ACK? (...or uncertain due to lack of data to send and ts) */
 	if (!(flag & (FLAG_FORWARD_PROGRESS|FLAG_TS_PROGRESS)))
 		return 0;
 
-	tcp_accecn_process_option(tp, skb, delivered_bytes);
+	opt_deltas_valid = tcp_accecn_process_option(tp, skb, delivered_bytes);
 
 	if (!(flag & FLAG_SLOWPATH)) {
 		/* AccECN counter might overflow on large ACKs */
@@ -545,6 +547,16 @@ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
 	safe_delta = delivered_pkts -
 		     ((delivered_pkts - delta) & TCP_ACCECN_CEP_ACE_MASK);
 
+	if (opt_deltas_valid) {
+		d_ceb = tp->delivered_ecn_bytes[INET_ECN_CE - 1] - old_ceb;
+		if (!d_ceb)
+			return delta;
+		if (d_ceb > delta * tp->mss_cache)
+			return safe_delta;
+		if (d_ceb < safe_delta * tp->mss_cache >> TCP_ACCECN_SAFETY_SHIFT)
+			return delta;
+	}
+
 	return safe_delta;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 24/28] tcp: try to fit AccECN option with SACK
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (22 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 23/28] tcp: AccECN option ceb/cep heuristic Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-19  3:29   ` Eric Dumazet
  2020-03-18  9:43 ` [RFC PATCH 25/28] tcp: try to avoid safer when ACKs are thinned Ilpo Järvinen
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

As SACK blocks tend to eat all option space when there are
many holes, it is useful to compromise on sending many SACK
blocks in every ACK and try to fit AccECN option there
by reduction the number of SACK blocks. But never go below
two SACK blocks because of AccECN option.

As AccECN option is often not put to every ACK, the space
hijack is usually only temporary.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_output.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4cc590a47f43..0aec2c57a9cc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -756,6 +756,21 @@ static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required,
 	if (opts->num_ecn_bytes < required)
 		return 0;
 
+	if (opts->num_ecn_bytes < required) {
+		if (opts->num_sack_blocks > 2) {
+			/* Try to fit the option by removing one SACK block */
+			opts->num_sack_blocks--;
+			size = tcp_options_fit_accecn(opts, required,
+						      remaining + TCPOLEN_SACK_PERBLOCK,
+						      max_combine_saving);
+			if (opts->options & OPTION_ACCECN)
+				return size - TCPOLEN_SACK_PERBLOCK;
+
+			opts->num_sack_blocks++;
+		}
+		return 0;
+	}
+
 	opts->options |= OPTION_ACCECN;
 	return size;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 25/28] tcp: try to avoid safer when ACKs are thinned
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (23 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 24/28] tcp: try to fit AccECN option with SACK Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 26/28] tcp: to prevent runaway AccECN cep/ACE deficit, limit GSO size Ilpo Järvinen
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Add newly acked pkts EWMA. When ACK thinning occurs, select
between safer and unsafe cep delta based on it.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 include/linux/tcp.h  |  1 +
 net/ipv4/tcp.c       |  1 +
 net/ipv4/tcp_input.c | 20 +++++++++++++++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 64db51e5d45e..22be7cf2e084 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -332,6 +332,7 @@ struct tcp_sock {
 		prev_ecnfield:2,/* ECN bits from the previous segment */
 		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
 		estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */
+	u16	pkts_acked_ewma;/* EWMA of packets acked for AccECN cep heuristic */
 	u64	accecn_opt_tstamp;	/* Last AccECN option sent timestamp */
 	u32	lost;		/* Total data packets lost incl. rexmits */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 09f73f81e6fa..4a22c19fa6d5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2629,6 +2629,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tcp_accecn_init_counters(tp);
 	tp->prev_ecnfield = 0;
 	tp->accecn_opt_tstamp = 0;
+	tp->pkts_acked_ewma = 0;
 	tcp_set_ca_state(sk, TCP_CA_Open);
 	tp->is_sack_reneg = 0;
 	tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6bc9995202c8..f5476e6b1479 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -514,6 +514,10 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp,
 	return res;
 }
 
+#define PKTS_ACKED_WEIGHT	6
+#define PKTS_ACKED_PREC		6
+#define ACK_COMP_THRESH		4
+
 /* Returns the ECN CE delta */
 static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
 			      u32 delivered_pkts, u32 delivered_bytes, int flag)
@@ -529,6 +533,19 @@ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
 
 	opt_deltas_valid = tcp_accecn_process_option(tp, skb, delivered_bytes);
 
+	if (delivered_pkts) {
+		if (!tp->pkts_acked_ewma) {
+			tp->pkts_acked_ewma = delivered_pkts << PKTS_ACKED_PREC;
+		} else {
+			u32 ewma = tp->pkts_acked_ewma;
+
+			ewma = (((ewma << PKTS_ACKED_WEIGHT) - ewma) +
+				(delivered_pkts << PKTS_ACKED_PREC)) >>
+			       PKTS_ACKED_WEIGHT;
+			tp->pkts_acked_ewma = min_t(u32, ewma, 0xFFFFU);
+		}
+	}
+
 	if (!(flag & FLAG_SLOWPATH)) {
 		/* AccECN counter might overflow on large ACKs */
 		if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK)
@@ -555,7 +572,8 @@ static u32 tcp_accecn_process(struct tcp_sock *tp, const struct sk_buff *skb,
 			return safe_delta;
 		if (d_ceb < safe_delta * tp->mss_cache >> TCP_ACCECN_SAFETY_SHIFT)
 			return delta;
-	}
+	} else if (tp->pkts_acked_ewma > (ACK_COMP_THRESH << PKTS_ACKED_PREC))
+		return delta;
 
 	return safe_delta;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 26/28] tcp: to prevent runaway AccECN cep/ACE deficit, limit GSO size
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (24 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 25/28] tcp: try to avoid safer when ACKs are thinned Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 27/28] gro: flushing when CWR is set negatively affects AccECN Ilpo Järvinen
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

It could occur that GSO sends segments in so large blocks
that ACE deficit keeps growing because ACE field can only
update in each super skb.

Put some limit into sending large super skbs in case the ACE
deficit is there and could go on indefinitely. Once the bool
becomes false, it's no longer necessary to recheck it during
further sending.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_output.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0aec2c57a9cc..4de6510532f2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2124,6 +2124,23 @@ static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
 	return !after(end_seq, tcp_wnd_end(tp));
 }
 
+/* Runaway ACE deficit possible? */
+static bool tcp_accecn_deficit_runaway_test(const struct tcp_sock *tp,
+					    int cwnd_quota)
+{
+	return (tcp_accecn_ace_deficit(tp) >= 2 * TCP_ACCECN_ACE_MAX_DELTA) &&
+	       (cwnd_quota > TCP_ACCECN_ACE_MAX_DELTA - 1);
+}
+
+static u32 tcp_accecn_gso_limit(struct tcp_sock *tp,
+				const struct sk_buff *skb, int cwnd_quota)
+{
+	if (unlikely(tcp_accecn_deficit_runaway_test(tp, cwnd_quota)))
+		return TCP_ACCECN_ACE_MAX_DELTA - 1;
+
+	return 0;
+}
+
 /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet
  * which is put after SKB on the list.  It is very much like
  * tcp_fragment() except that it may make several kinds of assumptions
@@ -2623,6 +2640,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	int cwnd_quota;
 	int result;
 	bool is_cwnd_limited = false, is_rwnd_limited = false;
+	/* AccECN limit will be lifted below if not needed */
+	bool accecn_gso_limit = tcp_ecn_mode_accecn(tp);
 	u32 max_segs;
 
 	sent_pkts = 0;
@@ -2676,7 +2695,16 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 						      nonagle : TCP_NAGLE_PUSH))))
 				break;
 		} else {
-			if (!push_one &&
+			if (accecn_gso_limit) {
+				u32 limit = tcp_accecn_gso_limit(tp, skb,
+								 cwnd_quota);
+				if (limit > 0)
+					cwnd_quota = limit;
+				else
+					accecn_gso_limit = false;
+			}
+
+			if (!push_one && !accecn_gso_limit &&
 			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
 						 &is_rwnd_limited, max_segs))
 				break;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 27/28] gro: flushing when CWR is set negatively affects AccECN
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (25 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 26/28] tcp: to prevent runaway AccECN cep/ACE deficit, limit GSO size Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-18  9:43 ` [RFC PATCH 28/28] tcp: AccECN sysctl documentation Ilpo Järvinen
  2020-03-18 23:29 ` [RFC PATCH 00/28]: Accurate ECN for TCP David Miller
  28 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

As AccECN may keep CWR bit asserted due to different
interpretation of the bit, flushing with GRO because of
CWR may effectively disable GRO until AccECN counter
field changes such that CWR-bit becomes 0.

There is no harm done from not immediately forwarding the
CWR'ed segment with RFC3168 ECN.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_offload.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 58ce382c793e..555c9be84f10 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -240,7 +240,6 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 found:
 	/* Include the IP ID check below from the inner most IP hdr */
 	flush = NAPI_GRO_CB(p)->flush;
-	flush |= (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (26 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 27/28] gro: flushing when CWR is set negatively affects AccECN Ilpo Järvinen
@ 2020-03-18  9:43 ` Ilpo Järvinen
  2020-03-19 23:02   ` Dave Taht
  2020-03-18 23:29 ` [RFC PATCH 00/28]: Accurate ECN for TCP David Miller
  28 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:43 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 Documentation/networking/ip-sysctl.txt | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5f53faff4e25..ecca6e1d6bea 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -301,15 +301,21 @@ tcp_ecn - INTEGER
 		0 Disable ECN.  Neither initiate nor accept ECN.
 		1 Enable ECN when requested by incoming connections and
 		  also request ECN on outgoing connection attempts.
-		2 Enable ECN when requested by incoming connections
+		2 Enable ECN or AccECN when requested by incoming connections
 		  but do not request ECN on outgoing connections.
+		3 Enable AccECN when requested by incoming connections and
+		  also request AccECN on outgoing connection attempts.
+	    0x102 Enable AccECN in optionless mode for incoming connections.
+	    0x103 Enable AccECN in optionless mode for incoming and outgoing
+		  connections.
 	Default: 2
 
 tcp_ecn_fallback - BOOLEAN
 	If the kernel detects that ECN connection misbehaves, enable fall
 	back to non-ECN. Currently, this knob implements the fallback
-	from RFC3168, section 6.1.1.1., but we reserve that in future,
-	additional detection mechanisms could be implemented under this
+	from RFC3168, section 6.1.1.1., as well as the ECT codepoint mangling
+	detection during the Accurate ECN handshake, but we reserve that in
+	future, additional detection mechanisms could be implemented under this
 	knob. The value	is not used, if tcp_ecn or per route (or congestion
 	control) ECN settings are disabled.
 	Default: 1 (fallback enabled)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/28]: Accurate ECN for TCP
  2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
                   ` (27 preceding siblings ...)
  2020-03-18  9:43 ` [RFC PATCH 28/28] tcp: AccECN sysctl documentation Ilpo Järvinen
@ 2020-03-18 23:29 ` David Miller
  2020-03-19 20:25   ` Ilpo Järvinen
  28 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2020-03-18 23:29 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, ycheng, ncardwell, eric.dumazet, olivier.tilmans

From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date: Wed, 18 Mar 2020 11:43:04 +0200

> Comments would be highly appreciated.

Two coding style comments which you should audit your entire submission
for:

1) Please order local variables in reverse christmas tree ordering (longest
   to shortest long)

2) Please do not use the inline keyword in foo.c files, let the compiler
   decide.

Thank you.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 24/28] tcp: try to fit AccECN option with SACK
  2020-03-18  9:43 ` [RFC PATCH 24/28] tcp: try to fit AccECN option with SACK Ilpo Järvinen
@ 2020-03-19  3:29   ` Eric Dumazet
  2020-03-19 20:33     ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2020-03-19  3:29 UTC (permalink / raw)
  To: Ilpo Järvinen, netdev
  Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans



On 3/18/20 2:43 AM, Ilpo Järvinen wrote:
> From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> 
> As SACK blocks tend to eat all option space when there are
> many holes, it is useful to compromise on sending many SACK
> blocks in every ACK and try to fit AccECN option there
> by reduction the number of SACK blocks. But never go below
> two SACK blocks because of AccECN option.
> 
> As AccECN option is often not put to every ACK, the space
> hijack is usually only temporary.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> ---
>  net/ipv4/tcp_output.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4cc590a47f43..0aec2c57a9cc 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -756,6 +756,21 @@ static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required,
>  	if (opts->num_ecn_bytes < required)
>  		return 0;

Have you tested this patch ?

(You forgot to remove the prior 2 lines)

>  
> +	if (opts->num_ecn_bytes < required) {
> +		if (opts->num_sack_blocks > 2) {
> +			/* Try to fit the option by removing one SACK block */
> +			opts->num_sack_blocks--;
> +			size = tcp_options_fit_accecn(opts, required,
> +						      remaining + TCPOLEN_SACK_PERBLOCK,
> +						      max_combine_saving);
> +			if (opts->options & OPTION_ACCECN)
> +				return size - TCPOLEN_SACK_PERBLOCK;
> +
> +			opts->num_sack_blocks++;
> +		}
> +		return 0;
> +	}
> +
>  	opts->options |= OPTION_ACCECN;
>  	return size;
>  }
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 15/28] tcp: add AccECN rx byte counters
  2020-03-18  9:43 ` [RFC PATCH 15/28] tcp: add AccECN rx byte counters Ilpo Järvinen
@ 2020-03-19  3:36   ` Eric Dumazet
  0 siblings, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2020-03-19  3:36 UTC (permalink / raw)
  To: Ilpo Järvinen, netdev
  Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans



On 3/18/20 2:43 AM, Ilpo Järvinen wrote:
> From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> ---
>  include/linux/tcp.h      |  1 +
>  include/net/tcp.h        | 18 +++++++++++++++++-
>  net/ipv4/tcp_input.c     | 13 +++++++++----
>  net/ipv4/tcp_minisocks.c |  3 ++-
>  4 files changed, 29 insertions(+), 6 deletions(-)
>


Why do we need these counters ?

Please add sensible changelogs to _all_ the patches.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 09/28] gso: AccECN support
  2020-03-18  9:43 ` [RFC PATCH 09/28] gso: AccECN support Ilpo Järvinen
@ 2020-03-19  3:44   ` Eric Dumazet
  2020-03-19 22:36     ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2020-03-19  3:44 UTC (permalink / raw)
  To: Ilpo Järvinen, netdev
  Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans



On 3/18/20 2:43 AM, Ilpo Järvinen wrote:
> From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> 
> Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> Take it into account in GSO by not clearing the CWR bit.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
>

Does it means TCP segmentation offload is disabled on all the NIC
around ?

Why tun driver is changed and not others ?

I believe you need to give much more details in changelog in general,
because many changes are not that obvious.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/28]: Accurate ECN for TCP
  2020-03-18 23:29 ` [RFC PATCH 00/28]: Accurate ECN for TCP David Miller
@ 2020-03-19 20:25   ` Ilpo Järvinen
  2020-03-20  4:38     ` David Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-19 20:25 UTC (permalink / raw)
  To: David Miller
  Cc: Netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

[-- Attachment #1: Type: text/plain, Size: 735 bytes --]

On Wed, 18 Mar 2020, David Miller wrote:

> From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> Date: Wed, 18 Mar 2020 11:43:04 +0200
> 
> > Comments would be highly appreciated.
> 
> Two coding style comments which you should audit your entire submission
> for:
> 
> 1) Please order local variables in reverse christmas tree ordering (longest
>    to shortest long)

Does this apply also to the usual struct tcp_sock *tp = tcp_sk(sk); line
or can it be put first if there are some dependencies on it?

> 2) Please do not use the inline keyword in foo.c files, let the compiler
>    decide.

Thanks. I'll do those (I certainly removed some other bits I moved from
header to .c but missed a few it seems). 

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 24/28] tcp: try to fit AccECN option with SACK
  2020-03-19  3:29   ` Eric Dumazet
@ 2020-03-19 20:33     ` Ilpo Järvinen
  0 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-19 20:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netdev, Yuchung Cheng, Neal Cardwell, Olivier Tilmans

[-- Attachment #1: Type: text/plain, Size: 1639 bytes --]

On Wed, 18 Mar 2020, Eric Dumazet wrote:

> 
> 
> On 3/18/20 2:43 AM, Ilpo Järvinen wrote:
> > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > 
> > As SACK blocks tend to eat all option space when there are
> > many holes, it is useful to compromise on sending many SACK
> > blocks in every ACK and try to fit AccECN option there
> > by reduction the number of SACK blocks. But never go below
> > two SACK blocks because of AccECN option.
> > 
> > As AccECN option is often not put to every ACK, the space
> > hijack is usually only temporary.
> > 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > ---
> >  net/ipv4/tcp_output.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 4cc590a47f43..0aec2c57a9cc 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -756,6 +756,21 @@ static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required,
> >  	if (opts->num_ecn_bytes < required)
> >  		return 0;
> 
> Have you tested this patch ?
> 
> (You forgot to remove the prior 2 lines)
> 
> >  
> > +	if (opts->num_ecn_bytes < required) {

Yes and no. There was no unit test for this particular condition but
I added a few now (with and w/o timestamps). I also managed to find and 
fix a byte-order related bug related to non-fullsized option while making 
those tests.

(I didn't actually forget to remove it. I managed to add the problem 
during a botched conflict merge when I reorganized some of the code.)

Thanks for taking a look.

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 09/28] gso: AccECN support
  2020-03-19  3:44   ` Eric Dumazet
@ 2020-03-19 22:36     ` Ilpo Järvinen
  0 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-19 22:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netdev, Yuchung Cheng, Neal Cardwell, Olivier Tilmans

[-- Attachment #1: Type: text/plain, Size: 2672 bytes --]

On Wed, 18 Mar 2020, Eric Dumazet wrote:
> On 3/18/20 2:43 AM, Ilpo Järvinen wrote:
> > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > 
> > Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> > Take it into account in GSO by not clearing the CWR bit.
> > 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> >
> 
> Does it means TCP segmentation offload is disabled on all the NIC
> around ?

On general level, yes. HW TSO should be disabled with AccECN (or when CWR 
flag is set for an incoming packet). The reason is how CWR flag is handled 
by RFC 3168 ECN-aware TSO. It splits the segment such that CWR is cleared 
starting from the 2nd segment which is incompatible how AccECN handles the 
CWR flag. With AccECN, CWR flag (or more accurately, the ACE field that 
also includes ECE & AE flags) changes only when new packet(s) with CE mark 
arrives so the flag should not be changed within a super-skb. The new
feature flag is necessary to prevent such TSO engines happily clearing 
CWRs (if the CWR handling feature cannot be turned off).

If NIC is completely unaware of RFC3168 ECN (doesn't support 
NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
despite supporting also NETIF_F_TSO_ECN, TSO could be safely used with 
AccECN on such NIC. I've little expertise on TSO HW so I don't know if 
some/what NICs can do it. Nonetheless, there's nothing fundamental 
preventing TSO being enabled with AccECN by NICs (if either of those 
conditions is met).

In the cases, where TSO would fail to keep its hands off CWR flag, it
should fallback to GSO which has always on AccECN support (similar to 
always on ECN support). I think that the current GSO changes are capable 
of handling AccECN skbs. For the other parts of the idea above I'm not 
so sure. There is this NETIF_F_GSO_SOFTWARE with comment that seems to 
indicate it's doing what I want but I'm not fully sure if adding a flag 
there is enough to achieve the desired effect?

On the rx side, supporting both RFC3168 and AccECN style CWR handling
is slightly more complicated (and possibly not worth the effort given
CWRs should be relatively rare with RFC3168-style ECN).

> Why tun driver is changed and not others ?

I think I didn't really understand why tun.c plays with the TSO_ECN flag 
until now (after finding a related comment from tap.c) and so that change 
now doesn't make much sense for me now any more. So I'll just remove that 
part.

> I believe you need to give much more details in changelog in general,
> because many changes are not that obvious.

I'll try to.

Thanks.

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-18  9:43 ` [RFC PATCH 28/28] tcp: AccECN sysctl documentation Ilpo Järvinen
@ 2020-03-19 23:02   ` Dave Taht
  2020-03-20 22:40     ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Taht @ 2020-03-19 23:02 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Linux Kernel Network Developers, Yuchung Cheng, Neal Cardwell,
	Eric Dumazet, Olivier Tilmans

On Wed, Mar 18, 2020 at 2:44 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
>
> From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> ---
>  Documentation/networking/ip-sysctl.txt | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 5f53faff4e25..ecca6e1d6bea 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -301,15 +301,21 @@ tcp_ecn - INTEGER
>                 0 Disable ECN.  Neither initiate nor accept ECN.
>                 1 Enable ECN when requested by incoming connections and
>                   also request ECN on outgoing connection attempts.
> -               2 Enable ECN when requested by incoming connections
> +               2 Enable ECN or AccECN when requested by incoming connections
>                   but do not request ECN on outgoing connections.

Changing existing user-behavior for this default seems to be overly
optimistic. Useful for testing, but...

> +               3 Enable AccECN when requested by incoming connections and
> +                 also request AccECN on outgoing connection attempts.
> +           0x102 Enable AccECN in optionless mode for incoming connections.
> +           0x103 Enable AccECN in optionless mode for incoming and outgoing
> +                 connections.

In terms of the logic bits here, it might make more sense

0: disable ecn
1: enable std ecn on in or out
2: enable std ecn when requested on in (the default)
3: essentially unused
4: enable accecn when requested on in
5: enable std ecn and accecn on in or out
6: enable accecn and ecn on in but not out

Do we have any data on how often the tcp ns bit is a source of
firewalling problems yet?

0x102 strikes me as a bit more magical than required and I don't know
what optionless means in this context.

>         Default: 2
>
>  tcp_ecn_fallback - BOOLEAN
>         If the kernel detects that ECN connection misbehaves, enable fall
>         back to non-ECN. Currently, this knob implements the fallback
> -       from RFC3168, section 6.1.1.1., but we reserve that in future,
> -       additional detection mechanisms could be implemented under this
> +       from RFC3168, section 6.1.1.1., as well as the ECT codepoint mangling
> +       detection during the Accurate ECN handshake, but we reserve that in
> +       future, additional detection mechanisms could be implemented under this
>         knob. The value is not used, if tcp_ecn or per route (or congestion
>         control) ECN settings are disabled.
>         Default: 1 (fallback enabled)
> --
> 2.20.1
>


-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/28]: Accurate ECN for TCP
  2020-03-19 20:25   ` Ilpo Järvinen
@ 2020-03-20  4:38     ` David Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David Miller @ 2020-03-20  4:38 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, ycheng, ncardwell, eric.dumazet, olivier.tilmans

From: "Ilpo Järvinen" <ilpo.jarvinen@cs.helsinki.fi>
Date: Thu, 19 Mar 2020 22:25:23 +0200 (EET)

> On Wed, 18 Mar 2020, David Miller wrote:
> 
>> From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
>> Date: Wed, 18 Mar 2020 11:43:04 +0200
>> 
>> > Comments would be highly appreciated.
>> 
>> Two coding style comments which you should audit your entire submission
>> for:
>> 
>> 1) Please order local variables in reverse christmas tree ordering (longest
>>    to shortest long)
> 
> Does this apply also to the usual struct tcp_sock *tp = tcp_sk(sk); line
> or can it be put first if there are some dependencies on it?

Yes, please.  Put the assignment into the code lines if necessary.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-19 23:02   ` Dave Taht
@ 2020-03-20 22:40     ` Ilpo Järvinen
  2020-03-20 23:22       ` Yuchung Cheng
  0 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-20 22:40 UTC (permalink / raw)
  To: Dave Taht
  Cc: Linux Kernel Network Developers, Yuchung Cheng, Neal Cardwell,
	Eric Dumazet, Olivier Tilmans

[-- Attachment #1: Type: text/plain, Size: 4608 bytes --]

On Thu, 19 Mar 2020, Dave Taht wrote:

> On Wed, Mar 18, 2020 at 2:44 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > ---
> >  Documentation/networking/ip-sysctl.txt | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > index 5f53faff4e25..ecca6e1d6bea 100644
> > --- a/Documentation/networking/ip-sysctl.txt
> > +++ b/Documentation/networking/ip-sysctl.txt
> > @@ -301,15 +301,21 @@ tcp_ecn - INTEGER
> >                 0 Disable ECN.  Neither initiate nor accept ECN.
> >                 1 Enable ECN when requested by incoming connections and
> >                   also request ECN on outgoing connection attempts.
> > -               2 Enable ECN when requested by incoming connections
> > +               2 Enable ECN or AccECN when requested by incoming connections
> >                   but do not request ECN on outgoing connections.
> 
> Changing existing user-behavior for this default seems to be overly
> optimistic. Useful for testing, but...

I disagree.

The kernel default on ECN is/has been "do nothing" like forever. Yet,
passively allowing ECN on servers is a low risk operation because nothing
will change before client actively asks for it. However, it was obvious 
that the servers didn't do that. The servers could have set tcp_ecn to 1 
(before 2 was there) which is low risk for _servers_ (unlike for clients) 
but only very very few did. I don't believe servers would now 
intentionally pick 2 when they clearly didn't pick 1 earlier either.

Adding 2 is/was an attempt to side-step the need for both ends to make 
conscious decision by setting the sysctl (which servers didn't want to 
do). That is, 2 gives decision on what to do into the hands of the client 
side which was the true intent of 2 (in case you don't know, I made that 
change).

Allowing the client side to make the decision alone has proven successful 
approach. We now have significant passive RFC3168 ECN server deployment. 
It is wide-spread enough that Apple found it useful enough for their 
client side, experimented with it and worked to fix the issues where they 
discovered something in the network was incompatible with ECN. I don't 
believe it would have happened without leaving the decision into the hands 
of the clients.

Similarly, passively allowing the client to decide to use AccECN is
low risk thing. ...As with RFC 3168 ECN, "do nothing" applies also for 
Accurate ECN here unless the client asks for it.

> > +               3 Enable AccECN when requested by incoming connections and
> > +                 also request AccECN on outgoing connection attempts.
> > +           0x102 Enable AccECN in optionless mode for incoming connections.
> > +           0x103 Enable AccECN in optionless mode for incoming and outgoing
> > +                 connections.
> 
> In terms of the logic bits here, it might make more sense
> 
> 0: disable ecn
> 1: enable std ecn on in or out
> 2: enable std ecn when requested on in (the default)
> 3: essentially unused
> 4: enable accecn when requested on in
> 5: enable std ecn and accecn on in or out
> 6: enable accecn and ecn on in but not out

If "full control" is the way to go, I think it should be made using flags 
instead, along these lines:

1: Enable RFC 3168 ECN in+out
2: Enable RFC 3168 ECN in (default on)
4: Enable Accurate ECN in (default on)
8: Enable Accurate ECN in+out

Note that I intentionally reversed the in and in/out order for 4&8 
(something that couldn't be done with 1&2 to preserve meaning of 1).

I think it's a bit complicated though but if this is what most people 
want, I can of course change it to flags.

> Do we have any data on how often the tcp ns bit is a source of
> firewalling problems yet?
> 
> 0x102 strikes me as a bit more magical than required

To me it compares to some fast open cookie things that are similarly using 
higher order bits in flag like manner.

> and I don't know
> what optionless means in this context.

Do you mean that "optionless" is not a good word to use here? (I'm not
a native speaker but I can imagine it might sound like "futureless"?)
I meant that AccECN operates then w/o sending any AccECN option (rx side 
still processes the options if the peer chooses to send them despite not 
getting any back).

Thanks.

-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-20 22:40     ` Ilpo Järvinen
@ 2020-03-20 23:22       ` Yuchung Cheng
  2020-03-23 13:34         ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: Yuchung Cheng @ 2020-03-20 23:22 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Dave Taht, Linux Kernel Network Developers, Neal Cardwell,
	Eric Dumazet, Olivier Tilmans

On Fri, Mar 20, 2020 at 3:40 PM Ilpo Järvinen
<ilpo.jarvinen@cs.helsinki.fi> wrote:
>
> On Thu, 19 Mar 2020, Dave Taht wrote:
>
> > On Wed, Mar 18, 2020 at 2:44 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > >
> > > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > >
> > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > ---
> > >  Documentation/networking/ip-sysctl.txt | 12 +++++++++---
> > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > > index 5f53faff4e25..ecca6e1d6bea 100644
> > > --- a/Documentation/networking/ip-sysctl.txt
> > > +++ b/Documentation/networking/ip-sysctl.txt
> > > @@ -301,15 +301,21 @@ tcp_ecn - INTEGER
> > >                 0 Disable ECN.  Neither initiate nor accept ECN.
> > >                 1 Enable ECN when requested by incoming connections and
> > >                   also request ECN on outgoing connection attempts.
> > > -               2 Enable ECN when requested by incoming connections
> > > +               2 Enable ECN or AccECN when requested by incoming connections
> > >                   but do not request ECN on outgoing connections.
> >
> > Changing existing user-behavior for this default seems to be overly
> > optimistic. Useful for testing, but...
>
> I disagree.
>
> The kernel default on ECN is/has been "do nothing" like forever. Yet,
> passively allowing ECN on servers is a low risk operation because nothing
> will change before client actively asks for it. However, it was obvious
> that the servers didn't do that. The servers could have set tcp_ecn to 1
> (before 2 was there) which is low risk for _servers_ (unlike for clients)
> but only very very few did. I don't believe servers would now
> intentionally pick 2 when they clearly didn't pick 1 earlier either.
>
> Adding 2 is/was an attempt to side-step the need for both ends to make
> conscious decision by setting the sysctl (which servers didn't want to
> do). That is, 2 gives decision on what to do into the hands of the client
> side which was the true intent of 2 (in case you don't know, I made that
> change).
What can a server configure to process only RFC3168 ECN if it prefers to?

>
> Allowing the client side to make the decision alone has proven successful
> approach. We now have significant passive RFC3168 ECN server deployment.
> It is wide-spread enough that Apple found it useful enough for their
> client side, experimented with it and worked to fix the issues where they
> discovered something in the network was incompatible with ECN. I don't
> believe it would have happened without leaving the decision into the hands
> of the clients.
>
> Similarly, passively allowing the client to decide to use AccECN is
> low risk thing. ...As with RFC 3168 ECN, "do nothing" applies also for
> Accurate ECN here unless the client asks for it.
>
> > > +               3 Enable AccECN when requested by incoming connections and
> > > +                 also request AccECN on outgoing connection attempts.
> > > +           0x102 Enable AccECN in optionless mode for incoming connections.
> > > +           0x103 Enable AccECN in optionless mode for incoming and outgoing
> > > +                 connections.
> >
> > In terms of the logic bits here, it might make more sense
> >
> > 0: disable ecn
> > 1: enable std ecn on in or out
> > 2: enable std ecn when requested on in (the default)
> > 3: essentially unused
> > 4: enable accecn when requested on in
> > 5: enable std ecn and accecn on in or out
> > 6: enable accecn and ecn on in but not out
>
> If "full control" is the way to go, I think it should be made using flags
> instead, along these lines:
>
> 1: Enable RFC 3168 ECN in+out
> 2: Enable RFC 3168 ECN in (default on)
> 4: Enable Accurate ECN in (default on)
> 8: Enable Accurate ECN in+out
>
> Note that I intentionally reversed the in and in/out order for 4&8
> (something that couldn't be done with 1&2 to preserve meaning of 1).
>
> I think it's a bit complicated though but if this is what most people
> want, I can of course change it to flags.
>
> > Do we have any data on how often the tcp ns bit is a source of
> > firewalling problems yet?
> >
> > 0x102 strikes me as a bit more magical than required
>
> To me it compares to some fast open cookie things that are similarly using
> higher order bits in flag like manner.
>
> > and I don't know
> > what optionless means in this context.
>
> Do you mean that "optionless" is not a good word to use here? (I'm not
> a native speaker but I can imagine it might sound like "futureless"?)
> I meant that AccECN operates then w/o sending any AccECN option (rx side
> still processes the options if the peer chooses to send them despite not
> getting any back).
>
> Thanks.
>
> --
>  i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-20 23:22       ` Yuchung Cheng
@ 2020-03-23 13:34         ` Ilpo Järvinen
  2020-03-23 19:03           ` Yuchung Cheng
  0 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-23 13:34 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Dave Taht, Linux Kernel Network Developers, Neal Cardwell,
	Eric Dumazet, Olivier Tilmans

[-- Attachment #1: Type: text/plain, Size: 3068 bytes --]

On Fri, 20 Mar 2020, Yuchung Cheng wrote:

> On Fri, Mar 20, 2020 at 3:40 PM Ilpo Järvinen
> <ilpo.jarvinen@cs.helsinki.fi> wrote:
> >
> > On Thu, 19 Mar 2020, Dave Taht wrote:
> >
> > > On Wed, Mar 18, 2020 at 2:44 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > > >
> > > > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > >
> > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > > ---
> > > >  Documentation/networking/ip-sysctl.txt | 12 +++++++++---
> > > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > > > index 5f53faff4e25..ecca6e1d6bea 100644
> > > > --- a/Documentation/networking/ip-sysctl.txt
> > > > +++ b/Documentation/networking/ip-sysctl.txt
> > > > @@ -301,15 +301,21 @@ tcp_ecn - INTEGER
> > > >                 0 Disable ECN.  Neither initiate nor accept ECN.
> > > >                 1 Enable ECN when requested by incoming connections and
> > > >                   also request ECN on outgoing connection attempts.
> > > > -               2 Enable ECN when requested by incoming connections
> > > > +               2 Enable ECN or AccECN when requested by incoming connections
> > > >                   but do not request ECN on outgoing connections.
> > >
> > > Changing existing user-behavior for this default seems to be overly
> > > optimistic. Useful for testing, but...
> >
> > I disagree.
> >
> > The kernel default on ECN is/has been "do nothing" like forever. Yet,
> > passively allowing ECN on servers is a low risk operation because nothing
> > will change before client actively asks for it. However, it was obvious
> > that the servers didn't do that. The servers could have set tcp_ecn to 1
> > (before 2 was there) which is low risk for _servers_ (unlike for clients)
> > but only very very few did. I don't believe servers would now
> > intentionally pick 2 when they clearly didn't pick 1 earlier either.
> >
> > Adding 2 is/was an attempt to side-step the need for both ends to make
> > conscious decision by setting the sysctl (which servers didn't want to
> > do). That is, 2 gives decision on what to do into the hands of the client
> > side which was the true intent of 2 (in case you don't know, I made that
> > change).
> What can a server configure to process only RFC3168 ECN if it prefers to?

That's why I suggested the flag-based approach?

> > If "full control" is the way to go, I think it should be made using flags
> > instead, along these lines:
> >
> > 1: Enable RFC 3168 ECN in+out
> > 2: Enable RFC 3168 ECN in (default on)
> > 4: Enable Accurate ECN in (default on)
> > 8: Enable Accurate ECN in+out
> >
> > Note that I intentionally reversed the in and in/out order for 4&8
> > (something that couldn't be done with 1&2 to preserve meaning of 1).

It should address any except "out" but no "in" (the meaning of 1 cannot 
be changed I think). But out w/o in doesn't sound very useful.

--
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-23 13:34         ` Ilpo Järvinen
@ 2020-03-23 19:03           ` Yuchung Cheng
  2020-03-24 12:50             ` Ilpo Järvinen
  0 siblings, 1 reply; 45+ messages in thread
From: Yuchung Cheng @ 2020-03-23 19:03 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Dave Taht, Linux Kernel Network Developers, Neal Cardwell,
	Eric Dumazet, Olivier Tilmans

On Mon, Mar 23, 2020 at 6:34 AM Ilpo Järvinen
<ilpo.jarvinen@cs.helsinki.fi> wrote:
>
> On Fri, 20 Mar 2020, Yuchung Cheng wrote:
>
> > On Fri, Mar 20, 2020 at 3:40 PM Ilpo Järvinen
> > <ilpo.jarvinen@cs.helsinki.fi> wrote:
> > >
> > > On Thu, 19 Mar 2020, Dave Taht wrote:
> > >
> > > > On Wed, Mar 18, 2020 at 2:44 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > > > >
> > > > > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > > >
> > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > > > ---
> > > > >  Documentation/networking/ip-sysctl.txt | 12 +++++++++---
> > > > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > > > > index 5f53faff4e25..ecca6e1d6bea 100644
> > > > > --- a/Documentation/networking/ip-sysctl.txt
> > > > > +++ b/Documentation/networking/ip-sysctl.txt
> > > > > @@ -301,15 +301,21 @@ tcp_ecn - INTEGER
> > > > >                 0 Disable ECN.  Neither initiate nor accept ECN.
> > > > >                 1 Enable ECN when requested by incoming connections and
> > > > >                   also request ECN on outgoing connection attempts.
> > > > > -               2 Enable ECN when requested by incoming connections
> > > > > +               2 Enable ECN or AccECN when requested by incoming connections
> > > > >                   but do not request ECN on outgoing connections.
> > > >
> > > > Changing existing user-behavior for this default seems to be overly
> > > > optimistic. Useful for testing, but...
> > >
> > > I disagree.
> > >
> > > The kernel default on ECN is/has been "do nothing" like forever. Yet,
> > > passively allowing ECN on servers is a low risk operation because nothing
> > > will change before client actively asks for it. However, it was obvious
> > > that the servers didn't do that. The servers could have set tcp_ecn to 1
> > > (before 2 was there) which is low risk for _servers_ (unlike for clients)
> > > but only very very few did. I don't believe servers would now
> > > intentionally pick 2 when they clearly didn't pick 1 earlier either.
> > >
> > > Adding 2 is/was an attempt to side-step the need for both ends to make
> > > conscious decision by setting the sysctl (which servers didn't want to
> > > do). That is, 2 gives decision on what to do into the hands of the client
> > > side which was the true intent of 2 (in case you don't know, I made that
> > > change).
> > What can a server configure to process only RFC3168 ECN if it prefers to?
>
> That's why I suggested the flag-based approach?

That's assuming an admin that has control of sysctls can also change
individual applications (easily). In reality it often is not the case.
The default sysctl choices in this patch seem risky to me.



>
> > > If "full control" is the way to go, I think it should be made using flags
> > > instead, along these lines:
> > >
> > > 1: Enable RFC 3168 ECN in+out
> > > 2: Enable RFC 3168 ECN in (default on)
> > > 4: Enable Accurate ECN in (default on)
> > > 8: Enable Accurate ECN in+out
> > >
> > > Note that I intentionally reversed the in and in/out order for 4&8
> > > (something that couldn't be done with 1&2 to preserve meaning of 1).
>
> It should address any except "out" but no "in" (the meaning of 1 cannot
> be changed I think). But out w/o in doesn't sound very useful.
>
> --
>  i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-23 19:03           ` Yuchung Cheng
@ 2020-03-24 12:50             ` Ilpo Järvinen
  2020-03-24 17:05               ` Yuchung Cheng
  0 siblings, 1 reply; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-24 12:50 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Dave Taht, Linux Kernel Network Developers, Neal Cardwell,
	Eric Dumazet, Olivier Tilmans

[-- Attachment #1: Type: text/plain, Size: 3761 bytes --]

On Mon, 23 Mar 2020, Yuchung Cheng wrote:

> On Mon, Mar 23, 2020 at 6:34 AM Ilpo Järvinen
> <ilpo.jarvinen@cs.helsinki.fi> wrote:
> >
> > On Fri, 20 Mar 2020, Yuchung Cheng wrote:
> >
> > > On Fri, Mar 20, 2020 at 3:40 PM Ilpo Järvinen
> > > <ilpo.jarvinen@cs.helsinki.fi> wrote:
> > > >
> > > > On Thu, 19 Mar 2020, Dave Taht wrote:
> > > >
> > > > > On Wed, Mar 18, 2020 at 2:44 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > > > > >
> > > > > > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > > > >
> > > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > > > > ---
> > > > > >  Documentation/networking/ip-sysctl.txt | 12 +++++++++---
> > > > > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > > > > > index 5f53faff4e25..ecca6e1d6bea 100644
> > > > > > --- a/Documentation/networking/ip-sysctl.txt
> > > > > > +++ b/Documentation/networking/ip-sysctl.txt
> > > > > > @@ -301,15 +301,21 @@ tcp_ecn - INTEGER
> > > > > >                 0 Disable ECN.  Neither initiate nor accept ECN.
> > > > > >                 1 Enable ECN when requested by incoming connections and
> > > > > >                   also request ECN on outgoing connection attempts.
> > > > > > -               2 Enable ECN when requested by incoming connections
> > > > > > +               2 Enable ECN or AccECN when requested by incoming connections
> > > > > >                   but do not request ECN on outgoing connections.
> > > > >
> > > > > Changing existing user-behavior for this default seems to be overly
> > > > > optimistic. Useful for testing, but...
> > > >
> > > > I disagree.
> > > >
> > > > The kernel default on ECN is/has been "do nothing" like forever. Yet,
> > > > passively allowing ECN on servers is a low risk operation because nothing
> > > > will change before client actively asks for it. However, it was obvious
> > > > that the servers didn't do that. The servers could have set tcp_ecn to 1
> > > > (before 2 was there) which is low risk for _servers_ (unlike for clients)
> > > > but only very very few did. I don't believe servers would now
> > > > intentionally pick 2 when they clearly didn't pick 1 earlier either.
> > > >
> > > > Adding 2 is/was an attempt to side-step the need for both ends to make
> > > > conscious decision by setting the sysctl (which servers didn't want to
> > > > do). That is, 2 gives decision on what to do into the hands of the client
> > > > side which was the true intent of 2 (in case you don't know, I made that
> > > > change).
> > > What can a server configure to process only RFC3168 ECN if it prefers to?
> >
> > That's why I suggested the flag-based approach?
> 
> That's assuming an admin that has control of sysctls can also change
> individual applications (easily). In reality it often is not the case.
> The default sysctl choices in this patch seem risky to me.
> 
> > > > If "full control" is the way to go, I think it should be made using flags
> > > > instead, along these lines:
> > > >
> > > > 1: Enable RFC 3168 ECN in+out
> > > > 2: Enable RFC 3168 ECN in (default on)
> > > > 4: Enable Accurate ECN in (default on)
> > > > 8: Enable Accurate ECN in+out
> > > >
> > > > Note that I intentionally reversed the in and in/out order for 4&8
> > > > (something that couldn't be done with 1&2 to preserve meaning of 1).
> >
> > It should address any except "out" but no "in" (the meaning of 1 cannot
> > be changed I think). But out w/o in doesn't sound very useful.

So you mean you'd want to have control that is finer-grained than what the
sysctls offer?


-- 
 i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 28/28] tcp: AccECN sysctl documentation
  2020-03-24 12:50             ` Ilpo Järvinen
@ 2020-03-24 17:05               ` Yuchung Cheng
  0 siblings, 0 replies; 45+ messages in thread
From: Yuchung Cheng @ 2020-03-24 17:05 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Dave Taht, Linux Kernel Network Developers, Neal Cardwell,
	Eric Dumazet, Olivier Tilmans

On Tue, Mar 24, 2020 at 5:50 AM Ilpo Järvinen
<ilpo.jarvinen@cs.helsinki.fi> wrote:
>
> On Mon, 23 Mar 2020, Yuchung Cheng wrote:
>
> > On Mon, Mar 23, 2020 at 6:34 AM Ilpo Järvinen
> > <ilpo.jarvinen@cs.helsinki.fi> wrote:
> > >
> > > On Fri, 20 Mar 2020, Yuchung Cheng wrote:
> > >
> > > > On Fri, Mar 20, 2020 at 3:40 PM Ilpo Järvinen
> > > > <ilpo.jarvinen@cs.helsinki.fi> wrote:
> > > > >
> > > > > On Thu, 19 Mar 2020, Dave Taht wrote:
> > > > >
> > > > > > On Wed, Mar 18, 2020 at 2:44 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> > > > > > >
> > > > > > > From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > > > > >
> > > > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
> > > > > > > ---
> > > > > > >  Documentation/networking/ip-sysctl.txt | 12 +++++++++---
> > > > > > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > > > > > > index 5f53faff4e25..ecca6e1d6bea 100644
> > > > > > > --- a/Documentation/networking/ip-sysctl.txt
> > > > > > > +++ b/Documentation/networking/ip-sysctl.txt
> > > > > > > @@ -301,15 +301,21 @@ tcp_ecn - INTEGER
> > > > > > >                 0 Disable ECN.  Neither initiate nor accept ECN.
> > > > > > >                 1 Enable ECN when requested by incoming connections and
> > > > > > >                   also request ECN on outgoing connection attempts.
> > > > > > > -               2 Enable ECN when requested by incoming connections
> > > > > > > +               2 Enable ECN or AccECN when requested by incoming connections
> > > > > > >                   but do not request ECN on outgoing connections.
> > > > > >
> > > > > > Changing existing user-behavior for this default seems to be overly
> > > > > > optimistic. Useful for testing, but...
> > > > >
> > > > > I disagree.
> > > > >
> > > > > The kernel default on ECN is/has been "do nothing" like forever. Yet,
> > > > > passively allowing ECN on servers is a low risk operation because nothing
> > > > > will change before client actively asks for it. However, it was obvious
> > > > > that the servers didn't do that. The servers could have set tcp_ecn to 1
> > > > > (before 2 was there) which is low risk for _servers_ (unlike for clients)
> > > > > but only very very few did. I don't believe servers would now
> > > > > intentionally pick 2 when they clearly didn't pick 1 earlier either.
> > > > >
> > > > > Adding 2 is/was an attempt to side-step the need for both ends to make
> > > > > conscious decision by setting the sysctl (which servers didn't want to
> > > > > do). That is, 2 gives decision on what to do into the hands of the client
> > > > > side which was the true intent of 2 (in case you don't know, I made that
> > > > > change).
> > > > What can a server configure to process only RFC3168 ECN if it prefers to?
> > >
> > > That's why I suggested the flag-based approach?
> >
> > That's assuming an admin that has control of sysctls can also change
> > individual applications (easily). In reality it often is not the case.
> > The default sysctl choices in this patch seem risky to me.
> >
> > > > > If "full control" is the way to go, I think it should be made using flags
> > > > > instead, along these lines:
> > > > >
> > > > > 1: Enable RFC 3168 ECN in+out
> > > > > 2: Enable RFC 3168 ECN in (default on)
> > > > > 4: Enable Accurate ECN in (default on)
> > > > > 8: Enable Accurate ECN in+out
> > > > >
> > > > > Note that I intentionally reversed the in and in/out order for 4&8
> > > > > (something that couldn't be done with 1&2 to preserve meaning of 1).
> > >
> > > It should address any except "out" but no "in" (the meaning of 1 cannot
> > > be changed I think). But out w/o in doesn't sound very useful.
>
> So you mean you'd want to have control that is finer-grained than what the
> sysctls offer?
I recommend having separate sysctl values for AccECN so that servers
configured to use existing values do not behave differently (on ECN)
after kernel upgrade, similar to what Dave Taht suggested.

>
>
> --
>  i.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH 06/28] tcp: reorganize SYN ECN code
  2020-03-18  9:37 [RFC PATCH 02/28] tcp: fast path functions later Ilpo Järvinen
@ 2020-03-18  9:37 ` Ilpo Järvinen
  0 siblings, 0 replies; 45+ messages in thread
From: Ilpo Järvinen @ 2020-03-18  9:37 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Olivier Tilmans

From: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>

No functional changes.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@cs.helsinki.fi>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dc225e616f98..116be30c1b2c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -336,10 +336,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 	tp->ecn_flags = 0;
 
 	if (use_ecn) {
-		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tp->ecn_flags = TCP_ECN_OK;
 		if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
 			INET_ECN_xmit(sk);
+
+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
+		tp->ecn_flags = TCP_ECN_OK;
 	}
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-03-24 17:05 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-18  9:43 [RFC PATCH 00/28]: Accurate ECN for TCP Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 01/28] tcp: add tp to prepare for AccECN code Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 02/28] tcp: fast path functions later Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 03/28] tcp: move tcp_in_ack_event later Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 04/28] tcp: create FLAG_TS_PROGRESS Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 05/28] tcp: extend TCP flags to allow AE bit/ACE field Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 06/28] tcp: reorganize SYN ECN code Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 07/28] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 08/28] tcp: helpers for ECN mode handling Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 09/28] gso: AccECN support Ilpo Järvinen
2020-03-19  3:44   ` Eric Dumazet
2020-03-19 22:36     ` Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 10/28] gro: prevent ACE field corruption & better AccECN handling Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 11/28] tcp: AccECN support to tcp_add_backlog Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 12/28] tcp: AccECN core Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 13/28] tcp: Pass flags to tcp_send_ack Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 14/28] tcp: AccECN negotiation Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 15/28] tcp: add AccECN rx byte counters Ilpo Järvinen
2020-03-19  3:36   ` Eric Dumazet
2020-03-18  9:43 ` [RFC PATCH 16/28] tcp: allow embedding leftover into option padding Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 17/28] tcp: AccECN needs to know delivered bytes Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 18/28] tcp: don't early return when sack doesn't fit Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 19/28] tcp: AccECN option Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 20/28] tcp: AccECN option send control Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 21/28] tcp: AccECN option beacon Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 22/28] tcp: AccECN option order bit & failure handling Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 23/28] tcp: AccECN option ceb/cep heuristic Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 24/28] tcp: try to fit AccECN option with SACK Ilpo Järvinen
2020-03-19  3:29   ` Eric Dumazet
2020-03-19 20:33     ` Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 25/28] tcp: try to avoid safer when ACKs are thinned Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 26/28] tcp: to prevent runaway AccECN cep/ACE deficit, limit GSO size Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 27/28] gro: flushing when CWR is set negatively affects AccECN Ilpo Järvinen
2020-03-18  9:43 ` [RFC PATCH 28/28] tcp: AccECN sysctl documentation Ilpo Järvinen
2020-03-19 23:02   ` Dave Taht
2020-03-20 22:40     ` Ilpo Järvinen
2020-03-20 23:22       ` Yuchung Cheng
2020-03-23 13:34         ` Ilpo Järvinen
2020-03-23 19:03           ` Yuchung Cheng
2020-03-24 12:50             ` Ilpo Järvinen
2020-03-24 17:05               ` Yuchung Cheng
2020-03-18 23:29 ` [RFC PATCH 00/28]: Accurate ECN for TCP David Miller
2020-03-19 20:25   ` Ilpo Järvinen
2020-03-20  4:38     ` David Miller
  -- strict thread matches above, loose matches on Subject: below --
2020-03-18  9:37 [RFC PATCH 02/28] tcp: fast path functions later Ilpo Järvinen
2020-03-18  9:37 ` [RFC PATCH 06/28] tcp: reorganize SYN ECN code Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).