netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/7] NIC driver Rx ring ECN
@ 2023-01-11 14:34 jgh
  2023-01-11 14:34 ` [RFC PATCH 1/7] net: NIC driver Rx ring ECN: skbuff and tcp support jgh
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

Routers and switches provide for mitigation of buffer overrun
by marking IP packets as "Congestion Experienced" [RFC 3168].
Participating transport protocols can use these marks to throttle
their send rates.

This patchset extends coverage to the receiving NIC/driver
buffers.

We use an out-of-band mechanism, marking the sk_buff rather
than the packet, to avoid need for DPI.

Participating NIC drivers are modified to add the marks.
Participating transport protocols are modified to notice
marks and combine with IP-level protocol marks.

Stats counters are incremented in ipv4 and ipv6 input processing,
with results:

 $ nstat -sz *Congest*
 #kernel
 Ip6InCongestionPkts             0                  0.0
 IpExtInCongestionPkts           148454             0.0
 $ 


Both NIC drivers and transports can be incrementatlly upgraded
to take advantage of the feature.  Three example drivers are
modified in this patchset.


Jeremy Harris (7):
  net: NIC driver Rx ring ECN: skbuff and tcp support
  net: NIC driver Rx ring ECN: stats counter
  drivers: net: xgene: NIC driver Rx ring ECN
  drivers: net: bnx2x: NIC driver Rx ring ECN
  drivers: net: bnx2x: NIC driver Rx ring ECN
  drivers: net: bnx2: NIC driver Rx ring ECN
  drivers: net: bnx2: NIC driver Rx ring ECN

 drivers/net/ethernet/apm/xgene/xgene_enet_main.c |  8 ++++++--
 drivers/net/ethernet/broadcom/bnx2.c             | 11 +++++++++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c  | 12 +++++++++---
 include/linux/skbuff.h                           |  2 ++
 include/uapi/linux/snmp.h                        |  1 +
 net/core/skbuff.c                                |  1 +
 net/ipv4/ip_input.c                              |  4 ++++
 net/ipv4/proc.c                                  |  1 +
 net/ipv4/tcp_input.c                             |  8 +++++++-
 net/ipv6/ip6_input.c                             |  5 +++++
 net/ipv6/proc.c                                  |  1 +
 11 files changed, 46 insertions(+), 8 deletions(-)


base-commit: 12c1604ae1a39bef87ac099f106594b4cb433b75
-- 
2.39.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/7] net: NIC driver Rx ring ECN: skbuff and tcp support
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
@ 2023-01-11 14:34 ` jgh
  2023-01-11 14:34 ` [RFC PATCH 2/7] net: NIC driver Rx ring ECN: stats counter jgh
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

This is the infrastructure and the primary information-consumer
for the facility.

Signed-off-by: Jeremy Harris <jgh@redhat.com>
---
 include/linux/skbuff.h | 2 ++
 net/core/skbuff.c      | 1 +
 net/ipv4/tcp_input.c   | 8 +++++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c8492401a10..8bdf0049e1a3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -804,6 +804,7 @@ typedef unsigned char *sk_buff_data_t;
  *		the packet minus one that have been verified as
  *		CHECKSUM_UNNECESSARY (max 3)
  *	@scm_io_uring: SKB holds io_uring registered files
+ *	@congestion_experienced: data-source or channel for SKB has congestion
  *	@dst_pending_confirm: need to confirm neighbour
  *	@decrypted: Decrypted SKB
  *	@slow_gro: state present at GRO time, slower prepare step required
@@ -983,6 +984,7 @@ struct sk_buff {
 	__u8			slow_gro:1;
 	__u8			csum_not_inet:1;
 	__u8			scm_io_uring:1;
+	__u8			congestion_experienced:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3a10387f9434..37940dd3dbe9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5548,6 +5548,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	to->truesize += delta;
 	to->len += len;
 	to->data_len += len;
+	to->congestion_experienced |= from->congestion_experienced;
 
 	*delta_truesize = delta;
 	return true;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cc072d2cfcd8..217374eefe41 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -338,8 +338,14 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
 static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	__u8 ecn;
 
-	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
+	if (unlikely(skb->congestion_experienced))
+		ecn = INET_ECN_CE;
+	else
+		ecn = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+
+	switch (ecn) {
 	case INET_ECN_NOT_ECT:
 		/* Funny extension: if ECT is not set on a segment,
 		 * and we already seen ECT on a previous segment,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/7] net: NIC driver Rx ring ECN: stats counter
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
  2023-01-11 14:34 ` [RFC PATCH 1/7] net: NIC driver Rx ring ECN: skbuff and tcp support jgh
@ 2023-01-11 14:34 ` jgh
  2023-01-11 14:34 ` [RFC PATCH 3/7] drivers: net: xgene: NIC driver Rx ring ECN jgh
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

Counters for IPv4 and IPv6 input packets for which skbs were
marked by NIC drivers, modelled on those for IP packet markings.

Signed-off-by: Jeremy Harris <jgh@redhat.com>
---
 include/uapi/linux/snmp.h | 1 +
 net/ipv4/ip_input.c       | 4 ++++
 net/ipv4/proc.c           | 1 +
 net/ipv6/ip6_input.c      | 5 +++++
 net/ipv6/proc.c           | 1 +
 5 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 6600cb0164c2..f8f763b6af28 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -57,6 +57,7 @@ enum
 	IPSTATS_MIB_ECT0PKTS,			/* InECT0Pkts */
 	IPSTATS_MIB_CEPKTS,			/* InCEPkts */
 	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
+	IPSTATS_MIB_INCONGPKTS,			/* InCongestionPkts */
 	__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index e880ce77322a..fd0ec860a3f6 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -503,6 +503,10 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 		       IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
 		       max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
 
+	if (unlikely(skb->congestion_experienced))
+		__IP_ADD_STATS(net, IPSTATS_MIB_INCONGPKTS,
+			       max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
+
 	if (!pskb_may_pull(skb, iph->ihl*4))
 		goto inhdr_error;
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index f88daace9de3..fb6e82e86070 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -117,6 +117,7 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_ITEM("InECT0Pkts", IPSTATS_MIB_ECT0PKTS),
 	SNMP_MIB_ITEM("InCEPkts", IPSTATS_MIB_CEPKTS),
 	SNMP_MIB_ITEM("ReasmOverlaps", IPSTATS_MIB_REASM_OVERLAPS),
+	SNMP_MIB_ITEM("InCongestionPkts", IPSTATS_MIB_INCONGPKTS),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index e1ebf5e42ebe..f43d8af20831 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -203,6 +203,11 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
 			IPSTATS_MIB_NOECTPKTS +
 				(ipv6_get_dsfield(hdr) & INET_ECN_MASK),
 			max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
+
+	if (unlikely(skb->congestion_experienced))
+		__IP6_ADD_STATS(net, idev, IPSTATS_MIB_INCONGPKTS,
+				max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
+
 	/*
 	 * RFC4291 2.5.3
 	 * The loopback address must not be used as the source address in IPv6
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index d6306aa46bb1..635ab1f99900 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -84,6 +84,7 @@ static const struct snmp_mib snmp6_ipstats_list[] = {
 	SNMP_MIB_ITEM("Ip6InECT1Pkts", IPSTATS_MIB_ECT1PKTS),
 	SNMP_MIB_ITEM("Ip6InECT0Pkts", IPSTATS_MIB_ECT0PKTS),
 	SNMP_MIB_ITEM("Ip6InCEPkts", IPSTATS_MIB_CEPKTS),
+	SNMP_MIB_ITEM("Ip6InCongestionPkts", IPSTATS_MIB_INCONGPKTS),
 	SNMP_MIB_SENTINEL
 };
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/7] drivers: net: xgene: NIC driver Rx ring ECN
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
  2023-01-11 14:34 ` [RFC PATCH 1/7] net: NIC driver Rx ring ECN: skbuff and tcp support jgh
  2023-01-11 14:34 ` [RFC PATCH 2/7] net: NIC driver Rx ring ECN: stats counter jgh
@ 2023-01-11 14:34 ` jgh
  2023-01-11 14:34 ` [RFC PATCH 4/7] drivers: net: bnx2x: " jgh
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

Sample NIC driver support.
This is the preferred model, usable where the driver has explicit
knowlege of the receive ring size and fill count.

Signed-off-by: Jeremy Harris <jgh@redhat.com>
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 390671640388..4f48f5a8ea8b 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -667,7 +667,8 @@ static bool xgene_enet_errata_10GE_8(struct sk_buff *skb, u32 len, u8 status)
 
 static int xgene_enet_rx_frame(struct xgene_enet_desc_ring *rx_ring,
 			       struct xgene_enet_raw_desc *raw_desc,
-			       struct xgene_enet_raw_desc *exp_desc)
+			       struct xgene_enet_raw_desc *exp_desc,
+			       bool congestion)
 {
 	struct xgene_enet_desc_ring *buf_pool, *page_pool;
 	u32 datalen, frag_size, skb_index;
@@ -757,6 +758,7 @@ static int xgene_enet_rx_frame(struct xgene_enet_desc_ring *rx_ring,
 
 	rx_ring->rx_packets++;
 	rx_ring->rx_bytes += datalen;
+	skb->congestion_experienced = congestion;
 	napi_gro_receive(&rx_ring->napi, skb);
 
 out:
@@ -814,7 +816,9 @@ static int xgene_enet_process_ring(struct xgene_enet_desc_ring *ring,
 			desc_count++;
 		}
 		if (is_rx_desc(raw_desc)) {
-			ret = xgene_enet_rx_frame(ring, raw_desc, exp_desc);
+			/* We are congested when the ring is 7/8'ths full
+			 */
+			ret = xgene_enet_rx_frame(ring, raw_desc, exp_desc, count > slots * 7 / 8);
 		} else {
 			ret = xgene_enet_tx_completion(ring, raw_desc);
 			is_completion = true;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 4/7] drivers: net: bnx2x: NIC driver Rx ring ECN
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
                   ` (2 preceding siblings ...)
  2023-01-11 14:34 ` [RFC PATCH 3/7] drivers: net: xgene: NIC driver Rx ring ECN jgh
@ 2023-01-11 14:34 ` jgh
  2023-01-11 14:34 ` [RFC PATCH 5/7] " jgh
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

Reformat local variables as reverse-christmas-tree.
No functional change.

Signed-off-by: Jeremy Harris <jgh@redhat.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 16c490692f42..145e338487b6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -880,12 +880,12 @@ void bnx2x_csum_validate(struct sk_buff *skb, union eth_rx_cqe *cqe,
 
 static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 {
-	struct bnx2x *bp = fp->bp;
 	u16 bd_cons, bd_prod, bd_prod_fw, comp_ring_cons;
+	struct eth_fast_path_rx_cqe *cqe_fp;
 	u16 sw_comp_cons, sw_comp_prod;
-	int rx_pkt = 0;
+	struct bnx2x *bp = fp->bp;
 	union eth_rx_cqe *cqe;
-	struct eth_fast_path_rx_cqe *cqe_fp;
+	int rx_pkt = 0;
 
 #ifdef BNX2X_STOP_ON_ERROR
 	if (unlikely(bp->panic))
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 5/7] drivers: net: bnx2x: NIC driver Rx ring ECN
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
                   ` (3 preceding siblings ...)
  2023-01-11 14:34 ` [RFC PATCH 4/7] drivers: net: bnx2x: " jgh
@ 2023-01-11 14:34 ` jgh
  2023-01-11 14:34 ` [RFC PATCH 6/7] drivers: net: bnx2: " jgh
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

Sample NIC driver support.
This is a less-preferred model, which will throttle based on the NAPI
budget rather than the receive ring fill level.

Signed-off-by: Jeremy Harris <jgh@redhat.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 145e338487b6..62fff8f3499b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -881,6 +881,7 @@ void bnx2x_csum_validate(struct sk_buff *skb, union eth_rx_cqe *cqe,
 static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 {
 	u16 bd_cons, bd_prod, bd_prod_fw, comp_ring_cons;
+	int congestion_level = budget * 7 / 8;
 	struct eth_fast_path_rx_cqe *cqe_fp;
 	u16 sw_comp_cons, sw_comp_prod;
 	struct bnx2x *bp = fp->bp;
@@ -1089,6 +1090,11 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
 					       le16_to_cpu(cqe_fp->vlan_tag));
 
+		/* We are congested if the napi budget is approached
+		 */
+		if (unlikely(rx_pkt > congestion_level))
+			skb->congestion_experienced = true;
+
 		napi_gro_receive(&fp->napi, skb);
 next_rx:
 		rx_buf->data = NULL;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 6/7] drivers: net: bnx2: NIC driver Rx ring ECN
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
                   ` (4 preceding siblings ...)
  2023-01-11 14:34 ` [RFC PATCH 5/7] " jgh
@ 2023-01-11 14:34 ` jgh
  2023-01-11 14:34 ` [RFC PATCH 7/7] " jgh
  2023-01-11 18:46 ` [RFC PATCH net-next 0/7] " Jakub Kicinski
  7 siblings, 0 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

Reformat local variables as reverse-christmas-tree.
No functional change.

Signed-off-by: Jeremy Harris <jgh@redhat.com>
---
 drivers/net/ethernet/broadcom/bnx2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 9f473854b0f4..f55ac9c7b6fd 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -3141,10 +3141,10 @@ bnx2_get_hw_rx_cons(struct bnx2_napi *bnapi)
 static int
 bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 {
-	struct bnx2_rx_ring_info *rxr = &bnapi->rx_ring;
 	u16 hw_cons, sw_cons, sw_ring_cons, sw_prod, sw_ring_prod;
-	struct l2_fhdr *rx_hdr;
+	struct bnx2_rx_ring_info *rxr = &bnapi->rx_ring;
 	int rx_pkt = 0, pg_ring_used = 0;
+	struct l2_fhdr *rx_hdr;
 
 	if (budget <= 0)
 		return rx_pkt;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 7/7] drivers: net: bnx2: NIC driver Rx ring ECN
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
                   ` (5 preceding siblings ...)
  2023-01-11 14:34 ` [RFC PATCH 6/7] drivers: net: bnx2: " jgh
@ 2023-01-11 14:34 ` jgh
  2023-01-11 18:46 ` [RFC PATCH net-next 0/7] " Jakub Kicinski
  7 siblings, 0 replies; 13+ messages in thread
From: jgh @ 2023-01-11 14:34 UTC (permalink / raw)
  To: netdev; +Cc: Jeremy Harris

From: Jeremy Harris <jgh@redhat.com>

Sample NIC driver support.
This is a less-preferred model, which will throttle based on the NAPI
budget rather than the receive ring fill level.

Signed-off-by: Jeremy Harris <jgh@redhat.com>
---
 drivers/net/ethernet/broadcom/bnx2.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index f55ac9c7b6fd..c7d867114234 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -3143,6 +3143,7 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 {
 	u16 hw_cons, sw_cons, sw_ring_cons, sw_prod, sw_ring_prod;
 	struct bnx2_rx_ring_info *rxr = &bnapi->rx_ring;
+	int congestion_level = budget * 7 / 8;
 	int rx_pkt = 0, pg_ring_used = 0;
 	struct l2_fhdr *rx_hdr;
 
@@ -3273,6 +3274,12 @@ bnx2_rx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget)
 				     PKT_HASH_TYPE_L3);
 
 		skb_record_rx_queue(skb, bnapi - &bp->bnx2_napi[0]);
+
+		/* We are congested if the budget is approached
+		 */
+		if (unlikely(rx_pkt > congestion_level))
+			skb->congestion_experienced = true;
+
 		napi_gro_receive(&bnapi->napi, skb);
 		rx_pkt++;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 0/7] NIC driver Rx ring ECN
  2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
                   ` (6 preceding siblings ...)
  2023-01-11 14:34 ` [RFC PATCH 7/7] " jgh
@ 2023-01-11 18:46 ` Jakub Kicinski
  2023-01-12 14:06   ` Jeremy Harris
  7 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2023-01-11 18:46 UTC (permalink / raw)
  To: jgh; +Cc: netdev

On Wed, 11 Jan 2023 14:34:20 +0000 jgh@redhat.com wrote:
> Stats counters are incremented in ipv4 and ipv6 input processing,
> with results:
> 
>  $ nstat -sz *Congest*
>  #kernel
>  Ip6InCongestionPkts             0                  0.0
>  IpExtInCongestionPkts           148454             0.0
>  $ 

Do you have any reason to believe that it actually helps anything?
NAPI with typical budget of 64 is easily exhausted (you just need 
two TSO frames arriving at once with 1500 MTU).

Host level congestion is better detected using time / latency signals.
Timestamp the packet at the NIC and compare the Rx time to current time
when processing by the driver. 

Google search "Google Swift congestion control".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 0/7] NIC driver Rx ring ECN
  2023-01-11 18:46 ` [RFC PATCH net-next 0/7] " Jakub Kicinski
@ 2023-01-12 14:06   ` Jeremy Harris
  2023-01-13  0:09     ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Jeremy Harris @ 2023-01-12 14:06 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

On 11/01/2023 18:46, Jakub Kicinski wrote:
> Do you have any reason to believe that it actually helps anything?

I've not measured actual drop-rates, no.

> NAPI with typical budget of 64 is easily exhausted (you just need
> two TSO frames arriving at once with 1500 MTU).

I see typical systems with 300, not 64 - but it's a valid point.
It's not the right measurement to try to control.
Perhaps I should work harder to locate the ring size within
the bnx2 and bnx2x drivers.

If I managed that (it being already the case for the xgene example)
would your opinions change?

> Host level congestion is better detected using time / latency signals.
> Timestamp the packet at the NIC and compare the Rx time to current time
> when processing by the driver.
> 
> Google search "Google Swift congestion control".

Nice, but
- requires we wait for timestamping-NICs
- does not address Rx drops due to Rx ring-buffer overflow

-- 
Cheers,
   Jeremy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 0/7] NIC driver Rx ring ECN
  2023-01-12 14:06   ` Jeremy Harris
@ 2023-01-13  0:09     ` Jakub Kicinski
  2023-01-19 17:05       ` Jeremy Harris
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2023-01-13  0:09 UTC (permalink / raw)
  To: Jeremy Harris; +Cc: netdev

On Thu, 12 Jan 2023 14:06:50 +0000 Jeremy Harris wrote:
> On 11/01/2023 18:46, Jakub Kicinski wrote:
> > Do you have any reason to believe that it actually helps anything?  
> 
> I've not measured actual drop-rates, no.
> 
> > NAPI with typical budget of 64 is easily exhausted (you just need
> > two TSO frames arriving at once with 1500 MTU).  
> 
> I see typical systems with 300, not 64

Say more? I thought you were going by NAPI budget which should be 64
in bnx2x.

> - but it's a valid point.
> It's not the right measurement to try to control.
> Perhaps I should work harder to locate the ring size within
> the bnx2 and bnx2x drivers.

Perhaps the older devices give you some extra information here.
Normally on the Rx path you don't know how long the queue is,
you just check whether the next descriptor has been filled or not.
"Looking ahead" may be costly because you're accessing the same 
memory as the device.

> If I managed that (it being already the case for the xgene example)
> would your opinions change?

It may be cool if we can retrofit some second-order signal into 
the time-based machinery. The problem is that we don't actually 
have any time-based machinery upstream, yet :(
And designing interfaces for a decade-old HW seems shortsighted.

> > Host level congestion is better detected using time / latency signals.
> > Timestamp the packet at the NIC and compare the Rx time to current time
> > when processing by the driver.
> > 
> > Google search "Google Swift congestion control".  
> 
> Nice, but
> - requires we wait for timestamping-NICs

Grep for HWTSTAMP_FILTER_ALL, there's HW out there.

> - does not address Rx drops due to Rx ring-buffer overflow

It's a stronger signal than "continuous run of packets".
You can have a standing queue of 2 packets, and keep processing 
for ever. There's no congestion, or overload. You'd see that 
timestamps are recent.

I experimented last year with implementing CoDel on the input queues,
worked pretty well (scroll down ~half way):

https://developers.facebook.com/blog/post/2022/04/25/investigating-tcp-self-throttling-triggered-overload/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 0/7] NIC driver Rx ring ECN
  2023-01-13  0:09     ` Jakub Kicinski
@ 2023-01-19 17:05       ` Jeremy Harris
  2023-01-19 17:20         ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Jeremy Harris @ 2023-01-19 17:05 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, jgh



On 13/01/2023 00:09, Jakub Kicinski wrote:
> It may be cool if we can retrofit some second-order signal into
> the time-based machinery. The problem is that we don't actually
> have any time-based machinery upstream, yet :(
> And designing interfaces for a decade-old HW seems shortsighted.
> 
>>> Host level congestion is better detected using time / latency signals.
>>> Timestamp the packet at the NIC and compare the Rx time to current time
>>> when processing by the driver.
>>>
>>> Google search "Google Swift congestion control".

> Grep for HWTSTAMP_FILTER_ALL, there's HW out there.

OK.

>> - does not address Rx drops due to Rx ring-buffer overflow
> 
> It's a stronger signal than "continuous run of packets".
> You can have a standing queue of 2 packets, and keep processing
> for ever. There's no congestion, or overload. You'd see that
> timestamps are recent.

Agreed.  That's why marking at a proportion of ring-fill approaching
100% was my "preferred" implementation.  But if the current situation
with NIC API design makes that commonly impractical, I guess it's
a dead duck.

> I experimented last year with implementing CoDel on the input queues,
> worked pretty well (scroll down ~half way):
> 
> https://developers.facebook.com/blog/post/2022/04/25/investigating-tcp-self-throttling-triggered-overload/

That looks nice.  Are there any plans to get that upstream?
-- 
Cheers,
   Jeremy


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next 0/7] NIC driver Rx ring ECN
  2023-01-19 17:05       ` Jeremy Harris
@ 2023-01-19 17:20         ` Jakub Kicinski
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Kicinski @ 2023-01-19 17:20 UTC (permalink / raw)
  To: Jeremy Harris; +Cc: netdev, jgh

On Thu, 19 Jan 2023 17:05:55 +0000 Jeremy Harris wrote:
> > I experimented last year with implementing CoDel on the input queues,
> > worked pretty well (scroll down ~half way):
> > 
> > https://developers.facebook.com/blog/post/2022/04/25/investigating-tcp-self-throttling-triggered-overload/  
> 
> That looks nice.  Are there any plans to get that upstream?

The use of XDP is more of a temporary hack. I hope BBRv2 or some other
congestion control protocol accounting for host-level congestion will
soon come out of Google or Meta :(

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-01-19 17:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-11 14:34 [RFC PATCH net-next 0/7] NIC driver Rx ring ECN jgh
2023-01-11 14:34 ` [RFC PATCH 1/7] net: NIC driver Rx ring ECN: skbuff and tcp support jgh
2023-01-11 14:34 ` [RFC PATCH 2/7] net: NIC driver Rx ring ECN: stats counter jgh
2023-01-11 14:34 ` [RFC PATCH 3/7] drivers: net: xgene: NIC driver Rx ring ECN jgh
2023-01-11 14:34 ` [RFC PATCH 4/7] drivers: net: bnx2x: " jgh
2023-01-11 14:34 ` [RFC PATCH 5/7] " jgh
2023-01-11 14:34 ` [RFC PATCH 6/7] drivers: net: bnx2: " jgh
2023-01-11 14:34 ` [RFC PATCH 7/7] " jgh
2023-01-11 18:46 ` [RFC PATCH net-next 0/7] " Jakub Kicinski
2023-01-12 14:06   ` Jeremy Harris
2023-01-13  0:09     ` Jakub Kicinski
2023-01-19 17:05       ` Jeremy Harris
2023-01-19 17:20         ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).