netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 00/15] tcp: BIG TCP implementation
@ 2022-02-03  1:51 Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute Eric Dumazet
                   ` (14 more replies)
  0 siblings, 15 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

This series implements BIG TCP as presented in netdev 0x15:

https://netdevconf.info/0x15/session.html?BIG-TCP

Standard TSO/GRO packet limit is 64KB

With BIG TCP, we allow bigger TSO/GRO packet sizes for IPv6 traffic.

Note that this feature is by default not enabled, because it might
break some eBPF programs assuming TCP header immediately follows IPv6 header.

Reducing number of packets traversing networking stack usually improves
performance, as shown on this experiment using a 100Gbit NIC, and 4K MTU.

'Standard' performance with current (74KB) limits.
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
77           138          183          8542.19    
79           143          178          8215.28    
70           117          164          9543.39    
80           144          176          8183.71    
78           126          155          9108.47    
80           146          184          8115.19    
71           113          165          9510.96    
74           113          164          9518.74    
79           137          178          8575.04    
73           111          171          9561.73    

Now enable BIG TCP on both hosts.

ip link set dev eth0 gro_ipv6_max_size 185000 gso_ipv6_max_size 185000
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
57           83           117          13871.38   
64           118          155          11432.94   
65           116          148          11507.62   
60           105          136          12645.15   
60           103          135          12760.34   
60           102          134          12832.64   
62           109          132          10877.68   
58           82           115          14052.93   
57           83           124          14212.58   
57           82           119          14196.01   

We see an increase of transactions per second, and lower latencies as well.

Coco Li (3):
  ipv6: Add hop-by-hop header to jumbograms in ip6_output
  ipvlan: enable BIG TCP Packets
  mlx5: support BIG TCP packets

Eric Dumazet (11):
  net: add netdev->tso_ipv6_max_size attribute
  ipv6: add dev->gso_ipv6_max_size
  tcp_cubic: make hystart_ack_delay() aware of BIG TCP
  ipv6: add struct hop_jumbo_hdr definition
  ipv6/gso: remove temporary HBH/jumbo header
  ipv6/gro: insert temporary HBH/jumbo header
  net: increase MAX_SKB_FRAGS
  net: loopback: enable BIG TCP packets
  bonding: update dev->tso_ipv6_max_size
  macvlan: enable BIG TCP Packets
  mlx4: support BIG TCP packets

Signed-off-by: Coco Li (1):
  ipv6: add GRO_IPV6_MAX_SIZE

 drivers/net/bonding/bond_main.c               |  3 +
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +
 drivers/net/ethernet/mellanox/mlx4/en_tx.c    | 47 ++++++++---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 81 +++++++++++++++----
 drivers/net/ipvlan/ipvlan_main.c              |  1 +
 drivers/net/loopback.c                        |  2 +
 drivers/net/macvlan.c                         |  1 +
 include/linux/ipv6.h                          |  1 +
 include/linux/netdevice.h                     | 32 ++++++++
 include/linux/skbuff.h                        | 14 +---
 include/net/ipv6.h                            | 42 ++++++++++
 include/uapi/linux/if_link.h                  |  3 +
 net/core/dev.c                                |  4 +
 net/core/gro.c                                | 20 ++++-
 net/core/rtnetlink.c                          | 33 ++++++++
 net/core/skbuff.c                             | 21 ++++-
 net/core/sock.c                               |  6 ++
 net/ipv4/tcp_cubic.c                          |  4 +-
 net/ipv6/ip6_offload.c                        | 32 +++++++-
 net/ipv6/ip6_output.c                         | 22 ++++-
 tools/include/uapi/linux/if_link.h            |  3 +
 22 files changed, 329 insertions(+), 47 deletions(-)

-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03 16:34   ` Jakub Kicinski
  2022-02-03  1:51 ` [PATCH net-next 02/15] ipv6: add dev->gso_ipv6_max_size Eric Dumazet
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Some NIC (or virtual devices) are LSOv2 compatible.

BIG TCP plans using the large LSOv2 feature for IPv6.

New netlink attribute IFLA_TSO_IPV6_MAX_SIZE is defined.

Drivers should use netif_set_tso_ipv6_max_size() to advertize their limit.

Unchanged drivers are not allowing big TSO packets to be sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h          | 10 ++++++++++
 include/uapi/linux/if_link.h       |  1 +
 net/core/dev.c                     |  2 ++
 net/core/rtnetlink.c               |  3 +++
 tools/include/uapi/linux/if_link.h |  1 +
 5 files changed, 17 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e490b84732d1654bf067b30f2bb0b0825f88dea9..b1f68df2b37bc4b623f61cc2c6f0c02ba2afbe02 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1948,6 +1948,7 @@ enum netdev_ml_priv_type {
  *	@dev_addr_shadow:	Copy of @dev_addr to catch direct writes.
  *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
  *	@watchdog_dev_tracker:	refcount tracker used by watchdog.
+ *	@tso_ipv6_max_size:	Maximum size of IPv6 TSO packets (driver/NIC limit)
  *
  *	FIXME: cleanup struct net_device such that network protocol info
  *	moves out.
@@ -2282,6 +2283,7 @@ struct net_device {
 	u8 dev_addr_shadow[MAX_ADDR_LEN];
 	netdevice_tracker	linkwatch_dev_tracker;
 	netdevice_tracker	watchdog_dev_tracker;
+	unsigned int		tso_ipv6_max_size;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -4818,6 +4820,14 @@ static inline void netif_set_gro_max_size(struct net_device *dev,
 	WRITE_ONCE(dev->gro_max_size, size);
 }
 
+/* Used by drivers to give their hardware/firmware limit for LSOv2 packets */
+static inline void netif_set_tso_ipv6_max_size(struct net_device *dev,
+					       unsigned int size)
+{
+	dev->tso_ipv6_max_size = size;
+}
+
+
 static inline void skb_gso_error_unwind(struct sk_buff *skb, __be16 protocol,
 					int pulled_hlen, u16 mac_offset,
 					int mac_len)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 6218f93f5c1a92b5765bc19dfb9d7583c3b9369b..79b9d399cd297a1f79dca5ce89762800c38ed4a8 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -348,6 +348,7 @@ enum {
 	IFLA_PARENT_DEV_NAME,
 	IFLA_PARENT_DEV_BUS_NAME,
 	IFLA_GRO_MAX_SIZE,
+	IFLA_TSO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 1baab07820f65f9bcf88a6d73e2c9ff741d33c18..b6ca3c348d41a097baf210f2a5d966b71308c69b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10188,6 +10188,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->gso_max_size = GSO_MAX_SIZE;
 	dev->gso_max_segs = GSO_MAX_SEGS;
 	dev->gro_max_size = GRO_MAX_SIZE;
+	dev->tso_ipv6_max_size = GSO_MAX_SIZE;
+
 	dev->upper_level = 1;
 	dev->lower_level = 1;
 #ifdef CONFIG_LOCKDEP
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e476403231f00053e1a261f31a8760325c75c941..4cefa07195ba3b67e7b724194b5d729d395ba466 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1027,6 +1027,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SEGS */
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_GRO_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_TSO_IPV6_MAX_SIZE */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
@@ -1730,6 +1731,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_GSO_MAX_SEGS, dev->gso_max_segs) ||
 	    nla_put_u32(skb, IFLA_GSO_MAX_SIZE, dev->gso_max_size) ||
 	    nla_put_u32(skb, IFLA_GRO_MAX_SIZE, dev->gro_max_size) ||
+	    nla_put_u32(skb, IFLA_TSO_IPV6_MAX_SIZE, dev->tso_ipv6_max_size) ||
 #ifdef CONFIG_RPS
 	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 #endif
@@ -1883,6 +1885,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_NEW_IFINDEX]	= NLA_POLICY_MIN(NLA_S32, 1),
 	[IFLA_PARENT_DEV_NAME]	= { .type = NLA_NUL_STRING },
 	[IFLA_GRO_MAX_SIZE]	= { .type = NLA_U32 },
+	[IFLA_TSO_IPV6_MAX_SIZE]	= { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 6218f93f5c1a92b5765bc19dfb9d7583c3b9369b..79b9d399cd297a1f79dca5ce89762800c38ed4a8 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -348,6 +348,7 @@ enum {
 	IFLA_PARENT_DEV_NAME,
 	IFLA_PARENT_DEV_BUS_NAME,
 	IFLA_GRO_MAX_SIZE,
+	IFLA_TSO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 02/15] ipv6: add dev->gso_ipv6_max_size
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  8:57   ` Paolo Abeni
  2022-02-03  1:51 ` [PATCH net-next 03/15] tcp_cubic: make hystart_ack_delay() aware of BIG TCP Eric Dumazet
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

This enable TCP stack to build TSO packets bigger than
64KB if the driver is LSOv2 compatible.

This patch introduces new variable gso_ipv6_max_size
that is modifiable through ip link.

ip link set dev eth0 gso_ipv6_max_size 185000

User input is capped by driver limit.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h          | 12 ++++++++++++
 include/uapi/linux/if_link.h       |  1 +
 net/core/dev.c                     |  1 +
 net/core/rtnetlink.c               | 15 +++++++++++++++
 net/core/sock.c                    |  6 ++++++
 tools/include/uapi/linux/if_link.h |  1 +
 6 files changed, 36 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b1f68df2b37bc4b623f61cc2c6f0c02ba2afbe02..2a563869ba44f7d48095d36b1395e3fbd8cfff87 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1949,6 +1949,7 @@ enum netdev_ml_priv_type {
  *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
  *	@watchdog_dev_tracker:	refcount tracker used by watchdog.
  *	@tso_ipv6_max_size:	Maximum size of IPv6 TSO packets (driver/NIC limit)
+ *	@gso_ipv6_max_size:	Maximum size of IPv6 GSO packets (user/admin limit)
  *
  *	FIXME: cleanup struct net_device such that network protocol info
  *	moves out.
@@ -2284,6 +2285,7 @@ struct net_device {
 	netdevice_tracker	linkwatch_dev_tracker;
 	netdevice_tracker	watchdog_dev_tracker;
 	unsigned int		tso_ipv6_max_size;
+	unsigned int		gso_ipv6_max_size;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -4804,6 +4806,10 @@ static inline void netif_set_gso_max_size(struct net_device *dev,
 {
 	/* dev->gso_max_size is read locklessly from sk_setup_caps() */
 	WRITE_ONCE(dev->gso_max_size, size);
+
+	/* legacy drivers want to lower gso_max_size, regardless of family. */
+	size = min(size, dev->gso_ipv6_max_size);
+	WRITE_ONCE(dev->gso_ipv6_max_size, size);
 }
 
 static inline void netif_set_gso_max_segs(struct net_device *dev,
@@ -4827,6 +4833,12 @@ static inline void netif_set_tso_ipv6_max_size(struct net_device *dev,
 	dev->tso_ipv6_max_size = size;
 }
 
+static inline void netif_set_gso_ipv6_max_size(struct net_device *dev,
+					       unsigned int size)
+{
+	size = min(size, dev->tso_ipv6_max_size);
+	WRITE_ONCE(dev->gso_ipv6_max_size, size);
+}
 
 static inline void skb_gso_error_unwind(struct sk_buff *skb, __be16 protocol,
 					int pulled_hlen, u16 mac_offset,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 79b9d399cd297a1f79dca5ce89762800c38ed4a8..024b3bd0467e1360917001dba6bcfd1f30391894 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -349,6 +349,7 @@ enum {
 	IFLA_PARENT_DEV_BUS_NAME,
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_IPV6_MAX_SIZE,
+	IFLA_GSO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index b6ca3c348d41a097baf210f2a5d966b71308c69b..53c947e6fdb7c47e6cc92fd4e38b71e9b90d921c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10189,6 +10189,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->gso_max_segs = GSO_MAX_SEGS;
 	dev->gro_max_size = GRO_MAX_SIZE;
 	dev->tso_ipv6_max_size = GSO_MAX_SIZE;
+	dev->gso_ipv6_max_size = GSO_MAX_SIZE;
 
 	dev->upper_level = 1;
 	dev->lower_level = 1;
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 4cefa07195ba3b67e7b724194b5d729d395ba466..0a0b26261f6d9e4e40bf9cfbda31a29c1f2e3aaa 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1028,6 +1028,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_GRO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_IPV6_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_GSO_IPV6_MAX_SIZE */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
@@ -1732,6 +1733,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_GSO_MAX_SIZE, dev->gso_max_size) ||
 	    nla_put_u32(skb, IFLA_GRO_MAX_SIZE, dev->gro_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_IPV6_MAX_SIZE, dev->tso_ipv6_max_size) ||
+	    nla_put_u32(skb, IFLA_GSO_IPV6_MAX_SIZE, dev->gso_ipv6_max_size) ||
 #ifdef CONFIG_RPS
 	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 #endif
@@ -1886,6 +1888,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PARENT_DEV_NAME]	= { .type = NLA_NUL_STRING },
 	[IFLA_GRO_MAX_SIZE]	= { .type = NLA_U32 },
 	[IFLA_TSO_IPV6_MAX_SIZE]	= { .type = NLA_U32 },
+	[IFLA_GSO_IPV6_MAX_SIZE]	= { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2772,6 +2775,15 @@ static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_GSO_IPV6_MAX_SIZE]) {
+		u32 max_size = nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]);
+
+		if (dev->gso_ipv6_max_size ^ max_size) {
+			netif_set_gso_ipv6_max_size(dev, max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
 	if (tb[IFLA_GSO_MAX_SEGS]) {
 		u32 max_segs = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]);
 
@@ -3247,6 +3259,9 @@ struct net_device *rtnl_create_link(struct net *net, const char *ifname,
 		netif_set_gso_max_segs(dev, nla_get_u32(tb[IFLA_GSO_MAX_SEGS]));
 	if (tb[IFLA_GRO_MAX_SIZE])
 		netif_set_gro_max_size(dev, nla_get_u32(tb[IFLA_GRO_MAX_SIZE]));
+	if (tb[IFLA_GSO_IPV6_MAX_SIZE])
+		netif_set_gso_ipv6_max_size(dev,
+			nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]));
 
 	return dev;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 09d31a7dc68f88af42f75f3f445818fe273b04fb..aec1e156548ea0818f025fd8f448f5e353f79a3b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2279,6 +2279,12 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
 			sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
+#if IS_ENABLED(CONFIG_IPV6)
+			if (sk->sk_family == AF_INET6 &&
+			    sk_is_tcp(sk) &&
+			    !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
+				sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_ipv6_max_size);
+#endif
 			sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
 			max_segs = max_t(u32, READ_ONCE(dst->dev->gso_max_segs), 1);
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 79b9d399cd297a1f79dca5ce89762800c38ed4a8..024b3bd0467e1360917001dba6bcfd1f30391894 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -349,6 +349,7 @@ enum {
 	IFLA_PARENT_DEV_BUS_NAME,
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_IPV6_MAX_SIZE,
+	IFLA_GSO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 03/15] tcp_cubic: make hystart_ack_delay() aware of BIG TCP
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 02/15] ipv6: add dev->gso_ipv6_max_size Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 04/15] ipv6: add struct hop_jumbo_hdr definition Eric Dumazet
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

hystart_ack_delay() had the assumption that a TSO packet
would not be bigger than GSO_MAX_SIZE.

This will no longer be true.

We should use sk->sk_gso_max_size instead.

This reduces chances of spurious Hystart ACK train detections.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index 24d562dd62254d6e50dd08236f8967400d81e1ea..dfc9dc951b7404776b2246c38273fbadf03c39fd 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -372,7 +372,7 @@ static void cubictcp_state(struct sock *sk, u8 new_state)
  * We apply another 100% factor because @rate is doubled at this point.
  * We cap the cushion to 1ms.
  */
-static u32 hystart_ack_delay(struct sock *sk)
+static u32 hystart_ack_delay(const struct sock *sk)
 {
 	unsigned long rate;
 
@@ -380,7 +380,7 @@ static u32 hystart_ack_delay(struct sock *sk)
 	if (!rate)
 		return 0;
 	return min_t(u64, USEC_PER_MSEC,
-		     div64_ul((u64)GSO_MAX_SIZE * 4 * USEC_PER_SEC, rate));
+		     div64_ul((u64)sk->sk_gso_max_size * 4 * USEC_PER_SEC, rate));
 }
 
 static void hystart_update(struct sock *sk, u32 delay)
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 04/15] ipv6: add struct hop_jumbo_hdr definition
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (2 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 03/15] tcp_cubic: make hystart_ack_delay() aware of BIG TCP Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header Eric Dumazet
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Following patches will need to add and remove local IPv6 jumbogram
options to enable BIG TCP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ipv6.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 082f30256f59fad18b78746b6650aee840932eba..ea2a4351b654f8bc96503aae2b9adcd478e1f8b2 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -151,6 +151,17 @@ struct frag_hdr {
 	__be32	identification;
 };
 
+/*
+ * Jumbo payload option, as described in RFC 2676 2.
+ */
+struct hop_jumbo_hdr {
+	u8	nexthdr;
+	u8	hdrlen;
+	u8	tlv_type;	/* IPV6_TLV_JUMBO, 0xC2 */
+	u8	tlv_len;	/* 4 */
+	__be32	jumbo_payload_len;
+};
+
 #define	IP6_MF		0x0001
 #define	IP6_OFFSET	0xFFF8
 
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (3 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 04/15] ipv6: add struct hop_jumbo_hdr definition Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03 18:53   ` Alexander H Duyck
  2022-02-03  1:51 ` [PATCH net-next 06/15] ipv6/gro: insert " Eric Dumazet
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.

If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ipv6.h | 31 +++++++++++++++++++++++++++++++
 net/core/skbuff.c  | 21 ++++++++++++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ea2a4351b654f8bc96503aae2b9adcd478e1f8b2..96e916fb933c3e7d4288e86790fcb2bb1353a261 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -464,6 +464,37 @@ bool ipv6_opt_accepted(const struct sock *sk, const struct sk_buff *skb,
 struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
 					   struct ipv6_txoptions *opt);
 
+/* This helper is specialized for BIG TCP needs.
+ * It assumes the hop_jumbo_hdr will immediately follow the IPV6 header.
+ * It assumes headers are already in skb->head, thus the sk argument is only read.
+ */
+static inline bool ipv6_has_hopopt_jumbo(const struct sk_buff *skb)
+{
+	struct hop_jumbo_hdr *jhdr;
+	struct ipv6hdr *nhdr;
+
+	if (likely(skb->len <= GRO_MAX_SIZE))
+		return false;
+
+	if (skb->protocol != htons(ETH_P_IPV6))
+		return false;
+
+	if (skb_network_offset(skb) +
+	    sizeof(struct ipv6hdr) +
+	    sizeof(struct hop_jumbo_hdr) > skb_headlen(skb))
+		return false;
+
+	nhdr = ipv6_hdr(skb);
+
+	if (nhdr->nexthdr != NEXTHDR_HOP)
+		return false;
+
+	jhdr = (struct hop_jumbo_hdr *) (nhdr + 1);
+	if (jhdr->tlv_type != IPV6_TLV_JUMBO || jhdr->hdrlen != 0)
+		return false;
+	return true;
+}
+
 static inline bool ipv6_accept_ra(struct inet6_dev *idev)
 {
 	/* If forwarding is enabled, RA are not accepted unless the special
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0118f0afaa4fce8da167ddf39de4c9f3880ca05b..53f17c7392311e7123628fcab4617efc169905a1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3959,8 +3959,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	skb_frag_t *frag = skb_shinfo(head_skb)->frags;
 	unsigned int mss = skb_shinfo(head_skb)->gso_size;
 	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
+	int hophdr_len = sizeof(struct hop_jumbo_hdr);
 	struct sk_buff *frag_skb = head_skb;
-	unsigned int offset = doffset;
+	unsigned int offset;
 	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
 	unsigned int partial_segs = 0;
 	unsigned int headroom;
@@ -3968,6 +3969,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	__be16 proto;
 	bool csum, sg;
 	int nfrags = skb_shinfo(head_skb)->nr_frags;
+	struct ipv6hdr *h6;
 	int err = -ENOMEM;
 	int i = 0;
 	int pos;
@@ -3992,6 +3994,23 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	}
 
 	__skb_push(head_skb, doffset);
+
+	if (ipv6_has_hopopt_jumbo(head_skb)) {
+		/* remove the HBH header.
+		 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+		 */
+		memmove(head_skb->data + hophdr_len,
+			head_skb->data,
+			ETH_HLEN + sizeof(struct ipv6hdr));
+		head_skb->data += hophdr_len;
+		head_skb->len -= hophdr_len;
+		head_skb->network_header += hophdr_len;
+		head_skb->mac_header += hophdr_len;
+		doffset -= hophdr_len;
+		h6 = (struct ipv6hdr *)(head_skb->data + ETH_HLEN);
+		h6->nexthdr = IPPROTO_TCP;
+	}
+	offset = doffset;
 	proto = skb_network_protocol(head_skb, NULL);
 	if (unlikely(!proto))
 		return ERR_PTR(-EINVAL);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 06/15] ipv6/gro: insert temporary HBH/jumbo header
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (4 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  9:19   ` Paolo Abeni
  2022-02-03  1:51 ` [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE Eric Dumazet
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
BIG TCP ipv6 packets (bigger than 64K).

This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
so that resulting packet can go through IPv6/TCP stacks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv6/ip6_offload.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index d37a79a8554e92a1dcaa6fd023cafe2114841ece..dac6f60436e167a3d979fef02f25fc039c6ed37d 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -318,15 +318,43 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head,
 INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 {
 	const struct net_offload *ops;
-	struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + nhoff);
+	struct ipv6hdr *iph;
 	int err = -ENOSYS;
+	u32 payload_len;
 
 	if (skb->encapsulation) {
 		skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
 		skb_set_inner_network_header(skb, nhoff);
 	}
 
-	iph->payload_len = htons(skb->len - nhoff - sizeof(*iph));
+	payload_len = skb->len - nhoff - sizeof(*iph);
+	if (unlikely(payload_len > IPV6_MAXPLEN)) {
+		struct hop_jumbo_hdr *hop_jumbo;
+		int hoplen = sizeof(*hop_jumbo);
+
+		/* Move network header left */
+		memmove(skb_mac_header(skb) - hoplen, skb_mac_header(skb),
+			skb->transport_header - skb->mac_header);
+		skb->data -= hoplen;
+		skb->len += hoplen;
+		skb->mac_header -= hoplen;
+		skb->network_header -= hoplen;
+		iph = (struct ipv6hdr *)(skb->data + nhoff);
+		hop_jumbo = (struct hop_jumbo_hdr *)(iph + 1);
+
+		/* Build hop-by-hop options */
+		hop_jumbo->nexthdr = iph->nexthdr;
+		hop_jumbo->hdrlen = 0;
+		hop_jumbo->tlv_type = IPV6_TLV_JUMBO;
+		hop_jumbo->tlv_len = 4;
+		hop_jumbo->jumbo_payload_len = htonl(payload_len + hoplen);
+
+		iph->nexthdr = NEXTHDR_HOP;
+		iph->payload_len = 0;
+	} else {
+		iph = (struct ipv6hdr *)(skb->data + nhoff);
+		iph->payload_len = htons(payload_len);
+	}
 
 	nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (5 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 06/15] ipv6/gro: insert " Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  2:18   ` Eric Dumazet
  2022-02-03 10:44   ` Paolo Abeni
  2022-02-03  1:51 ` [PATCH net-next 08/15] ipv6: Add hop-by-hop header to jumbograms in ip6_output Eric Dumazet
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: "Signed-off-by: Coco Li" <lixiaoyan@google.com>

Enable GRO to have IPv6 specific limit for max packet size.

This patch introduces new dev->gro_ipv6_max_size
that is modifiable through ip link.

ip link set dev eth0 gro_ipv6_max_size 185000

Note that this value is only considered if bigger than
gro_max_size, and for non encapsulated TCP/ipv6 packets.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h          | 10 ++++++++++
 include/uapi/linux/if_link.h       |  1 +
 net/core/dev.c                     |  1 +
 net/core/gro.c                     | 20 ++++++++++++++++++--
 net/core/rtnetlink.c               | 15 +++++++++++++++
 tools/include/uapi/linux/if_link.h |  1 +
 6 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2a563869ba44f7d48095d36b1395e3fbd8cfff87..a3a61cffd953add6f272a53f551a49a47d200c68 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1944,6 +1944,8 @@ enum netdev_ml_priv_type {
  *			keep a list of interfaces to be deleted.
  *	@gro_max_size:	Maximum size of aggregated packet in generic
  *			receive offload (GRO)
+ *	@gro_ipv6_max_size:	Maximum size of aggregated packet in generic
+ *				receive offload (GRO), for IPv6
  *
  *	@dev_addr_shadow:	Copy of @dev_addr to catch direct writes.
  *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
@@ -2137,6 +2139,7 @@ struct net_device {
 	int			napi_defer_hard_irqs;
 #define GRO_MAX_SIZE		65536
 	unsigned int		gro_max_size;
+	unsigned int		gro_ipv6_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
@@ -4840,6 +4843,13 @@ static inline void netif_set_gso_ipv6_max_size(struct net_device *dev,
 	WRITE_ONCE(dev->gso_ipv6_max_size, size);
 }
 
+static inline void netif_set_gro_ipv6_max_size(struct net_device *dev,
+					       unsigned int size)
+{
+	/* This pairs with the READ_ONCE() in skb_gro_receive() */
+	WRITE_ONCE(dev->gro_ipv6_max_size, size);
+}
+
 static inline void skb_gso_error_unwind(struct sk_buff *skb, __be16 protocol,
 					int pulled_hlen, u16 mac_offset,
 					int mac_len)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 024b3bd0467e1360917001dba6bcfd1f30391894..48fe85bed4a629df0dd7cc0ee3a5139370e2c94d 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -350,6 +350,7 @@ enum {
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_IPV6_MAX_SIZE,
 	IFLA_GSO_IPV6_MAX_SIZE,
+	IFLA_GRO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 53c947e6fdb7c47e6cc92fd4e38b71e9b90d921c..e7df5c3f53d6e96d01ff06d081cef77d0c6d9d29 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10190,6 +10190,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->gro_max_size = GRO_MAX_SIZE;
 	dev->tso_ipv6_max_size = GSO_MAX_SIZE;
 	dev->gso_ipv6_max_size = GSO_MAX_SIZE;
+	dev->gro_ipv6_max_size = GRO_MAX_SIZE;
 
 	dev->upper_level = 1;
 	dev->lower_level = 1;
diff --git a/net/core/gro.c b/net/core/gro.c
index a11b286d149593827f1990fb8d06b0295fa72189..005a05468418f0373264e8019384e2daa13176eb 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -136,11 +136,27 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	unsigned int new_truesize;
 	struct sk_buff *lp;
 
+	if (unlikely(NAPI_GRO_CB(skb)->flush))
+		return -E2BIG;
+
 	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
 	gro_max_size = READ_ONCE(p->dev->gro_max_size);
 
-	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
-		return -E2BIG;
+	if (unlikely(p->len + len >= gro_max_size)) {
+		/* pairs with WRITE_ONCE() in netif_set_gro_ipv6_max_size() */
+		unsigned int gro6_max_size = READ_ONCE(p->dev->gro_ipv6_max_size);
+
+		if (gro6_max_size > gro_max_size &&
+		    p->protocol == htons(ETH_P_IPV6) &&
+		    skb_headroom(p) >= sizeof(struct hop_jumbo_hdr) &&
+		    ipv6_hdr(p)->nexthdr == IPPROTO_TCP &&
+		    !p->encapsulation)
+			gro_max_size = gro6_max_size;
+
+		if (p->len + len >= gro_max_size)
+			return -E2BIG;
+	}
+
 
 	lp = NAPI_GRO_CB(p)->last;
 	pinfo = skb_shinfo(lp);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 0a0b26261f6d9e4e40bf9cfbda31a29c1f2e3aaa..cb552d99682ab8498613f79df9bd6fbaad8c2d59 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1029,6 +1029,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_GRO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_IPV6_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_GSO_IPV6_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_GRO_IPV6_MAX_SIZE */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
@@ -1734,6 +1735,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_GRO_MAX_SIZE, dev->gro_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_IPV6_MAX_SIZE, dev->tso_ipv6_max_size) ||
 	    nla_put_u32(skb, IFLA_GSO_IPV6_MAX_SIZE, dev->gso_ipv6_max_size) ||
+	    nla_put_u32(skb, IFLA_GRO_IPV6_MAX_SIZE, dev->gro_ipv6_max_size) ||
 #ifdef CONFIG_RPS
 	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 #endif
@@ -1889,6 +1891,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_GRO_MAX_SIZE]	= { .type = NLA_U32 },
 	[IFLA_TSO_IPV6_MAX_SIZE]	= { .type = NLA_U32 },
 	[IFLA_GSO_IPV6_MAX_SIZE]	= { .type = NLA_U32 },
+	[IFLA_GRO_IPV6_MAX_SIZE]	= { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2784,6 +2787,15 @@ static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_GRO_IPV6_MAX_SIZE]) {
+		u32 max_size = nla_get_u32(tb[IFLA_GRO_IPV6_MAX_SIZE]);
+
+		if (dev->gro_ipv6_max_size ^ max_size) {
+			netif_set_gro_ipv6_max_size(dev, max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
 	if (tb[IFLA_GSO_MAX_SEGS]) {
 		u32 max_segs = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]);
 
@@ -3262,6 +3274,9 @@ struct net_device *rtnl_create_link(struct net *net, const char *ifname,
 	if (tb[IFLA_GSO_IPV6_MAX_SIZE])
 		netif_set_gso_ipv6_max_size(dev,
 			nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]));
+	if (tb[IFLA_GRO_IPV6_MAX_SIZE])
+		netif_set_gro_ipv6_max_size(dev,
+			nla_get_u32(tb[IFLA_GRO_IPV6_MAX_SIZE]));
 
 	return dev;
 }
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 024b3bd0467e1360917001dba6bcfd1f30391894..48fe85bed4a629df0dd7cc0ee3a5139370e2c94d 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -350,6 +350,7 @@ enum {
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_IPV6_MAX_SIZE,
 	IFLA_GSO_IPV6_MAX_SIZE,
+	IFLA_GRO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 08/15] ipv6: Add hop-by-hop header to jumbograms in ip6_output
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (6 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  9:07   ` Paolo Abeni
  2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Coco Li <lixiaoyan@google.com>

Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.

Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.

This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/ipv6.h  |  1 +
 net/ipv6/ip6_output.c | 22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 1e0f8a31f3de175659dca9ecee9f97d8b01e2b68..d3fb87e1589997570cde9cb5d92b2222008a229d 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -144,6 +144,7 @@ struct inet6_skb_parm {
 #define IP6SKB_L3SLAVE         64
 #define IP6SKB_JUMBOGRAM      128
 #define IP6SKB_SEG6	      256
+#define IP6SKB_FAKEJUMBO      512
 };
 
 #if defined(CONFIG_NET_L3_MASTER_DEV)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 0c6c971ce0a58b50f8a9349b8507dffac9c7818c..f78ba145620560e5d7cb25aaf16fec61ddd9ed40 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -180,7 +180,9 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff
 #endif
 
 	mtu = ip6_skb_dst_mtu(skb);
-	if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu))
+	if (skb_is_gso(skb) &&
+	    !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
+	    !skb_gso_validate_network_len(skb, mtu))
 		return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);
 
 	if ((skb->len > mtu && !skb_is_gso(skb)) ||
@@ -251,6 +253,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
 	struct inet6_dev *idev = ip6_dst_idev(dst);
+	struct hop_jumbo_hdr *hop_jumbo;
+	int hoplen = sizeof(*hop_jumbo);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -258,7 +262,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
+	head_room = sizeof(struct ipv6hdr) + hoplen + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
@@ -281,6 +285,20 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 					     &fl6->saddr);
 	}
 
+	if (unlikely(seg_len > IPV6_MAXPLEN)) {
+		hop_jumbo = skb_push(skb, hoplen);
+
+		hop_jumbo->nexthdr = proto;
+		hop_jumbo->hdrlen = 0;
+		hop_jumbo->tlv_type = IPV6_TLV_JUMBO;
+		hop_jumbo->tlv_len = 4;
+		hop_jumbo->jumbo_payload_len = htonl(seg_len + hoplen);
+
+		proto = IPPROTO_HOPOPTS;
+		seg_len = 0;
+		IP6CB(skb)->flags |= IP6SKB_FAKEJUMBO;
+	}
+
 	skb_push(skb, sizeof(struct ipv6hdr));
 	skb_reset_network_header(skb);
 	hdr = ipv6_hdr(skb);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (7 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 08/15] ipv6: Add hop-by-hop header to jumbograms in ip6_output Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  5:02   ` kernel test robot
                     ` (4 more replies)
  2022-02-03  1:51 ` [PATCH net-next 10/15] net: loopback: enable BIG TCP packets Eric Dumazet
                   ` (5 subsequent siblings)
  14 siblings, 5 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Currently, MAX_SKB_FRAGS value is 17.

For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
attempts order-3 allocations, stuffing 32768 bytes per frag.

But with zero copy, we use order-0 pages.

For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
to be able to fit 45 segments per skb.

This is also needed for BIG TCP rx zerocopy, as zerocopy currently
does not support skbs with frag list.

We have used this MAX_SKB_FRAGS value for years at Google before
we deployed 4K MTU, with no adverse effect.
Back then, goal was to be able to receive full size (64KB) GRO
packets without the frag_list overhead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a27bcc4f7e9a92ea4f4f4f6e5f454bb4f8099f66..08c12c41c5a5907dccc7389f396394d8132d962e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -323,18 +323,8 @@ enum skb_drop_reason {
 	SKB_DROP_REASON_MAX,
 };
 
-/* To allow 64K frame to be packed as single skb without frag_list we
- * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
- * buffers which do not start on a page boundary.
- *
- * Since GRO uses frags we allocate at least 16 regardless of page
- * size.
- */
-#if (65536/PAGE_SIZE + 1) < 16
-#define MAX_SKB_FRAGS 16UL
-#else
-#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
-#endif
+#define MAX_SKB_FRAGS 45UL
+
 extern int sysctl_max_skb_frags;
 
 /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 10/15] net: loopback: enable BIG TCP packets
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (8 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 11/15] bonding: update dev->tso_ipv6_max_size Eric Dumazet
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Set the driver limit to 512 KB per TSO ipv6 packet.

This allows the admin/user to set a GSO ipv6 limit up to this value.

Tested:

ip link set dev lo gso_ipv6_max_size 200000
netperf -H ::1 -t TCP_RR -l 100 -- -r 80000,80000 &

tcpdump shows :

18:28:42.962116 IP6 ::1 > ::1: HBH 40051 > 63780: Flags [P.], seq 3626480001:3626560001, ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000
18:28:42.962138 IP6 ::1.63780 > ::1.40051: Flags [.], ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 0
18:28:42.962152 IP6 ::1 > ::1: HBH 63780 > 40051: Flags [P.], seq 3626560001:3626640001, ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000
18:28:42.962157 IP6 ::1.40051 > ::1.63780: Flags [.], ack 3626640001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 0
18:28:42.962180 IP6 ::1 > ::1: HBH 40051 > 63780: Flags [P.], seq 3626560001:3626640001, ack 3626640001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000
18:28:42.962214 IP6 ::1.63780 > ::1.40051: Flags [.], ack 3626640001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179265], length 0
18:28:42.962228 IP6 ::1 > ::1: HBH 63780 > 40051: Flags [P.], seq 3626640001:3626720001, ack 3626640001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179265], length 80000
18:28:42.962233 IP6 ::1.40051 > ::1.63780: Flags [.], ack 3626720001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179266], length 0

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/loopback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index ed0edf5884ef85bf49534ff85b7dca3d9c6aa3ab..0adb2eaaf6112d83ce067e49a4b62a28a67bfcf4 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -191,6 +191,8 @@ static void gen_lo_setup(struct net_device *dev,
 	dev->netdev_ops		= dev_ops;
 	dev->needs_free_netdev	= true;
 	dev->priv_destructor	= dev_destructor;
+
+	netif_set_tso_ipv6_max_size(dev, 512 * 1024);
 }
 
 /* The loopback device is special. There is only one instance
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 11/15] bonding: update dev->tso_ipv6_max_size
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (9 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 10/15] net: loopback: enable BIG TCP packets Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 12/15] macvlan: enable BIG TCP Packets Eric Dumazet
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Use the minimal value found in the set of lower devices.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/bonding/bond_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 238b56d77c369d9595d55bc681c2191c49dd2905..053ade451ab1647dc099b7ce1cfd89c333c1b60f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1419,6 +1419,7 @@ static void bond_compute_features(struct bonding *bond)
 	struct slave *slave;
 	unsigned short max_hard_header_len = ETH_HLEN;
 	unsigned int gso_max_size = GSO_MAX_SIZE;
+	unsigned int tso_ipv6_max_size = ~0U;
 	u16 gso_max_segs = GSO_MAX_SEGS;
 
 	if (!bond_has_slaves(bond))
@@ -1449,6 +1450,7 @@ static void bond_compute_features(struct bonding *bond)
 			max_hard_header_len = slave->dev->hard_header_len;
 
 		gso_max_size = min(gso_max_size, slave->dev->gso_max_size);
+		tso_ipv6_max_size = min(tso_ipv6_max_size, slave->dev->tso_ipv6_max_size);
 		gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs);
 	}
 	bond_dev->hard_header_len = max_hard_header_len;
@@ -1464,6 +1466,7 @@ static void bond_compute_features(struct bonding *bond)
 	bond_dev->mpls_features = mpls_features;
 	netif_set_gso_max_segs(bond_dev, gso_max_segs);
 	netif_set_gso_max_size(bond_dev, gso_max_size);
+	netif_set_tso_ipv6_max_size(bond_dev, tso_ipv6_max_size);
 
 	bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
 	if ((bond_dev->priv_flags & IFF_XMIT_DST_RELEASE_PERM) &&
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 12/15] macvlan: enable BIG TCP Packets
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (10 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 11/15] bonding: update dev->tso_ipv6_max_size Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 13/15] ipvlan: " Eric Dumazet
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Inherit tso_ipv6_max_size from lower device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/macvlan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 6ef5f77be4d0ad49b2ef1282413fa30f12072b58..ca2e828de5b09a62ebc9cf1c10506e6ecef34330 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -902,6 +902,7 @@ static int macvlan_init(struct net_device *dev)
 	dev->hw_enc_features    |= dev->features;
 	netif_set_gso_max_size(dev, lowerdev->gso_max_size);
 	netif_set_gso_max_segs(dev, lowerdev->gso_max_segs);
+	netif_set_tso_ipv6_max_size(dev, lowerdev->tso_ipv6_max_size);
 	dev->hard_header_len	= lowerdev->hard_header_len;
 	macvlan_set_lockdep_class(dev);
 
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 13/15] ipvlan: enable BIG TCP Packets
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (11 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 12/15] macvlan: enable BIG TCP Packets Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 14/15] mlx4: support BIG TCP packets Eric Dumazet
  2022-02-03  1:51 ` [PATCH net-next 15/15] mlx5: " Eric Dumazet
  14 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet

From: Coco Li <lixiaoyan@google.com>

Inherit tso_ipv6_max_size from physical device.

Tested:

eth0 tso_ipv6_max_size is set to 524288

ip link add link eth0 name ipvl1 type ipvlan
ip -d link show ipvl1
10: ipvl1@eth0:...
	ipvlan  mode l3 bridge addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 gro_max_size 65536 gso_ipv6_max_size 65535 tso_ipv6_max_size 524288 gro_ipv6_max_size 65536

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 696e245f6d009d4d5d4a9c3523e4aa1e5d0f8bb6..4de30df25f19b32a78a06d18c99e94662307b7fb 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -141,6 +141,7 @@ static int ipvlan_init(struct net_device *dev)
 	dev->hw_enc_features |= dev->features;
 	netif_set_gso_max_size(dev, phy_dev->gso_max_size);
 	netif_set_gso_max_segs(dev, phy_dev->gso_max_segs);
+	netif_set_tso_ipv6_max_size(dev, phy_dev->tso_ipv6_max_size);
 	dev->hard_header_len = phy_dev->hard_header_len;
 
 	netdev_lockdep_set_classes(dev);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 14/15] mlx4: support BIG TCP packets
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (12 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 13/15] ipvlan: " Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03 13:04   ` Tariq Toukan
  2022-02-03  1:51 ` [PATCH net-next 15/15] mlx5: " Eric Dumazet
  14 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet, Tariq Toukan

From: Eric Dumazet <edumazet@google.com>

mlx4 supports LSOv2 just fine.

IPv6 stack inserts a temporary Hop-by-Hop header
with JUMBO TLV for big packets.

We need to ignore the HBH header when populating TX descriptor.

Tested:

Before: (not enabling bigger TSO/GRO packets)

ip link set dev eth0 gso_ipv6_max_size 65536 gro_ipv6_max_size 65536

netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

262144 540000 70000   70000  10.00   6591.45  0.86   1.34   62.490  97.446
262144 540000

After: (enabling bigger TSO/GRO packets)

ip link set dev eth0 gso_ipv6_max_size 185000 gro_ipv6_max_size 185000

netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

262144 540000 70000   70000  10.00   8383.95  0.95   1.01   54.432  57.584
262144 540000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  3 ++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c    | 47 +++++++++++++++----
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index c61dc7ae0c056a4dbcf24297549f6b1b5cc25d92..76cb93f5e5240c54f6f4c57e39739376206b4f34 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3417,6 +3417,9 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	dev->min_mtu = ETH_MIN_MTU;
 	dev->max_mtu = priv->max_mtu;
 
+	/* supports LSOv2 packets, 512KB limit has been tested. */
+	netif_set_tso_ipv6_max_size(dev, 512 * 1024);
+
 	mdev->pndev[port] = dev;
 	mdev->upper[port] = NULL;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 817f4154b86d599cd593876ec83529051d95fe2f..c89b3e8094e7d8cfb11aaa6cc4ad63bf3ad5934e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -44,6 +44,7 @@
 #include <linux/ipv6.h>
 #include <linux/moduleparam.h>
 #include <linux/indirect_call_wrapper.h>
+#include <net/ipv6.h>
 
 #include "mlx4_en.h"
 
@@ -635,19 +636,28 @@ static int get_real_size(const struct sk_buff *skb,
 			 struct net_device *dev,
 			 int *lso_header_size,
 			 bool *inline_ok,
-			 void **pfrag)
+			 void **pfrag,
+			 int *hopbyhop)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
 	if (shinfo->gso_size) {
 		*inline_ok = false;
-		if (skb->encapsulation)
+		*hopbyhop = 0;
+		if (skb->encapsulation) {
 			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
-		else
+		} else {
+			/* Detects large IPV6 TCP packets and prepares for removal of
+			 * HBH header that has been pushed by ip6_xmit(),
+			 * mainly so that tcpdump can dissect them.
+			 */
+			if (ipv6_has_hopopt_jumbo(skb))
+				*hopbyhop = sizeof(struct hop_jumbo_hdr);
 			*lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);
+		}
 		real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
-			ALIGN(*lso_header_size + 4, DS_SIZE);
+			ALIGN(*lso_header_size - *hopbyhop + 4, DS_SIZE);
 		if (unlikely(*lso_header_size != skb_headlen(skb))) {
 			/* We add a segment for the skb linear buffer only if
 			 * it contains data */
@@ -874,6 +884,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	int desc_size;
 	int real_size;
 	u32 index, bf_index;
+	struct ipv6hdr *h6;
 	__be32 op_own;
 	int lso_header_size;
 	void *fragptr = NULL;
@@ -882,6 +893,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool stop_queue;
 	bool inline_ok;
 	u8 data_offset;
+	int hopbyhop;
 	bool bf_ok;
 
 	tx_ind = skb_get_queue_mapping(skb);
@@ -891,7 +903,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 
 	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
-				  &inline_ok, &fragptr);
+				  &inline_ok, &fragptr, &hopbyhop);
 	if (unlikely(!real_size))
 		goto tx_drop_count;
 
@@ -944,7 +956,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		data = &tx_desc->data;
 		data_offset = offsetof(struct mlx4_en_tx_desc, data);
 	} else {
-		int lso_align = ALIGN(lso_header_size + 4, DS_SIZE);
+		int lso_align = ALIGN(lso_header_size - hopbyhop + 4, DS_SIZE);
 
 		data = (void *)&tx_desc->lso + lso_align;
 		data_offset = offsetof(struct mlx4_en_tx_desc, lso) + lso_align;
@@ -1009,14 +1021,31 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			((ring->prod & ring->size) ?
 				cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
 
+		lso_header_size -= hopbyhop;
 		/* Fill in the LSO prefix */
 		tx_desc->lso.mss_hdr_size = cpu_to_be32(
 			shinfo->gso_size << 16 | lso_header_size);
 
-		/* Copy headers;
-		 * note that we already verified that it is linear */
-		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
 
+		if (unlikely(hopbyhop)) {
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			memcpy(tx_desc->lso.header, skb->data, ETH_HLEN + sizeof(*h6));
+			h6 = (struct ipv6hdr *)((char *)tx_desc->lso.header + ETH_HLEN);
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else {
+			/* Copy headers;
+			 * note that we already verified that it is linear
+			 */
+			memcpy(tx_desc->lso.header, skb->data, lso_header_size);
+		}
 		ring->tso_packets++;
 
 		i = shinfo->gso_segs;
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH net-next 15/15] mlx5: support BIG TCP packets
  2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
                   ` (13 preceding siblings ...)
  2022-02-03  1:51 ` [PATCH net-next 14/15] mlx4: support BIG TCP packets Eric Dumazet
@ 2022-02-03  1:51 ` Eric Dumazet
  2022-02-03  7:27   ` Tariq Toukan
  2022-02-04  4:03   ` kernel test robot
  14 siblings, 2 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  1:51 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Eric Dumazet, Saeed Mahameed,
	Leon Romanovsky

From: Coco Li <lixiaoyan@google.com>

mlx5 supports LSOv2.

IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
with JUMBO TLV for big packets.

We need to ignore/skip this HBH header when populating TX descriptor.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 81 +++++++++++++++----
 2 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index bf80fb6124499fc4e6a0310ab92c91159b4ccbbb..1c4ce90e5d0f5186c402137b744258ff4ce6a348 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4888,6 +4888,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 
 	netdev->priv_flags       |= IFF_UNICAST_FLT;
 
+	netif_set_tso_ipv6_max_size(netdev, 512 * 1024);
 	mlx5e_set_netdev_dev_addr(netdev);
 	mlx5e_ipsec_build_netdev(priv);
 	mlx5e_tls_build_netdev(priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 7fd33b356cc8d191413e8259acd0b26b3ebd6ba9..fc945bd8219dcb69950b1840bb492649c8749976 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -40,6 +40,7 @@
 #include "en_accel/en_accel.h"
 #include "en_accel/ipsec_rxtx.h"
 #include "en/ptp.h"
+#include <net/ipv6.h>
 
 static void mlx5e_dma_unmap_wqe_err(struct mlx5e_txqsq *sq, u8 num_dma)
 {
@@ -241,8 +242,11 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		sq->stats->csum_none++;
 }
 
+/* Returns the number of header bytes that we plan
+ * to inline later in the transmit descriptor
+ */
 static inline u16
-mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
+mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb, int *hopbyhop)
 {
 	struct mlx5e_sq_stats *stats = sq->stats;
 	u16 ihs;
@@ -252,15 +256,18 @@ mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
 		stats->tso_inner_packets++;
 		stats->tso_inner_bytes += skb->len - ihs;
 	} else {
-		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
 			ihs = skb_transport_offset(skb) + sizeof(struct udphdr);
-		else
+		} else {
+			if (ipv6_has_hopopt_jumbo(skb))
+				*hopbyhop = sizeof(struct hop_jumbo_hdr);
 			ihs = skb_transport_offset(skb) + tcp_hdrlen(skb);
+		}
 		stats->tso_packets++;
-		stats->tso_bytes += skb->len - ihs;
+		stats->tso_bytes += skb->len - ihs - *hopbyhop;
 	}
 
-	return ihs;
+	return ihs - *hopbyhop;
 }
 
 static inline int
@@ -319,6 +326,7 @@ struct mlx5e_tx_attr {
 	__be16 mss;
 	u16 insz;
 	u8 opcode;
+	u8 hopbyhop;
 };
 
 struct mlx5e_tx_wqe_attr {
@@ -355,14 +363,16 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5e_sq_stats *stats = sq->stats;
 
 	if (skb_is_gso(skb)) {
-		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
+		int hopbyhop;
+		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
 
 		*attr = (struct mlx5e_tx_attr) {
 			.opcode    = MLX5_OPCODE_LSO,
 			.mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
 			.ihs       = ihs,
 			.num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
-			.headlen   = skb_headlen(skb) - ihs,
+			.headlen   = skb_headlen(skb) - ihs - hopbyhop,
+			.hopbyhop  = hopbyhop,
 		};
 
 		stats->packets += skb_shinfo(skb)->gso_segs;
@@ -476,7 +486,8 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5_wqe_eth_seg  *eseg;
 	struct mlx5_wqe_data_seg *dseg;
 	struct mlx5e_tx_wqe_info *wi;
-
+	u16 ihs = attr->ihs;
+	struct ipv6hdr *h6;
 	struct mlx5e_sq_stats *stats = sq->stats;
 	int num_dma;
 
@@ -490,15 +501,36 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 
 	eseg->mss = attr->mss;
 
-	if (attr->ihs) {
-		if (skb_vlan_tag_present(skb)) {
-			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN);
-			mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
+	if (ihs) {
+		u8 *start = eseg->inline_hdr.start;
+
+		if (unlikely(attr->hopbyhop)) {
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			if (skb_vlan_tag_present(skb)) {
+				mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6));
+				ihs += VLAN_HLEN;
+				h6 = (struct ipv6hdr *)(start + sizeof(struct vlan_ethhdr));
+			} else {
+				memcpy(start, skb->data, ETH_HLEN + sizeof(*h6));
+				h6 = (struct ipv6hdr *)(start + ETH_HLEN);
+			}
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else if (skb_vlan_tag_present(skb)) {
+			mlx5e_insert_vlan(start, skb, ihs);
+			ihs += VLAN_HLEN;
 			stats->added_vlan_packets++;
 		} else {
-			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
-			memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
+			memcpy(start, skb->data, ihs);
 		}
+		eseg->inline_hdr.sz |= cpu_to_be16(ihs);
 		dseg += wqe_attr->ds_cnt_inl;
 	} else if (skb_vlan_tag_present(skb)) {
 		eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN);
@@ -509,7 +541,7 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	}
 
 	dseg += wqe_attr->ds_cnt_ids;
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs + attr->hopbyhop,
 					  attr->headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
@@ -1016,12 +1048,27 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	eseg->mss = attr.mss;
 
 	if (attr.ihs) {
-		memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
+		if (unlikely(attr.hopbyhop)) {
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			memcpy(eseg->inline_hdr.start, skb->data, ETH_HLEN + sizeof(*h6));
+			h6 = (struct ipv6hdr *)((char *)eseg->inline_hdr.start + ETH_HLEN);
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else {
+			memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
+		}
 		eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
 		dseg += wqe_attr.ds_cnt_inl;
 	}
 
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs + attr.hopbyhop,
 					  attr.headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE
  2022-02-03  1:51 ` [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE Eric Dumazet
@ 2022-02-03  2:18   ` Eric Dumazet
  2022-02-03 10:44   ` Paolo Abeni
  1 sibling, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  2:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, Jakub Kicinski, netdev, Coco Li

On Wed, Feb 2, 2022 at 5:52 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: "Signed-off-by: Coco Li" <lixiaoyan@google.com>

Small glitch here, it should be:

From: Coco Li <lixiaoyan@google.com>

Fixed in my tree :)

>
> Enable GRO to have IPv6 specific limit for max packet size.
>
> This patch introduces new dev->gro_ipv6_max_size
> that is modifiable through ip link.
>
> ip link set dev eth0 gro_ipv6_max_size 185000
>
> Note that this value is only considered if bigger than
> gro_max_size, and for non encapsulated TCP/ipv6 packets.
>
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
@ 2022-02-03  5:02   ` kernel test robot
  2022-02-03  5:20     ` Eric Dumazet
  2022-02-03  5:23   ` kernel test robot
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 58+ messages in thread
From: kernel test robot @ 2022-02-03  5:02 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: kbuild-all, netdev, Eric Dumazet, Coco Li

Hi Eric,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
config: arc-randconfig-r043-20220130 (https://download.01.org/0day-ci/archive/20220203/202202031206.1nNLT568-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
        git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arc SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/container_of.h:5,
                    from include/linux/list.h:5,
                    from include/linux/rculist.h:10,
                    from include/linux/pid.h:5,
                    from include/linux/sched.h:14,
                    from include/linux/ptrace.h:6,
                    from include/uapi/asm-generic/bpf_perf_event.h:4,
                    from ./arch/arc/include/generated/uapi/asm/bpf_perf_event.h:1,
                    from include/uapi/linux/bpf_perf_event.h:11,
                    from kernel/bpf/btf.c:6:
>> include/linux/build_bug.h:78:41: error: static assertion failed: "BITS_PER_LONG >= NR_MSG_FRAG_IDS"
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                         ^~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ^~~~~~~~~~~~~~~
   include/linux/skmsg.h:41:1: note: in expansion of macro 'static_assert'
      41 | static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);
         | ^~~~~~~~~~~~~
   kernel/bpf/btf.c: In function 'btf_seq_show':
   kernel/bpf/btf.c:6049:29: warning: function 'btf_seq_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
    6049 |         seq_vprintf((struct seq_file *)show->target, fmt, args);
         |                             ^~~~~~~~
   kernel/bpf/btf.c: In function 'btf_snprintf_show':
   kernel/bpf/btf.c:6086:9: warning: function 'btf_snprintf_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
    6086 |         len = vsnprintf(show->target, ssnprintf->len_left, fmt, args);
         |         ^~~


vim +78 include/linux/build_bug.h

bc6245e5efd70c4 Ian Abbott       2017-07-10  60  
6bab69c65013bed Rasmus Villemoes 2019-03-07  61  /**
6bab69c65013bed Rasmus Villemoes 2019-03-07  62   * static_assert - check integer constant expression at build time
6bab69c65013bed Rasmus Villemoes 2019-03-07  63   *
6bab69c65013bed Rasmus Villemoes 2019-03-07  64   * static_assert() is a wrapper for the C11 _Static_assert, with a
6bab69c65013bed Rasmus Villemoes 2019-03-07  65   * little macro magic to make the message optional (defaulting to the
6bab69c65013bed Rasmus Villemoes 2019-03-07  66   * stringification of the tested expression).
6bab69c65013bed Rasmus Villemoes 2019-03-07  67   *
6bab69c65013bed Rasmus Villemoes 2019-03-07  68   * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
6bab69c65013bed Rasmus Villemoes 2019-03-07  69   * scope, but requires the expression to be an integer constant
6bab69c65013bed Rasmus Villemoes 2019-03-07  70   * expression (i.e., it is not enough that __builtin_constant_p() is
6bab69c65013bed Rasmus Villemoes 2019-03-07  71   * true for expr).
6bab69c65013bed Rasmus Villemoes 2019-03-07  72   *
6bab69c65013bed Rasmus Villemoes 2019-03-07  73   * Also note that BUILD_BUG_ON() fails the build if the condition is
6bab69c65013bed Rasmus Villemoes 2019-03-07  74   * true, while static_assert() fails the build if the expression is
6bab69c65013bed Rasmus Villemoes 2019-03-07  75   * false.
6bab69c65013bed Rasmus Villemoes 2019-03-07  76   */
6bab69c65013bed Rasmus Villemoes 2019-03-07  77  #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
6bab69c65013bed Rasmus Villemoes 2019-03-07 @78  #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
6bab69c65013bed Rasmus Villemoes 2019-03-07  79  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  5:02   ` kernel test robot
@ 2022-02-03  5:20     ` Eric Dumazet
  2022-02-03  5:31       ` Jakub Kicinski
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  5:20 UTC (permalink / raw)
  To: kernel test robot
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, kbuild-all,
	netdev, Coco Li

On Wed, Feb 2, 2022 at 9:02 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Eric,
>
> I love your patch! Yet something to improve:
>
> [auto build test ERROR on net-next/master]
>
> url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
> config: arc-randconfig-r043-20220130 (https://download.01.org/0day-ci/archive/20220203/202202031206.1nNLT568-lkp@intel.com/config)
> compiler: arc-elf-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
>         git remote add linux-review https://github.com/0day-ci/linux
>         git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
>         git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
>         # save the config file to linux build tree
>         mkdir build_dir
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arc SHELL=/bin/bash kernel/bpf/
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>    In file included from include/linux/container_of.h:5,
>                     from include/linux/list.h:5,
>                     from include/linux/rculist.h:10,
>                     from include/linux/pid.h:5,
>                     from include/linux/sched.h:14,
>                     from include/linux/ptrace.h:6,
>                     from include/uapi/asm-generic/bpf_perf_event.h:4,
>                     from ./arch/arc/include/generated/uapi/asm/bpf_perf_event.h:1,
>                     from include/uapi/linux/bpf_perf_event.h:11,
>                     from kernel/bpf/btf.c:6:
> >> include/linux/build_bug.h:78:41: error: static assertion failed: "BITS_PER_LONG >= NR_MSG_FRAG_IDS"
>       78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>          |                                         ^~~~~~~~~~~~~~
>    include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
>       77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
>          |                                  ^~~~~~~~~~~~~~~
>    include/linux/skmsg.h:41:1: note: in expansion of macro 'static_assert'
>       41 | static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);

Not clear why we have this assertion. Do we use a bitmap in an
"unsigned long" in skmsg ?

We could still use the old 17 limit for 32bit arches/builds.

>          | ^~~~~~~~~~~~~
>    kernel/bpf/btf.c: In function 'btf_seq_show':
>    kernel/bpf/btf.c:6049:29: warning: function 'btf_seq_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
>     6049 |         seq_vprintf((struct seq_file *)show->target, fmt, args);
>          |                             ^~~~~~~~
>    kernel/bpf/btf.c: In function 'btf_snprintf_show':
>    kernel/bpf/btf.c:6086:9: warning: function 'btf_snprintf_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
>     6086 |         len = vsnprintf(show->target, ssnprintf->len_left, fmt, args);
>          |         ^~~
>
>
> vim +78 include/linux/build_bug.h
>
> bc6245e5efd70c4 Ian Abbott       2017-07-10  60
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  61  /**
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  62   * static_assert - check integer constant expression at build time
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  63   *
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  64   * static_assert() is a wrapper for the C11 _Static_assert, with a
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  65   * little macro magic to make the message optional (defaulting to the
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  66   * stringification of the tested expression).
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  67   *
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  68   * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  69   * scope, but requires the expression to be an integer constant
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  70   * expression (i.e., it is not enough that __builtin_constant_p() is
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  71   * true for expr).
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  72   *
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  73   * Also note that BUILD_BUG_ON() fails the build if the condition is
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  74   * true, while static_assert() fails the build if the expression is
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  75   * false.
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  76   */
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  77  #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
> 6bab69c65013bed Rasmus Villemoes 2019-03-07 @78  #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  79
>
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
  2022-02-03  5:02   ` kernel test robot
@ 2022-02-03  5:23   ` kernel test robot
  2022-02-03  5:43   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 58+ messages in thread
From: kernel test robot @ 2022-02-03  5:23 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: llvm, kbuild-all, netdev, Eric Dumazet, Coco Li

Hi Eric,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
config: hexagon-randconfig-r045-20220130 (https://download.01.org/0day-ci/archive/20220203/202202031315.B425Ipe8-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project a73e4ce6a59b01f0e37037761c1e6889d539d233)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
        git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from kernel/bpf/btf.c:22:
>> include/linux/skmsg.h:41:1: error: static_assert failed due to requirement '32 >= (45UL + 1)' "BITS_PER_LONG >= NR_MSG_FRAG_IDS"
   static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:41: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                           ^              ~~~~
   1 error generated.


vim +41 include/linux/skmsg.h

604326b41a6fb9 Daniel Borkmann 2018-10-13  25  
604326b41a6fb9 Daniel Borkmann 2018-10-13  26  struct sk_msg_sg {
604326b41a6fb9 Daniel Borkmann 2018-10-13  27  	u32				start;
604326b41a6fb9 Daniel Borkmann 2018-10-13  28  	u32				curr;
604326b41a6fb9 Daniel Borkmann 2018-10-13  29  	u32				end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  30  	u32				size;
604326b41a6fb9 Daniel Borkmann 2018-10-13  31  	u32				copybreak;
163ab96b52ae2b Jakub Kicinski  2019-10-06  32  	unsigned long			copy;
031097d9e079e4 Jakub Kicinski  2019-11-27  33  	/* The extra two elements:
031097d9e079e4 Jakub Kicinski  2019-11-27  34  	 * 1) used for chaining the front and sections when the list becomes
031097d9e079e4 Jakub Kicinski  2019-11-27  35  	 *    partitioned (e.g. end < start). The crypto APIs require the
031097d9e079e4 Jakub Kicinski  2019-11-27  36  	 *    chaining;
031097d9e079e4 Jakub Kicinski  2019-11-27  37  	 * 2) to chain tailer SG entries after the message.
d3b18ad31f93d0 John Fastabend  2018-10-13  38  	 */
031097d9e079e4 Jakub Kicinski  2019-11-27  39  	struct scatterlist		data[MAX_MSG_FRAGS + 2];
604326b41a6fb9 Daniel Borkmann 2018-10-13  40  };
031097d9e079e4 Jakub Kicinski  2019-11-27 @41  static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);
604326b41a6fb9 Daniel Borkmann 2018-10-13  42  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  5:20     ` Eric Dumazet
@ 2022-02-03  5:31       ` Jakub Kicinski
  2022-02-03  6:35         ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2022-02-03  5:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kernel test robot, Eric Dumazet, David S . Miller, kbuild-all,
	netdev, Coco Li

On Wed, 2 Feb 2022 21:20:32 -0800 Eric Dumazet wrote:
> Not clear why we have this assertion. Do we use a bitmap in an
> "unsigned long" in skmsg ?
> 
> We could still use the old 17 limit for 32bit arches/builds.

git blame points at me but I just adjusted it. Looks like its
struct sk_msg_sg::copy that's the reason. On a quick look we 
can make it an array of unsigned longs without a problem.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
  2022-02-03  5:02   ` kernel test robot
  2022-02-03  5:23   ` kernel test robot
@ 2022-02-03  5:43   ` kernel test robot
  2022-02-03 16:01   ` Paolo Abeni
  2022-02-03 17:26   ` Alexander H Duyck
  4 siblings, 0 replies; 58+ messages in thread
From: kernel test robot @ 2022-02-03  5:43 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: kbuild-all, netdev, Eric Dumazet, Coco Li

Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
config: arc-allyesconfig (https://download.01.org/0day-ci/archive/20220203/202202031344.0FFfnywX-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
        git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arc SHELL=/bin/bash drivers/net/ethernet/3com/ drivers/net/ethernet/agere/ drivers/net/ethernet/mellanox/mlx5/core/ drivers/net/wireguard/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/net/wireguard/send.c: In function 'encrypt_packet':
>> drivers/net/wireguard/send.c:219:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=]
     219 | }
         | ^
--
   drivers/net/wireguard/receive.c: In function 'decrypt_packet':
>> drivers/net/wireguard/receive.c:299:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=]
     299 | }
         | ^
--
>> drivers/net/ethernet/3com/typhoon.c:142:2: warning: #warning Typhoon only supports 32 entries in its SG list for TSO, disabling TSO [-Wcpp]
     142 | #warning Typhoon only supports 32 entries in its SG list for TSO, disabling TSO
         |  ^~~~~~~


vim +219 drivers/net/wireguard/send.c

e7096c131e5161 Jason A. Donenfeld 2019-12-09  161  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  162  static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair)
e7096c131e5161 Jason A. Donenfeld 2019-12-09  163  {
e7096c131e5161 Jason A. Donenfeld 2019-12-09  164  	unsigned int padding_len, plaintext_len, trailer_len;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  165  	struct scatterlist sg[MAX_SKB_FRAGS + 8];
e7096c131e5161 Jason A. Donenfeld 2019-12-09  166  	struct message_data *header;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  167  	struct sk_buff *trailer;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  168  	int num_frags;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  169  
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  170  	/* Force hash calculation before encryption so that flow analysis is
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  171  	 * consistent over the inner packet.
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  172  	 */
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  173  	skb_get_hash(skb);
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  174  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  175  	/* Calculate lengths. */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  176  	padding_len = calculate_skb_padding(skb);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  177  	trailer_len = padding_len + noise_encrypted_len(0);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  178  	plaintext_len = skb->len + padding_len;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  179  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  180  	/* Expand data section to have room for padding and auth tag. */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  181  	num_frags = skb_cow_data(skb, trailer_len, &trailer);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  182  	if (unlikely(num_frags < 0 || num_frags > ARRAY_SIZE(sg)))
e7096c131e5161 Jason A. Donenfeld 2019-12-09  183  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  184  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  185  	/* Set the padding to zeros, and make sure it and the auth tag are part
e7096c131e5161 Jason A. Donenfeld 2019-12-09  186  	 * of the skb.
e7096c131e5161 Jason A. Donenfeld 2019-12-09  187  	 */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  188  	memset(skb_tail_pointer(trailer), 0, padding_len);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  189  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  190  	/* Expand head section to have room for our header and the network
e7096c131e5161 Jason A. Donenfeld 2019-12-09  191  	 * stack's headers.
e7096c131e5161 Jason A. Donenfeld 2019-12-09  192  	 */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  193  	if (unlikely(skb_cow_head(skb, DATA_PACKET_HEAD_ROOM) < 0))
e7096c131e5161 Jason A. Donenfeld 2019-12-09  194  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  195  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  196  	/* Finalize checksum calculation for the inner packet, if required. */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  197  	if (unlikely(skb->ip_summed == CHECKSUM_PARTIAL &&
e7096c131e5161 Jason A. Donenfeld 2019-12-09  198  		     skb_checksum_help(skb)))
e7096c131e5161 Jason A. Donenfeld 2019-12-09  199  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  200  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  201  	/* Only after checksumming can we safely add on the padding at the end
e7096c131e5161 Jason A. Donenfeld 2019-12-09  202  	 * and the header.
e7096c131e5161 Jason A. Donenfeld 2019-12-09  203  	 */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  204  	skb_set_inner_network_header(skb, 0);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  205  	header = (struct message_data *)skb_push(skb, sizeof(*header));
e7096c131e5161 Jason A. Donenfeld 2019-12-09  206  	header->header.type = cpu_to_le32(MESSAGE_DATA);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  207  	header->key_idx = keypair->remote_index;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  208  	header->counter = cpu_to_le64(PACKET_CB(skb)->nonce);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  209  	pskb_put(skb, trailer, trailer_len);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  210  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  211  	/* Now we can encrypt the scattergather segments */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  212  	sg_init_table(sg, num_frags);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  213  	if (skb_to_sgvec(skb, sg, sizeof(struct message_data),
e7096c131e5161 Jason A. Donenfeld 2019-12-09  214  			 noise_encrypted_len(plaintext_len)) <= 0)
e7096c131e5161 Jason A. Donenfeld 2019-12-09  215  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  216  	return chacha20poly1305_encrypt_sg_inplace(sg, plaintext_len, NULL, 0,
e7096c131e5161 Jason A. Donenfeld 2019-12-09  217  						   PACKET_CB(skb)->nonce,
e7096c131e5161 Jason A. Donenfeld 2019-12-09  218  						   keypair->sending.key);
e7096c131e5161 Jason A. Donenfeld 2019-12-09 @219  }
e7096c131e5161 Jason A. Donenfeld 2019-12-09  220  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  5:31       ` Jakub Kicinski
@ 2022-02-03  6:35         ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03  6:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kernel test robot, Eric Dumazet, David S . Miller, kbuild-all,
	netdev, Coco Li

On Wed, Feb 2, 2022 at 9:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 2 Feb 2022 21:20:32 -0800 Eric Dumazet wrote:
> > Not clear why we have this assertion. Do we use a bitmap in an
> > "unsigned long" in skmsg ?
> >
> > We could still use the old 17 limit for 32bit arches/builds.
>
> git blame points at me but I just adjusted it. Looks like its
> struct sk_msg_sg::copy that's the reason. On a quick look we
> can make it an array of unsigned longs without a problem.

Oh right, thanks for the pointer.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 15/15] mlx5: support BIG TCP packets
  2022-02-03  1:51 ` [PATCH net-next 15/15] mlx5: " Eric Dumazet
@ 2022-02-03  7:27   ` Tariq Toukan
  2022-02-04  4:03   ` kernel test robot
  1 sibling, 0 replies; 58+ messages in thread
From: Tariq Toukan @ 2022-02-03  7:27 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Saeed Mahameed, Leon Romanovsky,
	Tariq Toukan

Hi,

Thanks for your patch!

On 2/3/2022 3:51 AM, Eric Dumazet wrote:
> From: Coco Li <lixiaoyan@google.com>
> 
> mlx5 supports LSOv2.
> 
> IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> with JUMBO TLV for big packets.
> 
> We need to ignore/skip this HBH header when populating TX descriptor.
> 
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> Cc: Leon Romanovsky <leon@kernel.org>
> ---
>   .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 81 +++++++++++++++----
>   2 files changed, 65 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index bf80fb6124499fc4e6a0310ab92c91159b4ccbbb..1c4ce90e5d0f5186c402137b744258ff4ce6a348 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -4888,6 +4888,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
>   
>   	netdev->priv_flags       |= IFF_UNICAST_FLT;
>   
> +	netif_set_tso_ipv6_max_size(netdev, 512 * 1024);
>   	mlx5e_set_netdev_dev_addr(netdev);
>   	mlx5e_ipsec_build_netdev(priv);
>   	mlx5e_tls_build_netdev(priv);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index 7fd33b356cc8d191413e8259acd0b26b3ebd6ba9..fc945bd8219dcb69950b1840bb492649c8749976 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -40,6 +40,7 @@
>   #include "en_accel/en_accel.h"
>   #include "en_accel/ipsec_rxtx.h"
>   #include "en/ptp.h"
> +#include <net/ipv6.h>
>   
>   static void mlx5e_dma_unmap_wqe_err(struct mlx5e_txqsq *sq, u8 num_dma)
>   {
> @@ -241,8 +242,11 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>   		sq->stats->csum_none++;
>   }
>   
> +/* Returns the number of header bytes that we plan
> + * to inline later in the transmit descriptor
> + */
>   static inline u16
> -mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
> +mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb, int *hopbyhop)
>   {
>   	struct mlx5e_sq_stats *stats = sq->stats;
>   	u16 ihs;
> @@ -252,15 +256,18 @@ mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
>   		stats->tso_inner_packets++;
>   		stats->tso_inner_bytes += skb->len - ihs;
>   	} else {
> -		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
> +		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
>   			ihs = skb_transport_offset(skb) + sizeof(struct udphdr);
> -		else
> +		} else {
> +			if (ipv6_has_hopopt_jumbo(skb))
> +				*hopbyhop = sizeof(struct hop_jumbo_hdr);
>   			ihs = skb_transport_offset(skb) + tcp_hdrlen(skb);
> +		}
>   		stats->tso_packets++;
> -		stats->tso_bytes += skb->len - ihs;
> +		stats->tso_bytes += skb->len - ihs - *hopbyhop;

AFAIU, *hopbyhop is already accounted inside ihs, why decrement it once 
more?

Probably it'd be cleaner to assign/fix both ihs and hopbyhop under 
ipv6_has_hopopt_jumbo branch():

		ihs = skb_transport_offset(skb) + tcp_hdrlen(skb);
		if (ipv6_has_hopopt_jumbo(skb)) {
			*hopbyhop = sizeof(struct hop_jumbo_hdr);
			ihs -= sizeof(struct hop_jumbo_hdr);
		}
...
		stats->tso_bytes += skb->len - ihs - *hopbyhop;
...
		return ihs;

>   	}
>   
> -	return ihs;
> +	return ihs - *hopbyhop;
>   }
>   
>   static inline int
> @@ -319,6 +326,7 @@ struct mlx5e_tx_attr {
>   	__be16 mss;
>   	u16 insz;
>   	u8 opcode;
> +	u8 hopbyhop;
>   };
>   
>   struct mlx5e_tx_wqe_attr {
> @@ -355,14 +363,16 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>   	struct mlx5e_sq_stats *stats = sq->stats;
>   
>   	if (skb_is_gso(skb)) {
> -		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
> +		int hopbyhop;

missing init to zero. mlx5e_tx_get_gso_ihs() doesn't always write to it.

> +		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
>   
>   		*attr = (struct mlx5e_tx_attr) {
>   			.opcode    = MLX5_OPCODE_LSO,
>   			.mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
>   			.ihs       = ihs,
>   			.num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
> -			.headlen   = skb_headlen(skb) - ihs,
> +			.headlen   = skb_headlen(skb) - ihs - hopbyhop,
> +			.hopbyhop  = hopbyhop,
>   		};
>   
>   		stats->packets += skb_shinfo(skb)->gso_segs;
> @@ -476,7 +486,8 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>   	struct mlx5_wqe_eth_seg  *eseg;
>   	struct mlx5_wqe_data_seg *dseg;
>   	struct mlx5e_tx_wqe_info *wi;
> -
> +	u16 ihs = attr->ihs;
> +	struct ipv6hdr *h6;
>   	struct mlx5e_sq_stats *stats = sq->stats;
>   	int num_dma;
>   
> @@ -490,15 +501,36 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>   
>   	eseg->mss = attr->mss;
>   
> -	if (attr->ihs) {
> -		if (skb_vlan_tag_present(skb)) {
> -			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN);
> -			mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
> +	if (ihs) {
> +		u8 *start = eseg->inline_hdr.start;
> +
> +		if (unlikely(attr->hopbyhop)) {
> +			/* remove the HBH header.
> +			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> +			 */
> +			if (skb_vlan_tag_present(skb)) {
> +				mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6));
> +				ihs += VLAN_HLEN;
> +				h6 = (struct ipv6hdr *)(start + sizeof(struct vlan_ethhdr));
> +			} else {
> +				memcpy(start, skb->data, ETH_HLEN + sizeof(*h6));
> +				h6 = (struct ipv6hdr *)(start + ETH_HLEN);
> +			}
> +			h6->nexthdr = IPPROTO_TCP;
> +			/* Copy the TCP header after the IPv6 one */
> +			memcpy(h6 + 1,
> +			       skb->data + ETH_HLEN + sizeof(*h6) +
> +					sizeof(struct hop_jumbo_hdr),
> +			       tcp_hdrlen(skb));
> +			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */

You are not using ihs when preparing the inline part of the descriptor, 
so this might yield a mismatch between ihs and the sum of the sizes 
you're copying above. Is there a guarantee that this won't happen?

> +		} else if (skb_vlan_tag_present(skb)) {
> +			mlx5e_insert_vlan(start, skb, ihs);
> +			ihs += VLAN_HLEN;
>   			stats->added_vlan_packets++;
>   		} else {
> -			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
> -			memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
> +			memcpy(start, skb->data, ihs);
>   		}
> +		eseg->inline_hdr.sz |= cpu_to_be16(ihs);
>   		dseg += wqe_attr->ds_cnt_inl;
>   	} else if (skb_vlan_tag_present(skb)) {
>   		eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN);
> @@ -509,7 +541,7 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>   	}
>   
>   	dseg += wqe_attr->ds_cnt_ids;
> -	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
> +	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs + attr->hopbyhop,
>   					  attr->headlen, dseg);
>   	if (unlikely(num_dma < 0))
>   		goto err_drop;
> @@ -1016,12 +1048,27 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>   	eseg->mss = attr.mss;
>   
>   	if (attr.ihs) {
> -		memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
> +		if (unlikely(attr.hopbyhop)) {
> +			/* remove the HBH header.
> +			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> +			 */
> +			memcpy(eseg->inline_hdr.start, skb->data, ETH_HLEN + sizeof(*h6));
> +			h6 = (struct ipv6hdr *)((char *)eseg->inline_hdr.start + ETH_HLEN);
> +			h6->nexthdr = IPPROTO_TCP;
> +			/* Copy the TCP header after the IPv6 one */
> +			memcpy(h6 + 1,
> +			       skb->data + ETH_HLEN + sizeof(*h6) +
> +					sizeof(struct hop_jumbo_hdr),
> +			       tcp_hdrlen(skb));
> +			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
> +		} else {
> +			memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
> +		}
>   		eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
>   		dseg += wqe_attr.ds_cnt_inl;
>   	}
>   
> -	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
> +	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs + attr.hopbyhop,
>   					  attr.headlen, dseg);
>   	if (unlikely(num_dma < 0))
>   		goto err_drop;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 02/15] ipv6: add dev->gso_ipv6_max_size
  2022-02-03  1:51 ` [PATCH net-next 02/15] ipv6: add dev->gso_ipv6_max_size Eric Dumazet
@ 2022-02-03  8:57   ` Paolo Abeni
  2022-02-03 15:34     ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Abeni @ 2022-02-03  8:57 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li

Hello,

On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> This enable TCP stack to build TSO packets bigger than
> 64KB if the driver is LSOv2 compatible.
> 
> This patch introduces new variable gso_ipv6_max_size
> that is modifiable through ip link.
> 
> ip link set dev eth0 gso_ipv6_max_size 185000
> 
> User input is capped by driver limit.
> 
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/netdevice.h          | 12 ++++++++++++
>  include/uapi/linux/if_link.h       |  1 +
>  net/core/dev.c                     |  1 +
>  net/core/rtnetlink.c               | 15 +++++++++++++++
>  net/core/sock.c                    |  6 ++++++
>  tools/include/uapi/linux/if_link.h |  1 +
>  6 files changed, 36 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b1f68df2b37bc4b623f61cc2c6f0c02ba2afbe02..2a563869ba44f7d48095d36b1395e3fbd8cfff87 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1949,6 +1949,7 @@ enum netdev_ml_priv_type {
>   *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
>   *	@watchdog_dev_tracker:	refcount tracker used by watchdog.
>   *	@tso_ipv6_max_size:	Maximum size of IPv6 TSO packets (driver/NIC limit)
> + *	@gso_ipv6_max_size:	Maximum size of IPv6 GSO packets (user/admin limit)
>   *
>   *	FIXME: cleanup struct net_device such that network protocol info
>   *	moves out.
> @@ -2284,6 +2285,7 @@ struct net_device {
>  	netdevice_tracker	linkwatch_dev_tracker;
>  	netdevice_tracker	watchdog_dev_tracker;
>  	unsigned int		tso_ipv6_max_size;
> +	unsigned int		gso_ipv6_max_size;
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>  
> @@ -4804,6 +4806,10 @@ static inline void netif_set_gso_max_size(struct net_device *dev,
>  {
>  	/* dev->gso_max_size is read locklessly from sk_setup_caps() */
>  	WRITE_ONCE(dev->gso_max_size, size);
> +
> +	/* legacy drivers want to lower gso_max_size, regardless of family. */
> +	size = min(size, dev->gso_ipv6_max_size);
> +	WRITE_ONCE(dev->gso_ipv6_max_size, size);
>  }
>  
>  static inline void netif_set_gso_max_segs(struct net_device *dev,
> @@ -4827,6 +4833,12 @@ static inline void netif_set_tso_ipv6_max_size(struct net_device *dev,
>  	dev->tso_ipv6_max_size = size;
>  }
>  
> +static inline void netif_set_gso_ipv6_max_size(struct net_device *dev,
> +					       unsigned int size)
> +{
> +	size = min(size, dev->tso_ipv6_max_size);
> +	WRITE_ONCE(dev->gso_ipv6_max_size, size);

Dumb questions on my side: should the above be limited to
tso_ipv6_max_size ? or increasing gso_ipv6_max_size helps even if the
egress NIC does not support LSOv2?

Should gso_ipv6_max_size be capped to some reasonable value (well lower
than 4G), to avoid the stack building very complex skbs?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 08/15] ipv6: Add hop-by-hop header to jumbograms in ip6_output
  2022-02-03  1:51 ` [PATCH net-next 08/15] ipv6: Add hop-by-hop header to jumbograms in ip6_output Eric Dumazet
@ 2022-02-03  9:07   ` Paolo Abeni
  2022-02-03 16:31     ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Abeni @ 2022-02-03  9:07 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li

On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Coco Li <lixiaoyan@google.com>
> 
> Instead of simply forcing a 0 payload_len in IPv6 header,
> implement RFC 2675 and insert a custom extension header.
> 
> Note that only TCP stack is currently potentially generating
> jumbograms, and that this extension header is purely local,
> it wont be sent on a physical link.
> 
> This is needed so that packet capture (tcpdump and friends)
> can properly dissect these large packets.
> 
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/ipv6.h  |  1 +
>  net/ipv6/ip6_output.c | 22 ++++++++++++++++++++--
>  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index 1e0f8a31f3de175659dca9ecee9f97d8b01e2b68..d3fb87e1589997570cde9cb5d92b2222008a229d 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -144,6 +144,7 @@ struct inet6_skb_parm {
>  #define IP6SKB_L3SLAVE         64
>  #define IP6SKB_JUMBOGRAM      128
>  #define IP6SKB_SEG6	      256
> +#define IP6SKB_FAKEJUMBO      512
>  };
>  
>  #if defined(CONFIG_NET_L3_MASTER_DEV)
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 0c6c971ce0a58b50f8a9349b8507dffac9c7818c..f78ba145620560e5d7cb25aaf16fec61ddd9ed40 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -180,7 +180,9 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff
>  #endif
>  
>  	mtu = ip6_skb_dst_mtu(skb);
> -	if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu))
> +	if (skb_is_gso(skb) &&
> +	    !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
> +	    !skb_gso_validate_network_len(skb, mtu))
>  		return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);

If I read correctly jumbogram with gso len not fitting the egress
device MTU will not be fragmented, as opposed to plain old GSO packets.
Am I correct? why fragmentation is not needed for jumbogram?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 06/15] ipv6/gro: insert temporary HBH/jumbo header
  2022-02-03  1:51 ` [PATCH net-next 06/15] ipv6/gro: insert " Eric Dumazet
@ 2022-02-03  9:19   ` Paolo Abeni
  2022-02-03 15:48     ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Paolo Abeni @ 2022-02-03  9:19 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li

On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
> BIG TCP ipv6 packets (bigger than 64K).
> 
> This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
> so that resulting packet can go through IPv6/TCP stacks.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv6/ip6_offload.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index d37a79a8554e92a1dcaa6fd023cafe2114841ece..dac6f60436e167a3d979fef02f25fc039c6ed37d 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -318,15 +318,43 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head,
>  INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
>  {
>  	const struct net_offload *ops;
> -	struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + nhoff);
> +	struct ipv6hdr *iph;
>  	int err = -ENOSYS;
> +	u32 payload_len;
>  
>  	if (skb->encapsulation) {
>  		skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
>  		skb_set_inner_network_header(skb, nhoff);
>  	}
>  
> -	iph->payload_len = htons(skb->len - nhoff - sizeof(*iph));
> +	payload_len = skb->len - nhoff - sizeof(*iph);
> +	if (unlikely(payload_len > IPV6_MAXPLEN)) {
> +		struct hop_jumbo_hdr *hop_jumbo;
> +		int hoplen = sizeof(*hop_jumbo);
> +
> +		/* Move network header left */
> +		memmove(skb_mac_header(skb) - hoplen, skb_mac_header(skb),
> +			skb->transport_header - skb->mac_header);

I was wondering if we should check for enough headroom and what about
TCP over UDP tunnel, then I read the next patch ;) 

I think a comment here referring to the constraint enforced by
skb_gro_receive() could help, or perhaps squashing the 2 patches?!?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE
  2022-02-03  1:51 ` [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE Eric Dumazet
  2022-02-03  2:18   ` Eric Dumazet
@ 2022-02-03 10:44   ` Paolo Abeni
  1 sibling, 0 replies; 58+ messages in thread
From: Paolo Abeni @ 2022-02-03 10:44 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li

On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: "Signed-off-by: Coco Li" <lixiaoyan@google.com>
> 
> Enable GRO to have IPv6 specific limit for max packet size.
> 
> This patch introduces new dev->gro_ipv6_max_size
> that is modifiable through ip link.
> 
> ip link set dev eth0 gro_ipv6_max_size 185000
> 
> Note that this value is only considered if bigger than
> gro_max_size, and for non encapsulated TCP/ipv6 packets.
> 
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/netdevice.h          | 10 ++++++++++
>  include/uapi/linux/if_link.h       |  1 +
>  net/core/dev.c                     |  1 +
>  net/core/gro.c                     | 20 ++++++++++++++++++--
>  net/core/rtnetlink.c               | 15 +++++++++++++++
>  tools/include/uapi/linux/if_link.h |  1 +
>  6 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 2a563869ba44f7d48095d36b1395e3fbd8cfff87..a3a61cffd953add6f272a53f551a49a47d200c68 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1944,6 +1944,8 @@ enum netdev_ml_priv_type {
>   *			keep a list of interfaces to be deleted.
>   *	@gro_max_size:	Maximum size of aggregated packet in generic
>   *			receive offload (GRO)
> + *	@gro_ipv6_max_size:	Maximum size of aggregated packet in generic
> + *				receive offload (GRO), for IPv6
>   *
>   *	@dev_addr_shadow:	Copy of @dev_addr to catch direct writes.
>   *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
> @@ -2137,6 +2139,7 @@ struct net_device {
>  	int			napi_defer_hard_irqs;
>  #define GRO_MAX_SIZE		65536
>  	unsigned int		gro_max_size;
> +	unsigned int		gro_ipv6_max_size;
>  	rx_handler_func_t __rcu	*rx_handler;
>  	void __rcu		*rx_handler_data;
>  
> @@ -4840,6 +4843,13 @@ static inline void netif_set_gso_ipv6_max_size(struct net_device *dev,
>  	WRITE_ONCE(dev->gso_ipv6_max_size, size);
>  }
>  
> +static inline void netif_set_gro_ipv6_max_size(struct net_device *dev,
> +					       unsigned int size)
> +{
> +	/* This pairs with the READ_ONCE() in skb_gro_receive() */
> +	WRITE_ONCE(dev->gro_ipv6_max_size, size);
> +}
> +
>  static inline void skb_gso_error_unwind(struct sk_buff *skb, __be16 protocol,
>  					int pulled_hlen, u16 mac_offset,
>  					int mac_len)
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 024b3bd0467e1360917001dba6bcfd1f30391894..48fe85bed4a629df0dd7cc0ee3a5139370e2c94d 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -350,6 +350,7 @@ enum {
>  	IFLA_GRO_MAX_SIZE,
>  	IFLA_TSO_IPV6_MAX_SIZE,
>  	IFLA_GSO_IPV6_MAX_SIZE,
> +	IFLA_GRO_IPV6_MAX_SIZE,
>  
>  	__IFLA_MAX
>  };
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 53c947e6fdb7c47e6cc92fd4e38b71e9b90d921c..e7df5c3f53d6e96d01ff06d081cef77d0c6d9d29 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -10190,6 +10190,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	dev->gro_max_size = GRO_MAX_SIZE;
>  	dev->tso_ipv6_max_size = GSO_MAX_SIZE;
>  	dev->gso_ipv6_max_size = GSO_MAX_SIZE;
> +	dev->gro_ipv6_max_size = GRO_MAX_SIZE;
>  
>  	dev->upper_level = 1;
>  	dev->lower_level = 1;
> diff --git a/net/core/gro.c b/net/core/gro.c
> index a11b286d149593827f1990fb8d06b0295fa72189..005a05468418f0373264e8019384e2daa13176eb 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -136,11 +136,27 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>  	unsigned int new_truesize;
>  	struct sk_buff *lp;
>  
> +	if (unlikely(NAPI_GRO_CB(skb)->flush))
> +		return -E2BIG;
> +
>  	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
>  	gro_max_size = READ_ONCE(p->dev->gro_max_size);
>  
> -	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
> -		return -E2BIG;
> +	if (unlikely(p->len + len >= gro_max_size)) {
> +		/* pairs with WRITE_ONCE() in netif_set_gro_ipv6_max_size() */
> +		unsigned int gro6_max_size = READ_ONCE(p->dev->gro_ipv6_max_size);
> +
> +		if (gro6_max_size > gro_max_size &&
> +		    p->protocol == htons(ETH_P_IPV6) &&
> +		    skb_headroom(p) >= sizeof(struct hop_jumbo_hdr) &&
> +		    ipv6_hdr(p)->nexthdr == IPPROTO_TCP &&
> +		    !p->encapsulation)
> +			gro_max_size = gro6_max_size;
> +
> +		if (p->len + len >= gro_max_size)
> +			return -E2BIG;
> +	}
> +
>  
>  	lp = NAPI_GRO_CB(p)->last;
>  	pinfo = skb_shinfo(lp);

If I read correctly, a big GRO packet could be forwarded and/or
redirected to an egress device not supporting LSOv2 or with a lower
tso_ipv6_max_size. Don't we need to update netif_needs_gso() to take
care of such scenario? 
AFAICS we are not enforcing gso_max_size, so I'm wondering if that is
really a problem?!?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 14/15] mlx4: support BIG TCP packets
  2022-02-03  1:51 ` [PATCH net-next 14/15] mlx4: support BIG TCP packets Eric Dumazet
@ 2022-02-03 13:04   ` Tariq Toukan
  2022-02-03 15:54     ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Tariq Toukan @ 2022-02-03 13:04 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li, Tariq Toukan



On 2/3/2022 3:51 AM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> mlx4 supports LSOv2 just fine.
> 
> IPv6 stack inserts a temporary Hop-by-Hop header
> with JUMBO TLV for big packets.
> 
> We need to ignore the HBH header when populating TX descriptor.
> 
> Tested:
> 
> Before: (not enabling bigger TSO/GRO packets)
> 
> ip link set dev eth0 gso_ipv6_max_size 65536 gro_ipv6_max_size 65536
> 
> netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
> MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> 
> 262144 540000 70000   70000  10.00   6591.45  0.86   1.34   62.490  97.446
> 262144 540000
> 
> After: (enabling bigger TSO/GRO packets)
> 
> ip link set dev eth0 gso_ipv6_max_size 185000 gro_ipv6_max_size 185000
> 
> netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
> MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> 
> 262144 540000 70000   70000  10.00   8383.95  0.95   1.01   54.432  57.584
> 262144 540000
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@nvidia.com>
> ---
>   .../net/ethernet/mellanox/mlx4/en_netdev.c    |  3 ++
>   drivers/net/ethernet/mellanox/mlx4/en_tx.c    | 47 +++++++++++++++----
>   2 files changed, 41 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index c61dc7ae0c056a4dbcf24297549f6b1b5cc25d92..76cb93f5e5240c54f6f4c57e39739376206b4f34 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -3417,6 +3417,9 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
>   	dev->min_mtu = ETH_MIN_MTU;
>   	dev->max_mtu = priv->max_mtu;
>   
> +	/* supports LSOv2 packets, 512KB limit has been tested. */
> +	netif_set_tso_ipv6_max_size(dev, 512 * 1024);
> +
>   	mdev->pndev[port] = dev;
>   	mdev->upper[port] = NULL;
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 817f4154b86d599cd593876ec83529051d95fe2f..c89b3e8094e7d8cfb11aaa6cc4ad63bf3ad5934e 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -44,6 +44,7 @@
>   #include <linux/ipv6.h>
>   #include <linux/moduleparam.h>
>   #include <linux/indirect_call_wrapper.h>
> +#include <net/ipv6.h>
>   
>   #include "mlx4_en.h"
>   
> @@ -635,19 +636,28 @@ static int get_real_size(const struct sk_buff *skb,
>   			 struct net_device *dev,
>   			 int *lso_header_size,
>   			 bool *inline_ok,
> -			 void **pfrag)
> +			 void **pfrag,
> +			 int *hopbyhop)
>   {
>   	struct mlx4_en_priv *priv = netdev_priv(dev);
>   	int real_size;
>   
>   	if (shinfo->gso_size) {
>   		*inline_ok = false;
> -		if (skb->encapsulation)
> +		*hopbyhop = 0;
> +		if (skb->encapsulation) {
>   			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
> -		else
> +		} else {
> +			/* Detects large IPV6 TCP packets and prepares for removal of
> +			 * HBH header that has been pushed by ip6_xmit(),
> +			 * mainly so that tcpdump can dissect them.
> +			 */
> +			if (ipv6_has_hopopt_jumbo(skb))
> +				*hopbyhop = sizeof(struct hop_jumbo_hdr);
>   			*lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);
> +		}
>   		real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
> -			ALIGN(*lso_header_size + 4, DS_SIZE);
> +			ALIGN(*lso_header_size - *hopbyhop + 4, DS_SIZE);
>   		if (unlikely(*lso_header_size != skb_headlen(skb))) {
>   			/* We add a segment for the skb linear buffer only if
>   			 * it contains data */
> @@ -874,6 +884,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   	int desc_size;
>   	int real_size;
>   	u32 index, bf_index;
> +	struct ipv6hdr *h6;
>   	__be32 op_own;
>   	int lso_header_size;
>   	void *fragptr = NULL;
> @@ -882,6 +893,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   	bool stop_queue;
>   	bool inline_ok;
>   	u8 data_offset;
> +	int hopbyhop;
>   	bool bf_ok;
>   
>   	tx_ind = skb_get_queue_mapping(skb);
> @@ -891,7 +903,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   		goto tx_drop;
>   
>   	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
> -				  &inline_ok, &fragptr);
> +				  &inline_ok, &fragptr, &hopbyhop);
>   	if (unlikely(!real_size))
>   		goto tx_drop_count;
>   
> @@ -944,7 +956,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   		data = &tx_desc->data;
>   		data_offset = offsetof(struct mlx4_en_tx_desc, data);
>   	} else {
> -		int lso_align = ALIGN(lso_header_size + 4, DS_SIZE);
> +		int lso_align = ALIGN(lso_header_size - hopbyhop + 4, DS_SIZE);
>   
>   		data = (void *)&tx_desc->lso + lso_align;
>   		data_offset = offsetof(struct mlx4_en_tx_desc, lso) + lso_align;
> @@ -1009,14 +1021,31 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   			((ring->prod & ring->size) ?
>   				cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
>   
> +		lso_header_size -= hopbyhop;
>   		/* Fill in the LSO prefix */
>   		tx_desc->lso.mss_hdr_size = cpu_to_be32(
>   			shinfo->gso_size << 16 | lso_header_size);
>   
> -		/* Copy headers;
> -		 * note that we already verified that it is linear */
> -		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
>   
> +		if (unlikely(hopbyhop)) {
> +			/* remove the HBH header.
> +			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> +			 */
> +			memcpy(tx_desc->lso.header, skb->data, ETH_HLEN + sizeof(*h6));
> +			h6 = (struct ipv6hdr *)((char *)tx_desc->lso.header + ETH_HLEN);
> +			h6->nexthdr = IPPROTO_TCP;
> +			/* Copy the TCP header after the IPv6 one */
> +			memcpy(h6 + 1,
> +			       skb->data + ETH_HLEN + sizeof(*h6) +
> +					sizeof(struct hop_jumbo_hdr),
> +			       tcp_hdrlen(skb));
> +			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */

Hi Eric,
Many thanks for your patches.
Impressive improvement indeed!

I am concerned about not using lso_header_size in this flow.
The num of bytes copied here might be out-of-sync with the value 
provided in the descriptor (tx_desc->lso.mss_hdr_size).
Are the two values guaranteed to be equal?
I think this is an assumption that can get broken in the future by 
unaware patches to the kernel stack.

Thanks,
Tariq

> +		} else {
> +			/* Copy headers;
> +			 * note that we already verified that it is linear
> +			 */
> +			memcpy(tx_desc->lso.header, skb->data, lso_header_size);
> +		}
>   		ring->tso_packets++;
>   
>   		i = shinfo->gso_segs;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 02/15] ipv6: add dev->gso_ipv6_max_size
  2022-02-03  8:57   ` Paolo Abeni
@ 2022-02-03 15:34     ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 15:34 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 12:57 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hello,
>
> On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > This enable TCP stack to build TSO packets bigger than
> > 64KB if the driver is LSOv2 compatible.
> >
> > This patch introduces new variable gso_ipv6_max_size
> > that is modifiable through ip link.
> >
> > ip link set dev eth0 gso_ipv6_max_size 185000
> >
> > User input is capped by driver limit.
> >
> > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  include/linux/netdevice.h          | 12 ++++++++++++
> >  include/uapi/linux/if_link.h       |  1 +
> >  net/core/dev.c                     |  1 +
> >  net/core/rtnetlink.c               | 15 +++++++++++++++
> >  net/core/sock.c                    |  6 ++++++
> >  tools/include/uapi/linux/if_link.h |  1 +
> >  6 files changed, 36 insertions(+)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index b1f68df2b37bc4b623f61cc2c6f0c02ba2afbe02..2a563869ba44f7d48095d36b1395e3fbd8cfff87 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1949,6 +1949,7 @@ enum netdev_ml_priv_type {
> >   *   @linkwatch_dev_tracker: refcount tracker used by linkwatch.
> >   *   @watchdog_dev_tracker:  refcount tracker used by watchdog.
> >   *   @tso_ipv6_max_size:     Maximum size of IPv6 TSO packets (driver/NIC limit)
> > + *   @gso_ipv6_max_size:     Maximum size of IPv6 GSO packets (user/admin limit)
> >   *
> >   *   FIXME: cleanup struct net_device such that network protocol info
> >   *   moves out.
> > @@ -2284,6 +2285,7 @@ struct net_device {
> >       netdevice_tracker       linkwatch_dev_tracker;
> >       netdevice_tracker       watchdog_dev_tracker;
> >       unsigned int            tso_ipv6_max_size;
> > +     unsigned int            gso_ipv6_max_size;
> >  };
> >  #define to_net_dev(d) container_of(d, struct net_device, dev)
> >
> > @@ -4804,6 +4806,10 @@ static inline void netif_set_gso_max_size(struct net_device *dev,
> >  {
> >       /* dev->gso_max_size is read locklessly from sk_setup_caps() */
> >       WRITE_ONCE(dev->gso_max_size, size);
> > +
> > +     /* legacy drivers want to lower gso_max_size, regardless of family. */
> > +     size = min(size, dev->gso_ipv6_max_size);
> > +     WRITE_ONCE(dev->gso_ipv6_max_size, size);
> >  }
> >
> >  static inline void netif_set_gso_max_segs(struct net_device *dev,
> > @@ -4827,6 +4833,12 @@ static inline void netif_set_tso_ipv6_max_size(struct net_device *dev,
> >       dev->tso_ipv6_max_size = size;
> >  }
> >
> > +static inline void netif_set_gso_ipv6_max_size(struct net_device *dev,
> > +                                            unsigned int size)
> > +{
> > +     size = min(size, dev->tso_ipv6_max_size);
> > +     WRITE_ONCE(dev->gso_ipv6_max_size, size);
>
> Dumb questions on my side: should the above be limited to
> tso_ipv6_max_size ? or increasing gso_ipv6_max_size helps even if the
> egress NIC does not support LSOv2?

I thought that " size = min(size, dev->tso_ipv6_max_size);" was doing
exactly that ?

I  will fix the From: tag because patch autor is Coco Li

>
> Should gso_ipv6_max_size be capped to some reasonable value (well lower
> than 4G), to avoid the stack building very complex skbs?
>

Drivers are responsible for choosing the max value, then admins choose
optimal operational values based on their constraints (like device MTU)

Typical LSOv2 values are 256K or 512KB, but we really tested BIG TCP
with 45 4K segments per packet.

> Thanks!
>
> Paolo
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 06/15] ipv6/gro: insert temporary HBH/jumbo header
  2022-02-03  9:19   ` Paolo Abeni
@ 2022-02-03 15:48     ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 15:48 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 1:20 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
> > BIG TCP ipv6 packets (bigger than 64K).
> >
> > This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
> > so that resulting packet can go through IPv6/TCP stacks.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/ipv6/ip6_offload.c | 32 ++++++++++++++++++++++++++++++--
> >  1 file changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> > index d37a79a8554e92a1dcaa6fd023cafe2114841ece..dac6f60436e167a3d979fef02f25fc039c6ed37d 100644
> > --- a/net/ipv6/ip6_offload.c
> > +++ b/net/ipv6/ip6_offload.c
> > @@ -318,15 +318,43 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head,
> >  INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
> >  {
> >       const struct net_offload *ops;
> > -     struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + nhoff);
> > +     struct ipv6hdr *iph;
> >       int err = -ENOSYS;
> > +     u32 payload_len;
> >
> >       if (skb->encapsulation) {
> >               skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
> >               skb_set_inner_network_header(skb, nhoff);
> >       }
> >
> > -     iph->payload_len = htons(skb->len - nhoff - sizeof(*iph));
> > +     payload_len = skb->len - nhoff - sizeof(*iph);
> > +     if (unlikely(payload_len > IPV6_MAXPLEN)) {
> > +             struct hop_jumbo_hdr *hop_jumbo;
> > +             int hoplen = sizeof(*hop_jumbo);
> > +
> > +             /* Move network header left */
> > +             memmove(skb_mac_header(skb) - hoplen, skb_mac_header(skb),
> > +                     skb->transport_header - skb->mac_header);
>
> I was wondering if we should check for enough headroom and what about
> TCP over UDP tunnel, then I read the next patch ;)

The check about headroom is provided in the following patch (ipv6: add
GRO_IPV6_MAX_SIZE),
which allows GRO stack to build packets bigger than 64KB,
if drivers provided enough headroom (8 bytes).
They usually provide NET_SKB_PAD (64 bytes or more)

Before the next patch, this code is dead.

Also current patch set does not cook BIG TCP packets for tunneled traffic
(look at skb_gro_receive() changes in following patch)


>
> I think a comment here referring to the constraint enforced by
> skb_gro_receive() could help, or perhaps squashing the 2 patches?!?

Well no, we spent time making small patches to ease review, and these patches
have different authors anyway.

>
> Thanks!
>
> Paolo
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 14/15] mlx4: support BIG TCP packets
  2022-02-03 13:04   ` Tariq Toukan
@ 2022-02-03 15:54     ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 15:54 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li,
	Tariq Toukan

()

On Thu, Feb 3, 2022 at 5:04 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 2/3/2022 3:51 AM, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > mlx4 supports LSOv2 just fine.
> >
> > IPv6 stack inserts a temporary Hop-by-Hop header
> > with JUMBO TLV for big packets.
> >
> > We need to ignore the HBH header when populating TX descriptor.
> >
> > Tested:
> >
> > Before: (not enabling bigger TSO/GRO packets)
> >
> > ip link set dev eth0 gso_ipv6_max_size 65536 gro_ipv6_max_size 65536
> >
> > netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
> > MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
> > Local /Remote
> > Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> > Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> > bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> >
> > 262144 540000 70000   70000  10.00   6591.45  0.86   1.34   62.490  97.446
> > 262144 540000
> >
> > After: (enabling bigger TSO/GRO packets)
> >
> > ip link set dev eth0 gso_ipv6_max_size 185000 gro_ipv6_max_size 185000
> >
> > netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
> > MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
> > Local /Remote
> > Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> > Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> > bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> >
> > 262144 540000 70000   70000  10.00   8383.95  0.95   1.01   54.432  57.584
> > 262144 540000
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Tariq Toukan <tariqt@nvidia.com>
> > ---
> >   .../net/ethernet/mellanox/mlx4/en_netdev.c    |  3 ++
> >   drivers/net/ethernet/mellanox/mlx4/en_tx.c    | 47 +++++++++++++++----
> >   2 files changed, 41 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > index c61dc7ae0c056a4dbcf24297549f6b1b5cc25d92..76cb93f5e5240c54f6f4c57e39739376206b4f34 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > @@ -3417,6 +3417,9 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
> >       dev->min_mtu = ETH_MIN_MTU;
> >       dev->max_mtu = priv->max_mtu;
> >
> > +     /* supports LSOv2 packets, 512KB limit has been tested. */
> > +     netif_set_tso_ipv6_max_size(dev, 512 * 1024);
> > +
> >       mdev->pndev[port] = dev;
> >       mdev->upper[port] = NULL;
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 817f4154b86d599cd593876ec83529051d95fe2f..c89b3e8094e7d8cfb11aaa6cc4ad63bf3ad5934e 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > @@ -44,6 +44,7 @@
> >   #include <linux/ipv6.h>
> >   #include <linux/moduleparam.h>
> >   #include <linux/indirect_call_wrapper.h>
> > +#include <net/ipv6.h>
> >
> >   #include "mlx4_en.h"
> >
> > @@ -635,19 +636,28 @@ static int get_real_size(const struct sk_buff *skb,
> >                        struct net_device *dev,
> >                        int *lso_header_size,
> >                        bool *inline_ok,
> > -                      void **pfrag)
> > +                      void **pfrag,
> > +                      int *hopbyhop)
> >   {
> >       struct mlx4_en_priv *priv = netdev_priv(dev);
> >       int real_size;
> >
> >       if (shinfo->gso_size) {
> >               *inline_ok = false;
> > -             if (skb->encapsulation)
> > +             *hopbyhop = 0;
> > +             if (skb->encapsulation) {
> >                       *lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
> > -             else
> > +             } else {
> > +                     /* Detects large IPV6 TCP packets and prepares for removal of
> > +                      * HBH header that has been pushed by ip6_xmit(),
> > +                      * mainly so that tcpdump can dissect them.
> > +                      */
> > +                     if (ipv6_has_hopopt_jumbo(skb))
> > +                             *hopbyhop = sizeof(struct hop_jumbo_hdr);
> >                       *lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);
> > +             }
> >               real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
> > -                     ALIGN(*lso_header_size + 4, DS_SIZE);
> > +                     ALIGN(*lso_header_size - *hopbyhop + 4, DS_SIZE);
> >               if (unlikely(*lso_header_size != skb_headlen(skb))) {
> >                       /* We add a segment for the skb linear buffer only if
> >                        * it contains data */
> > @@ -874,6 +884,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >       int desc_size;
> >       int real_size;
> >       u32 index, bf_index;
> > +     struct ipv6hdr *h6;
> >       __be32 op_own;
> >       int lso_header_size;
> >       void *fragptr = NULL;
> > @@ -882,6 +893,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >       bool stop_queue;
> >       bool inline_ok;
> >       u8 data_offset;
> > +     int hopbyhop;
> >       bool bf_ok;
> >
> >       tx_ind = skb_get_queue_mapping(skb);
> > @@ -891,7 +903,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >               goto tx_drop;
> >
> >       real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
> > -                               &inline_ok, &fragptr);
> > +                               &inline_ok, &fragptr, &hopbyhop);
> >       if (unlikely(!real_size))
> >               goto tx_drop_count;
> >
> > @@ -944,7 +956,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >               data = &tx_desc->data;
> >               data_offset = offsetof(struct mlx4_en_tx_desc, data);
> >       } else {
> > -             int lso_align = ALIGN(lso_header_size + 4, DS_SIZE);
> > +             int lso_align = ALIGN(lso_header_size - hopbyhop + 4, DS_SIZE);
> >
> >               data = (void *)&tx_desc->lso + lso_align;
> >               data_offset = offsetof(struct mlx4_en_tx_desc, lso) + lso_align;
> > @@ -1009,14 +1021,31 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >                       ((ring->prod & ring->size) ?
> >                               cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
> >
> > +             lso_header_size -= hopbyhop;
> >               /* Fill in the LSO prefix */
> >               tx_desc->lso.mss_hdr_size = cpu_to_be32(
> >                       shinfo->gso_size << 16 | lso_header_size);
> >
> > -             /* Copy headers;
> > -              * note that we already verified that it is linear */
> > -             memcpy(tx_desc->lso.header, skb->data, lso_header_size);
> >
> > +             if (unlikely(hopbyhop)) {
> > +                     /* remove the HBH header.
> > +                      * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> > +                      */
> > +                     memcpy(tx_desc->lso.header, skb->data, ETH_HLEN + sizeof(*h6));
> > +                     h6 = (struct ipv6hdr *)((char *)tx_desc->lso.header + ETH_HLEN);
> > +                     h6->nexthdr = IPPROTO_TCP;
> > +                     /* Copy the TCP header after the IPv6 one */
> > +                     memcpy(h6 + 1,
> > +                            skb->data + ETH_HLEN + sizeof(*h6) +
> > +                                     sizeof(struct hop_jumbo_hdr),
> > +                            tcp_hdrlen(skb));
> > +                     /* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
>
> Hi Eric,
> Many thanks for your patches.
> Impressive improvement indeed!
>
> I am concerned about not using lso_header_size in this flow.
> The num of bytes copied here might be out-of-sync with the value
> provided in the descriptor (tx_desc->lso.mss_hdr_size).
> Are the two values guaranteed to be equal?

I think they are equal.

get_real_size() sets :

*lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);

Also, BIG TCP is supporting native IPv6 + TCP only at this moment.

> I think this is an assumption that can get broken in the future by
> unaware patches to the kernel stack.

Changes are self-contained in drivers/net/ethernet/mellanox/mlx4/en_tx.c,
functions get_real_size() and mlx4_en_xmit()

>
> Thanks,
> Tariq
>
> > +             } else {
> > +                     /* Copy headers;
> > +                      * note that we already verified that it is linear
> > +                      */
> > +                     memcpy(tx_desc->lso.header, skb->data, lso_header_size);
> > +             }
> >               ring->tso_packets++;
> >
> >               i = shinfo->gso_segs;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
                     ` (2 preceding siblings ...)
  2022-02-03  5:43   ` kernel test robot
@ 2022-02-03 16:01   ` Paolo Abeni
  2022-02-03 17:26   ` Alexander H Duyck
  4 siblings, 0 replies; 58+ messages in thread
From: Paolo Abeni @ 2022-02-03 16:01 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li

On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Currently, MAX_SKB_FRAGS value is 17.
> 
> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> attempts order-3 allocations, stuffing 32768 bytes per frag.
> 
> But with zero copy, we use order-0 pages.
> 
> For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> to be able to fit 45 segments per skb.
> 
> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> does not support skbs with frag list.
> 
> We have used this MAX_SKB_FRAGS value for years at Google before
> we deployed 4K MTU, with no adverse effect.
> Back then, goal was to be able to receive full size (64KB) GRO
> packets without the frag_list overhead.

IIRC, while backporting some changes to an older RHEL kernel, we had to
increase the skb overhead due to kabi issue.

That caused some measurable regressions because some drivers (e.g.
ixgbe) where not able any more to allocate multiple (skb) heads from
the same page. 

All the above subject to some noise - it's a fainting memory.

I'll try to do some tests with the H/W I have handy, but it could take
a little time due to conflicting scheduling here.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 08/15] ipv6: Add hop-by-hop header to jumbograms in ip6_output
  2022-02-03  9:07   ` Paolo Abeni
@ 2022-02-03 16:31     ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 16:31 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 1:07 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> >
> > Instead of simply forcing a 0 payload_len in IPv6 header,
> > implement RFC 2675 and insert a custom extension header.
> >
> > Note that only TCP stack is currently potentially generating
> > jumbograms, and that this extension header is purely local,
> > it wont be sent on a physical link.
> >
> > This is needed so that packet capture (tcpdump and friends)
> > can properly dissect these large packets.
> >
> > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  include/linux/ipv6.h  |  1 +
> >  net/ipv6/ip6_output.c | 22 ++++++++++++++++++++--
> >  2 files changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > index 1e0f8a31f3de175659dca9ecee9f97d8b01e2b68..d3fb87e1589997570cde9cb5d92b2222008a229d 100644
> > --- a/include/linux/ipv6.h
> > +++ b/include/linux/ipv6.h
> > @@ -144,6 +144,7 @@ struct inet6_skb_parm {
> >  #define IP6SKB_L3SLAVE         64
> >  #define IP6SKB_JUMBOGRAM      128
> >  #define IP6SKB_SEG6        256
> > +#define IP6SKB_FAKEJUMBO      512
> >  };
> >
> >  #if defined(CONFIG_NET_L3_MASTER_DEV)
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 0c6c971ce0a58b50f8a9349b8507dffac9c7818c..f78ba145620560e5d7cb25aaf16fec61ddd9ed40 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -180,7 +180,9 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff
> >  #endif
> >
> >       mtu = ip6_skb_dst_mtu(skb);
> > -     if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu))
> > +     if (skb_is_gso(skb) &&
> > +         !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
> > +         !skb_gso_validate_network_len(skb, mtu))
> >               return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);
>
> If I read correctly jumbogram with gso len not fitting the egress
> device MTU will not be fragmented, as opposed to plain old GSO packets.
> Am I correct? why fragmentation is not needed for jumbogram?

I guess we could add this validation in place.

Honestly, we do not expect BIG TCP being deployed in hostile
environments (host having devices with different MTU)

Fragmentation is evil and should be avoided at all costs.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute
  2022-02-03  1:51 ` [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute Eric Dumazet
@ 2022-02-03 16:34   ` Jakub Kicinski
  2022-02-03 16:56     ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2022-02-03 16:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet, Coco Li

On Wed,  2 Feb 2022 17:51:26 -0800 Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Some NIC (or virtual devices) are LSOv2 compatible.
> 
> BIG TCP plans using the large LSOv2 feature for IPv6.
> 
> New netlink attribute IFLA_TSO_IPV6_MAX_SIZE is defined.
> 
> Drivers should use netif_set_tso_ipv6_max_size() to advertize their limit.
> 
> Unchanged drivers are not allowing big TSO packets to be sent.

Many drivers will have a limit on how many buffer descriptors they
can chain, not the size of the super frame, I'd think. Is that not
the case? We can't assume all pages but the first and last are full,
right?

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index e490b84732d1654bf067b30f2bb0b0825f88dea9..b1f68df2b37bc4b623f61cc2c6f0c02ba2afbe02 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1948,6 +1948,7 @@ enum netdev_ml_priv_type {
>   *	@dev_addr_shadow:	Copy of @dev_addr to catch direct writes.
>   *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
>   *	@watchdog_dev_tracker:	refcount tracker used by watchdog.
> + *	@tso_ipv6_max_size:	Maximum size of IPv6 TSO packets (driver/NIC limit)
>   *
>   *	FIXME: cleanup struct net_device such that network protocol info
>   *	moves out.
> @@ -2282,6 +2283,7 @@ struct net_device {
>  	u8 dev_addr_shadow[MAX_ADDR_LEN];
>  	netdevice_tracker	linkwatch_dev_tracker;
>  	netdevice_tracker	watchdog_dev_tracker;
> +	unsigned int		tso_ipv6_max_size;
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>  
> @@ -4818,6 +4820,14 @@ static inline void netif_set_gro_max_size(struct net_device *dev,
>  	WRITE_ONCE(dev->gro_max_size, size);
>  }
>  
> +/* Used by drivers to give their hardware/firmware limit for LSOv2 packets */
> +static inline void netif_set_tso_ipv6_max_size(struct net_device *dev,
> +					       unsigned int size)
> +{
> +	dev->tso_ipv6_max_size = size;
> +}
> +
> +

nit: double new line

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute
  2022-02-03 16:34   ` Jakub Kicinski
@ 2022-02-03 16:56     ` Eric Dumazet
  2022-02-03 18:58       ` Jakub Kicinski
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 16:56 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Eric Dumazet, David S . Miller, netdev, Coco Li

On Thu, Feb 3, 2022 at 8:34 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed,  2 Feb 2022 17:51:26 -0800 Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Some NIC (or virtual devices) are LSOv2 compatible.
> >
> > BIG TCP plans using the large LSOv2 feature for IPv6.
> >
> > New netlink attribute IFLA_TSO_IPV6_MAX_SIZE is defined.
> >
> > Drivers should use netif_set_tso_ipv6_max_size() to advertize their limit.
> >
> > Unchanged drivers are not allowing big TSO packets to be sent.
>
> Many drivers will have a limit on how many buffer descriptors they
> can chain, not the size of the super frame, I'd think. Is that not
> the case? We can't assume all pages but the first and last are full,
> right?

In our case, we have a 100Gbit Google NIC which has these limits:

- TX descriptor has a 16bit field filled with skb->len
- No more than 21 frags per 'packet'

In order to support BIG TCP on it, we had to split the bigger TCP packets
into smaller chunks, to satisfy both constraints (even if the second
constraint is hardly hit once you chop to ~60KB packets, given our 4K
MTU)

ndo_features_check() might help to take care of small oddities.

For instance I will insert the following in the next version of the series:

commit 26644be08edc2f14f6ec79f650cc4a5d380df498
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Feb 2 23:22:01 2022 -0800

    net: typhoon: implement ndo_features_check method

    Instead of disabling TSO if MAX_SKB_FRAGS > 32, implement
    ndo_features_check() method for this driver.

    If skb has more than 32 frags, use the following heuristic:

    1) force GSO for gso packets
    2) Otherwise force linearization.

    Most locally generated TCP packets will use a small number of fragments
    anyway.

    Signed-off-by: Eric Dumazet <edumazet@google.com>

diff --git a/drivers/net/ethernet/3com/typhoon.c
b/drivers/net/ethernet/3com/typhoon.c
index 8aec5d9fbfef2803c181387537300502a937caf0..216e26a49e9c272ba7483bfa06941ff11ea40e3c
100644
--- a/drivers/net/ethernet/3com/typhoon.c
+++ b/drivers/net/ethernet/3com/typhoon.c
@@ -138,11 +138,6 @@ MODULE_PARM_DESC(use_mmio, "Use MMIO (1) or
PIO(0) to access the NIC. "
 module_param(rx_copybreak, int, 0);
 module_param(use_mmio, int, 0);

-#if defined(NETIF_F_TSO) && MAX_SKB_FRAGS > 32
-#warning Typhoon only supports 32 entries in its SG list for TSO, disabling TSO
-#undef NETIF_F_TSO
-#endif
-
 #if TXLO_ENTRIES <= (2 * MAX_SKB_FRAGS)
 #error TX ring too small!
 #endif
@@ -2261,9 +2256,23 @@ typhoon_test_mmio(struct pci_dev *pdev)
        return mode;
 }

+static netdev_features_t typhoon_features_check(struct sk_buff *skb,
+                                               struct net_device *dev,
+                                               netdev_features_t features)
+{
+       if (skb_shinfo(skb)->nr_frags > 32) {
+               if (skb_is_gso(skb))
+                       features &= ~NETIF_F_GSO_MASK;
+               else
+                       features &= ~NETIF_F_SG;
+       }
+       return features;
+}
+
 static const struct net_device_ops typhoon_netdev_ops = {
        .ndo_open               = typhoon_open,
        .ndo_stop               = typhoon_close,
+       .ndo_features_check     = typhoon_features_check,
        .ndo_start_xmit         = typhoon_start_tx,
        .ndo_set_rx_mode        = typhoon_set_rx_mode,
        .ndo_tx_timeout         = typhoon_tx_timeout,

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
                     ` (3 preceding siblings ...)
  2022-02-03 16:01   ` Paolo Abeni
@ 2022-02-03 17:26   ` Alexander H Duyck
  2022-02-03 17:34     ` Eric Dumazet
  4 siblings, 1 reply; 58+ messages in thread
From: Alexander H Duyck @ 2022-02-03 17:26 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li

On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Currently, MAX_SKB_FRAGS value is 17.
> 
> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> attempts order-3 allocations, stuffing 32768 bytes per frag.
> 
> But with zero copy, we use order-0 pages.
> 
> For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> to be able to fit 45 segments per skb.
> 
> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> does not support skbs with frag list.
> 
> We have used this MAX_SKB_FRAGS value for years at Google before
> we deployed 4K MTU, with no adverse effect.
> Back then, goal was to be able to receive full size (64KB) GRO
> packets without the frag_list overhead.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

So a big issue I see with this patch is the potential queueing issues
it may introduce on Tx queues. I suspect it will cause a number of
performance regressions and deadlocks as it will change the Tx queueing
behavior for many NICs.

As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
the ingredients for DESC_NEEDED in order to determine if the Tx queue
needs to stop. With this change the value for igb for instance is
jumping from 21 to 49, and the wake threshold is twice that, 98. As
such the minimum Tx descriptor threshold for the driver would need to
be updated beyond 80 otherwise it is likely to deadlock the first time
it has to pause.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03 17:26   ` Alexander H Duyck
@ 2022-02-03 17:34     ` Eric Dumazet
  2022-02-03 17:56       ` Alexander Duyck
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 17:34 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 9:26 AM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Currently, MAX_SKB_FRAGS value is 17.
> >
> > For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> > attempts order-3 allocations, stuffing 32768 bytes per frag.
> >
> > But with zero copy, we use order-0 pages.
> >
> > For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> > to be able to fit 45 segments per skb.
> >
> > This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> > does not support skbs with frag list.
> >
> > We have used this MAX_SKB_FRAGS value for years at Google before
> > we deployed 4K MTU, with no adverse effect.
> > Back then, goal was to be able to receive full size (64KB) GRO
> > packets without the frag_list overhead.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> So a big issue I see with this patch is the potential queueing issues
> it may introduce on Tx queues. I suspect it will cause a number of
> performance regressions and deadlocks as it will change the Tx queueing
> behavior for many NICs.
>
> As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> the ingredients for DESC_NEEDED in order to determine if the Tx queue
> needs to stop. With this change the value for igb for instance is
> jumping from 21 to 49, and the wake threshold is twice that, 98. As
> such the minimum Tx descriptor threshold for the driver would need to
> be updated beyond 80 otherwise it is likely to deadlock the first time
> it has to pause.

Are these limits hard coded in Intel drivers and firmware, or do you
think this can be changed ?

I could make  MAX_SKB_FRAGS a config option, and default to 17, until
all drivers have been fixed.

Alternative is that I remove this patch from the series and we apply
it to Google production kernels,
as we did before.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03 17:34     ` Eric Dumazet
@ 2022-02-03 17:56       ` Alexander Duyck
  2022-02-03 19:18         ` Jakub Kicinski
  2022-02-04 10:18         ` David Laight
  0 siblings, 2 replies; 58+ messages in thread
From: Alexander Duyck @ 2022-02-03 17:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 9:34 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 9:26 AM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > >
> > > Currently, MAX_SKB_FRAGS value is 17.
> > >
> > > For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> > > attempts order-3 allocations, stuffing 32768 bytes per frag.
> > >
> > > But with zero copy, we use order-0 pages.
> > >
> > > For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> > > to be able to fit 45 segments per skb.
> > >
> > > This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> > > does not support skbs with frag list.
> > >
> > > We have used this MAX_SKB_FRAGS value for years at Google before
> > > we deployed 4K MTU, with no adverse effect.
> > > Back then, goal was to be able to receive full size (64KB) GRO
> > > packets without the frag_list overhead.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > So a big issue I see with this patch is the potential queueing issues
> > it may introduce on Tx queues. I suspect it will cause a number of
> > performance regressions and deadlocks as it will change the Tx queueing
> > behavior for many NICs.
> >
> > As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> > the ingredients for DESC_NEEDED in order to determine if the Tx queue
> > needs to stop. With this change the value for igb for instance is
> > jumping from 21 to 49, and the wake threshold is twice that, 98. As
> > such the minimum Tx descriptor threshold for the driver would need to
> > be updated beyond 80 otherwise it is likely to deadlock the first time
> > it has to pause.
>
> Are these limits hard coded in Intel drivers and firmware, or do you
> think this can be changed ?

This is all code in the drivers. Most drivers have them as the logic
is used to avoid having to return NETIDEV_TX_BUSY. Basically the
assumption is there is a 1:1 correlation between descriptors and
individual frags. So most drivers would need to increase the size of
their Tx descriptor rings if they were optimized for a lower value.

The other thing is that most of the tuning for things like interrupt
moderation assume a certain fill level on the queues and those would
likely need to be updated to account for this change.

> I could make  MAX_SKB_FRAGS a config option, and default to 17, until
> all drivers have been fixed.
>
> Alternative is that I remove this patch from the series and we apply
> it to Google production kernels,
> as we did before.

A config option would probably be preferred. The big issue as I see it
is that changing MAX_SKB_FRAGS is going to have ripples throughout the
ecosystem as the shared info size will be increasing and the queueing
behavior for most drivers will be modified as a result.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03  1:51 ` [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header Eric Dumazet
@ 2022-02-03 18:53   ` Alexander H Duyck
  2022-02-03 19:17     ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Alexander H Duyck @ 2022-02-03 18:53 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Eric Dumazet, Coco Li

On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> ipv6 tcp and gro stacks will soon be able to build big TCP packets,
> with an added temporary Hop By Hop header.
> 
> If GSO is involved for these large packets, we need to remove
> the temporary HBH header before segmentation happens.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/net/ipv6.h | 31 +++++++++++++++++++++++++++++++
>  net/core/skbuff.c  | 21 ++++++++++++++++++++-
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index ea2a4351b654f8bc96503aae2b9adcd478e1f8b2..96e916fb933c3e7d4288e86790fcb2bb1353a261 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -464,6 +464,37 @@ bool ipv6_opt_accepted(const struct sock *sk, const struct sk_buff *skb,
>  struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
>  					   struct ipv6_txoptions *opt);
>  
> +/* This helper is specialized for BIG TCP needs.
> + * It assumes the hop_jumbo_hdr will immediately follow the IPV6 header.
> + * It assumes headers are already in skb->head, thus the sk argument is only read.
> + */
> +static inline bool ipv6_has_hopopt_jumbo(const struct sk_buff *skb)
> +{
> +	struct hop_jumbo_hdr *jhdr;
> +	struct ipv6hdr *nhdr;
> +
> +	if (likely(skb->len <= GRO_MAX_SIZE))
> +		return false;
> +
> +	if (skb->protocol != htons(ETH_P_IPV6))
> +		return false;
> +
> +	if (skb_network_offset(skb) +
> +	    sizeof(struct ipv6hdr) +
> +	    sizeof(struct hop_jumbo_hdr) > skb_headlen(skb))
> +		return false;
> +
> +	nhdr = ipv6_hdr(skb);
> +
> +	if (nhdr->nexthdr != NEXTHDR_HOP)
> +		return false;
> +
> +	jhdr = (struct hop_jumbo_hdr *) (nhdr + 1);
> +	if (jhdr->tlv_type != IPV6_TLV_JUMBO || jhdr->hdrlen != 0)
> +		return false;
> +	return true;

Rather than having to perform all of these checkes would it maybe make
sense to add SKB_GSO_JUMBOGRAM as a gso_type flag? Then it would make
it easier for drivers to indicate if they support the new offload or
not.

An added bonus is that it would probably make it easier to do something
like a GSO_PARTIAL for this since then it would just be a matter of
flagging it, stripping the extra hop-by-hop header, and chopping it
into gso_max_size chunks.

> +}
> +
>  static inline bool ipv6_accept_ra(struct inet6_dev *idev)
>  {
>  	/* If forwarding is enabled, RA are not accepted unless the special
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 0118f0afaa4fce8da167ddf39de4c9f3880ca05b..53f17c7392311e7123628fcab4617efc169905a1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3959,8 +3959,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>  	skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>  	unsigned int mss = skb_shinfo(head_skb)->gso_size;
>  	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
> +	int hophdr_len = sizeof(struct hop_jumbo_hdr);
>  	struct sk_buff *frag_skb = head_skb;
> -	unsigned int offset = doffset;
> +	unsigned int offset;
>  	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
>  	unsigned int partial_segs = 0;
>  	unsigned int headroom;
> @@ -3968,6 +3969,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>  	__be16 proto;
>  	bool csum, sg;
>  	int nfrags = skb_shinfo(head_skb)->nr_frags;
> +	struct ipv6hdr *h6;
>  	int err = -ENOMEM;
>  	int i = 0;
>  	int pos;
> @@ -3992,6 +3994,23 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>  	}
>  
>  	__skb_push(head_skb, doffset);
> +
> +	if (ipv6_has_hopopt_jumbo(head_skb)) {
> +		/* remove the HBH header.
> +		 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> +		 */
> +		memmove(head_skb->data + hophdr_len,
> +			head_skb->data,
> +			ETH_HLEN + sizeof(struct ipv6hdr));
> +		head_skb->data += hophdr_len;
> +		head_skb->len -= hophdr_len;
> +		head_skb->network_header += hophdr_len;
> +		head_skb->mac_header += hophdr_len;
> +		doffset -= hophdr_len;
> +		h6 = (struct ipv6hdr *)(head_skb->data + ETH_HLEN);
> +		h6->nexthdr = IPPROTO_TCP;
> +	}

Does it really make the most sense to be doing this here, or should
this be a part of the IPv6 processing? It seems like of asymmetric when
compared with the change in the next patch to add the header in GRO.

> +	offset = doffset;
>  	proto = skb_network_protocol(head_skb, NULL);
>  	if (unlikely(!proto))
>  		return ERR_PTR(-EINVAL);



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute
  2022-02-03 16:56     ` Eric Dumazet
@ 2022-02-03 18:58       ` Jakub Kicinski
  2022-02-03 19:12         ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2022-02-03 18:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, Coco Li

On Thu, 3 Feb 2022 08:56:56 -0800 Eric Dumazet wrote:
> On Thu, Feb 3, 2022 at 8:34 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed,  2 Feb 2022 17:51:26 -0800 Eric Dumazet wrote:  
> > > From: Eric Dumazet <edumazet@google.com>
> > >
> > > Some NIC (or virtual devices) are LSOv2 compatible.
> > >
> > > BIG TCP plans using the large LSOv2 feature for IPv6.
> > >
> > > New netlink attribute IFLA_TSO_IPV6_MAX_SIZE is defined.
> > >
> > > Drivers should use netif_set_tso_ipv6_max_size() to advertize their limit.
> > >
> > > Unchanged drivers are not allowing big TSO packets to be sent.  
> >
> > Many drivers will have a limit on how many buffer descriptors they
> > can chain, not the size of the super frame, I'd think. Is that not
> > the case? We can't assume all pages but the first and last are full,
> > right?  
> 
> In our case, we have a 100Gbit Google NIC which has these limits:
> 
> - TX descriptor has a 16bit field filled with skb->len
> - No more than 21 frags per 'packet'
> 
> In order to support BIG TCP on it, we had to split the bigger TCP packets
> into smaller chunks, to satisfy both constraints (even if the second
> constraint is hardly hit once you chop to ~60KB packets, given our 4K
> MTU)
> 
> ndo_features_check() might help to take care of small oddities.

Makes sense, I was curious if we can do more in the core so that fewer
changes are required in the drivers. Both so that drivers don't have to
strip the header and so that drivers with limitations can be served 
pre-cooked smaller skbs.

> For instance I will insert the following in the next version of the series:
> 
> commit 26644be08edc2f14f6ec79f650cc4a5d380df498
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Wed Feb 2 23:22:01 2022 -0800
> 
>     net: typhoon: implement ndo_features_check method
> 
>     Instead of disabling TSO if MAX_SKB_FRAGS > 32, implement
>     ndo_features_check() method for this driver.
> 
>     If skb has more than 32 frags, use the following heuristic:
> 
>     1) force GSO for gso packets
>     2) Otherwise force linearization.
> 
>     Most locally generated TCP packets will use a small number of fragments
>     anyway.
> 
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> diff --git a/drivers/net/ethernet/3com/typhoon.c
> b/drivers/net/ethernet/3com/typhoon.c
> index 8aec5d9fbfef2803c181387537300502a937caf0..216e26a49e9c272ba7483bfa06941ff11ea40e3c
> 100644
> --- a/drivers/net/ethernet/3com/typhoon.c
> +++ b/drivers/net/ethernet/3com/typhoon.c
> @@ -138,11 +138,6 @@ MODULE_PARM_DESC(use_mmio, "Use MMIO (1) or
> PIO(0) to access the NIC. "
>  module_param(rx_copybreak, int, 0);
>  module_param(use_mmio, int, 0);
> 
> -#if defined(NETIF_F_TSO) && MAX_SKB_FRAGS > 32
> -#warning Typhoon only supports 32 entries in its SG list for TSO, disabling TSO
> -#undef NETIF_F_TSO
> -#endif

I wonder how many drivers just assumed MAX_SKB_FRAGS will never 
change :S What do you think about a device-level check in the core 
for number of frags?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute
  2022-02-03 18:58       ` Jakub Kicinski
@ 2022-02-03 19:12         ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 19:12 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Eric Dumazet, David S . Miller, netdev, Coco Li

On Thu, Feb 3, 2022 at 10:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 3 Feb 2022 08:56:56 -0800 Eric Dumazet wrote:
> > On Thu, Feb 3, 2022 at 8:34 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Wed,  2 Feb 2022 17:51:26 -0800 Eric Dumazet wrote:
> > > > From: Eric Dumazet <edumazet@google.com>
> > > >
> > > > Some NIC (or virtual devices) are LSOv2 compatible.
> > > >
> > > > BIG TCP plans using the large LSOv2 feature for IPv6.
> > > >
> > > > New netlink attribute IFLA_TSO_IPV6_MAX_SIZE is defined.
> > > >
> > > > Drivers should use netif_set_tso_ipv6_max_size() to advertize their limit.
> > > >
> > > > Unchanged drivers are not allowing big TSO packets to be sent.
> > >
> > > Many drivers will have a limit on how many buffer descriptors they
> > > can chain, not the size of the super frame, I'd think. Is that not
> > > the case? We can't assume all pages but the first and last are full,
> > > right?
> >
> > In our case, we have a 100Gbit Google NIC which has these limits:
> >
> > - TX descriptor has a 16bit field filled with skb->len
> > - No more than 21 frags per 'packet'
> >
> > In order to support BIG TCP on it, we had to split the bigger TCP packets
> > into smaller chunks, to satisfy both constraints (even if the second
> > constraint is hardly hit once you chop to ~60KB packets, given our 4K
> > MTU)
> >
> > ndo_features_check() might help to take care of small oddities.
>
> Makes sense, I was curious if we can do more in the core so that fewer
> changes are required in the drivers. Both so that drivers don't have to
> strip the header and so that drivers with limitations can be served
> pre-cooked smaller skbs.

I have on my plate to implement a helper to split 'big GRO/TSO' packets
into smaller chunks. I have avoided doing it in our Google NIC driver,
to avoid extra sk_buff/skb->head allocations for each BIG TCP packet.

Yes, core networking stack could use it.

> I wonder how many drivers just assumed MAX_SKB_FRAGS will never
> change :S What do you think about a device-level check in the core
> for number of frags?

I guess we could do this if the CONFIG_MAX_SKB_FRAGS > 17

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03 18:53   ` Alexander H Duyck
@ 2022-02-03 19:17     ` Eric Dumazet
  2022-02-03 19:45       ` Alexander Duyck
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 19:17 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 10:53 AM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
>
> Rather than having to perform all of these checkes would it maybe make
> sense to add SKB_GSO_JUMBOGRAM as a gso_type flag? Then it would make
> it easier for drivers to indicate if they support the new offload or
> not.

Yes, this could be an option.

>
> An added bonus is that it would probably make it easier to do something
> like a GSO_PARTIAL for this since then it would just be a matter of
> flagging it, stripping the extra hop-by-hop header, and chopping it
> into gso_max_size chunks.
>
> > +}
> > +
> >  static inline bool ipv6_accept_ra(struct inet6_dev *idev)
> >  {
> >       /* If forwarding is enabled, RA are not accepted unless the special
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 0118f0afaa4fce8da167ddf39de4c9f3880ca05b..53f17c7392311e7123628fcab4617efc169905a1 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3959,8 +3959,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> >       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
> >       unsigned int mss = skb_shinfo(head_skb)->gso_size;
> >       unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
> > +     int hophdr_len = sizeof(struct hop_jumbo_hdr);
> >       struct sk_buff *frag_skb = head_skb;
> > -     unsigned int offset = doffset;
> > +     unsigned int offset;
> >       unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
> >       unsigned int partial_segs = 0;
> >       unsigned int headroom;
> > @@ -3968,6 +3969,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> >       __be16 proto;
> >       bool csum, sg;
> >       int nfrags = skb_shinfo(head_skb)->nr_frags;
> > +     struct ipv6hdr *h6;
> >       int err = -ENOMEM;
> >       int i = 0;
> >       int pos;
> > @@ -3992,6 +3994,23 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> >       }
> >
> >       __skb_push(head_skb, doffset);
> > +
> > +     if (ipv6_has_hopopt_jumbo(head_skb)) {
> > +             /* remove the HBH header.
> > +              * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> > +              */
> > +             memmove(head_skb->data + hophdr_len,
> > +                     head_skb->data,
> > +                     ETH_HLEN + sizeof(struct ipv6hdr));
> > +             head_skb->data += hophdr_len;
> > +             head_skb->len -= hophdr_len;
> > +             head_skb->network_header += hophdr_len;
> > +             head_skb->mac_header += hophdr_len;
> > +             doffset -= hophdr_len;
> > +             h6 = (struct ipv6hdr *)(head_skb->data + ETH_HLEN);
> > +             h6->nexthdr = IPPROTO_TCP;
> > +     }
>
> Does it really make the most sense to be doing this here, or should
> this be a part of the IPv6 processing? It seems like of asymmetric when
> compared with the change in the next patch to add the header in GRO.
>

Not sure what you mean. We do have to strip the header here, I do not
see where else to make this ?

> > +     offset = doffset;
> >       proto = skb_network_protocol(head_skb, NULL);
> >       if (unlikely(!proto))
> >               return ERR_PTR(-EINVAL);
>
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03 17:56       ` Alexander Duyck
@ 2022-02-03 19:18         ` Jakub Kicinski
  2022-02-03 19:20           ` Eric Dumazet
  2022-02-04 10:18         ` David Laight
  1 sibling, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2022-02-03 19:18 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Eric Dumazet, David S . Miller, netdev, Coco Li

On Thu, 3 Feb 2022 09:56:42 -0800 Alexander Duyck wrote:
> > I could make  MAX_SKB_FRAGS a config option, and default to 17, until
> > all drivers have been fixed.
> >
> > Alternative is that I remove this patch from the series and we apply
> > it to Google production kernels,
> > as we did before.  
> 
> A config option would probably be preferred. The big issue as I see it
> is that changing MAX_SKB_FRAGS is going to have ripples throughout the
> ecosystem as the shared info size will be increasing and the queueing
> behavior for most drivers will be modified as a result.

I'd vote for making the change and dealing with the fall out. Unlikely
many people would turn this knob otherwise and it's a major difference.
Better not to fork the characteristics of the stack, IMHO.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03 19:18         ` Jakub Kicinski
@ 2022-02-03 19:20           ` Eric Dumazet
  2022-02-03 19:54             ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 19:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, Eric Dumazet, David S . Miller, netdev, Coco Li

On Thu, Feb 3, 2022 at 11:18 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 3 Feb 2022 09:56:42 -0800 Alexander Duyck wrote:
> > > I could make  MAX_SKB_FRAGS a config option, and default to 17, until
> > > all drivers have been fixed.
> > >
> > > Alternative is that I remove this patch from the series and we apply
> > > it to Google production kernels,
> > > as we did before.
> >
> > A config option would probably be preferred. The big issue as I see it
> > is that changing MAX_SKB_FRAGS is going to have ripples throughout the
> > ecosystem as the shared info size will be increasing and the queueing
> > behavior for most drivers will be modified as a result.
>
> I'd vote for making the change and dealing with the fall out. Unlikely
> many people would turn this knob otherwise and it's a major difference.
> Better not to fork the characteristics of the stack, IMHO.

Another issue with CONFIG_ options is that they are integer.

Trying the following did not work

#define MAX_SKB_FRAGS  ((unsigned long)CONFIG_MAX_SKB_FRAGS)

Because in some places we have

#if    (   MAX_SKB_FRAGS > ...)

(MAX_SKB_FRAGS is UL currently, making it an integer might cause some
signed/unsigned operations buggy)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03 19:17     ` Eric Dumazet
@ 2022-02-03 19:45       ` Alexander Duyck
  2022-02-03 19:59         ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Alexander Duyck @ 2022-02-03 19:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 11:17 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 10:53 AM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> >
> > Rather than having to perform all of these checkes would it maybe make
> > sense to add SKB_GSO_JUMBOGRAM as a gso_type flag? Then it would make
> > it easier for drivers to indicate if they support the new offload or
> > not.
>
> Yes, this could be an option.
>
> >
> > An added bonus is that it would probably make it easier to do something
> > like a GSO_PARTIAL for this since then it would just be a matter of
> > flagging it, stripping the extra hop-by-hop header, and chopping it
> > into gso_max_size chunks.
> >
> > > +}
> > > +
> > >  static inline bool ipv6_accept_ra(struct inet6_dev *idev)
> > >  {
> > >       /* If forwarding is enabled, RA are not accepted unless the special
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 0118f0afaa4fce8da167ddf39de4c9f3880ca05b..53f17c7392311e7123628fcab4617efc169905a1 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -3959,8 +3959,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> > >       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
> > >       unsigned int mss = skb_shinfo(head_skb)->gso_size;
> > >       unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
> > > +     int hophdr_len = sizeof(struct hop_jumbo_hdr);
> > >       struct sk_buff *frag_skb = head_skb;
> > > -     unsigned int offset = doffset;
> > > +     unsigned int offset;
> > >       unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
> > >       unsigned int partial_segs = 0;
> > >       unsigned int headroom;
> > > @@ -3968,6 +3969,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> > >       __be16 proto;
> > >       bool csum, sg;
> > >       int nfrags = skb_shinfo(head_skb)->nr_frags;
> > > +     struct ipv6hdr *h6;
> > >       int err = -ENOMEM;
> > >       int i = 0;
> > >       int pos;
> > > @@ -3992,6 +3994,23 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> > >       }
> > >
> > >       __skb_push(head_skb, doffset);
> > > +
> > > +     if (ipv6_has_hopopt_jumbo(head_skb)) {
> > > +             /* remove the HBH header.
> > > +              * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> > > +              */
> > > +             memmove(head_skb->data + hophdr_len,
> > > +                     head_skb->data,
> > > +                     ETH_HLEN + sizeof(struct ipv6hdr));
> > > +             head_skb->data += hophdr_len;
> > > +             head_skb->len -= hophdr_len;
> > > +             head_skb->network_header += hophdr_len;
> > > +             head_skb->mac_header += hophdr_len;
> > > +             doffset -= hophdr_len;
> > > +             h6 = (struct ipv6hdr *)(head_skb->data + ETH_HLEN);
> > > +             h6->nexthdr = IPPROTO_TCP;
> > > +     }
> >
> > Does it really make the most sense to be doing this here, or should
> > this be a part of the IPv6 processing? It seems like of asymmetric when
> > compared with the change in the next patch to add the header in GRO.
> >
>
> Not sure what you mean. We do have to strip the header here, I do not
> see where else to make this ?

It is the fact that you are adding IPv6 specific code to the
net/core/skbuff.c block here. Logically speaking if you are adding the
header in ipv6_gro_receive then it really seems li:ke the logic to
remove the header really belongs in ipv6_gso_segment. I suppose this
is an attempt to optimize it though, since normally updates to the
header are done after segmentation instead of before.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03 19:20           ` Eric Dumazet
@ 2022-02-03 19:54             ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 19:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, Eric Dumazet, David S . Miller, netdev, Coco Li

On Thu, Feb 3, 2022 at 11:20 AM Eric Dumazet <edumazet@google.com> wrote:

>
> Another issue with CONFIG_ options is that they are integer.
>
> Trying the following did not work
>
> #define MAX_SKB_FRAGS  ((unsigned long)CONFIG_MAX_SKB_FRAGS)
>
> Because in some places we have
>
> #if    (   MAX_SKB_FRAGS > ...)
>
> (MAX_SKB_FRAGS is UL currently, making it an integer might cause some
> signed/unsigned operations buggy)

I came to something like this, clearly this a bit ugly.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 08c12c41c5a5907dccc7389f396394d8132d962e..cc3cac3ee109f95c8a51eb90ba4a3bf7bebe86eb
100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -323,7 +323,15 @@ enum skb_drop_reason {
        SKB_DROP_REASON_MAX,
 };

+#ifdef CONFIG_MAX_SKB_FRAGS_17
+#define MAX_SKB_FRAGS 17UL
+#endif
+#ifdef CONFIG_MAX_SKB_FRAGS_25
+#define MAX_SKB_FRAGS 25UL
+#endif
+#ifdef CONFIG_MAX_SKB_FRAGS_45
 #define MAX_SKB_FRAGS 45UL
+#endif

 extern int sysctl_max_skb_frags;

diff --git a/net/Kconfig b/net/Kconfig
index 8a1f9d0287de3c32040eee03b60114c6e6d150bc..d91027a654c2aad7bfa55152ef81c882bf394aff
100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -253,6 +253,29 @@ config PCPU_DEV_REFCNT
          network device refcount are using per cpu variables if this
option is set.
          This can be forced to N to detect underflows (with a
performance drop).

+choice
+       prompt "Maximum number of fragments per skb_shared_info"
+       default MAX_SKB_FRAGS_17
+
+config MAX_SKB_FRAGS_17
+       bool "17 fragments per skb_shared_info"
+       help
+         Some drivers have assumptions about MAX_SKB_FRAGS being 17.
+         Until they are fixed, it is safe to adopt the old limit.
+
+config MAX_SKB_FRAGS_25
+       bool "25 fragments per skb_shared_info"
+       help
+         Helps BIG TCP workloads, but might expose bugs in some legacy drivers.
+
+config MAX_SKB_FRAGS_45
+       bool "45 fragments per skb_shared_info"
+       help
+         Helps BIG TCP workloads, but might expose bugs in some legacy drivers.
+         This also increase memory overhead of small packets.
+
+endchoice
+
 config RPS
        bool
        depends on SMP && SYSFS

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03 19:45       ` Alexander Duyck
@ 2022-02-03 19:59         ` Eric Dumazet
  2022-02-03 21:08           ` Alexander H Duyck
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 19:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 11:45 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> It is the fact that you are adding IPv6 specific code to the
> net/core/skbuff.c block here. Logically speaking if you are adding the
> header in ipv6_gro_receive then it really seems li:ke the logic to
> remove the header really belongs in ipv6_gso_segment. I suppose this
> is an attempt to optimize it though, since normally updates to the
> header are done after segmentation instead of before.

Right, doing this at the top level means we do the thing once only,
instead of 45 times
if the skb has 45 segments.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03 19:59         ` Eric Dumazet
@ 2022-02-03 21:08           ` Alexander H Duyck
  2022-02-03 21:41             ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Alexander H Duyck @ 2022-02-03 21:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, 2022-02-03 at 11:59 -0800, Eric Dumazet wrote:
> On Thu, Feb 3, 2022 at 11:45 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> 
> > It is the fact that you are adding IPv6 specific code to the
> > net/core/skbuff.c block here. Logically speaking if you are adding the
> > header in ipv6_gro_receive then it really seems li:ke the logic to
> > remove the header really belongs in ipv6_gso_segment. I suppose this
> > is an attempt to optimize it though, since normally updates to the
> > header are done after segmentation instead of before.
> 
> Right, doing this at the top level means we do the thing once only,
> instead of 45 times if the skb has 45 segments.

I'm just wondering if there is a way for us to do it in
ipv6_gso_segment directly instead though. With this we essentially end
up having to free the skb if the segmentation fails anyway since it
won't be able to go out on the wire.

If we assume the stack will successfully segment the frame then it
might make sense to just take care of the hop-by-hop header before we
start processing the L4 protocol.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03 21:08           ` Alexander H Duyck
@ 2022-02-03 21:41             ` Eric Dumazet
  2022-02-04  0:05               ` Alexander Duyck
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-03 21:41 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 1:08 PM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Thu, 2022-02-03 at 11:59 -0800, Eric Dumazet wrote:
> > On Thu, Feb 3, 2022 at 11:45 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >
> > > It is the fact that you are adding IPv6 specific code to the
> > > net/core/skbuff.c block here. Logically speaking if you are adding the
> > > header in ipv6_gro_receive then it really seems li:ke the logic to
> > > remove the header really belongs in ipv6_gso_segment. I suppose this
> > > is an attempt to optimize it though, since normally updates to the
> > > header are done after segmentation instead of before.
> >
> > Right, doing this at the top level means we do the thing once only,
> > instead of 45 times if the skb has 45 segments.
>
> I'm just wondering if there is a way for us to do it in
> ipv6_gso_segment directly instead though. With this we essentially end
> up having to free the skb if the segmentation fails anyway since it
> won't be able to go out on the wire.
>

Having a HBH jumbo header in place while the current frame is MTU size
(typically MTU < 9000) would
violate the specs. A HBH jumbo header presence implies packet length > 64K.



> If we assume the stack will successfully segment the frame then it
> might make sense to just take care of the hop-by-hop header before we
> start processing the L4 protocol.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-03 21:41             ` Eric Dumazet
@ 2022-02-04  0:05               ` Alexander Duyck
  2022-02-04  0:27                 ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Alexander Duyck @ 2022-02-04  0:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 1:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 1:08 PM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Thu, 2022-02-03 at 11:59 -0800, Eric Dumazet wrote:
> > > On Thu, Feb 3, 2022 at 11:45 AM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > >
> > > > It is the fact that you are adding IPv6 specific code to the
> > > > net/core/skbuff.c block here. Logically speaking if you are adding the
> > > > header in ipv6_gro_receive then it really seems li:ke the logic to
> > > > remove the header really belongs in ipv6_gso_segment. I suppose this
> > > > is an attempt to optimize it though, since normally updates to the
> > > > header are done after segmentation instead of before.
> > >
> > > Right, doing this at the top level means we do the thing once only,
> > > instead of 45 times if the skb has 45 segments.
> >
> > I'm just wondering if there is a way for us to do it in
> > ipv6_gso_segment directly instead though. With this we essentially end
> > up having to free the skb if the segmentation fails anyway since it
> > won't be able to go out on the wire.
> >
>
> Having a HBH jumbo header in place while the current frame is MTU size
> (typically MTU < 9000) would
> violate the specs. A HBH jumbo header presence implies packet length > 64K.

I get that. What I was getting at was that we might be able to process
it in ipv6_gso_segment before we hand it off to either TCP or UDP gso
handlers to segment.

The general idea being we keep the IPv6 specific bits in the IPv6
specific code instead of having the skb_segment function now have to
understand IPv6 packets. So what we would end up doing is having to do
an skb_cow to replace the skb->head if any clones might be holding on
it, and then just chop off the HBH jumbo header before we start the
segmenting.

The risk would be that we waste cycles removing the HBH header for a
frame that is going to fail, but I am not sure how likely a scenario
that is or if we need to optimize for that.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-04  0:05               ` Alexander Duyck
@ 2022-02-04  0:27                 ` Eric Dumazet
  2022-02-04  1:14                   ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-04  0:27 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 4:05 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>

> I get that. What I was getting at was that we might be able to process
> it in ipv6_gso_segment before we hand it off to either TCP or UDP gso
> handlers to segment.
>
> The general idea being we keep the IPv6 specific bits in the IPv6
> specific code instead of having the skb_segment function now have to
> understand IPv6 packets. So what we would end up doing is having to do
> an skb_cow to replace the skb->head if any clones might be holding on
> it, and then just chop off the HBH jumbo header before we start the
> segmenting.
>
> The risk would be that we waste cycles removing the HBH header for a
> frame that is going to fail, but I am not sure how likely a scenario
> that is or if we need to optimize for that.

I guess I can try this for the next version, thanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-04  0:27                 ` Eric Dumazet
@ 2022-02-04  1:14                   ` Eric Dumazet
  2022-02-04  1:48                     ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-04  1:14 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 4:27 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 4:05 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
>
> > I get that. What I was getting at was that we might be able to process
> > it in ipv6_gso_segment before we hand it off to either TCP or UDP gso
> > handlers to segment.
> >
> > The general idea being we keep the IPv6 specific bits in the IPv6
> > specific code instead of having the skb_segment function now have to
> > understand IPv6 packets. So what we would end up doing is having to do
> > an skb_cow to replace the skb->head if any clones might be holding on
> > it, and then just chop off the HBH jumbo header before we start the
> > segmenting.
> >
> > The risk would be that we waste cycles removing the HBH header for a
> > frame that is going to fail, but I am not sure how likely a scenario
> > that is or if we need to optimize for that.
>
> I guess I can try this for the next version, thanks.

I came up with:

ommit 147f17169ccc6c2c38ea802e5728528ed54f492d
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Nov 20 16:49:35 2021 -0800

    ipv6/gso: remove temporary HBH/jumbo header

    ipv6 tcp and gro stacks will soon be able to build big TCP packets,
    with an added temporary Hop By Hop header.

    If GSO is involved for these large packets, we need to remove
    the temporary HBH header before segmentation happens.

    v2: perform HBH removal from ipv6_gso_segment() instead of
        skb_segment() (Alexander feedback)

    Signed-off-by: Eric Dumazet <edumazet@google.com>

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ea2a4351b654f8bc96503aae2b9adcd478e1f8b2..a850c18dae0dfedccb9d956bf1ec9fa6b0368c6b
100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -464,6 +464,38 @@ bool ipv6_opt_accepted(const struct sock *sk,
const struct sk_buff *skb,
 struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
                                           struct ipv6_txoptions *opt);

+/* This helper is specialized for BIG TCP needs.
+ * It assumes the hop_jumbo_hdr will immediately follow the IPV6 header.
+ * It assumes headers are already in skb->head, thus the sk argument
is only read.
+ */
+static inline bool ipv6_has_hopopt_jumbo(const struct sk_buff *skb)
+{
+       const struct hop_jumbo_hdr *jhdr;
+       const struct ipv6hdr *nhdr;
+
+       if (likely(skb->len <= GRO_MAX_SIZE))
+               return false;
+
+       if (skb->protocol != htons(ETH_P_IPV6))
+               return false;
+
+       if (skb_network_offset(skb) +
+           sizeof(struct ipv6hdr) +
+           sizeof(struct hop_jumbo_hdr) > skb_headlen(skb))
+               return false;
+
+       nhdr = ipv6_hdr(skb);
+
+       if (nhdr->nexthdr != NEXTHDR_HOP)
+               return false;
+
+       jhdr = (const struct hop_jumbo_hdr *) (nhdr + 1);
+       if (jhdr->tlv_type != IPV6_TLV_JUMBO || jhdr->hdrlen != 0 ||
+           jhdr->nexthdr != IPPROTO_TCP)
+               return false;
+       return true;
+}
+
 static inline bool ipv6_accept_ra(struct inet6_dev *idev)
 {
        /* If forwarding is enabled, RA are not accepted unless the special
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index d37a79a8554e92a1dcaa6fd023cafe2114841ece..7f65097c8f30fa19a8c9c265eb4f027e91848021
100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -87,6 +87,27 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
        bool gso_partial;

        skb_reset_network_header(skb);
+       if (ipv6_has_hopopt_jumbo(skb)) {
+               const int hophdr_len = sizeof(struct hop_jumbo_hdr);
+               int err;
+
+               err = skb_cow_head(skb, 0);
+               if (err < 0)
+                       return ERR_PTR(err);
+
+               /* remove the HBH header.
+                * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+                */
+               memmove(skb->data + hophdr_len,
+                       skb->data,
+                       ETH_HLEN + sizeof(struct ipv6hdr));
+               skb->data += hophdr_len;
+               skb->len -= hophdr_len;
+               skb->network_header += hophdr_len;
+               skb->mac_header += hophdr_len;
+               ipv6h = (struct ipv6hdr *)skb->data;
+               ipv6h->nexthdr = IPPROTO_TCP;
+       }
        nhoff = skb_network_header(skb) - skb_mac_header(skb);
        if (unlikely(!pskb_may_pull(skb, sizeof(*ipv6h))))
                goto out;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-04  1:14                   ` Eric Dumazet
@ 2022-02-04  1:48                     ` Eric Dumazet
  2022-02-04  2:15                       ` Eric Dumazet
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Dumazet @ 2022-02-04  1:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 5:14 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 4:27 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Feb 3, 2022 at 4:05 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> >
> > > I get that. What I was getting at was that we might be able to process
> > > it in ipv6_gso_segment before we hand it off to either TCP or UDP gso
> > > handlers to segment.
> > >
> > > The general idea being we keep the IPv6 specific bits in the IPv6
> > > specific code instead of having the skb_segment function now have to
> > > understand IPv6 packets. So what we would end up doing is having to do
> > > an skb_cow to replace the skb->head if any clones might be holding on
> > > it, and then just chop off the HBH jumbo header before we start the
> > > segmenting.
> > >
> > > The risk would be that we waste cycles removing the HBH header for a
> > > frame that is going to fail, but I am not sure how likely a scenario
> > > that is or if we need to optimize for that.
> >
> > I guess I can try this for the next version, thanks.
>
> I came up with:
>
> ommit 147f17169ccc6c2c38ea802e5728528ed54f492d
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Sat Nov 20 16:49:35 2021 -0800
>
>     ipv6/gso: remove temporary HBH/jumbo header
>
>     ipv6 tcp and gro stacks will soon be able to build big TCP packets,
>     with an added temporary Hop By Hop header.
>
>     If GSO is involved for these large packets, we need to remove
>     the temporary HBH header before segmentation happens.
>
>     v2: perform HBH removal from ipv6_gso_segment() instead of
>         skb_segment() (Alexander feedback)
>
>     Signed-off-by: Eric Dumazet <edumazet@google.com>


Well, this does not work at all.




>  static inline bool ipv6_accept_ra(struct inet6_dev *idev)
>  {
>         /* If forwarding is enabled, RA are not accepted unless the special
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index d37a79a8554e92a1dcaa6fd023cafe2114841ece..7f65097c8f30fa19a8c9c265eb4f027e91848021
> 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -87,6 +87,27 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
>         bool gso_partial;
>
>         skb_reset_network_header(skb);
> +       if (ipv6_has_hopopt_jumbo(skb)) {
> +               const int hophdr_len = sizeof(struct hop_jumbo_hdr);
> +               int err;
> +
> +               err = skb_cow_head(skb, 0);
> +               if (err < 0)
> +                       return ERR_PTR(err);
> +
> +               /* remove the HBH header.
> +                * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> +                */
> +               memmove(skb->data + hophdr_len,
> +                       skb->data,
> +                       ETH_HLEN + sizeof(struct ipv6hdr));
> +               skb->data += hophdr_len;
> +               skb->len -= hophdr_len;
> +               skb->network_header += hophdr_len;
> +               skb->mac_header += hophdr_len;
> +               ipv6h = (struct ipv6hdr *)skb->data;
> +               ipv6h->nexthdr = IPPROTO_TCP;
> +       }
>         nhoff = skb_network_header(skb) - skb_mac_header(skb);
>         if (unlikely(!pskb_may_pull(skb, sizeof(*ipv6h))))
>                 goto out;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header
  2022-02-04  1:48                     ` Eric Dumazet
@ 2022-02-04  2:15                       ` Eric Dumazet
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Dumazet @ 2022-02-04  2:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

On Thu, Feb 3, 2022 at 5:48 PM Eric Dumazet <edumazet@google.com> wrote:

>
> Well, this does not work at all.
>
>
>
>
> >  static inline bool ipv6_accept_ra(struct inet6_dev *idev)
> >  {
> >         /* If forwarding is enabled, RA are not accepted unless the special
> > diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> > index d37a79a8554e92a1dcaa6fd023cafe2114841ece..7f65097c8f30fa19a8c9c265eb4f027e91848021
> > 100644
> > --- a/net/ipv6/ip6_offload.c
> > +++ b/net/ipv6/ip6_offload.c
> > @@ -87,6 +87,27 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
> >         bool gso_partial;
> >
> >         skb_reset_network_header(skb);
> > +       if (ipv6_has_hopopt_jumbo(skb)) {
> > +               const int hophdr_len = sizeof(struct hop_jumbo_hdr);
> > +               int err;
> > +
> > +               err = skb_cow_head(skb, 0);
> > +               if (err < 0)
> > +                       return ERR_PTR(err);
> > +
> > +               /* remove the HBH header.
> > +                * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> > +                */
> > +               memmove(skb->data + hophdr_len,
> > +                       skb->data,

Oh, I must use skb_mac_header() instead of skb->data, sorry for the noise.

> > +                       ETH_HLEN + sizeof(struct ipv6hdr));
> > +               skb->data += hophdr_len;
> > +               skb->len -= hophdr_len;
> > +               skb->network_header += hophdr_len;
> > +               skb->mac_header += hophdr_len;
> > +               ipv6h = (struct ipv6hdr *)skb->data;
> > +               ipv6h->nexthdr = IPPROTO_TCP;
> > +       }
> >         nhoff = skb_network_header(skb) - skb_mac_header(skb);
> >         if (unlikely(!pskb_may_pull(skb, sizeof(*ipv6h))))
> >                 goto out;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 15/15] mlx5: support BIG TCP packets
  2022-02-03  1:51 ` [PATCH net-next 15/15] mlx5: " Eric Dumazet
  2022-02-03  7:27   ` Tariq Toukan
@ 2022-02-04  4:03   ` kernel test robot
  1 sibling, 0 replies; 58+ messages in thread
From: kernel test robot @ 2022-02-04  4:03 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: kbuild-all, netdev, Eric Dumazet, Coco Li, Saeed Mahameed,
	Leon Romanovsky

Hi Eric,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
config: arc-allyesconfig (https://download.01.org/0day-ci/archive/20220204/202202041153.aALvQUP0-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7561f5d66d00583e6d88fa6b2fffd868dcc82b2e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
        git checkout 7561f5d66d00583e6d88fa6b2fffd868dcc82b2e
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arc SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/container_of.h:5,
                    from include/linux/kernel.h:21,
                    from include/linux/skbuff.h:13,
                    from include/linux/tcp.h:17,
                    from drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:33:
   include/linux/build_bug.h:78:41: error: static assertion failed: "BITS_PER_LONG >= NR_MSG_FRAG_IDS"
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                         ^~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ^~~~~~~~~~~~~~~
   include/linux/skmsg.h:41:1: note: in expansion of macro 'static_assert'
      41 | static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);
         | ^~~~~~~~~~~~~
   drivers/net/ethernet/mellanox/mlx5/core/en_tx.c: In function 'mlx5i_sq_xmit':
>> drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:1055:86: error: 'h6' undeclared (first use in this function)
    1055 |                         memcpy(eseg->inline_hdr.start, skb->data, ETH_HLEN + sizeof(*h6));
         |                                                                                      ^~
   drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:1055:86: note: each undeclared identifier is reported only once for each function it appears in


vim +/h6 +1055 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c

  1011	
  1012	void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
  1013			   struct mlx5_av *av, u32 dqpn, u32 dqkey, bool xmit_more)
  1014	{
  1015		struct mlx5e_tx_wqe_attr wqe_attr;
  1016		struct mlx5e_tx_attr attr;
  1017		struct mlx5i_tx_wqe *wqe;
  1018	
  1019		struct mlx5_wqe_datagram_seg *datagram;
  1020		struct mlx5_wqe_ctrl_seg *cseg;
  1021		struct mlx5_wqe_eth_seg  *eseg;
  1022		struct mlx5_wqe_data_seg *dseg;
  1023		struct mlx5e_tx_wqe_info *wi;
  1024	
  1025		struct mlx5e_sq_stats *stats = sq->stats;
  1026		int num_dma;
  1027		u16 pi;
  1028	
  1029		mlx5e_sq_xmit_prepare(sq, skb, NULL, &attr);
  1030		mlx5i_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
  1031	
  1032		pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
  1033		wqe = MLX5I_SQ_FETCH_WQE(sq, pi);
  1034	
  1035		stats->xmit_more += xmit_more;
  1036	
  1037		/* fill wqe */
  1038		wi       = &sq->db.wqe_info[pi];
  1039		cseg     = &wqe->ctrl;
  1040		datagram = &wqe->datagram;
  1041		eseg     = &wqe->eth;
  1042		dseg     =  wqe->data;
  1043	
  1044		mlx5i_txwqe_build_datagram(av, dqpn, dqkey, datagram);
  1045	
  1046		mlx5e_txwqe_build_eseg_csum(sq, skb, NULL, eseg);
  1047	
  1048		eseg->mss = attr.mss;
  1049	
  1050		if (attr.ihs) {
  1051			if (unlikely(attr.hopbyhop)) {
  1052				/* remove the HBH header.
  1053				 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
  1054				 */
> 1055				memcpy(eseg->inline_hdr.start, skb->data, ETH_HLEN + sizeof(*h6));

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-03 17:56       ` Alexander Duyck
  2022-02-03 19:18         ` Jakub Kicinski
@ 2022-02-04 10:18         ` David Laight
  2022-02-04 15:46           ` Alexander Duyck
  1 sibling, 1 reply; 58+ messages in thread
From: David Laight @ 2022-02-04 10:18 UTC (permalink / raw)
  To: 'Alexander Duyck', Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, netdev, Coco Li

From: Alexander Duyck
> Sent: 03 February 2022 17:57
...
> > > So a big issue I see with this patch is the potential queueing issues
> > > it may introduce on Tx queues. I suspect it will cause a number of
> > > performance regressions and deadlocks as it will change the Tx queueing
> > > behavior for many NICs.
> > >
> > > As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> > > the ingredients for DESC_NEEDED in order to determine if the Tx queue
> > > needs to stop. With this change the value for igb for instance is
> > > jumping from 21 to 49, and the wake threshold is twice that, 98. As
> > > such the minimum Tx descriptor threshold for the driver would need to
> > > be updated beyond 80 otherwise it is likely to deadlock the first time
> > > it has to pause.
> >
> > Are these limits hard coded in Intel drivers and firmware, or do you
> > think this can be changed ?
> 
> This is all code in the drivers. Most drivers have them as the logic
> is used to avoid having to return NETIDEV_TX_BUSY. Basically the
> assumption is there is a 1:1 correlation between descriptors and
> individual frags. So most drivers would need to increase the size of
> their Tx descriptor rings if they were optimized for a lower value.

Maybe the drivers can be a little less conservative about the number
of fragments they expect in the next message?
There is little point requiring 49 free descriptors when the workload
never has more than 2 or 3 fragments.

Clearly you don't want to re-enable things unless there are enough
descriptors for an skb that has generated NETDEV_TX_BUSY, but the
current logic of 'trying to never actually return NETDEV_TX_BUSY'
is probably over cautious.

Does Linux allow skb to have a lot of short fragments?
If dma_map isn't cheap (probably anything with an iommu or non-coherent
memory) them copying/merging short fragments into a pre-mapped
buffer can easily be faster.
Many years ago we found it was worth copying anything under 1k on
a sparc mbus+sbus system.
I don't think Linux can generate what I've seen elsewhere - the mac
driver being asked to transmit something with 1000+ one byte fragmemts!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS
  2022-02-04 10:18         ` David Laight
@ 2022-02-04 15:46           ` Alexander Duyck
  0 siblings, 0 replies; 58+ messages in thread
From: Alexander Duyck @ 2022-02-04 15:46 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, Eric Dumazet, David S . Miller, Jakub Kicinski,
	netdev, Coco Li

On Fri, Feb 4, 2022 at 2:18 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Alexander Duyck
> > Sent: 03 February 2022 17:57
> ...
> > > > So a big issue I see with this patch is the potential queueing issues
> > > > it may introduce on Tx queues. I suspect it will cause a number of
> > > > performance regressions and deadlocks as it will change the Tx queueing
> > > > behavior for many NICs.
> > > >
> > > > As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> > > > the ingredients for DESC_NEEDED in order to determine if the Tx queue
> > > > needs to stop. With this change the value for igb for instance is
> > > > jumping from 21 to 49, and the wake threshold is twice that, 98. As
> > > > such the minimum Tx descriptor threshold for the driver would need to
> > > > be updated beyond 80 otherwise it is likely to deadlock the first time
> > > > it has to pause.
> > >
> > > Are these limits hard coded in Intel drivers and firmware, or do you
> > > think this can be changed ?
> >
> > This is all code in the drivers. Most drivers have them as the logic
> > is used to avoid having to return NETIDEV_TX_BUSY. Basically the
> > assumption is there is a 1:1 correlation between descriptors and
> > individual frags. So most drivers would need to increase the size of
> > their Tx descriptor rings if they were optimized for a lower value.
>
> Maybe the drivers can be a little less conservative about the number
> of fragments they expect in the next message?
> There is little point requiring 49 free descriptors when the workload
> never has more than 2 or 3 fragments.
>
> Clearly you don't want to re-enable things unless there are enough
> descriptors for an skb that has generated NETDEV_TX_BUSY, but the
> current logic of 'trying to never actually return NETDEV_TX_BUSY'
> is probably over cautious.

The problem is that NETDEV_TX_BUSY can cause all sorts of issues in
terms of the flow of packets. Basically when you start having to push
packets back from the device to the qdisc you can essentially create
head-of-line blocking type scenarios which can make things like
traffic shaping that much more difficult.

> Does Linux allow skb to have a lot of short fragments?
> If dma_map isn't cheap (probably anything with an iommu or non-coherent
> memory) them copying/merging short fragments into a pre-mapped
> buffer can easily be faster.

I know Linux skbs can have a lot of short fragments. The i40e has a
workaround for cases where more than 8 fragments are needed to
transmit a single frame for instance, see __i40e_chk_linearize().

> Many years ago we found it was worth copying anything under 1k on
> a sparc mbus+sbus system.
> I don't think Linux can generate what I've seen elsewhere - the mac
> driver being asked to transmit something with 1000+ one byte fragmemts!
>
>         David

Linux cannot generate the 1000+ fragments, mainly because it is
limited by the frags. However as I pointed out above it isn't uncommon
to see an skb composed of a number of smaller fragments.

That said, I don't know if we really need to be rewriting the code for
NETDEV_TX_BUSY handling on the drivers. It would just be a matter of
reserving more memory in the descriptor rings since the counts would
be going from 42 to 98 in order to unblock a Tx queue in the case of
igb for instance, and currently the minimum ring size is 80. So in
this case it would just be a matter of increasing the minimum so that
it cannot be configured into a deadlock.

Ultimately that is the trade-off with this approach. What we are doing
is increasing the memory footprint of the drivers and skbs in order to
allow for more buffering in the skb to increase throughput. I wonder
if it wouldn't make sense to just make MAX_SKB_FRAGS a driver level
setting like gso_max_size so that low end NICs out there aren't having
to reserve a ton of memory to store fragments they will never use.

- Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2022-02-04 15:46 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-03  1:51 [PATCH net-next 00/15] tcp: BIG TCP implementation Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 01/15] net: add netdev->tso_ipv6_max_size attribute Eric Dumazet
2022-02-03 16:34   ` Jakub Kicinski
2022-02-03 16:56     ` Eric Dumazet
2022-02-03 18:58       ` Jakub Kicinski
2022-02-03 19:12         ` Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 02/15] ipv6: add dev->gso_ipv6_max_size Eric Dumazet
2022-02-03  8:57   ` Paolo Abeni
2022-02-03 15:34     ` Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 03/15] tcp_cubic: make hystart_ack_delay() aware of BIG TCP Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 04/15] ipv6: add struct hop_jumbo_hdr definition Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 05/15] ipv6/gso: remove temporary HBH/jumbo header Eric Dumazet
2022-02-03 18:53   ` Alexander H Duyck
2022-02-03 19:17     ` Eric Dumazet
2022-02-03 19:45       ` Alexander Duyck
2022-02-03 19:59         ` Eric Dumazet
2022-02-03 21:08           ` Alexander H Duyck
2022-02-03 21:41             ` Eric Dumazet
2022-02-04  0:05               ` Alexander Duyck
2022-02-04  0:27                 ` Eric Dumazet
2022-02-04  1:14                   ` Eric Dumazet
2022-02-04  1:48                     ` Eric Dumazet
2022-02-04  2:15                       ` Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 06/15] ipv6/gro: insert " Eric Dumazet
2022-02-03  9:19   ` Paolo Abeni
2022-02-03 15:48     ` Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 07/15] ipv6: add GRO_IPV6_MAX_SIZE Eric Dumazet
2022-02-03  2:18   ` Eric Dumazet
2022-02-03 10:44   ` Paolo Abeni
2022-02-03  1:51 ` [PATCH net-next 08/15] ipv6: Add hop-by-hop header to jumbograms in ip6_output Eric Dumazet
2022-02-03  9:07   ` Paolo Abeni
2022-02-03 16:31     ` Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS Eric Dumazet
2022-02-03  5:02   ` kernel test robot
2022-02-03  5:20     ` Eric Dumazet
2022-02-03  5:31       ` Jakub Kicinski
2022-02-03  6:35         ` Eric Dumazet
2022-02-03  5:23   ` kernel test robot
2022-02-03  5:43   ` kernel test robot
2022-02-03 16:01   ` Paolo Abeni
2022-02-03 17:26   ` Alexander H Duyck
2022-02-03 17:34     ` Eric Dumazet
2022-02-03 17:56       ` Alexander Duyck
2022-02-03 19:18         ` Jakub Kicinski
2022-02-03 19:20           ` Eric Dumazet
2022-02-03 19:54             ` Eric Dumazet
2022-02-04 10:18         ` David Laight
2022-02-04 15:46           ` Alexander Duyck
2022-02-03  1:51 ` [PATCH net-next 10/15] net: loopback: enable BIG TCP packets Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 11/15] bonding: update dev->tso_ipv6_max_size Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 12/15] macvlan: enable BIG TCP Packets Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 13/15] ipvlan: " Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 14/15] mlx4: support BIG TCP packets Eric Dumazet
2022-02-03 13:04   ` Tariq Toukan
2022-02-03 15:54     ` Eric Dumazet
2022-02-03  1:51 ` [PATCH net-next 15/15] mlx5: " Eric Dumazet
2022-02-03  7:27   ` Tariq Toukan
2022-02-04  4:03   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).