All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 net-next 00/12] tcp: BIG TCP implementation
@ 2022-05-06 15:30 Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 01/12] net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes Eric Dumazet
                   ` (11 more replies)
  0 siblings, 12 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

This series implements BIG TCP as presented in netdev 0x15:

https://netdevconf.info/0x15/session.html?BIG-TCP

Jonathan Corbet made a nice summary: https://lwn.net/Articles/884104/

Standard TSO/GRO packet limit is 64KB

With BIG TCP, we allow bigger TSO/GRO packet sizes for IPv6 traffic.

Note that this feature is by default not enabled, because it might
break some eBPF programs assuming TCP header immediately follows IPv6 header.

While tcpdump recognizes the HBH/Jumbo header, standard pcap filters
are unable to skip over IPv6 extension headers.

Reducing number of packets traversing networking stack usually improves
performance, as shown on this experiment using a 100Gbit NIC, and 4K MTU.

'Standard' performance with current (74KB) limits.
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
77           138          183          8542.19    
79           143          178          8215.28    
70           117          164          9543.39    
80           144          176          8183.71    
78           126          155          9108.47    
80           146          184          8115.19    
71           113          165          9510.96    
74           113          164          9518.74    
79           137          178          8575.04    
73           111          171          9561.73    

Now enable BIG TCP on both hosts.

ip link set dev eth0 gro_ipv6_max_size 185000 gso_ipv6_max_size 185000
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
57           83           117          13871.38   
64           118          155          11432.94   
65           116          148          11507.62   
60           105          136          12645.15   
60           103          135          12760.34   
60           102          134          12832.64   
62           109          132          10877.68   
58           82           115          14052.93   
57           83           124          14212.58   
57           82           119          14196.01   

We see an increase of transactions per second, and lower latencies as well.

v4: Rebased on top of Jakub series (Merge branch 'tso-gso-limit-split')
    max_tso_size is now family independent.

v3: Fixed a typo in RFC number (Alexander)
    Added Reviewed-by: tags from Tariq on mlx4/mlx5 parts.

v2: Removed the MAX_SKB_FRAGS change, this belongs to a different series.
    Addressed feedback, for Alexander and nvidia folks.


Coco Li (4):
  ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  ipv6: add IFLA_GRO_IPV6_MAX_SIZE
  ipv6: Add hop-by-hop header to jumbograms in ip6_output
  mlx5: support BIG TCP packets

Eric Dumazet (8):
  net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes
  tcp_cubic: make hystart_ack_delay() aware of BIG TCP
  ipv6: add struct hop_jumbo_hdr definition
  ipv6/gso: remove temporary HBH/jumbo header
  ipv6/gro: insert temporary HBH/jumbo header
  net: loopback: enable BIG TCP packets
  veth: enable BIG TCP packets
  mlx4: support BIG TCP packets

 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  3 +
 drivers/net/ethernet/mellanox/mlx4/en_tx.c    | 47 +++++++++--
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 84 +++++++++++++++----
 drivers/net/loopback.c                        |  2 +
 drivers/net/veth.c                            |  1 +
 include/linux/ipv6.h                          |  1 +
 include/linux/netdevice.h                     |  5 ++
 include/net/ipv6.h                            | 44 ++++++++++
 include/uapi/linux/if_link.h                  |  4 +
 net/core/dev.c                                |  3 +
 net/core/gro.c                                | 20 ++++-
 net/core/rtnetlink.c                          | 51 +++++++++++
 net/core/sock.c                               |  8 ++
 net/ipv4/tcp_cubic.c                          |  4 +-
 net/ipv6/ip6_offload.c                        | 56 ++++++++++++-
 net/ipv6/ip6_output.c                         | 22 ++++-
 tools/include/uapi/linux/if_link.h            |  4 +
 18 files changed, 326 insertions(+), 34 deletions(-)

-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 01/12] net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE Eric Dumazet
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS
are used to report to user-space the device TSO limits.

ip -d link sh dev eth1
...
   tso_max_size 65536 tso_max_segs 65535

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/uapi/linux/if_link.h       | 2 ++
 net/core/rtnetlink.c               | 6 ++++++
 tools/include/uapi/linux/if_link.h | 2 ++
 3 files changed, 10 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index d1e600816b82c2e73c3e0684c66ddf9841a75b04..5f58dcfe2787f308bb2aa5777cca0816dd32bbb9 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -368,6 +368,8 @@ enum {
 	IFLA_PARENT_DEV_NAME,
 	IFLA_PARENT_DEV_BUS_NAME,
 	IFLA_GRO_MAX_SIZE,
+	IFLA_TSO_MAX_SIZE,
+	IFLA_TSO_MAX_SEGS,
 
 	__IFLA_MAX
 };
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6d4b9272995b7a9b9f83b8868845222f4785edf..512ed661204e0c66c8dcfaddc3001500d10f63ab 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1064,6 +1064,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SEGS */
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_GRO_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_TSO_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_TSO_MAX_SEGS */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
@@ -1769,6 +1771,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_GSO_MAX_SEGS, dev->gso_max_segs) ||
 	    nla_put_u32(skb, IFLA_GSO_MAX_SIZE, dev->gso_max_size) ||
 	    nla_put_u32(skb, IFLA_GRO_MAX_SIZE, dev->gro_max_size) ||
+	    nla_put_u32(skb, IFLA_TSO_MAX_SIZE, dev->tso_max_size) ||
+	    nla_put_u32(skb, IFLA_TSO_MAX_SEGS, dev->tso_max_segs) ||
 #ifdef CONFIG_RPS
 	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 #endif
@@ -1922,6 +1926,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_NEW_IFINDEX]	= NLA_POLICY_MIN(NLA_S32, 1),
 	[IFLA_PARENT_DEV_NAME]	= { .type = NLA_NUL_STRING },
 	[IFLA_GRO_MAX_SIZE]	= { .type = NLA_U32 },
+	[IFLA_TSO_MAX_SIZE]	= { .type = NLA_REJECT },
+	[IFLA_TSO_MAX_SEGS]	= { .type = NLA_REJECT },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index e1ba2d51b717b7ac7f06e94ac9791cf4c8a5ab6f..b339bf2196ca160ed3040615ae624b9a028562fb 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -348,6 +348,8 @@ enum {
 	IFLA_PARENT_DEV_NAME,
 	IFLA_PARENT_DEV_BUS_NAME,
 	IFLA_GRO_MAX_SIZE,
+	IFLA_TSO_MAX_SIZE,
+	IFLA_TSO_MAX_SEGS,
 
 	__IFLA_MAX
 };
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 01/12] net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 20:48   ` Alexander H Duyck
  2022-05-06 15:30 ` [PATCH v4 net-next 03/12] tcp_cubic: make hystart_ack_delay() aware of BIG TCP Eric Dumazet
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Coco Li <lixiaoyan@google.com>

This enables ipv6/TCP stacks to build TSO packets bigger than
64KB if the driver is LSOv2 compatible.

This patch introduces new variable gso_ipv6_max_size
that is modifiable through ip link.

ip link set dev eth0 gso_ipv6_max_size 185000

User input is capped by driver limit (tso_max_size)

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h          |  2 ++
 include/uapi/linux/if_link.h       |  1 +
 net/core/dev.c                     |  2 ++
 net/core/rtnetlink.c               | 23 +++++++++++++++++++++++
 net/core/sock.c                    |  8 ++++++++
 tools/include/uapi/linux/if_link.h |  1 +
 6 files changed, 37 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8cf0ac616cb9b7279e643cf6630f42c807742244..47f413dac12e901700045f4b73d47ecdca0f4f3c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1917,6 +1917,7 @@ enum netdev_ml_priv_type {
  *	@rtnl_link_ops:	Rtnl_link_ops
  *
  *	@gso_max_size:	Maximum size of generic segmentation offload
+ *	@gso_ipv6_max_size:	Maximum size of IPv6 GSO packets (user/admin limit)
  *	@tso_max_size:	Device (as in HW) limit on the max TSO request size
  *	@gso_max_segs:	Maximum number of segments that can be passed to the
  *			NIC for GSO
@@ -2264,6 +2265,7 @@ struct net_device {
 	/* for setting kernel sock attribute on TCP connection setup */
 #define GSO_MAX_SIZE		65536
 	unsigned int		gso_max_size;
+	unsigned int		gso_ipv6_max_size;
 #define TSO_LEGACY_MAX_SIZE	65536
 #define TSO_MAX_SIZE		UINT_MAX
 	unsigned int		tso_max_size;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 5f58dcfe2787f308bb2aa5777cca0816dd32bbb9..aa05fc9cc23f4ccf92f4cbba57f43472749cd42a 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -370,6 +370,7 @@ enum {
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_MAX_SIZE,
 	IFLA_TSO_MAX_SEGS,
+	IFLA_GSO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index f036ccb61da4da3ffc52c4f2402427054b831e8a..aa8757215b2a9f14683f95086732668eb99a875b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10607,6 +10607,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->gro_max_size = GRO_MAX_SIZE;
 	dev->tso_max_size = TSO_LEGACY_MAX_SIZE;
 	dev->tso_max_segs = TSO_MAX_SEGS;
+	dev->gso_ipv6_max_size = GSO_MAX_SIZE;
+
 	dev->upper_level = 1;
 	dev->lower_level = 1;
 #ifdef CONFIG_LOCKDEP
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 512ed661204e0c66c8dcfaddc3001500d10f63ab..847cf80f81754451e5f220f846db734a7625695b 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1066,6 +1066,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_GRO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SEGS */
+	       + nla_total_size(4) /* IFLA_GSO_IPV6_MAX_SIZE */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
@@ -1773,6 +1774,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_GRO_MAX_SIZE, dev->gro_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_MAX_SIZE, dev->tso_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_MAX_SEGS, dev->tso_max_segs) ||
+	    nla_put_u32(skb, IFLA_GSO_IPV6_MAX_SIZE, dev->gso_ipv6_max_size) ||
 #ifdef CONFIG_RPS
 	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 #endif
@@ -1928,6 +1930,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_GRO_MAX_SIZE]	= { .type = NLA_U32 },
 	[IFLA_TSO_MAX_SIZE]	= { .type = NLA_REJECT },
 	[IFLA_TSO_MAX_SEGS]	= { .type = NLA_REJECT },
+	[IFLA_GSO_IPV6_MAX_SIZE] = { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2644,6 +2647,14 @@ static int do_set_proto_down(struct net_device *dev,
 	return 0;
 }
 
+static void netif_set_gso_ipv6_max_size(struct net_device *dev,
+					unsigned int size)
+{
+	size = min(size, dev->tso_max_size);
+	/* Paired with READ_ONCE() in sk_setup_caps() */
+	WRITE_ONCE(dev->gso_ipv6_max_size, size);
+}
+
 #define DO_SETLINK_MODIFIED	0x01
 /* notify flag means notify + modified. */
 #define DO_SETLINK_NOTIFY	0x03
@@ -2820,6 +2831,15 @@ static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_GSO_IPV6_MAX_SIZE]) {
+		u32 max_size = nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]);
+
+		if (dev->gso_ipv6_max_size ^ max_size) {
+			netif_set_gso_ipv6_max_size(dev, max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
 	if (tb[IFLA_GSO_MAX_SEGS]) {
 		u32 max_segs = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]);
 
@@ -3283,6 +3303,9 @@ struct net_device *rtnl_create_link(struct net *net, const char *ifname,
 		netif_set_gso_max_segs(dev, nla_get_u32(tb[IFLA_GSO_MAX_SEGS]));
 	if (tb[IFLA_GRO_MAX_SIZE])
 		netif_set_gro_max_size(dev, nla_get_u32(tb[IFLA_GRO_MAX_SIZE]));
+	if (tb[IFLA_GSO_IPV6_MAX_SIZE])
+		netif_set_gso_ipv6_max_size(dev,
+			nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]));
 
 	return dev;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 6b287eb5427b32865d25fc22122fefeff3a4ccf5..4a29c3bf6b95f76280d8e32e903a0916322d5c4f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2312,6 +2312,14 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
 			sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
+#if IS_ENABLED(CONFIG_IPV6)
+			if (sk->sk_family == AF_INET6 &&
+			    sk_is_tcp(sk) &&
+			    !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) {
+				/* Paired with WRITE_ONCE() in netif_set_gso_ipv6_max_size() */
+				sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_ipv6_max_size);
+			}
+#endif
 			sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
 			max_segs = max_t(u32, READ_ONCE(dst->dev->gso_max_segs), 1);
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index b339bf2196ca160ed3040615ae624b9a028562fb..443eddd285f37198566fa1357f0d394ec5270ab9 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -350,6 +350,7 @@ enum {
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_MAX_SIZE,
 	IFLA_TSO_MAX_SEGS,
+	IFLA_GSO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 03/12] tcp_cubic: make hystart_ack_delay() aware of BIG TCP
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 01/12] net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 04/12] ipv6: add struct hop_jumbo_hdr definition Eric Dumazet
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

hystart_ack_delay() had the assumption that a TSO packet
would not be bigger than GSO_MAX_SIZE.

This will no longer be true.

We should use sk->sk_gso_max_size instead.

This reduces chances of spurious Hystart ACK train detections.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cubic.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index b0918839bee7cf0264ec3bbcdfc1417daa86d197..68178e7280ce24c26a48e48a51518d759e4d1718 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -372,7 +372,7 @@ static void cubictcp_state(struct sock *sk, u8 new_state)
  * We apply another 100% factor because @rate is doubled at this point.
  * We cap the cushion to 1ms.
  */
-static u32 hystart_ack_delay(struct sock *sk)
+static u32 hystart_ack_delay(const struct sock *sk)
 {
 	unsigned long rate;
 
@@ -380,7 +380,7 @@ static u32 hystart_ack_delay(struct sock *sk)
 	if (!rate)
 		return 0;
 	return min_t(u64, USEC_PER_MSEC,
-		     div64_ul((u64)GSO_MAX_SIZE * 4 * USEC_PER_SEC, rate));
+		     div64_ul((u64)sk->sk_gso_max_size * 4 * USEC_PER_SEC, rate));
 }
 
 static void hystart_update(struct sock *sk, u32 delay)
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 04/12] ipv6: add struct hop_jumbo_hdr definition
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (2 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 03/12] tcp_cubic: make hystart_ack_delay() aware of BIG TCP Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 05/12] ipv6/gso: remove temporary HBH/jumbo header Eric Dumazet
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Following patches will need to add and remove local IPv6 jumbogram
options to enable BIG TCP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ipv6.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 213612f1680c7c39f4c07f0c05b4e6cf34a7878e..63d019953c47ea03d3b723a58c25e83c249489a9 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -151,6 +151,17 @@ struct frag_hdr {
 	__be32	identification;
 };
 
+/*
+ * Jumbo payload option, as described in RFC 2675 2.
+ */
+struct hop_jumbo_hdr {
+	u8	nexthdr;
+	u8	hdrlen;
+	u8	tlv_type;	/* IPV6_TLV_JUMBO, 0xC2 */
+	u8	tlv_len;	/* 4 */
+	__be32	jumbo_payload_len;
+};
+
 #define	IP6_MF		0x0001
 #define	IP6_OFFSET	0xFFF8
 
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 05/12] ipv6/gso: remove temporary HBH/jumbo header
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (3 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 04/12] ipv6: add struct hop_jumbo_hdr definition Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 06/12] ipv6/gro: insert " Eric Dumazet
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.

If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.

v2: perform HBH removal from ipv6_gso_segment() instead of
    skb_segment() (Alexander feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ipv6.h     | 33 +++++++++++++++++++++++++++++++++
 net/ipv6/ip6_offload.c | 24 +++++++++++++++++++++++-
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 63d019953c47ea03d3b723a58c25e83c249489a9..b6df0314aa02dd1c4094620145ccb24da7195b2b 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -467,6 +467,39 @@ bool ipv6_opt_accepted(const struct sock *sk, const struct sk_buff *skb,
 struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
 					   struct ipv6_txoptions *opt);
 
+/* This helper is specialized for BIG TCP needs.
+ * It assumes the hop_jumbo_hdr will immediately follow the IPV6 header.
+ * It assumes headers are already in skb->head.
+ * Returns 0, or IPPROTO_TCP if a BIG TCP packet is there.
+ */
+static inline int ipv6_has_hopopt_jumbo(const struct sk_buff *skb)
+{
+	const struct hop_jumbo_hdr *jhdr;
+	const struct ipv6hdr *nhdr;
+
+	if (likely(skb->len <= GRO_MAX_SIZE))
+		return 0;
+
+	if (skb->protocol != htons(ETH_P_IPV6))
+		return 0;
+
+	if (skb_network_offset(skb) +
+	    sizeof(struct ipv6hdr) +
+	    sizeof(struct hop_jumbo_hdr) > skb_headlen(skb))
+		return 0;
+
+	nhdr = ipv6_hdr(skb);
+
+	if (nhdr->nexthdr != NEXTHDR_HOP)
+		return 0;
+
+	jhdr = (const struct hop_jumbo_hdr *) (nhdr + 1);
+	if (jhdr->tlv_type != IPV6_TLV_JUMBO || jhdr->hdrlen != 0 ||
+	    jhdr->nexthdr != IPPROTO_TCP)
+		return 0;
+	return jhdr->nexthdr;
+}
+
 static inline bool ipv6_accept_ra(struct inet6_dev *idev)
 {
 	/* If forwarding is enabled, RA are not accepted unless the special
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index c4fc03c1ac99dbecd92e2b47b2db65374197434d..a6a6c1539c28d242ef8c35fcd5ce900512ce912d 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -77,7 +77,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
 	struct ipv6hdr *ipv6h;
 	const struct net_offload *ops;
-	int proto;
+	int proto, nexthdr;
 	struct frag_hdr *fptr;
 	unsigned int payload_len;
 	u8 *prevhdr;
@@ -87,6 +87,28 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 	bool gso_partial;
 
 	skb_reset_network_header(skb);
+	nexthdr = ipv6_has_hopopt_jumbo(skb);
+	if (nexthdr) {
+		const int hophdr_len = sizeof(struct hop_jumbo_hdr);
+		int err;
+
+		err = skb_cow_head(skb, 0);
+		if (err < 0)
+			return ERR_PTR(err);
+
+		/* remove the HBH header.
+		 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+		 */
+		memmove(skb_mac_header(skb) + hophdr_len,
+			skb_mac_header(skb),
+			ETH_HLEN + sizeof(struct ipv6hdr));
+		skb->data += hophdr_len;
+		skb->len -= hophdr_len;
+		skb->network_header += hophdr_len;
+		skb->mac_header += hophdr_len;
+		ipv6h = (struct ipv6hdr *)skb->data;
+		ipv6h->nexthdr = nexthdr;
+	}
 	nhoff = skb_network_header(skb) - skb_mac_header(skb);
 	if (unlikely(!pskb_may_pull(skb, sizeof(*ipv6h))))
 		goto out;
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 06/12] ipv6/gro: insert temporary HBH/jumbo header
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (4 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 05/12] ipv6/gso: remove temporary HBH/jumbo header Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE Eric Dumazet
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
BIG TCP ipv6 packets (bigger than 64K).

This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
so that resulting packet can go through IPv6/TCP stacks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv6/ip6_offload.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index a6a6c1539c28d242ef8c35fcd5ce900512ce912d..d12dba2dd5354dbb79bb80df4038dec2544cddeb 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -342,15 +342,43 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head,
 INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 {
 	const struct net_offload *ops;
-	struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + nhoff);
+	struct ipv6hdr *iph;
 	int err = -ENOSYS;
+	u32 payload_len;
 
 	if (skb->encapsulation) {
 		skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
 		skb_set_inner_network_header(skb, nhoff);
 	}
 
-	iph->payload_len = htons(skb->len - nhoff - sizeof(*iph));
+	payload_len = skb->len - nhoff - sizeof(*iph);
+	if (unlikely(payload_len > IPV6_MAXPLEN)) {
+		struct hop_jumbo_hdr *hop_jumbo;
+		int hoplen = sizeof(*hop_jumbo);
+
+		/* Move network header left */
+		memmove(skb_mac_header(skb) - hoplen, skb_mac_header(skb),
+			skb->transport_header - skb->mac_header);
+		skb->data -= hoplen;
+		skb->len += hoplen;
+		skb->mac_header -= hoplen;
+		skb->network_header -= hoplen;
+		iph = (struct ipv6hdr *)(skb->data + nhoff);
+		hop_jumbo = (struct hop_jumbo_hdr *)(iph + 1);
+
+		/* Build hop-by-hop options */
+		hop_jumbo->nexthdr = iph->nexthdr;
+		hop_jumbo->hdrlen = 0;
+		hop_jumbo->tlv_type = IPV6_TLV_JUMBO;
+		hop_jumbo->tlv_len = 4;
+		hop_jumbo->jumbo_payload_len = htonl(payload_len + hoplen);
+
+		iph->nexthdr = NEXTHDR_HOP;
+		iph->payload_len = 0;
+	} else {
+		iph = (struct ipv6hdr *)(skb->data + nhoff);
+		iph->payload_len = htons(payload_len);
+	}
 
 	nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (5 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 06/12] ipv6/gro: insert " Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 21:06   ` Alexander H Duyck
  2022-05-06 15:30 ` [PATCH v4 net-next 08/12] ipv6: Add hop-by-hop header to jumbograms in ip6_output Eric Dumazet
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Coco Li <lixiaoyan@google.com>

Enable GRO to have IPv6 specific limit for max packet size.

This patch introduces new dev->gro_ipv6_max_size
that is modifiable through ip link.

ip link set dev eth0 gro_ipv6_max_size 185000

Note that this value is only considered if bigger than
gro_max_size, and for non encapsulated TCP/ipv6 packets.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h          |  3 +++
 include/uapi/linux/if_link.h       |  1 +
 net/core/dev.c                     |  1 +
 net/core/gro.c                     | 20 ++++++++++++++++++--
 net/core/rtnetlink.c               | 22 ++++++++++++++++++++++
 tools/include/uapi/linux/if_link.h |  1 +
 6 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 47f413dac12e901700045f4b73d47ecdca0f4f3c..df12c9843d94cb847e0ce5ba1b3b36bde7d476ed 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1962,6 +1962,8 @@ enum netdev_ml_priv_type {
  *			keep a list of interfaces to be deleted.
  *	@gro_max_size:	Maximum size of aggregated packet in generic
  *			receive offload (GRO)
+ *	@gro_ipv6_max_size:	Maximum size of aggregated packet in generic
+ *				receive offload (GRO), for IPv6
  *
  *	@dev_addr_shadow:	Copy of @dev_addr to catch direct writes.
  *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
@@ -2154,6 +2156,7 @@ struct net_device {
 	int			napi_defer_hard_irqs;
 #define GRO_MAX_SIZE		65536
 	unsigned int		gro_max_size;
+	unsigned int		gro_ipv6_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index aa05fc9cc23f4ccf92f4cbba57f43472749cd42a..9ece3a391105c171057cc491c1458ee8a45e07e0 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -371,6 +371,7 @@ enum {
 	IFLA_TSO_MAX_SIZE,
 	IFLA_TSO_MAX_SEGS,
 	IFLA_GSO_IPV6_MAX_SIZE,
+	IFLA_GRO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index aa8757215b2a9f14683f95086732668eb99a875b..582b7fe052a6fb06437f95bd6a451b79e188cc57 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10608,6 +10608,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->tso_max_size = TSO_LEGACY_MAX_SIZE;
 	dev->tso_max_segs = TSO_MAX_SEGS;
 	dev->gso_ipv6_max_size = GSO_MAX_SIZE;
+	dev->gro_ipv6_max_size = GRO_MAX_SIZE;
 
 	dev->upper_level = 1;
 	dev->lower_level = 1;
diff --git a/net/core/gro.c b/net/core/gro.c
index 78110edf5d4b36d2fa6f8a2676096efe0112aa0e..8b35403dd7e909a8d7df591d952a4600c13f360b 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -161,11 +161,27 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	unsigned int new_truesize;
 	struct sk_buff *lp;
 
+	if (unlikely(NAPI_GRO_CB(skb)->flush))
+		return -E2BIG;
+
 	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
 	gro_max_size = READ_ONCE(p->dev->gro_max_size);
 
-	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
-		return -E2BIG;
+	if (unlikely(p->len + len >= gro_max_size)) {
+		/* pairs with WRITE_ONCE() in netif_set_gro_ipv6_max_size() */
+		unsigned int gro6_max_size = READ_ONCE(p->dev->gro_ipv6_max_size);
+
+		if (gro6_max_size > gro_max_size &&
+		    p->protocol == htons(ETH_P_IPV6) &&
+		    skb_headroom(p) >= sizeof(struct hop_jumbo_hdr) &&
+		    ipv6_hdr(p)->nexthdr == IPPROTO_TCP &&
+		    !p->encapsulation)
+			gro_max_size = gro6_max_size;
+
+		if (p->len + len >= gro_max_size)
+			return -E2BIG;
+	}
+
 
 	lp = NAPI_GRO_CB(p)->last;
 	pinfo = skb_shinfo(lp);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 847cf80f81754451e5f220f846db734a7625695b..5fa3ff835aaf6601c31458ec88e88837d353eabd 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1067,6 +1067,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SEGS */
 	       + nla_total_size(4) /* IFLA_GSO_IPV6_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_GRO_IPV6_MAX_SIZE */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
@@ -1775,6 +1776,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_TSO_MAX_SIZE, dev->tso_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_MAX_SEGS, dev->tso_max_segs) ||
 	    nla_put_u32(skb, IFLA_GSO_IPV6_MAX_SIZE, dev->gso_ipv6_max_size) ||
+	    nla_put_u32(skb, IFLA_GRO_IPV6_MAX_SIZE, dev->gro_ipv6_max_size) ||
 #ifdef CONFIG_RPS
 	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 #endif
@@ -1931,6 +1933,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_TSO_MAX_SIZE]	= { .type = NLA_REJECT },
 	[IFLA_TSO_MAX_SEGS]	= { .type = NLA_REJECT },
 	[IFLA_GSO_IPV6_MAX_SIZE] = { .type = NLA_U32 },
+	[IFLA_GRO_IPV6_MAX_SIZE] = { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2655,6 +2658,13 @@ static void netif_set_gso_ipv6_max_size(struct net_device *dev,
 	WRITE_ONCE(dev->gso_ipv6_max_size, size);
 }
 
+static void netif_set_gro_ipv6_max_size(struct net_device *dev,
+					unsigned int size)
+{
+	/* This pairs with the READ_ONCE() in skb_gro_receive() */
+	WRITE_ONCE(dev->gro_ipv6_max_size, size);
+}
+
 #define DO_SETLINK_MODIFIED	0x01
 /* notify flag means notify + modified. */
 #define DO_SETLINK_NOTIFY	0x03
@@ -2840,6 +2850,15 @@ static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_GRO_IPV6_MAX_SIZE]) {
+		u32 max_size = nla_get_u32(tb[IFLA_GRO_IPV6_MAX_SIZE]);
+
+		if (dev->gro_ipv6_max_size ^ max_size) {
+			netif_set_gro_ipv6_max_size(dev, max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
 	if (tb[IFLA_GSO_MAX_SEGS]) {
 		u32 max_segs = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]);
 
@@ -3306,6 +3325,9 @@ struct net_device *rtnl_create_link(struct net *net, const char *ifname,
 	if (tb[IFLA_GSO_IPV6_MAX_SIZE])
 		netif_set_gso_ipv6_max_size(dev,
 			nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]));
+	if (tb[IFLA_GRO_IPV6_MAX_SIZE])
+		netif_set_gro_ipv6_max_size(dev,
+			nla_get_u32(tb[IFLA_GRO_IPV6_MAX_SIZE]));
 
 	return dev;
 }
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 443eddd285f37198566fa1357f0d394ec5270ab9..5aead1be6b99623fb6ffd31cfcfd44976eb8794f 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -351,6 +351,7 @@ enum {
 	IFLA_TSO_MAX_SIZE,
 	IFLA_TSO_MAX_SEGS,
 	IFLA_GSO_IPV6_MAX_SIZE,
+	IFLA_GRO_IPV6_MAX_SIZE,
 
 	__IFLA_MAX
 };
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 08/12] ipv6: Add hop-by-hop header to jumbograms in ip6_output
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (6 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 09/12] net: loopback: enable BIG TCP packets Eric Dumazet
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Coco Li <lixiaoyan@google.com>

Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.

Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.

This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/ipv6.h  |  1 +
 net/ipv6/ip6_output.c | 22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index ec5ca392eaa31e83a022b1124fae6b607ba168cd..38c8203d52cbf39e523c43fe630a7b184b9991aa 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -145,6 +145,7 @@ struct inet6_skb_parm {
 #define IP6SKB_L3SLAVE         64
 #define IP6SKB_JUMBOGRAM      128
 #define IP6SKB_SEG6	      256
+#define IP6SKB_FAKEJUMBO      512
 };
 
 #if defined(CONFIG_NET_L3_MASTER_DEV)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index afa5bd4ad167c4a40878f33773d43be85e89c32f..4081b12a01ff22ecf94a6490aef0665808407a6e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -182,7 +182,9 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff
 #endif
 
 	mtu = ip6_skb_dst_mtu(skb);
-	if (skb_is_gso(skb) && !skb_gso_validate_network_len(skb, mtu))
+	if (skb_is_gso(skb) &&
+	    !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
+	    !skb_gso_validate_network_len(skb, mtu))
 		return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);
 
 	if ((skb->len > mtu && !skb_is_gso(skb)) ||
@@ -252,6 +254,8 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	struct dst_entry *dst = skb_dst(skb);
 	struct net_device *dev = dst->dev;
 	struct inet6_dev *idev = ip6_dst_idev(dst);
+	struct hop_jumbo_hdr *hop_jumbo;
+	int hoplen = sizeof(*hop_jumbo);
 	unsigned int head_room;
 	struct ipv6hdr *hdr;
 	u8  proto = fl6->flowi6_proto;
@@ -259,7 +263,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	int hlimit = -1;
 	u32 mtu;
 
-	head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dev);
+	head_room = sizeof(struct ipv6hdr) + hoplen + LL_RESERVED_SPACE(dev);
 	if (opt)
 		head_room += opt->opt_nflen + opt->opt_flen;
 
@@ -282,6 +286,20 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 					     &fl6->saddr);
 	}
 
+	if (unlikely(seg_len > IPV6_MAXPLEN)) {
+		hop_jumbo = skb_push(skb, hoplen);
+
+		hop_jumbo->nexthdr = proto;
+		hop_jumbo->hdrlen = 0;
+		hop_jumbo->tlv_type = IPV6_TLV_JUMBO;
+		hop_jumbo->tlv_len = 4;
+		hop_jumbo->jumbo_payload_len = htonl(seg_len + hoplen);
+
+		proto = IPPROTO_HOPOPTS;
+		seg_len = 0;
+		IP6CB(skb)->flags |= IP6SKB_FAKEJUMBO;
+	}
+
 	skb_push(skb, sizeof(struct ipv6hdr));
 	skb_reset_network_header(skb);
 	hdr = ipv6_hdr(skb);
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 09/12] net: loopback: enable BIG TCP packets
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (7 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 08/12] ipv6: Add hop-by-hop header to jumbograms in ip6_output Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 10/12] veth: " Eric Dumazet
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Set the driver limit to 512 KB per TSO ipv6 packet.

This allows the admin/user to set a GSO ipv6 limit up to this value.

Tested:

ip link set dev lo gso_ipv6_max_size 200000
netperf -H ::1 -t TCP_RR -l 100 -- -r 80000,80000 &

tcpdump shows :

18:28:42.962116 IP6 ::1 > ::1: HBH 40051 > 63780: Flags [P.], seq 3626480001:3626560001, ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000
18:28:42.962138 IP6 ::1.63780 > ::1.40051: Flags [.], ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 0
18:28:42.962152 IP6 ::1 > ::1: HBH 63780 > 40051: Flags [P.], seq 3626560001:3626640001, ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000
18:28:42.962157 IP6 ::1.40051 > ::1.63780: Flags [.], ack 3626640001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 0
18:28:42.962180 IP6 ::1 > ::1: HBH 40051 > 63780: Flags [P.], seq 3626560001:3626640001, ack 3626640001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000
18:28:42.962214 IP6 ::1.63780 > ::1.40051: Flags [.], ack 3626640001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179265], length 0
18:28:42.962228 IP6 ::1 > ::1: HBH 63780 > 40051: Flags [P.], seq 3626640001:3626720001, ack 3626640001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179265], length 80000
18:28:42.962233 IP6 ::1.40051 > ::1.63780: Flags [.], ack 3626720001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179266], length 0

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/loopback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 720394c0639b20a2fd6262e4ee9d5813c02802f1..f504dd0817092d63f0e929500ec49cfdda562c4b 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -191,6 +191,8 @@ static void gen_lo_setup(struct net_device *dev,
 	dev->netdev_ops		= dev_ops;
 	dev->needs_free_netdev	= true;
 	dev->priv_destructor	= dev_destructor;
+
+	netif_set_tso_max_size(dev, 512 * 1024);
 }
 
 /* The loopback device is special. There is only one instance
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 10/12] veth: enable BIG TCP packets
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (8 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 09/12] net: loopback: enable BIG TCP packets Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 22:33   ` Jakub Kicinski
  2022-05-06 15:30 ` [PATCH v4 net-next 11/12] mlx4: support " Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 12/12] mlx5: " Eric Dumazet
  11 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Set the driver limit to 512 KB per TSO ipv6 packet.

This allows the admin/user to set a GSO ipv6 limit up to this value.

ip link set dev veth10 gso_ipv6_max_size 200000

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/veth.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f474e79a774580e4cb67da44b5f0c796c3ce8abb..989248b0f0c64349494a54735bb5abac66a42a93 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1647,6 +1647,7 @@ static void veth_setup(struct net_device *dev)
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
 	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
+	netif_set_tso_max_size(dev, 512 * 1024);
 }
 
 /*
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 11/12] mlx4: support BIG TCP packets
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (9 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 10/12] veth: " Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 15:30 ` [PATCH v4 net-next 12/12] mlx5: " Eric Dumazet
  11 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet, Tariq Toukan

From: Eric Dumazet <edumazet@google.com>

mlx4 supports LSOv2 just fine.

IPv6 stack inserts a temporary Hop-by-Hop header
with JUMBO TLV for big packets.

We need to ignore the HBH header when populating TX descriptor.

Tested:

Before: (not enabling bigger TSO/GRO packets)

ip link set dev eth0 gso_ipv6_max_size 65536 gro_ipv6_max_size 65536

netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

262144 540000 70000   70000  10.00   6591.45  0.86   1.34   62.490  97.446
262144 540000

After: (enabling bigger TSO/GRO packets)

ip link set dev eth0 gso_ipv6_max_size 185000 gro_ipv6_max_size 185000

netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

262144 540000 70000   70000  10.00   8383.95  0.95   1.01   54.432  57.584
262144 540000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  3 ++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c    | 47 +++++++++++++++----
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index c61dc7ae0c056a4dbcf24297549f6b1b5cc25d92..e9b854c563ab7cf7a5f037b65d7f734577369f52 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3417,6 +3417,9 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	dev->min_mtu = ETH_MIN_MTU;
 	dev->max_mtu = priv->max_mtu;
 
+	/* supports LSOv2 packets, 512KB limit has been tested. */
+	netif_set_tso_max_size(dev, 512 * 1024);
+
 	mdev->pndev[port] = dev;
 	mdev->upper[port] = NULL;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index f777151d226fb601f52366850f8c86358e214032..af3b2b59a2a6940a2839b277815ec7c3b4af1008 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -43,6 +43,7 @@
 #include <linux/ip.h>
 #include <linux/ipv6.h>
 #include <linux/indirect_call_wrapper.h>
+#include <net/ipv6.h>
 
 #include "mlx4_en.h"
 
@@ -634,19 +635,28 @@ static int get_real_size(const struct sk_buff *skb,
 			 struct net_device *dev,
 			 int *lso_header_size,
 			 bool *inline_ok,
-			 void **pfrag)
+			 void **pfrag,
+			 int *hopbyhop)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int real_size;
 
 	if (shinfo->gso_size) {
 		*inline_ok = false;
-		if (skb->encapsulation)
+		*hopbyhop = 0;
+		if (skb->encapsulation) {
 			*lso_header_size = (skb_inner_transport_header(skb) - skb->data) + inner_tcp_hdrlen(skb);
-		else
+		} else {
+			/* Detects large IPV6 TCP packets and prepares for removal of
+			 * HBH header that has been pushed by ip6_xmit(),
+			 * mainly so that tcpdump can dissect them.
+			 */
+			if (ipv6_has_hopopt_jumbo(skb))
+				*hopbyhop = sizeof(struct hop_jumbo_hdr);
 			*lso_header_size = skb_transport_offset(skb) + tcp_hdrlen(skb);
+		}
 		real_size = CTRL_SIZE + shinfo->nr_frags * DS_SIZE +
-			ALIGN(*lso_header_size + 4, DS_SIZE);
+			ALIGN(*lso_header_size - *hopbyhop + 4, DS_SIZE);
 		if (unlikely(*lso_header_size != skb_headlen(skb))) {
 			/* We add a segment for the skb linear buffer only if
 			 * it contains data */
@@ -873,6 +883,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	int desc_size;
 	int real_size;
 	u32 index, bf_index;
+	struct ipv6hdr *h6;
 	__be32 op_own;
 	int lso_header_size;
 	void *fragptr = NULL;
@@ -881,6 +892,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool stop_queue;
 	bool inline_ok;
 	u8 data_offset;
+	int hopbyhop;
 	bool bf_ok;
 
 	tx_ind = skb_get_queue_mapping(skb);
@@ -890,7 +902,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 
 	real_size = get_real_size(skb, shinfo, dev, &lso_header_size,
-				  &inline_ok, &fragptr);
+				  &inline_ok, &fragptr, &hopbyhop);
 	if (unlikely(!real_size))
 		goto tx_drop_count;
 
@@ -943,7 +955,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		data = &tx_desc->data;
 		data_offset = offsetof(struct mlx4_en_tx_desc, data);
 	} else {
-		int lso_align = ALIGN(lso_header_size + 4, DS_SIZE);
+		int lso_align = ALIGN(lso_header_size - hopbyhop + 4, DS_SIZE);
 
 		data = (void *)&tx_desc->lso + lso_align;
 		data_offset = offsetof(struct mlx4_en_tx_desc, lso) + lso_align;
@@ -1008,14 +1020,31 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			((ring->prod & ring->size) ?
 				cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
 
+		lso_header_size -= hopbyhop;
 		/* Fill in the LSO prefix */
 		tx_desc->lso.mss_hdr_size = cpu_to_be32(
 			shinfo->gso_size << 16 | lso_header_size);
 
-		/* Copy headers;
-		 * note that we already verified that it is linear */
-		memcpy(tx_desc->lso.header, skb->data, lso_header_size);
 
+		if (unlikely(hopbyhop)) {
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			memcpy(tx_desc->lso.header, skb->data, ETH_HLEN + sizeof(*h6));
+			h6 = (struct ipv6hdr *)((char *)tx_desc->lso.header + ETH_HLEN);
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else {
+			/* Copy headers;
+			 * note that we already verified that it is linear
+			 */
+			memcpy(tx_desc->lso.header, skb->data, lso_header_size);
+		}
 		ring->tso_packets++;
 
 		i = shinfo->gso_segs;
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
                   ` (10 preceding siblings ...)
  2022-05-06 15:30 ` [PATCH v4 net-next 11/12] mlx4: support " Eric Dumazet
@ 2022-05-06 15:30 ` Eric Dumazet
  2022-05-06 22:34   ` Jakub Kicinski
  11 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 15:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet, Eric Dumazet, Tariq Toukan,
	Saeed Mahameed, Leon Romanovsky

From: Coco Li <lixiaoyan@google.com>

mlx5 supports LSOv2.

IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
with JUMBO TLV for big packets.

We need to ignore/skip this HBH header when populating TX descriptor.

Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.

v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 84 +++++++++++++++----
 2 files changed, 69 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d27986869b8ba070d1a4f8bcdc7e14ab54ae984e..226825410a1aa55b5b7941a7389a78abdb800521 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4920,6 +4920,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 
 	netdev->priv_flags       |= IFF_UNICAST_FLT;
 
+	netif_set_tso_max_size(netdev, 512 * 1024);
 	mlx5e_set_netdev_dev_addr(netdev);
 	mlx5e_ipsec_build_netdev(priv);
 	mlx5e_ktls_build_netdev(priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 2dc48406cd08d21ff94f665cd61ab9227f351215..b4fc45ba1b347fb9ad0f46b9c091cc45e4d3d84f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -40,6 +40,7 @@
 #include "en_accel/en_accel.h"
 #include "en_accel/ipsec_rxtx.h"
 #include "en/ptp.h"
+#include <net/ipv6.h>
 
 static void mlx5e_dma_unmap_wqe_err(struct mlx5e_txqsq *sq, u8 num_dma)
 {
@@ -130,23 +131,32 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		sq->stats->csum_none++;
 }
 
+/* Returns the number of header bytes that we plan
+ * to inline later in the transmit descriptor
+ */
 static inline u16
-mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
+mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb, int *hopbyhop)
 {
 	struct mlx5e_sq_stats *stats = sq->stats;
 	u16 ihs;
 
+	*hopbyhop = 0;
 	if (skb->encapsulation) {
 		ihs = skb_inner_transport_offset(skb) + inner_tcp_hdrlen(skb);
 		stats->tso_inner_packets++;
 		stats->tso_inner_bytes += skb->len - ihs;
 	} else {
-		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
 			ihs = skb_transport_offset(skb) + sizeof(struct udphdr);
-		else
+		} else {
 			ihs = skb_transport_offset(skb) + tcp_hdrlen(skb);
+			if (ipv6_has_hopopt_jumbo(skb)) {
+				*hopbyhop = sizeof(struct hop_jumbo_hdr);
+				ihs -= sizeof(struct hop_jumbo_hdr);
+			}
+		}
 		stats->tso_packets++;
-		stats->tso_bytes += skb->len - ihs;
+		stats->tso_bytes += skb->len - ihs - *hopbyhop;
 	}
 
 	return ihs;
@@ -208,6 +218,7 @@ struct mlx5e_tx_attr {
 	__be16 mss;
 	u16 insz;
 	u8 opcode;
+	u8 hopbyhop;
 };
 
 struct mlx5e_tx_wqe_attr {
@@ -244,14 +255,16 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5e_sq_stats *stats = sq->stats;
 
 	if (skb_is_gso(skb)) {
-		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
+		int hopbyhop;
+		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
 
 		*attr = (struct mlx5e_tx_attr) {
 			.opcode    = MLX5_OPCODE_LSO,
 			.mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
 			.ihs       = ihs,
 			.num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
-			.headlen   = skb_headlen(skb) - ihs,
+			.headlen   = skb_headlen(skb) - ihs - hopbyhop,
+			.hopbyhop  = hopbyhop,
 		};
 
 		stats->packets += skb_shinfo(skb)->gso_segs;
@@ -365,7 +378,8 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5_wqe_eth_seg  *eseg;
 	struct mlx5_wqe_data_seg *dseg;
 	struct mlx5e_tx_wqe_info *wi;
-
+	u16 ihs = attr->ihs;
+	struct ipv6hdr *h6;
 	struct mlx5e_sq_stats *stats = sq->stats;
 	int num_dma;
 
@@ -379,15 +393,36 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 
 	eseg->mss = attr->mss;
 
-	if (attr->ihs) {
-		if (skb_vlan_tag_present(skb)) {
-			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN);
-			mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
+	if (ihs) {
+		u8 *start = eseg->inline_hdr.start;
+
+		if (unlikely(attr->hopbyhop)) {
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			if (skb_vlan_tag_present(skb)) {
+				mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6));
+				ihs += VLAN_HLEN;
+				h6 = (struct ipv6hdr *)(start + sizeof(struct vlan_ethhdr));
+			} else {
+				memcpy(start, skb->data, ETH_HLEN + sizeof(*h6));
+				h6 = (struct ipv6hdr *)(start + ETH_HLEN);
+			}
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else if (skb_vlan_tag_present(skb)) {
+			mlx5e_insert_vlan(start, skb, ihs);
+			ihs += VLAN_HLEN;
 			stats->added_vlan_packets++;
 		} else {
-			eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
-			memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
+			memcpy(start, skb->data, ihs);
 		}
+		eseg->inline_hdr.sz |= cpu_to_be16(ihs);
 		dseg += wqe_attr->ds_cnt_inl;
 	} else if (skb_vlan_tag_present(skb)) {
 		eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN);
@@ -398,7 +433,7 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	}
 
 	dseg += wqe_attr->ds_cnt_ids;
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs + attr->hopbyhop,
 					  attr->headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
@@ -918,12 +953,29 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	eseg->mss = attr.mss;
 
 	if (attr.ihs) {
-		memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
+		if (unlikely(attr.hopbyhop)) {
+			struct ipv6hdr *h6;
+
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			memcpy(eseg->inline_hdr.start, skb->data, ETH_HLEN + sizeof(*h6));
+			h6 = (struct ipv6hdr *)((char *)eseg->inline_hdr.start + ETH_HLEN);
+			h6->nexthdr = IPPROTO_TCP;
+			/* Copy the TCP header after the IPv6 one */
+			memcpy(h6 + 1,
+			       skb->data + ETH_HLEN + sizeof(*h6) +
+					sizeof(struct hop_jumbo_hdr),
+			       tcp_hdrlen(skb));
+			/* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */
+		} else {
+			memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
+		}
 		eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
 		dseg += wqe_attr.ds_cnt_inl;
 	}
 
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs + attr.hopbyhop,
 					  attr.headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 15:30 ` [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE Eric Dumazet
@ 2022-05-06 20:48   ` Alexander H Duyck
  2022-05-06 21:20     ` Eric Dumazet
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander H Duyck @ 2022-05-06 20:48 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet

On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> From: Coco Li <lixiaoyan@google.com>
> 
> This enables ipv6/TCP stacks to build TSO packets bigger than
> 64KB if the driver is LSOv2 compatible.
> 
> This patch introduces new variable gso_ipv6_max_size
> that is modifiable through ip link.
> 
> ip link set dev eth0 gso_ipv6_max_size 185000
> 
> User input is capped by driver limit (tso_max_size)
> 
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

So I am still not a fan of adding all this extra tooling to make an
attribute that is just being applied to one protocol. Why not just
allow gso_max_size to extend beyond 64K and only limit it by
tso_max_size?

Doing that would make this patch much simpler as most of the code below
could be dropped.

> ---
>  include/linux/netdevice.h          |  2 ++
>  include/uapi/linux/if_link.h       |  1 +
>  net/core/dev.c                     |  2 ++
>  net/core/rtnetlink.c               | 23 +++++++++++++++++++++++
>  net/core/sock.c                    |  8 ++++++++
>  tools/include/uapi/linux/if_link.h |  1 +
>  6 files changed, 37 insertions(+)

<snip>

> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 512ed661204e0c66c8dcfaddc3001500d10f63ab..847cf80f81754451e5f220f846db734a7625695b 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c

<snip>

> @@ -2820,6 +2831,15 @@ static int do_setlink(const struct sk_buff *skb,
>  		}
>  	}
>  
> +	if (tb[IFLA_GSO_IPV6_MAX_SIZE]) {
> +		u32 max_size = nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]);
> +
> +		if (dev->gso_ipv6_max_size ^ max_size) {
> +			netif_set_gso_ipv6_max_size(dev, max_size);
> +			status |= DO_SETLINK_MODIFIED;
> +		}
> +	}
> +
>  	if (tb[IFLA_GSO_MAX_SEGS]) {
>  		u32 max_segs = nla_get_u32(tb[IFLA_GSO_MAX_SEGS]);
>  

So the this code wouldn't be needed but the block above where we are
doing the check for max_size > GSO_MAX_SIZE could be removed.

> @@ -3283,6 +3303,9 @@ struct net_device *rtnl_create_link(struct net *net, const char *ifname,
>  		netif_set_gso_max_segs(dev, nla_get_u32(tb[IFLA_GSO_MAX_SEGS]));
>  	if (tb[IFLA_GRO_MAX_SIZE])
>  		netif_set_gro_max_size(dev, nla_get_u32(tb[IFLA_GRO_MAX_SIZE]));
> +	if (tb[IFLA_GSO_IPV6_MAX_SIZE])
> +		netif_set_gso_ipv6_max_size(dev,
> +			nla_get_u32(tb[IFLA_GSO_IPV6_MAX_SIZE]));
>  
>  	return dev;
>  }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 6b287eb5427b32865d25fc22122fefeff3a4ccf5..4a29c3bf6b95f76280d8e32e903a0916322d5c4f 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2312,6 +2312,14 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
>  			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
>  			/* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
>  			sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
> +#if IS_ENABLED(CONFIG_IPV6)
> +			if (sk->sk_family == AF_INET6 &&
> +			    sk_is_tcp(sk) &&
> +			    !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) {
> +				/* Paired with WRITE_ONCE() in netif_set_gso_ipv6_max_size() */
> +				sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_ipv6_max_size);
> +			}
> +#endif
>  			sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
>  			/* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
>  			max_segs = max_t(u32, READ_ONCE(dst->dev->gso_max_segs), 1);

This block here could then be rewritten as:
if (sk->sk_gso_max_size > GSO_MAX_SIZE &&
    (!IS_ENABLED(CONFIG_IPV6) || sk->sk_family != AF_INET6 || 
     !skb_is_tcp(sk) || ipv6_addr_v4mapped(&sk->sk_v6_rcf_saddr))
	sk->sk_gso_max_size = GSO_MAX_SIZE;

Then if we need protocol specific knobs in the future we could always
come back and make them act as caps instead of outright replacements
for the gso_max_size value.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE
  2022-05-06 15:30 ` [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE Eric Dumazet
@ 2022-05-06 21:06   ` Alexander H Duyck
  2022-05-06 21:22     ` Eric Dumazet
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander H Duyck @ 2022-05-06 21:06 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Coco Li, Eric Dumazet

On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> From: Coco Li <lixiaoyan@google.com>
> 
> Enable GRO to have IPv6 specific limit for max packet size.
> 
> This patch introduces new dev->gro_ipv6_max_size
> that is modifiable through ip link.
> 
> ip link set dev eth0 gro_ipv6_max_size 185000
> 
> Note that this value is only considered if bigger than
> gro_max_size, and for non encapsulated TCP/ipv6 packets.
> 
> Signed-off-by: Coco Li <lixiaoyan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

This is another spot where it doesn't make much sense to me to add yet
another control. Instead it would make much more sense to simply remove
the cap from the existing control and simply add a check that caps the
non-IPv6 protocols at GRO_MAX_SIZE.

> ---
>  include/linux/netdevice.h          |  3 +++
>  include/uapi/linux/if_link.h       |  1 +
>  net/core/dev.c                     |  1 +
>  net/core/gro.c                     | 20 ++++++++++++++++++--
>  net/core/rtnetlink.c               | 22 ++++++++++++++++++++++
>  tools/include/uapi/linux/if_link.h |  1 +
>  6 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 47f413dac12e901700045f4b73d47ecdca0f4f3c..df12c9843d94cb847e0ce5ba1b3b36bde7d476ed 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1962,6 +1962,8 @@ enum netdev_ml_priv_type {
>   *			keep a list of interfaces to be deleted.
>   *	@gro_max_size:	Maximum size of aggregated packet in generic
>   *			receive offload (GRO)
> + *	@gro_ipv6_max_size:	Maximum size of aggregated packet in generic
> + *				receive offload (GRO), for IPv6
>   *
>   *	@dev_addr_shadow:	Copy of @dev_addr to catch direct writes.
>   *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
> @@ -2154,6 +2156,7 @@ struct net_device {
>  	int			napi_defer_hard_irqs;
>  #define GRO_MAX_SIZE		65536
>  	unsigned int		gro_max_size;
> +	unsigned int		gro_ipv6_max_size;
>  	rx_handler_func_t __rcu	*rx_handler;
>  	void __rcu		*rx_handler_data;
>  
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index aa05fc9cc23f4ccf92f4cbba57f43472749cd42a..9ece3a391105c171057cc491c1458ee8a45e07e0 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -371,6 +371,7 @@ enum {
>  	IFLA_TSO_MAX_SIZE,
>  	IFLA_TSO_MAX_SEGS,
>  	IFLA_GSO_IPV6_MAX_SIZE,
> +	IFLA_GRO_IPV6_MAX_SIZE,
>  
>  	__IFLA_MAX
>  };
> diff --git a/net/core/dev.c b/net/core/dev.c
> index aa8757215b2a9f14683f95086732668eb99a875b..582b7fe052a6fb06437f95bd6a451b79e188cc57 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -10608,6 +10608,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	dev->tso_max_size = TSO_LEGACY_MAX_SIZE;
>  	dev->tso_max_segs = TSO_MAX_SEGS;
>  	dev->gso_ipv6_max_size = GSO_MAX_SIZE;
> +	dev->gro_ipv6_max_size = GRO_MAX_SIZE;
>  
>  	dev->upper_level = 1;
>  	dev->lower_level = 1;
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 78110edf5d4b36d2fa6f8a2676096efe0112aa0e..8b35403dd7e909a8d7df591d952a4600c13f360b 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -161,11 +161,27 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>  	unsigned int new_truesize;
>  	struct sk_buff *lp;
>  
> +	if (unlikely(NAPI_GRO_CB(skb)->flush))
> +		return -E2BIG;
> +
>  	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
>  	gro_max_size = READ_ONCE(p->dev->gro_max_size);
>  
> -	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
> -		return -E2BIG;

So if we just overwrite the existing gro_max_size we could skip the
changes above and all the extra netlink overhead.

> +	if (unlikely(p->len + len >= gro_max_size)) {
> +		/* pairs with WRITE_ONCE() in netif_set_gro_ipv6_max_size() */
> +		unsigned int gro6_max_size = READ_ONCE(p->dev->gro_ipv6_max_size);
> +
> +		if (gro6_max_size > gro_max_size &&
> +		    p->protocol == htons(ETH_P_IPV6) &&
> +		    skb_headroom(p) >= sizeof(struct hop_jumbo_hdr) &&
> +		    ipv6_hdr(p)->nexthdr == IPPROTO_TCP &&
> +		    !p->encapsulation)
> +			gro_max_size = gro6_max_size;
> +
> +		if (p->len + len >= gro_max_size)
> +			return -E2BIG;
> +	}
> +

Instead all we would need to do is add an extra section here along the
lines of:
	if (p->len + len > GRO_MAX_SIZE &&
	    (p->protocol != htons(ETH_P_IPV6) ||
	     skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
	     ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
	     p->encapsulation)
		return -E2BIG;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 20:48   ` Alexander H Duyck
@ 2022-05-06 21:20     ` Eric Dumazet
  2022-05-06 21:37       ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 21:20 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 1:48 PM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> >
> > This enables ipv6/TCP stacks to build TSO packets bigger than
> > 64KB if the driver is LSOv2 compatible.
> >
> > This patch introduces new variable gso_ipv6_max_size
> > that is modifiable through ip link.
> >
> > ip link set dev eth0 gso_ipv6_max_size 185000
> >
> > User input is capped by driver limit (tso_max_size)
> >
> > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> So I am still not a fan of adding all this extra tooling to make an
> attribute that is just being applied to one protocol. Why not just
> allow gso_max_size to extend beyond 64K and only limit it by
> tso_max_size?

Answer is easy, and documented in our paper. Please read it.

We do not want to enable BIG TCP for IPv4, this breaks user space badly.

I do not want to break tcpdump just because some people think TCP just works.

>
> Doing that would make this patch much simpler as most of the code below
> could be dropped.
>

Sure, but no thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE
  2022-05-06 21:06   ` Alexander H Duyck
@ 2022-05-06 21:22     ` Eric Dumazet
  2022-05-06 22:01       ` Alexander Duyck
  2022-05-09 18:17       ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Alexander Duyck
  0 siblings, 2 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 21:22 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 2:06 PM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> >
> > Enable GRO to have IPv6 specific limit for max packet size.
> >
> > This patch introduces new dev->gro_ipv6_max_size
> > that is modifiable through ip link.
> >
> > ip link set dev eth0 gro_ipv6_max_size 185000
> >
> > Note that this value is only considered if bigger than
> > gro_max_size, and for non encapsulated TCP/ipv6 packets.
> >
> > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> This is another spot where it doesn't make much sense to me to add yet
> another control. Instead it would make much more sense to simply remove
> the cap from the existing control and simply add a check that caps the
> non-IPv6 protocols at GRO_MAX_SIZE.

Can you please send a diff on top of our patch series ?

It is kind of hard to see what you want, and _why_ you want this.

Note that GRO_MAX_SIZE has been replaced by dev->gro_max_size last year.

Yes, yet another control, but some people want more control than others I guess.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 21:20     ` Eric Dumazet
@ 2022-05-06 21:37       ` Alexander Duyck
  2022-05-06 21:50         ` Eric Dumazet
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2022-05-06 21:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 2:20 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, May 6, 2022 at 1:48 PM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > > From: Coco Li <lixiaoyan@google.com>
> > >
> > > This enables ipv6/TCP stacks to build TSO packets bigger than
> > > 64KB if the driver is LSOv2 compatible.
> > >
> > > This patch introduces new variable gso_ipv6_max_size
> > > that is modifiable through ip link.
> > >
> > > ip link set dev eth0 gso_ipv6_max_size 185000
> > >
> > > User input is capped by driver limit (tso_max_size)
> > >
> > > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > So I am still not a fan of adding all this extra tooling to make an
> > attribute that is just being applied to one protocol. Why not just
> > allow gso_max_size to extend beyond 64K and only limit it by
> > tso_max_size?
>
> Answer is easy, and documented in our paper. Please read it.

I have read it.

> We do not want to enable BIG TCP for IPv4, this breaks user space badly.
>
> I do not want to break tcpdump just because some people think TCP just works.

The changes I suggested don't enable it for IPv4. What your current
code is doing now is using dev->gso_max_size and if it is the correct
IPv6 type you are using dev->gso_ipv6_max_size. What I am saying is
that instead of adding yet another netdev control you should just make
it so that we retain existing behavior when gso_max_size is less than
GSO_MAX_SIZE, and when it exceeds it all non-ipv6 types max out at
GSO_MAX_SIZE and only the ipv6 type packets use gso_max_size when you
exceed GSO_MAX_SIZE.

The big thing I am not a fan of is adding protocol level controls down
in the link level interface. It makes things confusing. For example,
say somebody has existing scripts to limit the gso_max_size and they
were using IPv6 and your new control is added. Suddenly the
gso_max_size limitations they were setting won't be applied because
you split things up at the protocol level.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 21:37       ` Alexander Duyck
@ 2022-05-06 21:50         ` Eric Dumazet
  2022-05-06 22:16           ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 21:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 2:37 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, May 6, 2022 at 2:20 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, May 6, 2022 at 1:48 PM Alexander H Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > > > From: Coco Li <lixiaoyan@google.com>
> > > >
> > > > This enables ipv6/TCP stacks to build TSO packets bigger than
> > > > 64KB if the driver is LSOv2 compatible.
> > > >
> > > > This patch introduces new variable gso_ipv6_max_size
> > > > that is modifiable through ip link.
> > > >
> > > > ip link set dev eth0 gso_ipv6_max_size 185000
> > > >
> > > > User input is capped by driver limit (tso_max_size)
> > > >
> > > > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >
> > > So I am still not a fan of adding all this extra tooling to make an
> > > attribute that is just being applied to one protocol. Why not just
> > > allow gso_max_size to extend beyond 64K and only limit it by
> > > tso_max_size?
> >
> > Answer is easy, and documented in our paper. Please read it.
>
> I have read it.
>
> > We do not want to enable BIG TCP for IPv4, this breaks user space badly.
> >
> > I do not want to break tcpdump just because some people think TCP just works.
>
> The changes I suggested don't enable it for IPv4. What your current
> code is doing now is using dev->gso_max_size and if it is the correct
> IPv6 type you are using dev->gso_ipv6_max_size. What I am saying is
> that instead of adding yet another netdev control you should just make
> it so that we retain existing behavior when gso_max_size is less than
> GSO_MAX_SIZE, and when it exceeds it all non-ipv6 types max out at
> GSO_MAX_SIZE and only the ipv6 type packets use gso_max_size when you
> exceed GSO_MAX_SIZE.

gso_max_size can not exceed GSO_MAX_SIZE.
This will break many drivers.
I do not want to change hundreds of them.

Look, we chose this implementation so that chances of breaking things
are very small.
I understand this is frustrating, but I suggest you take the
responsibility of breaking things,
and not add this on us.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE
  2022-05-06 21:22     ` Eric Dumazet
@ 2022-05-06 22:01       ` Alexander Duyck
  2022-05-06 22:08         ` Eric Dumazet
  2022-05-09 18:17       ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Alexander Duyck
  1 sibling, 1 reply; 47+ messages in thread
From: Alexander Duyck @ 2022-05-06 22:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 2:22 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, May 6, 2022 at 2:06 PM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > > From: Coco Li <lixiaoyan@google.com>
> > >
> > > Enable GRO to have IPv6 specific limit for max packet size.
> > >
> > > This patch introduces new dev->gro_ipv6_max_size
> > > that is modifiable through ip link.
> > >
> > > ip link set dev eth0 gro_ipv6_max_size 185000
> > >
> > > Note that this value is only considered if bigger than
> > > gro_max_size, and for non encapsulated TCP/ipv6 packets.
> > >
> > > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > This is another spot where it doesn't make much sense to me to add yet
> > another control. Instead it would make much more sense to simply remove
> > the cap from the existing control and simply add a check that caps the
> > non-IPv6 protocols at GRO_MAX_SIZE.
>
> Can you please send a diff on top of our patch series ?

I would rather not as it would essentially just be a revert of the two
problematic patches since what I am suggesting is significantly
smaller.

> It is kind of hard to see what you want, and _why_ you want this.
>
> Note that GRO_MAX_SIZE has been replaced by dev->gro_max_size last year.

I am using GRO_MAX_SIZE as a legacy value for everything that is not
IPv6. If it would help you could go back and take a look at Jakub's
patch series and see what he did with TSO_LEGACY_MAX_SIZE. You could
think of my use here as GRO_LEGACY_MAX_SIZE. What I am doing is
capping all the non-ipv6/tcp flows at the default maximum limit for
legacy setups.

> Yes, yet another control, but some people want more control than others I guess.

Basically these patches are reducing functionality from an existing
control. The g[sr]o_max_size values were applied to all incoming or
outgoing traffic. The patches are adding a special control that only
applies to a subset of ipv6 traffic. Instead of taking that route I
would rather have the max_size values allowed to exceed the legacy
limits, and in those cases that cannot support the new sizes we
default back to the legacy maxes. Doing that I feel like we would get
much more consistent behavior and if somebody is wanting to use these
values for their original intended purpose which was limiting the
traffic they will be able to affect all traffic, not just the
non-ipv6/tcp traffic.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE
  2022-05-06 22:01       ` Alexander Duyck
@ 2022-05-06 22:08         ` Eric Dumazet
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 22:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 3:01 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, May 6, 2022 at 2:22 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, May 6, 2022 at 2:06 PM Alexander H Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > > > From: Coco Li <lixiaoyan@google.com>
> > > >
> > > > Enable GRO to have IPv6 specific limit for max packet size.
> > > >
> > > > This patch introduces new dev->gro_ipv6_max_size
> > > > that is modifiable through ip link.
> > > >
> > > > ip link set dev eth0 gro_ipv6_max_size 185000
> > > >
> > > > Note that this value is only considered if bigger than
> > > > gro_max_size, and for non encapsulated TCP/ipv6 packets.
> > > >
> > > > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >
> > > This is another spot where it doesn't make much sense to me to add yet
> > > another control. Instead it would make much more sense to simply remove
> > > the cap from the existing control and simply add a check that caps the
> > > non-IPv6 protocols at GRO_MAX_SIZE.
> >
> > Can you please send a diff on top of our patch series ?
>
> I would rather not as it would essentially just be a revert of the two
> problematic patches since what I am suggesting is significantly
> smaller.
>
> > It is kind of hard to see what you want, and _why_ you want this.
> >
> > Note that GRO_MAX_SIZE has been replaced by dev->gro_max_size last year.
>
> I am using GRO_MAX_SIZE as a legacy value for everything that is not
> IPv6. If it would help you could go back and take a look at Jakub's
> patch series and see what he did with TSO_LEGACY_MAX_SIZE.

Yes, I was the one suggesting this TSO_LEGACY_MAX_SIZE.

> You could
> think of my use here as GRO_LEGACY_MAX_SIZE. What I am doing is
> capping all the non-ipv6/tcp flows at the default maximum limit for
> legacy setups.
>
> > Yes, yet another control, but some people want more control than others I guess.
>
> Basically these patches are reducing functionality from an existing
> control. The g[sr]o_max_size values were applied to all incoming or
> outgoing traffic.

Yes, and we need to change that, otherwise we are stuck at 65536,
because legacy.

> The patches are adding a special control that only applies to a subset of ipv6 traffic.

Exactly. This is not an accident.

> Instead of taking that route I
> would rather have the max_size values allowed to exceed the legacy
> limits, and in those cases that cannot support the new sizes we
> default back to the legacy maxes.

Please send a tested patch. I think it will break drivers.

We spent months doing extensive tests, and I do not see any reason to spend more
time on something that you suggest that I feel is wrong.

> Doing that I feel like we would get
> much more consistent behavior and if somebody is wanting to use these
> values for their original intended purpose which was limiting the
> traffic they will be able to affect all traffic, not just the
> non-ipv6/tcp traffic.

Some people (not us) want to add BIG-TCP with IPv4 as well in a future
evolution.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 21:50         ` Eric Dumazet
@ 2022-05-06 22:16           ` Alexander Duyck
  2022-05-06 22:25             ` Eric Dumazet
  2022-05-06 22:26             ` Jakub Kicinski
  0 siblings, 2 replies; 47+ messages in thread
From: Alexander Duyck @ 2022-05-06 22:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 2:50 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, May 6, 2022 at 2:37 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Fri, May 6, 2022 at 2:20 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, May 6, 2022 at 1:48 PM Alexander H Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > > > > From: Coco Li <lixiaoyan@google.com>
> > > > >
> > > > > This enables ipv6/TCP stacks to build TSO packets bigger than
> > > > > 64KB if the driver is LSOv2 compatible.
> > > > >
> > > > > This patch introduces new variable gso_ipv6_max_size
> > > > > that is modifiable through ip link.
> > > > >
> > > > > ip link set dev eth0 gso_ipv6_max_size 185000
> > > > >
> > > > > User input is capped by driver limit (tso_max_size)
> > > > >
> > > > > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > >
> > > > So I am still not a fan of adding all this extra tooling to make an
> > > > attribute that is just being applied to one protocol. Why not just
> > > > allow gso_max_size to extend beyond 64K and only limit it by
> > > > tso_max_size?
> > >
> > > Answer is easy, and documented in our paper. Please read it.
> >
> > I have read it.
> >
> > > We do not want to enable BIG TCP for IPv4, this breaks user space badly.
> > >
> > > I do not want to break tcpdump just because some people think TCP just works.
> >
> > The changes I suggested don't enable it for IPv4. What your current
> > code is doing now is using dev->gso_max_size and if it is the correct
> > IPv6 type you are using dev->gso_ipv6_max_size. What I am saying is
> > that instead of adding yet another netdev control you should just make
> > it so that we retain existing behavior when gso_max_size is less than
> > GSO_MAX_SIZE, and when it exceeds it all non-ipv6 types max out at
> > GSO_MAX_SIZE and only the ipv6 type packets use gso_max_size when you
> > exceed GSO_MAX_SIZE.
>
> gso_max_size can not exceed GSO_MAX_SIZE.
> This will break many drivers.
> I do not want to change hundreds of them.

Most drivers will not be impacted because they cannot exceed
tso_max_size. The tso_max_size is the limit, not GSO_MAX_SIZE. Last I
knew this patch set is overwriting that value to increase it beyond
the legacy limits.

Right now the check is:
if (max_size > GSO_MAX_SIZE || || max_size > dev->tso_max_size)

What I am suggesting is that tso_max_size be used as the only limit,
which is already defaulted to cap out at TSO_LEGACY_MAX_SIZE. So just
remove the "max_size > GSO_MAX_SIZE ||" portion of the call. Then when
you call netif_set_tso_max_size in the driver to enable jumbograms you
are good to set gso_max_size to something larger than the standard
65536.

> Look, we chose this implementation so that chances of breaking things
> are very small.
> I understand this is frustrating, but I suggest you take the
> responsibility of breaking things,
> and not add this on us.

What I have been trying to point out is your patch set will break things.

For all those cases out there where people are using gso_max_size to
limit things you just poked a hole in that for IPv6 cases. What I am
suggesting is that we don't do that as it will be likely to trigger a
number of problems for people.

The primary reason gso_max_size was added was because there are cases
out there where doing too big of a TSO was breaking things. For
devices that are being used for LSOv2 I highly doubt they need to
worry about cases less than 65536. As such they can just max out at
65536 for all non-IPv6 traffic and instead use gso_max_size as the
limit for the IPv6/TSO case.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 22:16           ` Alexander Duyck
@ 2022-05-06 22:25             ` Eric Dumazet
  2022-05-06 22:26             ` Jakub Kicinski
  1 sibling, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-06 22:25 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 3:16 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, May 6, 2022 at 2:50 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, May 6, 2022 at 2:37 PM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Fri, May 6, 2022 at 2:20 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, May 6, 2022 at 1:48 PM Alexander H Duyck
> > > > <alexander.duyck@gmail.com> wrote:
> > > > >
> > > > > On Fri, 2022-05-06 at 08:30 -0700, Eric Dumazet wrote:
> > > > > > From: Coco Li <lixiaoyan@google.com>
> > > > > >
> > > > > > This enables ipv6/TCP stacks to build TSO packets bigger than
> > > > > > 64KB if the driver is LSOv2 compatible.
> > > > > >
> > > > > > This patch introduces new variable gso_ipv6_max_size
> > > > > > that is modifiable through ip link.
> > > > > >
> > > > > > ip link set dev eth0 gso_ipv6_max_size 185000
> > > > > >
> > > > > > User input is capped by driver limit (tso_max_size)
> > > > > >
> > > > > > Signed-off-by: Coco Li <lixiaoyan@google.com>
> > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > So I am still not a fan of adding all this extra tooling to make an
> > > > > attribute that is just being applied to one protocol. Why not just
> > > > > allow gso_max_size to extend beyond 64K and only limit it by
> > > > > tso_max_size?
> > > >
> > > > Answer is easy, and documented in our paper. Please read it.
> > >
> > > I have read it.
> > >
> > > > We do not want to enable BIG TCP for IPv4, this breaks user space badly.
> > > >
> > > > I do not want to break tcpdump just because some people think TCP just works.
> > >
> > > The changes I suggested don't enable it for IPv4. What your current
> > > code is doing now is using dev->gso_max_size and if it is the correct
> > > IPv6 type you are using dev->gso_ipv6_max_size. What I am saying is
> > > that instead of adding yet another netdev control you should just make
> > > it so that we retain existing behavior when gso_max_size is less than
> > > GSO_MAX_SIZE, and when it exceeds it all non-ipv6 types max out at
> > > GSO_MAX_SIZE and only the ipv6 type packets use gso_max_size when you
> > > exceed GSO_MAX_SIZE.
> >
> > gso_max_size can not exceed GSO_MAX_SIZE.
> > This will break many drivers.
> > I do not want to change hundreds of them.
>
> Most drivers will not be impacted because they cannot exceed
> tso_max_size. The tso_max_size is the limit, not GSO_MAX_SIZE. Last I
> knew this patch set is overwriting that value to increase it beyond
> the legacy limits.
>
> Right now the check is:
> if (max_size > GSO_MAX_SIZE || || max_size > dev->tso_max_size)
>

This is Jakub patch, already merged. Nothing to do with BIG TCP ?


> What I am suggesting is that tso_max_size be used as the only limit,
> which is already defaulted to cap out at TSO_LEGACY_MAX_SIZE. So just
> remove the "max_size > GSO_MAX_SIZE ||" portion of the call. Then when
> you call netif_set_tso_max_size in the driver to enable jumbograms you
> are good to set gso_max_size to something larger than the standard
> 65536.

Again, this was Jakub patch, right ?

>
> > Look, we chose this implementation so that chances of breaking things
> > are very small.
> > I understand this is frustrating, but I suggest you take the
> > responsibility of breaking things,
> > and not add this on us.
>
> What I have been trying to point out is your patch set will break things.
>

Can you give a concrete example ?

> For all those cases out there where people are using gso_max_size to
> limit things you just poked a hole in that for IPv6 cases. What I am
> suggesting is that we don't do that as it will be likely to trigger a
> number of problems for people.

No, because legacy drivers won't  use BIG TCP.

dev->tso_max_size will be <= 65536 for them.

Look at netif_set_gso_ipv6_max_size() logic.

>
> The primary reason gso_max_size was added was because there are cases
> out there where doing too big of a TSO was breaking things. For
> devices that are being used for LSOv2 I highly doubt they need to
> worry about cases less than 65536. As such they can just max out at
> 65536 for all non-IPv6 traffic and instead use gso_max_size as the
> limit for the IPv6/TSO case.

I think we disagree completely.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 22:16           ` Alexander Duyck
  2022-05-06 22:25             ` Eric Dumazet
@ 2022-05-06 22:26             ` Jakub Kicinski
  2022-05-06 22:46               ` Alexander Duyck
  1 sibling, 1 reply; 47+ messages in thread
From: Jakub Kicinski @ 2022-05-06 22:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li

On Fri, 6 May 2022 15:16:21 -0700 Alexander Duyck wrote:
> On Fri, May 6, 2022 at 2:50 PM Eric Dumazet <edumazet@google.com> wrote:
> > gso_max_size can not exceed GSO_MAX_SIZE.
> > This will break many drivers.
> > I do not want to change hundreds of them.  
> 
> Most drivers will not be impacted because they cannot exceed
> tso_max_size. The tso_max_size is the limit, not GSO_MAX_SIZE. Last I
> knew this patch set is overwriting that value to increase it beyond
> the legacy limits.
> 
> Right now the check is:
> if (max_size > GSO_MAX_SIZE || max_size > dev->tso_max_size)
> 
> What I am suggesting is that tso_max_size be used as the only limit,
> which is already defaulted to cap out at TSO_LEGACY_MAX_SIZE. So just
> remove the "max_size > GSO_MAX_SIZE ||" portion of the call. Then when
> you call netif_set_tso_max_size in the driver to enable jumbograms you
> are good to set gso_max_size to something larger than the standard
> 65536.

TBH that was my expectation as well.

Drivers should not pay any attention to dev->gso_* any longer.

> > Look, we chose this implementation so that chances of breaking things
> > are very small.
> > I understand this is frustrating, but I suggest you take the
> > responsibility of breaking things,
> > and not add this on us.  
> 
> What I have been trying to point out is your patch set will break things.
> 
> For all those cases out there where people are using gso_max_size to
> limit things you just poked a hole in that for IPv6 cases. What I am
> suggesting is that we don't do that as it will be likely to trigger a
> number of problems for people.
> 
> The primary reason gso_max_size was added was because there are cases
> out there where doing too big of a TSO was breaking things. For
> devices that are being used for LSOv2 I highly doubt they need to
> worry about cases less than 65536. As such they can just max out at
> 65536 for all non-IPv6 traffic and instead use gso_max_size as the
> limit for the IPv6/TSO case.

Good point. GSO limit is expected to be a cap, so we shouldn't go above
it. At the same time nothing wrong with IPv4 continuing to generate 64k
GSOs after the user raises the limit.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 10/12] veth: enable BIG TCP packets
  2022-05-06 15:30 ` [PATCH v4 net-next 10/12] veth: " Eric Dumazet
@ 2022-05-06 22:33   ` Jakub Kicinski
  0 siblings, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2022-05-06 22:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, Paolo Abeni, netdev, Coco Li, Eric Dumazet

On Fri,  6 May 2022 08:30:46 -0700 Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Set the driver limit to 512 KB per TSO ipv6 packet.
> 
> This allows the admin/user to set a GSO ipv6 limit up to this value.
> 
> ip link set dev veth10 gso_ipv6_max_size 200000
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  drivers/net/veth.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index f474e79a774580e4cb67da44b5f0c796c3ce8abb..989248b0f0c64349494a54735bb5abac66a42a93 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1647,6 +1647,7 @@ static void veth_setup(struct net_device *dev)
>  	dev->hw_features = VETH_FEATURES;
>  	dev->hw_enc_features = VETH_FEATURES;
>  	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
> +	netif_set_tso_max_size(dev, 512 * 1024);

Should we have a define for a "good SW device limit" ?
Or set it to infinity (TSO_MAX_SIZE)?

>  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-06 15:30 ` [PATCH v4 net-next 12/12] mlx5: " Eric Dumazet
@ 2022-05-06 22:34   ` Jakub Kicinski
  2022-05-07  0:32     ` Eric Dumazet
  0 siblings, 1 reply; 47+ messages in thread
From: Jakub Kicinski @ 2022-05-06 22:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, netdev, Coco Li, Eric Dumazet,
	Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Fri,  6 May 2022 08:30:48 -0700 Eric Dumazet wrote:
> From: Coco Li <lixiaoyan@google.com>
> 
> mlx5 supports LSOv2.
> 
> IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> with JUMBO TLV for big packets.
> 
> We need to ignore/skip this HBH header when populating TX descriptor.
> 
> Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> 
> v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y

In file included from ../include/linux/string.h:253,
                 from ../arch/x86/include/asm/page_32.h:22,
                 from ../arch/x86/include/asm/page.h:14,
                 from ../arch/x86/include/asm/processor.h:19,
                 from ../arch/x86/include/asm/timex.h:5,
                 from ../include/linux/timex.h:65,
                 from ../include/linux/time32.h:13,
                 from ../include/linux/time.h:60,
                 from ../include/linux/skbuff.h:15,
                 from ../include/linux/tcp.h:17,
                 from ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:33:
In function ‘fortify_memcpy_chk’,
    inlined from ‘mlx5e_insert_vlan’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:104:2,
    inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:404:5:
../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
  328 |                         __write_overflow_field(p_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘fortify_memcpy_chk’,
    inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
  328 |                         __write_overflow_field(p_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘fortify_memcpy_chk’,
    inlined from ‘mlx5i_sq_xmit’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:962:4:
../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
  328 |                         __write_overflow_field(p_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE
  2022-05-06 22:26             ` Jakub Kicinski
@ 2022-05-06 22:46               ` Alexander Duyck
  0 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2022-05-06 22:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li

On Fri, May 6, 2022 at 3:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 6 May 2022 15:16:21 -0700 Alexander Duyck wrote:
> > On Fri, May 6, 2022 at 2:50 PM Eric Dumazet <edumazet@google.com> wrote:
> > > gso_max_size can not exceed GSO_MAX_SIZE.
> > > This will break many drivers.
> > > I do not want to change hundreds of them.
> >
> > Most drivers will not be impacted because they cannot exceed
> > tso_max_size. The tso_max_size is the limit, not GSO_MAX_SIZE. Last I
> > knew this patch set is overwriting that value to increase it beyond
> > the legacy limits.
> >
> > Right now the check is:
> > if (max_size > GSO_MAX_SIZE || max_size > dev->tso_max_size)
> >
> > What I am suggesting is that tso_max_size be used as the only limit,
> > which is already defaulted to cap out at TSO_LEGACY_MAX_SIZE. So just
> > remove the "max_size > GSO_MAX_SIZE ||" portion of the call. Then when
> > you call netif_set_tso_max_size in the driver to enable jumbograms you
> > are good to set gso_max_size to something larger than the standard
> > 65536.
>
> TBH that was my expectation as well.
>
> Drivers should not pay any attention to dev->gso_* any longer.

Right. However, there are a few protocol items that it looks like do
need to be addressed as SCTP and FCoE appear to be accessing it raw
without any wrappers. So there will be more work than what I called
out to deal with as those would probably need to be wrapped in a min()
function call using the legacy max.

I can take a look at generating the patches if we really want to go
down that route, but it will take me a day or two to get it coded up
as I don't have a ton of free time to work on side projects.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-06 22:34   ` Jakub Kicinski
@ 2022-05-07  0:32     ` Eric Dumazet
  2022-05-07  1:54       ` Jakub Kicinski
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-07  0:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev, Coco Li,
	Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Fri, May 6, 2022 at 3:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri,  6 May 2022 08:30:48 -0700 Eric Dumazet wrote:
> > From: Coco Li <lixiaoyan@google.com>
> >
> > mlx5 supports LSOv2.
> >
> > IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> > with JUMBO TLV for big packets.
> >
> > We need to ignore/skip this HBH header when populating TX descriptor.
> >
> > Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> > layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> >
> > v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> > v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
>
> In file included from ../include/linux/string.h:253,
>                  from ../arch/x86/include/asm/page_32.h:22,
>                  from ../arch/x86/include/asm/page.h:14,
>                  from ../arch/x86/include/asm/processor.h:19,
>                  from ../arch/x86/include/asm/timex.h:5,
>                  from ../include/linux/timex.h:65,
>                  from ../include/linux/time32.h:13,
>                  from ../include/linux/time.h:60,
>                  from ../include/linux/skbuff.h:15,
>                  from ../include/linux/tcp.h:17,
>                  from ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:33:
> In function ‘fortify_memcpy_chk’,
>     inlined from ‘mlx5e_insert_vlan’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:104:2,
>     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:404:5:
> ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
>   328 |                         __write_overflow_field(p_size_field, size);
>       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> In function ‘fortify_memcpy_chk’,
>     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
> ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
>   328 |                         __write_overflow_field(p_size_field, size);
>       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> In function ‘fortify_memcpy_chk’,
>     inlined from ‘mlx5i_sq_xmit’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:962:4:
> ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
>   328 |                         __write_overflow_field(p_size_field, size);
>       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I guess these warnings show up before this BIG TCP patch ?

I do not see any struct_group() being used in mlx5

May I ask which compiler is used here, and what CONFIG_ option needs to be set ?

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  0:32     ` Eric Dumazet
@ 2022-05-07  1:54       ` Jakub Kicinski
  2022-05-07  1:54         ` Jakub Kicinski
                           ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Jakub Kicinski @ 2022-05-07  1:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev, Coco Li,
	Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Kees Cook

On Fri, 6 May 2022 17:32:43 -0700 Eric Dumazet wrote:
> On Fri, May 6, 2022 at 3:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri,  6 May 2022 08:30:48 -0700 Eric Dumazet wrote:  
> > > From: Coco Li <lixiaoyan@google.com>
> > >
> > > mlx5 supports LSOv2.
> > >
> > > IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> > > with JUMBO TLV for big packets.
> > >
> > > We need to ignore/skip this HBH header when populating TX descriptor.
> > >
> > > Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> > > layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> > >
> > > v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> > > v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y  
> >
> > In file included from ../include/linux/string.h:253,
> >                  from ../arch/x86/include/asm/page_32.h:22,
> >                  from ../arch/x86/include/asm/page.h:14,
> >                  from ../arch/x86/include/asm/processor.h:19,
> >                  from ../arch/x86/include/asm/timex.h:5,
> >                  from ../include/linux/timex.h:65,
> >                  from ../include/linux/time32.h:13,
> >                  from ../include/linux/time.h:60,
> >                  from ../include/linux/skbuff.h:15,
> >                  from ../include/linux/tcp.h:17,
> >                  from ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:33:
> > In function ‘fortify_memcpy_chk’,
> >     inlined from ‘mlx5e_insert_vlan’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:104:2,
> >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:404:5:
> > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> >   328 |                         __write_overflow_field(p_size_field, size);
> >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > In function ‘fortify_memcpy_chk’,
> >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
> > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> >   328 |                         __write_overflow_field(p_size_field, size);
> >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > In function ‘fortify_memcpy_chk’,
> >     inlined from ‘mlx5i_sq_xmit’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:962:4:
> > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> >   328 |                         __write_overflow_field(p_size_field, size);
> >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  
> 
> I guess these warnings show up before this BIG TCP patch ?
> 
> I do not see any struct_group() being used in mlx5
> 
> May I ask which compiler is used here, and what CONFIG_ option needs to be set ?
> 
> Thanks.

Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds
cleanly. Gotta be the new W=1 filed overflow warnings, let's bother
Kees.

I believe this is the code in question:

@@ -379,15 +393,36 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,

+		u8 *start = eseg->inline_hdr.start;
+
+		if (unlikely(attr->hopbyhop)) {
+			/* remove the HBH header.
+			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
+			 */
+			if (skb_vlan_tag_present(skb)) {
+				mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6));

Unhappiness #1 ^^^

Where mlx5e_insert_vlan() is:

static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
{
	struct vlan_ethhdr *vhdr = (struct vlan_ethhdr *)start;
	int cpy1_sz = 2 * ETH_ALEN;
	int cpy2_sz = ihs - cpy1_sz;

	memcpy(&vhdr->addrs, skb->data, cpy1_sz);
	vhdr->h_vlan_proto = skb->vlan_proto;
	vhdr->h_vlan_TCI = cpu_to_be16(skb_vlan_tag_get(skb));
	memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz, cpy2_sz);
}

indeed ihs == ETH_HLEN + sizeof(*h6) will make cpy2_sz come out as something
much bigger than the vhdr->h_vlan_encapsulated_proto field.

+				ihs += VLAN_HLEN;
+				h6 = (struct ipv6hdr *)(start + sizeof(struct vlan_ethhdr));
+			} else {
+				memcpy(start, skb->data, ETH_HLEN + sizeof(*h6));

Unhappiness #2 ^^^

Again, ETH_HLEN + sizeof(*h6) will be larger than eseg->inline_hdr.start
which is what start is pointing at.

+				h6 = (struct ipv6hdr *)(start + ETH_HLEN);
+			}

I didn't look where #3 is.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  1:54       ` Jakub Kicinski
@ 2022-05-07  1:54         ` Jakub Kicinski
  2022-05-07  2:10         ` Eric Dumazet
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2022-05-07  1:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev, Coco Li,
	Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Kees Cook

On Fri, 6 May 2022 18:54:05 -0700 Jakub Kicinski wrote:
> Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds

s/our/your/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  1:54       ` Jakub Kicinski
  2022-05-07  1:54         ` Jakub Kicinski
@ 2022-05-07  2:10         ` Eric Dumazet
  2022-05-07  2:37           ` Jakub Kicinski
  2022-05-07  6:57         ` Kees Cook
  2022-05-07  7:46         ` Kees Cook
  3 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-07  2:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev, Coco Li,
	Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Kees Cook

On Fri, May 6, 2022 at 6:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 6 May 2022 17:32:43 -0700 Eric Dumazet wrote:
> > On Fri, May 6, 2022 at 3:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Fri,  6 May 2022 08:30:48 -0700 Eric Dumazet wrote:
> > > > From: Coco Li <lixiaoyan@google.com>
> > > >
> > > > mlx5 supports LSOv2.
> > > >
> > > > IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> > > > with JUMBO TLV for big packets.
> > > >
> > > > We need to ignore/skip this HBH header when populating TX descriptor.
> > > >
> > > > Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> > > > layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> > > >
> > > > v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> > > > v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
> > >
> > > In file included from ../include/linux/string.h:253,
> > >                  from ../arch/x86/include/asm/page_32.h:22,
> > >                  from ../arch/x86/include/asm/page.h:14,
> > >                  from ../arch/x86/include/asm/processor.h:19,
> > >                  from ../arch/x86/include/asm/timex.h:5,
> > >                  from ../include/linux/timex.h:65,
> > >                  from ../include/linux/time32.h:13,
> > >                  from ../include/linux/time.h:60,
> > >                  from ../include/linux/skbuff.h:15,
> > >                  from ../include/linux/tcp.h:17,
> > >                  from ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:33:
> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5e_insert_vlan’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:104:2,
> > >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:404:5:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5i_sq_xmit’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:962:4:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > I guess these warnings show up before this BIG TCP patch ?
> >
> > I do not see any struct_group() being used in mlx5
> >
> > May I ask which compiler is used here, and what CONFIG_ option needs to be set ?
> >
> > Thanks.
>
> Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds
> cleanly. Gotta be the new W=1 filed overflow warnings, let's bother
> Kees.

Note that inline_hdr.start is a 2 byte array.

Obviously mlx5 driver copies more than 2 bytes of inlined headers.

mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs)
is called already with attr->ihs > 2

So it should already complain ?

static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
{
   struct vlan_ethhdr *vhdr = (struct vlan_ethhdr *)start;
   int cpy1_sz = 2 * ETH_ALEN;
   int cpy2_sz = ihs - cpy1_sz;

    memcpy(&vhdr->addrs, skb->data, cpy1_sz);
    vhdr->h_vlan_proto = skb->vlan_proto;
    vhdr->h_vlan_TCI = cpu_to_be16(skb_vlan_tag_get(skb));
    memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz,
cpy2_sz);  // Here, more than 2 bytes are copied already
}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  2:10         ` Eric Dumazet
@ 2022-05-07  2:37           ` Jakub Kicinski
  2022-05-07  2:43             ` Eric Dumazet
  2022-05-07  7:23             ` Kees Cook
  0 siblings, 2 replies; 47+ messages in thread
From: Jakub Kicinski @ 2022-05-07  2:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev, Coco Li,
	Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Kees Cook

On Fri, 6 May 2022 19:10:48 -0700 Eric Dumazet wrote:
> On Fri, May 6, 2022 at 6:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds
> > cleanly. Gotta be the new W=1 filed overflow warnings, let's bother
> > Kees.  
> 
> Note that inline_hdr.start is a 2 byte array.
> 
> Obviously mlx5 driver copies more than 2 bytes of inlined headers.
> 
> mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs)
> is called already with attr->ihs > 2
> 
> So it should already complain ?

It's a static checker, I presume it ignores attr->ihs because 
it can't prove its value is indeed > 2. Unpleasant :/

> static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
> {
>    struct vlan_ethhdr *vhdr = (struct vlan_ethhdr *)start;
>    int cpy1_sz = 2 * ETH_ALEN;
>    int cpy2_sz = ihs - cpy1_sz;
> 
>     memcpy(&vhdr->addrs, skb->data, cpy1_sz);
>     vhdr->h_vlan_proto = skb->vlan_proto;
>     vhdr->h_vlan_TCI = cpu_to_be16(skb_vlan_tag_get(skb));
>     memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz,
> cpy2_sz);  // Here, more than 2 bytes are copied already
> }


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  2:37           ` Jakub Kicinski
@ 2022-05-07  2:43             ` Eric Dumazet
  2022-05-07  7:16               ` Kees Cook
  2022-05-07  7:23             ` Kees Cook
  1 sibling, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-07  2:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, David S . Miller, Paolo Abeni, netdev, Coco Li,
	Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Kees Cook

On Fri, May 6, 2022 at 7:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 6 May 2022 19:10:48 -0700 Eric Dumazet wrote:
> > On Fri, May 6, 2022 at 6:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds
> > > cleanly. Gotta be the new W=1 filed overflow warnings, let's bother
> > > Kees.
> >
> > Note that inline_hdr.start is a 2 byte array.
> >
> > Obviously mlx5 driver copies more than 2 bytes of inlined headers.
> >
> > mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs)
> > is called already with attr->ihs > 2
> >
> > So it should already complain ?
>
> It's a static checker, I presume it ignores attr->ihs because
> it can't prove its value is indeed > 2. Unpleasant :/

Well, the unpleasant thing is that I do not see a way to get rid of
this warning.
Networking is full of variable sized headers.

>
> > static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
> > {
> >    struct vlan_ethhdr *vhdr = (struct vlan_ethhdr *)start;
> >    int cpy1_sz = 2 * ETH_ALEN;
> >    int cpy2_sz = ihs - cpy1_sz;
> >
> >     memcpy(&vhdr->addrs, skb->data, cpy1_sz);
> >     vhdr->h_vlan_proto = skb->vlan_proto;
> >     vhdr->h_vlan_TCI = cpu_to_be16(skb_vlan_tag_get(skb));
> >     memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz,
> > cpy2_sz);  // Here, more than 2 bytes are copied already
> > }
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  1:54       ` Jakub Kicinski
  2022-05-07  1:54         ` Jakub Kicinski
  2022-05-07  2:10         ` Eric Dumazet
@ 2022-05-07  6:57         ` Kees Cook
  2022-05-07  7:46         ` Kees Cook
  3 siblings, 0 replies; 47+ messages in thread
From: Kees Cook @ 2022-05-07  6:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li, Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Fri, May 06, 2022 at 06:54:05PM -0700, Jakub Kicinski wrote:
> On Fri, 6 May 2022 17:32:43 -0700 Eric Dumazet wrote:
> > On Fri, May 6, 2022 at 3:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Fri,  6 May 2022 08:30:48 -0700 Eric Dumazet wrote:  
> > > > From: Coco Li <lixiaoyan@google.com>
> > > >
> > > > mlx5 supports LSOv2.
> > > >
> > > > IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
> > > > with JUMBO TLV for big packets.
> > > >
> > > > We need to ignore/skip this HBH header when populating TX descriptor.
> > > >
> > > > Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
> > > > layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
> > > >
> > > > v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
> > > > v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y  
> > >
> > > In file included from ../include/linux/string.h:253,
> > >                  from ../arch/x86/include/asm/page_32.h:22,
> > >                  from ../arch/x86/include/asm/page.h:14,
> > >                  from ../arch/x86/include/asm/processor.h:19,
> > >                  from ../arch/x86/include/asm/timex.h:5,
> > >                  from ../include/linux/timex.h:65,
> > >                  from ../include/linux/time32.h:13,
> > >                  from ../include/linux/time.h:60,
> > >                  from ../include/linux/skbuff.h:15,
> > >                  from ../include/linux/tcp.h:17,
> > >                  from ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:33:
> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5e_insert_vlan’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:104:2,
> > >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:404:5:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5i_sq_xmit’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:962:4:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  
> > 
> > I guess these warnings show up before this BIG TCP patch ?
> > 
> > I do not see any struct_group() being used in mlx5
> > 
> > May I ask which compiler is used here, and what CONFIG_ option needs to be set ?
> > 
> > Thanks.
> 
> Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds
> cleanly. Gotta be the new W=1 filed overflow warnings, let's bother
> Kees.

Hello!

These aren't from W=1. The read overflows are hidden behind W=1. I
imagine this is due to gcc getting smarter and being able to introspect
the possible values of ihs during inlining.

> I believe this is the code in question:
> 
> @@ -379,15 +393,36 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> 
> +		u8 *start = eseg->inline_hdr.start;
> +
> +		if (unlikely(attr->hopbyhop)) {
> +			/* remove the HBH header.
> +			 * Layout: [Ethernet header][IPv6 header][HBH][TCP header]
> +			 */
> +			if (skb_vlan_tag_present(skb)) {
> +				mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6));
> 
> Unhappiness #1 ^^^
> 
> Where mlx5e_insert_vlan() is:
> 
> static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
> {
> 	struct vlan_ethhdr *vhdr = (struct vlan_ethhdr *)start;
> 	int cpy1_sz = 2 * ETH_ALEN;
> 	int cpy2_sz = ihs - cpy1_sz;

Why are these "int"? Seems like they should be u16?

> 
> 	memcpy(&vhdr->addrs, skb->data, cpy1_sz);
               ^^^^^ this line was actually fixed earlier.

> 	vhdr->h_vlan_proto = skb->vlan_proto;
> 	vhdr->h_vlan_TCI = cpu_to_be16(skb_vlan_tag_get(skb));
> 	memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz, cpy2_sz);
               ^^^^^
This one, though, is the new problem. The lack of annotation in the
struct made me miss it -- this code is asking the compiler to
potentially copy beyond the end of the struct declaration. If this is
intentional, I could suggest a solution, but ...

> }
> 
> indeed ihs == ETH_HLEN + sizeof(*h6) will make cpy2_sz come out as something
> much bigger than the vhdr->h_vlan_encapsulated_proto field.

It sounds like it's not. In which case, I would ask: "what validates the
size of ihs?" because neither I nor the compiler can see it. :P If
nothing validates it, then this looks like a potential heap overflow,
though I haven't studied how these is laid out in memory. Maybe it's
harmless, but I never assume that. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  2:43             ` Eric Dumazet
@ 2022-05-07  7:16               ` Kees Cook
  0 siblings, 0 replies; 47+ messages in thread
From: Kees Cook @ 2022-05-07  7:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li, Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Fri, May 06, 2022 at 07:43:13PM -0700, Eric Dumazet wrote:
> On Fri, May 6, 2022 at 7:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 6 May 2022 19:10:48 -0700 Eric Dumazet wrote:
> > > On Fri, May 6, 2022 at 6:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds
> > > > cleanly. Gotta be the new W=1 filed overflow warnings, let's bother
> > > > Kees.
> > >
> > > Note that inline_hdr.start is a 2 byte array.
> > >
> > > Obviously mlx5 driver copies more than 2 bytes of inlined headers.
> > >
> > > mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs)
> > > is called already with attr->ihs > 2
> > >
> > > So it should already complain ?
> >
> > It's a static checker, I presume it ignores attr->ihs because
> > it can't prove its value is indeed > 2. Unpleasant :/
> 
> Well, the unpleasant thing is that I do not see a way to get rid of
> this warning.
> Networking is full of variable sized headers.

So... this _is_ supposed to be copying off the end of struct vlan_ethhdr?
In that case, either don't use the vhdr cast, or add a flex array to
the end of the header. e.g. (untested):

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 2dc48406cd08..990476b2e595 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -94,13 +94,18 @@ static inline u16 mlx5e_calc_min_inline(enum mlx5_inline_modes mode,
 static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
 {
 	struct vlan_ethhdr *vhdr = (struct vlan_ethhdr *)start;
-	int cpy1_sz = 2 * ETH_ALEN;
-	int cpy2_sz = ihs - cpy1_sz;
+	void *data = skb->data;
+	const u16 cpy1_sz = sizeof(vhdr->addrs);
+	const u16 cpy2_sz = sizeof(vhdr->h_vlan_encapsulated_proto);
+	const u16 cpy3_sz = ihs - cpy1_sz - cpy2_sz;
 
-	memcpy(&vhdr->addrs, skb->data, cpy1_sz);
+	memcpy(&vhdr->addrs, data, cpy1_sz);
+	data += sizeof(cpy1_sz);
 	vhdr->h_vlan_proto = skb->vlan_proto;
 	vhdr->h_vlan_TCI = cpu_to_be16(skb_vlan_tag_get(skb));
-	memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz, cpy2_sz);
+	memcpy(&vhdr->h_vlan_encapsulated_proto, data, cpy2_sz);
+	data += sizeof(cpy2_sz);
+	memcpy(&vhdr->h_vlan_contents, data, cpy3_sz);
 }
 
 static inline void
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 2be4dd7e90a9..8178e20ce5b3 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -44,6 +44,7 @@ struct vlan_hdr {
  *	@h_vlan_proto: ethernet protocol
  *	@h_vlan_TCI: priority and VLAN ID
  *	@h_vlan_encapsulated_proto: packet type ID or len
+ *	@h_vlan_contents: The rest of the packet
  */
 struct vlan_ethhdr {
 	struct_group(addrs,
@@ -53,6 +54,7 @@ struct vlan_ethhdr {
 	__be16		h_vlan_proto;
 	__be16		h_vlan_TCI;
 	__be16		h_vlan_encapsulated_proto;
+	u8		h_vlan_contents[];
 };
 
 #include <linux/skbuff.h>


I'm still learning the skb helpers, but shouldn't this be using something
similar to skb_pull() that would do bounds checking, etc? Open-coded
accesses of skb->data have shown a repeated pattern of being a source
of flaws:
https://github.com/KSPP/linux/issues/140

And speaking to the existing code, even if skb->data were
bounds-checked, what are the bounds of "start"?

-- 
Kees Cook

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  2:37           ` Jakub Kicinski
  2022-05-07  2:43             ` Eric Dumazet
@ 2022-05-07  7:23             ` Kees Cook
  1 sibling, 0 replies; 47+ messages in thread
From: Kees Cook @ 2022-05-07  7:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li, Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Fri, May 06, 2022 at 07:37:34PM -0700, Jakub Kicinski wrote:
> On Fri, 6 May 2022 19:10:48 -0700 Eric Dumazet wrote:
> > On Fri, May 6, 2022 at 6:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > Without our patches drivers/net/ethernet/mellanox/mlx5/core/ builds
> > > cleanly. Gotta be the new W=1 filed overflow warnings, let's bother
> > > Kees.  
> > 
> > Note that inline_hdr.start is a 2 byte array.
> > 
> > Obviously mlx5 driver copies more than 2 bytes of inlined headers.
> > 
> > mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs)
> > is called already with attr->ihs > 2
> > 
> > So it should already complain ?
> 
> It's a static checker, I presume it ignores attr->ihs because 
> it can't prove its value is indeed > 2. Unpleasant :/

I think it's actually the reverse. GCC keeps getting better about tracking
potential variable value ranges. In this case it thinks ihs WILL be > 2.
And this is bumping up against the kernel's lack of "intentional overflow"
annotations in source (i.e. structure layouts). But we can't protect
against unintentional overflow unless we've been able to explicitly
describe to the compiler what is intended.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  1:54       ` Jakub Kicinski
                           ` (2 preceding siblings ...)
  2022-05-07  6:57         ` Kees Cook
@ 2022-05-07  7:46         ` Kees Cook
  2022-05-07 11:19           ` Eric Dumazet
  3 siblings, 1 reply; 47+ messages in thread
From: Kees Cook @ 2022-05-07  7:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li, Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Fri, May 06, 2022 at 06:54:05PM -0700, Jakub Kicinski wrote:
> On Fri, 6 May 2022 17:32:43 -0700 Eric Dumazet wrote:
> > On Fri, May 6, 2022 at 3:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ah, my old friend, inline_hdr.start. Looks a lot like another one I fixed
earlier in ad5185735f7d ("net/mlx5e: Avoid field-overflowing memcpy()"):

        if (attr->ihs) {
                if (skb_vlan_tag_present(skb)) {
                        eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN);
                        mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
                        stats->added_vlan_packets++;
                } else {
                        eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
                        memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                }
                dseg += wqe_attr->ds_cnt_inl;

This is actually two regions, 2 bytes in eseg and everything else in
dseg. Splitting the memcpy() will work:

	memcpy(eseg->inline_hdr.start, skb->data, sizeof(eseg->inline_hdr.start));
	memcpy(dseg, skb->data + sizeof(eseg->inline_hdr.start), ihs - sizeof(eseg->inline_hdr.start));

But this begs the question, what is validating that ihs -2 is equal to
wqe_attr->ds_cnt_inl * sizeof(*desg) ?

And how is wqe bounds checked?


> > > In function ‘fortify_memcpy_chk’,
> > >     inlined from ‘mlx5i_sq_xmit’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:962:4:
> > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > >   328 |                         __write_overflow_field(p_size_field, size);
> > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  

And moar inline_hdr.start:

	if (attr.ihs) {
		memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
		eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
		dseg += wqe_attr.ds_cnt_inl;
	}

again, a split:

	memcpy(eseg->inline_hdr.start, skb->data, sizeof(eseg->inline_hdr.start));
	eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
	memcpy(dseg, skb->data + sizeof(eseg->inline_hdr.start), ihs - sizeof(eseg->inline_hdr.start));
	dseg += wqe_attr.ds_cnt_inl;

And the same bounds questions come up.

It'd be really nice to get some kind of generalized "copy out of
skb->data with bounds checking that may likely all get reduced to
constant checks".

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07  7:46         ` Kees Cook
@ 2022-05-07 11:19           ` Eric Dumazet
  2022-05-09  8:05             ` David Laight
  2022-05-09 23:20             ` Kees Cook
  0 siblings, 2 replies; 47+ messages in thread
From: Eric Dumazet @ 2022-05-07 11:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jakub Kicinski, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li, Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Sat, May 7, 2022 at 12:46 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, May 06, 2022 at 06:54:05PM -0700, Jakub Kicinski wrote:
> > On Fri, 6 May 2022 17:32:43 -0700 Eric Dumazet wrote:
> > > On Fri, May 6, 2022 at 3:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > In function ‘fortify_memcpy_chk’,
> > > >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
> > > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > > >   328 |                         __write_overflow_field(p_size_field, size);
> > > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Ah, my old friend, inline_hdr.start. Looks a lot like another one I fixed
> earlier in ad5185735f7d ("net/mlx5e: Avoid field-overflowing memcpy()"):
>
>         if (attr->ihs) {
>                 if (skb_vlan_tag_present(skb)) {
>                         eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN);
>                         mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
>                         stats->added_vlan_packets++;
>                 } else {
>                         eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
>                         memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 }
>                 dseg += wqe_attr->ds_cnt_inl;
>
> This is actually two regions, 2 bytes in eseg and everything else in
> dseg. Splitting the memcpy() will work:
>
>         memcpy(eseg->inline_hdr.start, skb->data, sizeof(eseg->inline_hdr.start));
>         memcpy(dseg, skb->data + sizeof(eseg->inline_hdr.start), ihs - sizeof(eseg->inline_hdr.start));
>
> But this begs the question, what is validating that ihs -2 is equal to
> wqe_attr->ds_cnt_inl * sizeof(*desg) ?
>
> And how is wqe bounds checked?

Look at the definition of struct mlx5i_tx_wqe

Then mlx5i_sq_calc_wqe_attr() computes the number of ds_cnt  (16 bytes
granularity)
units needed.

Then look at mlx5e_txqsq_get_next_pi()

I doubt a compiler can infer that the driver is correct.

Basically this is variable length structure, quite common in NIC
world, given number of dma descriptor can vary from 1 to XX,
and variable size of headers. (Typically, fast NIC want to get the
headers inlined in TX descriptor)


>
>
> > > > In function ‘fortify_memcpy_chk’,
> > > >     inlined from ‘mlx5i_sq_xmit’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:962:4:
> > > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > > >   328 |                         __write_overflow_field(p_size_field, size);
> > > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> And moar inline_hdr.start:
>
>         if (attr.ihs) {
>                 memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
>                 eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
>                 dseg += wqe_attr.ds_cnt_inl;
>         }
>
> again, a split:
>
>         memcpy(eseg->inline_hdr.start, skb->data, sizeof(eseg->inline_hdr.start));
>         eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
>         memcpy(dseg, skb->data + sizeof(eseg->inline_hdr.start), ihs - sizeof(eseg->inline_hdr.start));
>         dseg += wqe_attr.ds_cnt_inl;
>
> And the same bounds questions come up.
>
> It'd be really nice to get some kind of generalized "copy out of
> skb->data with bounds checking that may likely all get reduced to
> constant checks".


NIC drivers send millions of packets per second.
We can not really afford copying each component of a frame one byte at a time.

The memcpy() here can typically copy IPv6 header (40 bytes) + TCP
header (up to 60 bytes), plus more headers if encapsulation is added.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07 11:19           ` Eric Dumazet
@ 2022-05-09  8:05             ` David Laight
  2022-05-09 23:20             ` Kees Cook
  1 sibling, 0 replies; 47+ messages in thread
From: David Laight @ 2022-05-09  8:05 UTC (permalink / raw)
  To: 'Eric Dumazet', Kees Cook
  Cc: Jakub Kicinski, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li, Tariq Toukan, Saeed Mahameed, Leon Romanovsky

From: Eric Dumazet
> Sent: 07 May 2022 12:19
....
> NIC drivers send millions of packets per second.
> We can not really afford copying each component of a frame one byte at a time.

Any run-time checks om memcpy() are also going to kill performance.

The 'user copy hardening' tests already have a measurable
effect on system calls like sendmsg().

	David
 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series
  2022-05-06 21:22     ` Eric Dumazet
  2022-05-06 22:01       ` Alexander Duyck
@ 2022-05-09 18:17       ` Alexander Duyck
  2022-05-09 18:17         ` [PATCH 1/2] net: Allow gso_max_size to exceed 65536 Alexander Duyck
                           ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Alexander Duyck @ 2022-05-09 18:17 UTC (permalink / raw)
  To: edumazet
  Cc: alexander.duyck, davem, eric.dumazet, kuba, lixiaoyan, netdev, pabeni

This patch set is meant to replace patches 2 and 7 in the Big TCP series.
From what I can tell it looks like they can just be dropped from the series
and these two patches could be added to the end of the set.

With these patches I have verified that both the loopback and mlx5 drivers
are able to send and receive IPv6 jumbogram frames when configured with a
g[sr]o_max_size value larger than 64K.

Note I had to make one minor change to iproute2 to allow submitting a value
larger than 64K in that I removed a check that was limiting gso_max_size to
no more than 65536. In the future an alternative might be to fetch the
IFLA_TSO_MAX_SIZE attribute if it exists and use that, and if not then use
65536 as the limit.

---

Alexander Duyck (2):
      net: Allow gso_max_size to exceed 65536
      net: Allow gro_max_size to exceed 65536


 drivers/net/ethernet/amd/xgbe/xgbe.h            |  3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |  2 +-
 drivers/net/ethernet/sfc/ef100_nic.c            |  3 ++-
 drivers/net/ethernet/sfc/falcon/tx.c            |  3 ++-
 drivers/net/ethernet/sfc/tx_common.c            |  3 ++-
 drivers/net/ethernet/synopsys/dwc-xlgmac.h      |  3 ++-
 drivers/net/hyperv/rndis_filter.c               |  2 +-
 drivers/scsi/fcoe/fcoe.c                        |  2 +-
 include/linux/netdevice.h                       |  6 ++++--
 include/net/ipv6.h                              |  2 +-
 net/bpf/test_run.c                              |  2 +-
 net/core/dev.c                                  |  7 ++++---
 net/core/gro.c                                  |  8 ++++++++
 net/core/rtnetlink.c                            | 10 +---------
 net/core/sock.c                                 |  4 ++++
 net/ipv4/tcp_bbr.c                              |  2 +-
 net/ipv4/tcp_output.c                           |  2 +-
 net/sctp/output.c                               |  3 ++-
 18 files changed, 40 insertions(+), 27 deletions(-)

--


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/2] net: Allow gso_max_size to exceed 65536
  2022-05-09 18:17       ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Alexander Duyck
@ 2022-05-09 18:17         ` Alexander Duyck
  2022-05-09 18:17         ` [PATCH 2/2] net: Allow gro_max_size " Alexander Duyck
  2022-05-09 18:54         ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Eric Dumazet
  2 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2022-05-09 18:17 UTC (permalink / raw)
  To: edumazet
  Cc: alexander.duyck, davem, eric.dumazet, kuba, lixiaoyan, netdev, pabeni

From: Alexander Duyck <alexanderduyck@fb.com>

The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.

With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.

So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe.h            |    3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    2 +-
 drivers/net/ethernet/sfc/ef100_nic.c            |    3 ++-
 drivers/net/ethernet/sfc/falcon/tx.c            |    3 ++-
 drivers/net/ethernet/sfc/tx_common.c            |    3 ++-
 drivers/net/ethernet/synopsys/dwc-xlgmac.h      |    3 ++-
 drivers/net/hyperv/rndis_filter.c               |    2 +-
 drivers/scsi/fcoe/fcoe.c                        |    2 +-
 include/linux/netdevice.h                       |    3 ++-
 net/bpf/test_run.c                              |    2 +-
 net/core/dev.c                                  |    5 +++--
 net/core/rtnetlink.c                            |    2 +-
 net/core/sock.c                                 |    4 ++++
 net/ipv4/tcp_bbr.c                              |    2 +-
 net/ipv4/tcp_output.c                           |    2 +-
 net/sctp/output.c                               |    3 ++-
 16 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe.h b/drivers/net/ethernet/amd/xgbe/xgbe.h
index 607a2c90513b..d9547552ceef 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe.h
+++ b/drivers/net/ethernet/amd/xgbe/xgbe.h
@@ -151,7 +151,8 @@
 #define XGBE_TX_MAX_BUF_SIZE	(0x3fff & ~(64 - 1))
 
 /* Descriptors required for maximum contiguous TSO/GSO packet */
-#define XGBE_TX_MAX_SPLIT	((GSO_MAX_SIZE / XGBE_TX_MAX_BUF_SIZE) + 1)
+#define XGBE_TX_MAX_SPLIT	\
+	((GSO_LEGACY_MAX_SIZE / XGBE_TX_MAX_BUF_SIZE) + 1)
 
 /* Maximum possible descriptors needed for an SKB:
  * - Maximum number of SKB frags
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index fb11081001a0..838870bc6dbd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2038,7 +2038,7 @@ mlx5e_hw_gro_skb_has_enough_space(struct sk_buff *skb, u16 data_bcnt)
 {
 	int nr_frags = skb_shinfo(skb)->nr_frags;
 
-	return PAGE_SIZE * nr_frags + data_bcnt <= GSO_MAX_SIZE;
+	return PAGE_SIZE * nr_frags + data_bcnt <= GRO_MAX_SIZE;
 }
 
 static void
diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c
index a69d756e09b9..b2536d2c218a 100644
--- a/drivers/net/ethernet/sfc/ef100_nic.c
+++ b/drivers/net/ethernet/sfc/ef100_nic.c
@@ -1008,7 +1008,8 @@ static int ef100_process_design_param(struct efx_nic *efx,
 		}
 		return 0;
 	case ESE_EF100_DP_GZ_TSO_MAX_PAYLOAD_LEN:
-		nic_data->tso_max_payload_len = min_t(u64, reader->value, GSO_MAX_SIZE);
+		nic_data->tso_max_payload_len = min_t(u64, reader->value,
+						      GSO_LEGACY_MAX_SIZE);
 		netif_set_tso_max_size(efx->net_dev,
 				       nic_data->tso_max_payload_len);
 		return 0;
diff --git a/drivers/net/ethernet/sfc/falcon/tx.c b/drivers/net/ethernet/sfc/falcon/tx.c
index f7306e93a8b8..b9369483758c 100644
--- a/drivers/net/ethernet/sfc/falcon/tx.c
+++ b/drivers/net/ethernet/sfc/falcon/tx.c
@@ -98,7 +98,8 @@ unsigned int ef4_tx_max_skb_descs(struct ef4_nic *efx)
 	/* Possibly more for PCIe page boundaries within input fragments */
 	if (PAGE_SIZE > EF4_PAGE_SIZE)
 		max_descs += max_t(unsigned int, MAX_SKB_FRAGS,
-				   DIV_ROUND_UP(GSO_MAX_SIZE, EF4_PAGE_SIZE));
+				   DIV_ROUND_UP(GSO_LEGACY_MAX_SIZE,
+						EF4_PAGE_SIZE));
 
 	return max_descs;
 }
diff --git a/drivers/net/ethernet/sfc/tx_common.c b/drivers/net/ethernet/sfc/tx_common.c
index 9bc8281b7f5b..658ea2d34070 100644
--- a/drivers/net/ethernet/sfc/tx_common.c
+++ b/drivers/net/ethernet/sfc/tx_common.c
@@ -416,7 +416,8 @@ unsigned int efx_tx_max_skb_descs(struct efx_nic *efx)
 	/* Possibly more for PCIe page boundaries within input fragments */
 	if (PAGE_SIZE > EFX_PAGE_SIZE)
 		max_descs += max_t(unsigned int, MAX_SKB_FRAGS,
-				   DIV_ROUND_UP(GSO_MAX_SIZE, EFX_PAGE_SIZE));
+				   DIV_ROUND_UP(GSO_LEGACY_MAX_SIZE,
+						EFX_PAGE_SIZE));
 
 	return max_descs;
 }
diff --git a/drivers/net/ethernet/synopsys/dwc-xlgmac.h b/drivers/net/ethernet/synopsys/dwc-xlgmac.h
index 98e3a271e017..a848e10f3ea4 100644
--- a/drivers/net/ethernet/synopsys/dwc-xlgmac.h
+++ b/drivers/net/ethernet/synopsys/dwc-xlgmac.h
@@ -38,7 +38,8 @@
 #define XLGMAC_RX_DESC_MAX_DIRTY	(XLGMAC_RX_DESC_CNT >> 3)
 
 /* Descriptors required for maximum contiguous TSO/GSO packet */
-#define XLGMAC_TX_MAX_SPLIT	((GSO_MAX_SIZE / XLGMAC_TX_MAX_BUF_SIZE) + 1)
+#define XLGMAC_TX_MAX_SPLIT	\
+	((GSO_LEGACY_MAX_SIZE / XLGMAC_TX_MAX_BUF_SIZE) + 1)
 
 /* Maximum possible descriptors needed for a SKB */
 #define XLGMAC_TX_MAX_DESC_NR	(MAX_SKB_FRAGS + XLGMAC_TX_MAX_SPLIT + 2)
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 866af2cc27a3..6da36cb8af80 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1349,7 +1349,7 @@ static int rndis_netdev_set_hwcaps(struct rndis_device *rndis_device,
 	struct net_device_context *net_device_ctx = netdev_priv(net);
 	struct ndis_offload hwcaps;
 	struct ndis_offload_params offloads;
-	unsigned int gso_max_size = GSO_MAX_SIZE;
+	unsigned int gso_max_size = GSO_LEGACY_MAX_SIZE;
 	int ret;
 
 	/* Find HW offload capabilities */
diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 44ca6110213c..79b2827e4081 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -667,7 +667,7 @@ static void fcoe_netdev_features_change(struct fc_lport *lport,
 
 	if (netdev->features & NETIF_F_FSO) {
 		lport->seq_offload = 1;
-		lport->lso_max = netdev->gso_max_size;
+		lport->lso_max = min(netdev->gso_max_size, GSO_LEGACY_MAX_SIZE);
 		FCOE_NETDEV_DBG(netdev, "Supports LSO for max len 0x%x\n",
 				lport->lso_max);
 	} else {
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8cf0ac616cb9..da063cb37759 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2262,7 +2262,8 @@ struct net_device {
 	const struct rtnl_link_ops *rtnl_link_ops;
 
 	/* for setting kernel sock attribute on TCP connection setup */
-#define GSO_MAX_SIZE		65536
+#define GSO_LEGACY_MAX_SIZE	65536u
+#define GSO_MAX_SIZE		UINT_MAX
 	unsigned int		gso_max_size;
 #define TSO_LEGACY_MAX_SIZE	65536
 #define TSO_MAX_SIZE		UINT_MAX
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 8d54fef9a568..9b5a1f630bb0 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1001,7 +1001,7 @@ static int convert___skb_to_skb(struct sk_buff *skb, struct __sk_buff *__skb)
 		cb->pkt_len = skb->len;
 	} else {
 		if (__skb->wire_len < skb->len ||
-		    __skb->wire_len > GSO_MAX_SIZE)
+		    __skb->wire_len > GSO_LEGACY_MAX_SIZE)
 			return -EINVAL;
 		cb->pkt_len = __skb->wire_len;
 	}
diff --git a/net/core/dev.c b/net/core/dev.c
index f036ccb61da4..a1bbe000953f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2998,7 +2998,8 @@ EXPORT_SYMBOL(netif_set_real_num_queues);
  * @size:	max skb->len of a TSO frame
  *
  * Set the limit on the size of TSO super-frames the device can handle.
- * Unless explicitly set the stack will assume the value of %GSO_MAX_SIZE.
+ * Unless explicitly set the stack will assume the value of
+ * %GSO_LEGACY_MAX_SIZE.
  */
 void netif_set_tso_max_size(struct net_device *dev, unsigned int size)
 {
@@ -10602,7 +10603,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 
 	dev_net_set(dev, &init_net);
 
-	dev->gso_max_size = GSO_MAX_SIZE;
+	dev->gso_max_size = GSO_LEGACY_MAX_SIZE;
 	dev->gso_max_segs = GSO_MAX_SEGS;
 	dev->gro_max_size = GRO_MAX_SIZE;
 	dev->tso_max_size = TSO_LEGACY_MAX_SIZE;
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 512ed661204e..c5b44de41088 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2809,7 +2809,7 @@ static int do_setlink(const struct sk_buff *skb,
 	if (tb[IFLA_GSO_MAX_SIZE]) {
 		u32 max_size = nla_get_u32(tb[IFLA_GSO_MAX_SIZE]);
 
-		if (max_size > GSO_MAX_SIZE || max_size > dev->tso_max_size) {
+		if (max_size > dev->tso_max_size) {
 			err = -EINVAL;
 			goto errout;
 		}
diff --git a/net/core/sock.c b/net/core/sock.c
index 6b287eb5427b..f7c3171078b6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2312,6 +2312,10 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
 			sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
+			if (sk->sk_gso_max_size > GSO_LEGACY_MAX_SIZE &&
+			    (!IS_ENABLED(CONFIG_IPV6) || sk->sk_family != AF_INET6 ||
+			     !sk_is_tcp(sk) || ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)))
+				sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
 			sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
 			max_segs = max_t(u32, READ_ONCE(dst->dev->gso_max_segs), 1);
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index c7d30a3bbd81..075e744bfb48 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -310,7 +310,7 @@ static u32 bbr_tso_segs_goal(struct sock *sk)
 	 */
 	bytes = min_t(unsigned long,
 		      sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
-		      GSO_MAX_SIZE - 1 - MAX_TCP_HEADER);
+		      GSO_LEGACY_MAX_SIZE - 1 - MAX_TCP_HEADER);
 	segs = max_t(u32, bytes / tp->mss_cache, bbr_min_tso_segs(sk));
 
 	return min(segs, 0x7FU);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b092228e4342..b4b2284ed4a2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1553,7 +1553,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	 * SO_SNDBUF values.
 	 * Also allow first and last skb in retransmit queue to be split.
 	 */
-	limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE);
+	limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_LEGACY_MAX_SIZE);
 	if (unlikely((sk->sk_wmem_queued >> 1) > limit &&
 		     tcp_queue != TCP_FRAG_IN_WRITE_QUEUE &&
 		     skb != tcp_rtx_queue_head(sk) &&
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 72fe6669c50d..a63df055ac57 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -134,7 +134,8 @@ void sctp_packet_config(struct sctp_packet *packet, __u32 vtag,
 		dst_hold(tp->dst);
 		sk_setup_caps(sk, tp->dst);
 	}
-	packet->max_size = sk_can_gso(sk) ? READ_ONCE(tp->dst->dev->gso_max_size)
+	packet->max_size = sk_can_gso(sk) ? min(READ_ONCE(tp->dst->dev->gso_max_size),
+						GSO_LEGACY_MAX_SIZE)
 					  : asoc->pathmtu;
 	rcu_read_unlock();
 }



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/2] net: Allow gro_max_size to exceed 65536
  2022-05-09 18:17       ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Alexander Duyck
  2022-05-09 18:17         ` [PATCH 1/2] net: Allow gso_max_size to exceed 65536 Alexander Duyck
@ 2022-05-09 18:17         ` Alexander Duyck
  2022-05-09 18:54         ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Eric Dumazet
  2 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2022-05-09 18:17 UTC (permalink / raw)
  To: edumazet
  Cc: alexander.duyck, davem, eric.dumazet, kuba, lixiaoyan, netdev, pabeni

From: Alexander Duyck <alexanderduyck@fb.com>

Allow the gro_max_size to exceed a value larger than 65536.

There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    2 +-
 include/linux/netdevice.h                       |    3 ++-
 include/net/ipv6.h                              |    2 +-
 net/core/dev.c                                  |    2 +-
 net/core/gro.c                                  |    8 ++++++++
 net/core/rtnetlink.c                            |    8 --------
 6 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 838870bc6dbd..24de37b79f5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2038,7 +2038,7 @@ mlx5e_hw_gro_skb_has_enough_space(struct sk_buff *skb, u16 data_bcnt)
 {
 	int nr_frags = skb_shinfo(skb)->nr_frags;
 
-	return PAGE_SIZE * nr_frags + data_bcnt <= GRO_MAX_SIZE;
+	return PAGE_SIZE * nr_frags + data_bcnt <= GRO_LEGACY_MAX_SIZE;
 }
 
 static void
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index da063cb37759..b78c41e664bd 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2151,7 +2151,8 @@ struct net_device {
 	struct bpf_prog __rcu	*xdp_prog;
 	unsigned long		gro_flush_timeout;
 	int			napi_defer_hard_irqs;
-#define GRO_MAX_SIZE		65536
+#define GRO_LEGACY_MAX_SIZE	65536u
+#define GRO_MAX_SIZE		UINT_MAX
 	unsigned int		gro_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index b6df0314aa02..5b38bf1a586b 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -477,7 +477,7 @@ static inline int ipv6_has_hopopt_jumbo(const struct sk_buff *skb)
 	const struct hop_jumbo_hdr *jhdr;
 	const struct ipv6hdr *nhdr;
 
-	if (likely(skb->len <= GRO_MAX_SIZE))
+	if (likely(skb->len <= GRO_LEGACY_MAX_SIZE))
 		return 0;
 
 	if (skb->protocol != htons(ETH_P_IPV6))
diff --git a/net/core/dev.c b/net/core/dev.c
index a1bbe000953f..7349f75891d5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10605,7 +10605,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 
 	dev->gso_max_size = GSO_LEGACY_MAX_SIZE;
 	dev->gso_max_segs = GSO_MAX_SEGS;
-	dev->gro_max_size = GRO_MAX_SIZE;
+	dev->gro_max_size = GRO_LEGACY_MAX_SIZE;
 	dev->tso_max_size = TSO_LEGACY_MAX_SIZE;
 	dev->tso_max_segs = TSO_MAX_SEGS;
 	dev->upper_level = 1;
diff --git a/net/core/gro.c b/net/core/gro.c
index 78110edf5d4b..b4190eb08467 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -167,6 +167,14 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
 		return -E2BIG;
 
+	if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
+		if (p->protocol != htons(ETH_P_IPV6) ||
+		    skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
+		    ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
+		    p->encapsulation)
+			return -E2BIG;
+	}
+
 	lp = NAPI_GRO_CB(p)->last;
 	pinfo = skb_shinfo(lp);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index c5b44de41088..15b1b3092a7f 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2347,14 +2347,6 @@ static int validate_linkmsg(struct net_device *dev, struct nlattr *tb[],
 		}
 	}
 
-	if (tb[IFLA_GRO_MAX_SIZE]) {
-		u32 gro_max_size = nla_get_u32(tb[IFLA_GRO_MAX_SIZE]);
-
-		if (gro_max_size > GRO_MAX_SIZE) {
-			NL_SET_ERR_MSG(extack, "too big gro_max_size");
-			return -EINVAL;
-		}
-	}
 	return 0;
 }
 



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series
  2022-05-09 18:17       ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Alexander Duyck
  2022-05-09 18:17         ` [PATCH 1/2] net: Allow gso_max_size to exceed 65536 Alexander Duyck
  2022-05-09 18:17         ` [PATCH 2/2] net: Allow gro_max_size " Alexander Duyck
@ 2022-05-09 18:54         ` Eric Dumazet
  2022-05-09 20:21           ` Alexander H Duyck
  2 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-09 18:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Eric Dumazet, Jakub Kicinski, Coco Li, netdev, Paolo Abeni

On Mon, May 9, 2022 at 11:17 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> This patch set is meant to replace patches 2 and 7 in the Big TCP series.
> From what I can tell it looks like they can just be dropped from the series
> and these two patches could be added to the end of the set.
>
> With these patches I have verified that both the loopback and mlx5 drivers
> are able to send and receive IPv6 jumbogram frames when configured with a
> g[sr]o_max_size value larger than 64K.
>
> Note I had to make one minor change to iproute2 to allow submitting a value
> larger than 64K in that I removed a check that was limiting gso_max_size to
> no more than 65536. In the future an alternative might be to fetch the
> IFLA_TSO_MAX_SIZE attribute if it exists and use that, and if not then use
> 65536 as the limit.

OK, thanks.

My remarks are :

1) Adding these enablers at the end of the series will not be
bisection friendly.

2) Lots more changes, and more backport conflicts for us.

I do not care really, it seems you absolutely hate the new attributes,
I can live with that,
but honestly this makes the BIG TCP patch series quite invasive.


>
> ---
>
> Alexander Duyck (2):
>       net: Allow gso_max_size to exceed 65536
>       net: Allow gro_max_size to exceed 65536
>
>
>  drivers/net/ethernet/amd/xgbe/xgbe.h            |  3 ++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |  2 +-
>  drivers/net/ethernet/sfc/ef100_nic.c            |  3 ++-
>  drivers/net/ethernet/sfc/falcon/tx.c            |  3 ++-
>  drivers/net/ethernet/sfc/tx_common.c            |  3 ++-
>  drivers/net/ethernet/synopsys/dwc-xlgmac.h      |  3 ++-
>  drivers/net/hyperv/rndis_filter.c               |  2 +-
>  drivers/scsi/fcoe/fcoe.c                        |  2 +-
>  include/linux/netdevice.h                       |  6 ++++--
>  include/net/ipv6.h                              |  2 +-
>  net/bpf/test_run.c                              |  2 +-
>  net/core/dev.c                                  |  7 ++++---
>  net/core/gro.c                                  |  8 ++++++++
>  net/core/rtnetlink.c                            | 10 +---------
>  net/core/sock.c                                 |  4 ++++
>  net/ipv4/tcp_bbr.c                              |  2 +-
>  net/ipv4/tcp_output.c                           |  2 +-
>  net/sctp/output.c                               |  3 ++-
>  18 files changed, 40 insertions(+), 27 deletions(-)
>
> --
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series
  2022-05-09 18:54         ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Eric Dumazet
@ 2022-05-09 20:21           ` Alexander H Duyck
  2022-05-09 20:31             ` Eric Dumazet
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander H Duyck @ 2022-05-09 20:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Eric Dumazet, Jakub Kicinski, Coco Li, netdev, Paolo Abeni

On Mon, 2022-05-09 at 11:54 -0700, Eric Dumazet wrote:
> On Mon, May 9, 2022 at 11:17 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > 
> > This patch set is meant to replace patches 2 and 7 in the Big TCP series.
> > From what I can tell it looks like they can just be dropped from the series
> > and these two patches could be added to the end of the set.
> > 
> > With these patches I have verified that both the loopback and mlx5 drivers
> > are able to send and receive IPv6 jumbogram frames when configured with a
> > g[sr]o_max_size value larger than 64K.
> > 
> > Note I had to make one minor change to iproute2 to allow submitting a value
> > larger than 64K in that I removed a check that was limiting gso_max_size to
> > no more than 65536. In the future an alternative might be to fetch the
> > IFLA_TSO_MAX_SIZE attribute if it exists and use that, and if not then use
> > 65536 as the limit.
> 
> OK, thanks.
> 
> My remarks are :
> 
> 1) Adding these enablers at the end of the series will not be
> bisection friendly.

They don't have to be added at the end, but essentially they could be
drop in replacements for the two patches called out. I just called it
out that way as that is what I ended up doing in order to test the
patches, and to make it easier to just send them as a pair instead of
sending the entire set. I moved them to the end of the list and was
swapping between the 2 sets in my testing. I was able to reorder them
without any issues. So if you wanted you could place these two patches
as patches 2 and 7 in your series.

> 2) Lots more changes, and more backport conflicts for us.
> 
> I do not care really, it seems you absolutely hate the new attributes,
> I can live with that,
> but honestly this makes the BIG TCP patch series quite invasive.

As it stands the BIG TCP patch series breaks things since it is
outright overrriding the gso_max_size value in the case of IPv6/TCP
frames. As I mentioned before this is going to force people to have to
update scripts if they are reducing gso_max_size as they would also now
need to update gso_ipv6_max_size.

It makes much more sense to me to allow people to push up the value
from 64K to whatever value it is you want to allow for the IPv6/TCP GSO
and then just cap the protocols if they cannot support it.

As far as the backport/kcompat work it should be pretty straight
forward. You just replace the references in the driver to GSO_MAX_SIZE
with GSO_LEGACY_MAX_SIZE and then do a check in a header file somewhere
via #ifndef and if it doesn't exist you define it.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series
  2022-05-09 20:21           ` Alexander H Duyck
@ 2022-05-09 20:31             ` Eric Dumazet
  2022-05-09 21:05               ` Alexander Duyck
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2022-05-09 20:31 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: David Miller, Eric Dumazet, Jakub Kicinski, Coco Li, netdev, Paolo Abeni

On Mon, May 9, 2022 at 1:22 PM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Mon, 2022-05-09 at 11:54 -0700, Eric Dumazet wrote:
> > On Mon, May 9, 2022 at 11:17 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > This patch set is meant to replace patches 2 and 7 in the Big TCP series.
> > > From what I can tell it looks like they can just be dropped from the series
> > > and these two patches could be added to the end of the set.
> > >
> > > With these patches I have verified that both the loopback and mlx5 drivers
> > > are able to send and receive IPv6 jumbogram frames when configured with a
> > > g[sr]o_max_size value larger than 64K.
> > >
> > > Note I had to make one minor change to iproute2 to allow submitting a value
> > > larger than 64K in that I removed a check that was limiting gso_max_size to
> > > no more than 65536. In the future an alternative might be to fetch the
> > > IFLA_TSO_MAX_SIZE attribute if it exists and use that, and if not then use
> > > 65536 as the limit.
> >
> > OK, thanks.
> >
> > My remarks are :
> >
> > 1) Adding these enablers at the end of the series will not be
> > bisection friendly.
>
> They don't have to be added at the end, but essentially they could be
> drop in replacements for the two patches called out. I just called it
> out that way as that is what I ended up doing in order to test the
> patches, and to make it easier to just send them as a pair instead of
> sending the entire set. I moved them to the end of the list and was
> swapping between the 2 sets in my testing. I was able to reorder them
> without any issues. So if you wanted you could place these two patches
> as patches 2 and 7 in your series.
>
> > 2) Lots more changes, and more backport conflicts for us.
> >
> > I do not care really, it seems you absolutely hate the new attributes,
> > I can live with that,
> > but honestly this makes the BIG TCP patch series quite invasive.
>
> As it stands the BIG TCP patch series breaks things since it is
> outright overrriding the gso_max_size value in the case of IPv6/TCP
> frames. As I mentioned before this is going to force people to have to
> update scripts if they are reducing gso_max_size as they would also now
> need to update gso_ipv6_max_size.

If they never set  gso_ipv6_max_size, they do not have to change it.
If they set it, well, they get what they wanted.
Also, the driver value caps  gso_ipv6_max_size, so our patches broke nothing.

Some people could actually decide to limit IPV4 TSO packets to 40KB,
and yet limit
IPv6 packets to 128KB.
Their choice.
Apparently you think this is not a valid choice.


>
> It makes much more sense to me to allow people to push up the value
> from 64K to whatever value it is you want to allow for the IPv6/TCP GSO
> and then just cap the protocols if they cannot support it.
>
> As far as the backport/kcompat work it should be pretty straight
> forward. You just replace the references in the driver to GSO_MAX_SIZE
> with GSO_LEGACY_MAX_SIZE and then do a check in a header file somewhere
> via #ifndef and if it doesn't exist you define it.

Well, this is the kind of stuff that Intel loves to do in their
out-of-tree driver,
which is kind of horrible.

Look, I will spend fews days rebasing and testing a new series
including your patches,
no need to answer this email.

We will live with future merge conflicts, and errors because you
wanted to change
GSO_MAX_SIZE, instead of a clean change.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series
  2022-05-09 20:31             ` Eric Dumazet
@ 2022-05-09 21:05               ` Alexander Duyck
  0 siblings, 0 replies; 47+ messages in thread
From: Alexander Duyck @ 2022-05-09 21:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Eric Dumazet, Jakub Kicinski, Coco Li, netdev, Paolo Abeni

On Mon, May 9, 2022 at 1:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, May 9, 2022 at 1:22 PM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Mon, 2022-05-09 at 11:54 -0700, Eric Dumazet wrote:
> > > On Mon, May 9, 2022 at 11:17 AM Alexander Duyck
> > > <alexander.duyck@gmail.com> wrote:
> > > >
> > > > This patch set is meant to replace patches 2 and 7 in the Big TCP series.
> > > > From what I can tell it looks like they can just be dropped from the series
> > > > and these two patches could be added to the end of the set.
> > > >
> > > > With these patches I have verified that both the loopback and mlx5 drivers
> > > > are able to send and receive IPv6 jumbogram frames when configured with a
> > > > g[sr]o_max_size value larger than 64K.
> > > >
> > > > Note I had to make one minor change to iproute2 to allow submitting a value
> > > > larger than 64K in that I removed a check that was limiting gso_max_size to
> > > > no more than 65536. In the future an alternative might be to fetch the
> > > > IFLA_TSO_MAX_SIZE attribute if it exists and use that, and if not then use
> > > > 65536 as the limit.
> > >
> > > OK, thanks.
> > >
> > > My remarks are :
> > >
> > > 1) Adding these enablers at the end of the series will not be
> > > bisection friendly.
> >
> > They don't have to be added at the end, but essentially they could be
> > drop in replacements for the two patches called out. I just called it
> > out that way as that is what I ended up doing in order to test the
> > patches, and to make it easier to just send them as a pair instead of
> > sending the entire set. I moved them to the end of the list and was
> > swapping between the 2 sets in my testing. I was able to reorder them
> > without any issues. So if you wanted you could place these two patches
> > as patches 2 and 7 in your series.
> >
> > > 2) Lots more changes, and more backport conflicts for us.
> > >
> > > I do not care really, it seems you absolutely hate the new attributes,
> > > I can live with that,
> > > but honestly this makes the BIG TCP patch series quite invasive.
> >
> > As it stands the BIG TCP patch series breaks things since it is
> > outright overrriding the gso_max_size value in the case of IPv6/TCP
> > frames. As I mentioned before this is going to force people to have to
> > update scripts if they are reducing gso_max_size as they would also now
> > need to update gso_ipv6_max_size.
>
> If they never set  gso_ipv6_max_size, they do not have to change it.
> If they set it, well, they get what they wanted.
> Also, the driver value caps  gso_ipv6_max_size, so our patches broke nothing.

I agree that the driver value caps it now that the patches from Jakub
are in. My concern is more with the fact that if they are reducing it
to address some other issue on their NIC then they are now going to
have to update 2 controls instead of just one.

> Some people could actually decide to limit IPV4 TSO packets to 40KB,
> and yet limit
> IPv6 packets to 128KB.
> Their choice.
> Apparently you think this is not a valid choice.

That would be a perfectly valid choice, but limiting it at the NIC
doesn't make sense to me since the NIC is an L2 device and what you
are talking about doing is making modifications up at the L3 layer. It
might make more sense to associate something like that with either a
sysctl at the protocol layer, or maybe even as some sort of attribute
to associate with a routing destination.

I would say the best comparison is device MTU and PMTU or MSS. The
device MTU is a hardware specific value. Nothing larger than that gets
through that specific interface. The PMTU or MSS is what defines your
value from one end to another and is usually stored away in the
routing and/or socket layers. Quite often the PMTU or MSS is smaller
than the device MTU and is tuned in order to get optimal throughput to
the network destination.

> >
> > It makes much more sense to me to allow people to push up the value
> > from 64K to whatever value it is you want to allow for the IPv6/TCP GSO
> > and then just cap the protocols if they cannot support it.
> >
> > As far as the backport/kcompat work it should be pretty straight
> > forward. You just replace the references in the driver to GSO_MAX_SIZE
> > with GSO_LEGACY_MAX_SIZE and then do a check in a header file somewhere
> > via #ifndef and if it doesn't exist you define it.
>
> Well, this is the kind of stuff that Intel loves to do in their
> out-of-tree driver,
> which is kind of horrible.
>
> Look, I will spend fews days rebasing and testing a new series
> including your patches,
> no need to answer this email.
>
> We will live with future merge conflicts, and errors because you
> wanted to change
> GSO_MAX_SIZE, instead of a clean change.

I appreciate all the effort you and the team at Google put into this,
and I am looking forward to seeing it accepted.

Thanks,

- Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 net-next 12/12] mlx5: support BIG TCP packets
  2022-05-07 11:19           ` Eric Dumazet
  2022-05-09  8:05             ` David Laight
@ 2022-05-09 23:20             ` Kees Cook
  1 sibling, 0 replies; 47+ messages in thread
From: Kees Cook @ 2022-05-09 23:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Eric Dumazet, David S . Miller, Paolo Abeni,
	netdev, Coco Li, Tariq Toukan, Saeed Mahameed, Leon Romanovsky

On Sat, May 07, 2022 at 04:19:06AM -0700, Eric Dumazet wrote:
> On Sat, May 7, 2022 at 12:46 AM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Fri, May 06, 2022 at 06:54:05PM -0700, Jakub Kicinski wrote:
> > > On Fri, 6 May 2022 17:32:43 -0700 Eric Dumazet wrote:
> > > > On Fri, May 6, 2022 at 3:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > In function ‘fortify_memcpy_chk’,
> > > > >     inlined from ‘mlx5e_sq_xmit_wqe’ at ../drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:408:5:
> > > > > ../include/linux/fortify-string.h:328:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
> > > > >   328 |                         __write_overflow_field(p_size_field, size);
> > > > >       |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > Ah, my old friend, inline_hdr.start. Looks a lot like another one I fixed
> > earlier in ad5185735f7d ("net/mlx5e: Avoid field-overflowing memcpy()"):
> >
> >         if (attr->ihs) {
> >                 if (skb_vlan_tag_present(skb)) {
> >                         eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN);
> >                         mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
> >                         stats->added_vlan_packets++;
> >                 } else {
> >                         eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs);
> >                         memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
> >                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                 }
> >                 dseg += wqe_attr->ds_cnt_inl;
> >
> > This is actually two regions, 2 bytes in eseg and everything else in
> > dseg. Splitting the memcpy() will work:
> >
> >         memcpy(eseg->inline_hdr.start, skb->data, sizeof(eseg->inline_hdr.start));
> >         memcpy(dseg, skb->data + sizeof(eseg->inline_hdr.start), ihs - sizeof(eseg->inline_hdr.start));
> >
> > But this begs the question, what is validating that ihs -2 is equal to
> > wqe_attr->ds_cnt_inl * sizeof(*desg) ?
> >
> > And how is wqe bounds checked?
> 
> Look at the definition of struct mlx5i_tx_wqe
> 
> Then mlx5i_sq_calc_wqe_attr() computes the number of ds_cnt  (16 bytes
> granularity)
> units needed.
> 
> Then look at mlx5e_txqsq_get_next_pi()

Thanks! I'll study the paths.

> I doubt a compiler can infer that the driver is correct.

Agreed; this layering visibility is a bit strange to deal with. I'll see
if I can come up with a sane solution that doesn't split the memcpy but
establishes some way to do compile-time (or run-time) bounds checking.
If I can't, I suspect I'll have to create an "unsafe_memcpy" wrapper
that expressly ignores the structure layouts, etc. That's basically what
memcpy() currently is, so it's not a regression from that perspective.
I'd just prefer to find a way to refactor things so that the compiler
can actually help us do the bounds checking.

> Basically this is variable length structure, quite common in NIC
> world, given number of dma descriptor can vary from 1 to XX,
> and variable size of headers. (Typically, fast NIC want to get the
> headers inlined in TX descriptor)

Yup; most of the refactoring patches I've sent for the memcpy bounds
checking have been in networking. :) (But then, also, all the recent
security flaws with memcpy overflows have also been in networking,
so no real surprise, I guess.)

> NIC drivers send millions of packets per second.
> We can not really afford copying each component of a frame one byte at a time.
> 
> The memcpy() here can typically copy IPv6 header (40 bytes) + TCP
> header (up to 60 bytes), plus more headers if encapsulation is added.

Right; I need to make sure this gets fixed without wrecking performance.
:)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-05-09 23:21 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-06 15:30 [PATCH v4 net-next 00/12] tcp: BIG TCP implementation Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 01/12] net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 02/12] ipv6: add IFLA_GSO_IPV6_MAX_SIZE Eric Dumazet
2022-05-06 20:48   ` Alexander H Duyck
2022-05-06 21:20     ` Eric Dumazet
2022-05-06 21:37       ` Alexander Duyck
2022-05-06 21:50         ` Eric Dumazet
2022-05-06 22:16           ` Alexander Duyck
2022-05-06 22:25             ` Eric Dumazet
2022-05-06 22:26             ` Jakub Kicinski
2022-05-06 22:46               ` Alexander Duyck
2022-05-06 15:30 ` [PATCH v4 net-next 03/12] tcp_cubic: make hystart_ack_delay() aware of BIG TCP Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 04/12] ipv6: add struct hop_jumbo_hdr definition Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 05/12] ipv6/gso: remove temporary HBH/jumbo header Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 06/12] ipv6/gro: insert " Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 07/12] ipv6: add IFLA_GRO_IPV6_MAX_SIZE Eric Dumazet
2022-05-06 21:06   ` Alexander H Duyck
2022-05-06 21:22     ` Eric Dumazet
2022-05-06 22:01       ` Alexander Duyck
2022-05-06 22:08         ` Eric Dumazet
2022-05-09 18:17       ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Alexander Duyck
2022-05-09 18:17         ` [PATCH 1/2] net: Allow gso_max_size to exceed 65536 Alexander Duyck
2022-05-09 18:17         ` [PATCH 2/2] net: Allow gro_max_size " Alexander Duyck
2022-05-09 18:54         ` [PATCH 0/2] Replacements for patches 2 and 7 in Big TCP series Eric Dumazet
2022-05-09 20:21           ` Alexander H Duyck
2022-05-09 20:31             ` Eric Dumazet
2022-05-09 21:05               ` Alexander Duyck
2022-05-06 15:30 ` [PATCH v4 net-next 08/12] ipv6: Add hop-by-hop header to jumbograms in ip6_output Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 09/12] net: loopback: enable BIG TCP packets Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 10/12] veth: " Eric Dumazet
2022-05-06 22:33   ` Jakub Kicinski
2022-05-06 15:30 ` [PATCH v4 net-next 11/12] mlx4: support " Eric Dumazet
2022-05-06 15:30 ` [PATCH v4 net-next 12/12] mlx5: " Eric Dumazet
2022-05-06 22:34   ` Jakub Kicinski
2022-05-07  0:32     ` Eric Dumazet
2022-05-07  1:54       ` Jakub Kicinski
2022-05-07  1:54         ` Jakub Kicinski
2022-05-07  2:10         ` Eric Dumazet
2022-05-07  2:37           ` Jakub Kicinski
2022-05-07  2:43             ` Eric Dumazet
2022-05-07  7:16               ` Kees Cook
2022-05-07  7:23             ` Kees Cook
2022-05-07  6:57         ` Kees Cook
2022-05-07  7:46         ` Kees Cook
2022-05-07 11:19           ` Eric Dumazet
2022-05-09  8:05             ` David Laight
2022-05-09 23:20             ` Kees Cook

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.