All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 net-next 00/10] net: support ipv4 big tcp
@ 2023-01-28 15:58 Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
                   ` (12 more replies)
  0 siblings, 13 replies; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This is similar to the BIG TCP patchset added by Eric for IPv6:

  https://lwn.net/Articles/895398/

Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
doesn't have exthdrs(options) for the BIG TCP packets' length. To make
it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
indicate this might be a BIG TCP packet and use skb->len as the real
IPv4 total length.

This will work safely, as all BIG TCP packets are GSO/GRO packets and
processed on the same host as they were created; There is no padding
in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
packet total length; Also, before implementing the feature, all those
places that may get iph tot_len from BIG TCP packets are taken care
with some new APIs:

Patch 1 adds some APIs for iph tot_len setting and getting, which are
used in all these places where IPv4 BIG TCP packets may reach in Patch
2-7, Patch 8 adds a GSO_TCP tp_status for af_packet users, and Patch 9
add new netlink attributes to make IPv4 BIG TCP independent from IPv6
BIG TCP on configuration, and Patch 10 implements this feature.

Note that the similar change as in Patch 2-6 are also needed for IPv6
BIG TCP packets, and will be addressed in another patchset.

The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
and 1.5K MTU:

No BIG TCP:
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
168          322          337          3776.49
143          236          277          4654.67
128          258          288          4772.83
171          229          278          4645.77
175          228          243          4678.93
149          239          279          4599.86
164          234          268          4606.94
155          276          289          4235.82
180          255          268          4418.95
168          241          249          4417.82

Enable BIG TCP:
ip link set dev ens1f0np0 gro_ipv4_max_size 128000 gso_ipv4_max_size 128000
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
161          241          252          4821.73
174          205          217          5098.28
167          208          220          5001.43
164          228          249          4883.98
150          233          249          4914.90
180          233          244          4819.66
154          208          219          5004.92
157          209          247          4999.78
160          218          246          4842.31
174          206          217          5080.99

Thanks for the feedback from Eric and David Ahern.

v1->v2:
  - remove the fixes and the selftest for IPv6 BIG TCP, will do it in
    another patchset.
  - add GSO_TCP for tp_status in packet sockets to tell the af_packet
    users that this is a TCP GSO packet in Patch 8.
  - also check skb_is_gso() when checking if it's a GSO TCP packet in
    Patch 1.
v2->v3:
  - add gso/gro_ipv4_max_size per device and netlink attributes for them
    in Patch 9, so that we can selectively enable BIG TCP for IPv6, and
    not for IPv4, as Eric required.
  - remove the selftest, as it requires userspace iproute2 change after
    making IPv4 BIG TCP independent from IPv6 BIG TCP on configuration.
v3->v4:
  - put gso/gro_ipv4_max_size close to other related fields, so that we
    do not need an extra cache line miss, as Eric suggested.
  - also check ipv6_addr_v4mapped() when reading gso(_ipv4)_max_size in
    sk_setup_caps(), as Eric noticed.

Xin Long (10):
  net: add a couple of helpers for iph tot_len
  bridge: use skb_ip_totlen in br netfilter
  openvswitch: use skb_ip_totlen in conntrack
  net: sched: use skb_ip_totlen and iph_totlen
  netfilter: use skb_ip_totlen and iph_totlen
  cipso_ipv4: use iph_set_totlen in skbuff_setattr
  ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
  packet: add TP_STATUS_GSO_TCP for tp_status
  net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  net: add support for ipv4 big tcp

 drivers/net/ipvlan/ipvlan_core.c           |  2 +-
 include/linux/ip.h                         | 21 ++++++++++++++
 include/linux/netdevice.h                  |  6 ++++
 include/net/netfilter/nf_tables_ipv4.h     |  4 +--
 include/net/route.h                        |  3 --
 include/uapi/linux/if_link.h               |  3 ++
 include/uapi/linux/if_packet.h             |  1 +
 net/bridge/br_netfilter_hooks.c            |  2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c |  4 +--
 net/core/dev.c                             |  4 +++
 net/core/dev.h                             | 18 ++++++++++++
 net/core/gro.c                             | 12 ++++----
 net/core/rtnetlink.c                       | 33 ++++++++++++++++++++++
 net/core/sock.c                            | 26 +++++++++--------
 net/ipv4/af_inet.c                         |  7 +++--
 net/ipv4/cipso_ipv4.c                      |  2 +-
 net/ipv4/ip_input.c                        |  2 +-
 net/ipv4/ip_output.c                       |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c            |  2 +-
 net/netfilter/nf_log_syslog.c              |  2 +-
 net/netfilter/xt_length.c                  |  2 +-
 net/openvswitch/conntrack.c                |  2 +-
 net/packet/af_packet.c                     |  4 +++
 net/sched/act_ct.c                         |  2 +-
 net/sched/sch_cake.c                       |  2 +-
 25 files changed, 130 insertions(+), 38 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 01/10] net: add a couple of helpers for iph tot_len
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-02-01 15:31   ` David Ahern
  2023-01-28 15:58 ` [PATCHv4 net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This patch adds three APIs to replace the iph->tot_len setting
and getting in all places where IPv4 BIG TCP packets may reach,
they will be used in the following patches.

Note that iph_totlen() will be used when iph is not in linear
data of the skb.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/linux/ip.h  | 21 +++++++++++++++++++++
 include/net/route.h |  3 ---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 3d9c6750af62..d11c25f5030a 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -35,4 +35,25 @@ static inline unsigned int ip_transport_len(const struct sk_buff *skb)
 {
 	return ntohs(ip_hdr(skb)->tot_len) - skb_network_header_len(skb);
 }
+
+static inline unsigned int iph_totlen(const struct sk_buff *skb, const struct iphdr *iph)
+{
+	u32 len = ntohs(iph->tot_len);
+
+	return (len || !skb_is_gso(skb) || !skb_is_gso_tcp(skb)) ?
+	       len : skb->len - skb_network_offset(skb);
+}
+
+static inline unsigned int skb_ip_totlen(const struct sk_buff *skb)
+{
+	return iph_totlen(skb, ip_hdr(skb));
+}
+
+/* IPv4 datagram length is stored into 16bit field (tot_len) */
+#define IP_MAX_MTU	0xFFFFU
+
+static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
+{
+	iph->tot_len = len <= IP_MAX_MTU ? htons(len) : 0;
+}
 #endif	/* _LINUX_IP_H */
diff --git a/include/net/route.h b/include/net/route.h
index 6e92dd5bcd61..fe00b0a2e475 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -35,9 +35,6 @@
 #include <linux/cache.h>
 #include <linux/security.h>
 
-/* IPv4 datagram length is stored into 16bit field (tot_len) */
-#define IP_MAX_MTU	0xFFFFU
-
 #define RTO_ONLINK	0x01
 
 #define RT_CONN_FLAGS(sk)   (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 02/10] bridge: use skb_ip_totlen in br netfilter
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-01-31 15:01   ` Nikolay Aleksandrov
  2023-01-28 15:58 ` [PATCHv4 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

These 3 places in bridge netfilter are called on RX path after GRO
and IPv4 TCP GSO packets may come through, so replace iph tot_len
accessing with skb_ip_totlen() in there.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/bridge/br_netfilter_hooks.c            | 2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index f20f4373ff40..b67c9c98effa 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -214,7 +214,7 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
 		goto csum_error;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < len) {
 		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index 5c5dd437f1c2..71056ee84773 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -212,7 +212,7 @@ static int nf_ct_br_ip_check(const struct sk_buff *skb)
 	    iph->version != 4)
 		return -1;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < nhoff + len ||
 	    len < (iph->ihl * 4))
                 return -1;
@@ -256,7 +256,7 @@ static unsigned int nf_ct_bridge_pre(void *priv, struct sk_buff *skb,
 		if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 			return NF_ACCEPT;
 
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		if (pskb_trim_rcsum(skb, len))
 			return NF_ACCEPT;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-02-01 13:29   ` Aaron Conole
  2023-01-28 15:58 ` [PATCHv4 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

IPv4 GSO packets may get processed in ovs_skb_network_trim(),
and we need to use skb_ip_totlen() to get iph totlen.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/openvswitch/conntrack.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c8b137649ca4..2172930b1f17 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1103,7 +1103,7 @@ static int ovs_skb_network_trim(struct sk_buff *skb)
 
 	switch (skb->protocol) {
 	case htons(ETH_P_IP):
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		break;
 	case htons(ETH_P_IPV6):
 		len = sizeof(struct ipv6hdr)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (2 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 05/10] netfilter: " Xin Long
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets
and access iph->tot_len, replace them with skb_ip_totlen() and
iph_totlen() accordingly.

Note that we don't need to replace the one in tcf_csum_ipv4(), as it
will return for TCP GSO packets in tcf_csum_ipv4_tcp().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sched/act_ct.c   | 2 +-
 net/sched/sch_cake.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 0ca2bb8ed026..d68bb5dbf0dc 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -707,7 +707,7 @@ static int tcf_ct_skb_network_trim(struct sk_buff *skb, int family)
 
 	switch (family) {
 	case NFPROTO_IPV4:
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		break;
 	case NFPROTO_IPV6:
 		len = sizeof(struct ipv6hdr)
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 3ed0c3342189..7970217b565a 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1209,7 +1209,7 @@ static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
 			    iph_check->daddr != iph->daddr)
 				continue;
 
-			seglen = ntohs(iph_check->tot_len) -
+			seglen = iph_totlen(skb, iph_check) -
 				       (4 * iph_check->ihl);
 		} else if (iph_check->version == 6) {
 			ipv6h = (struct ipv6hdr *)iph;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 05/10] netfilter: use skb_ip_totlen and iph_totlen
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (3 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

There are also quite some places in netfilter that may process IPv4 TCP
GSO packets, we need to replace them too.

In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen()
return value, otherwise it may overflow and mismatch. This change will
also help us add selftest for IPv4 BIG TCP in the following patch.

Note that we don't need to replace the one in tcpmss_tg4(), as it will
return if there is data after tcphdr in tcpmss_mangle_packet(). The
same in mangle_contents() in nf_nat_helper.c, it returns false when
skb->len + extra > 65535 in enlarge_skb().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/netfilter/nf_tables_ipv4.h | 4 ++--
 net/netfilter/ipvs/ip_vs_xmit.c        | 2 +-
 net/netfilter/nf_log_syslog.c          | 2 +-
 net/netfilter/xt_length.c              | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_tables_ipv4.h b/include/net/netfilter/nf_tables_ipv4.h
index 112708f7a6b4..947973623dc7 100644
--- a/include/net/netfilter/nf_tables_ipv4.h
+++ b/include/net/netfilter/nf_tables_ipv4.h
@@ -29,7 +29,7 @@ static inline int __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt)
 	if (iph->ihl < 5 || iph->version != 4)
 		return -1;
 
-	len = ntohs(iph->tot_len);
+	len = iph_totlen(pkt->skb, iph);
 	thoff = iph->ihl * 4;
 	if (pkt->skb->len < len)
 		return -1;
@@ -64,7 +64,7 @@ static inline int nft_set_pktinfo_ipv4_ingress(struct nft_pktinfo *pkt)
 	if (iph->ihl < 5 || iph->version != 4)
 		goto inhdr_error;
 
-	len = ntohs(iph->tot_len);
+	len = iph_totlen(pkt->skb, iph);
 	thoff = iph->ihl * 4;
 	if (pkt->skb->len < len) {
 		__IP_INC_STATS(nft_net(pkt), IPSTATS_MIB_INTRUNCATEDPKTS);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 029171379884..80448885c3d7 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -994,7 +994,7 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
 		old_dsfield = ipv4_get_dsfield(old_iph);
 		*ttl = old_iph->ttl;
 		if (payload_len)
-			*payload_len = ntohs(old_iph->tot_len);
+			*payload_len = skb_ip_totlen(skb);
 	}
 
 	/* Implement full-functionality option for ECN encapsulation */
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index cb894f0d63e9..c66689ad2b49 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -322,7 +322,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 
 	/* Max length: 46 "LEN=65535 TOS=0xFF PREC=0xFF TTL=255 ID=65535 " */
 	nf_log_buf_add(m, "LEN=%u TOS=0x%02X PREC=0x%02X TTL=%u ID=%u ",
-		       ntohs(ih->tot_len), ih->tos & IPTOS_TOS_MASK,
+		       iph_totlen(skb, ih), ih->tos & IPTOS_TOS_MASK,
 		       ih->tos & IPTOS_PREC_MASK, ih->ttl, ntohs(ih->id));
 
 	/* Max length: 6 "CE DF MF " */
diff --git a/net/netfilter/xt_length.c b/net/netfilter/xt_length.c
index 1873da3a945a..b3d623a52885 100644
--- a/net/netfilter/xt_length.c
+++ b/net/netfilter/xt_length.c
@@ -21,7 +21,7 @@ static bool
 length_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_length_info *info = par->matchinfo;
-	u_int16_t pktlen = ntohs(ip_hdr(skb)->tot_len);
+	u32 pktlen = skb_ip_totlen(skb);
 
 	return (pktlen >= info->min && pktlen <= info->max) ^ info->invert;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (4 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 05/10] netfilter: " Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
the iph->tot_len update should use iph_set_totlen().

Note that for these non GSO packets, the new iph tot_len with extra
iph option len added may become greater than 65535, the old process
will cast it and set iph->tot_len to it, which is a bug. In theory,
iph options shouldn't be added for these big packets in here, a fix
may be needed here in the future. For now this patch is only to set
iph->tot_len to 0 when it happens.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/ipv4/cipso_ipv4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 6cd3b6c559f0..79ae7204e8ed 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -2222,7 +2222,7 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
 		memset((char *)(iph + 1) + buf_len, 0, opt_len - buf_len);
 	if (len_delta != 0) {
 		iph->ihl = 5 + (opt_len >> 2);
-		iph->tot_len = htons(skb->len);
+		iph_set_totlen(iph, skb->len);
 	}
 	ip_send_check(iph);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (5 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-01-28 15:58 ` [PATCHv4 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status Xin Long
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

ipvlan devices calls netif_inherit_tso_max() to get the tso_max_size/segs
from the lower device, so when lower device supports BIG TCP, the ipvlan
devices support it too. We also should consider its iph tot_len accessing.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/ipvlan/ipvlan_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index bb1c298c1e78..460b3d4f2245 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -157,7 +157,7 @@ void *ipvlan_get_L3_hdr(struct ipvl_port *port, struct sk_buff *skb, int *type)
 			return NULL;
 
 		ip4h = ip_hdr(skb);
-		pktlen = ntohs(ip4h->tot_len);
+		pktlen = skb_ip_totlen(skb);
 		if (ip4h->ihl < 5 || ip4h->version != 4)
 			return NULL;
 		if (skb->len < pktlen || pktlen < (ip4h->ihl * 4))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (6 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-02-01 15:32   ` David Ahern
  2023-01-28 15:58 ` [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

Introduce TP_STATUS_GSO_TCP tp_status flag to tell the af_packet user
that this is a TCP GSO packet. When parsing IPv4 BIG TCP packets in
tcpdump/libpcap, it can use tp_len as the IPv4 packet len when this
flag is set, as iph tot_len is set to 0 for IPv4 BIG TCP packets.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/uapi/linux/if_packet.h | 1 +
 net/packet/af_packet.c         | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index a8516b3594a4..78c981d6a9d4 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -115,6 +115,7 @@ struct tpacket_auxdata {
 #define TP_STATUS_BLK_TMO		(1 << 5)
 #define TP_STATUS_VLAN_TPID_VALID	(1 << 6) /* auxdata has valid tp_vlan_tpid */
 #define TP_STATUS_CSUM_VALID		(1 << 7)
+#define TP_STATUS_GSO_TCP		(1 << 8)
 
 /* Tx ring - header status */
 #define TP_STATUS_AVAILABLE	      0
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b5ab98ca2511..8ffb19c643ab 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2296,6 +2296,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	else if (skb->pkt_type != PACKET_OUTGOING &&
 		 skb_csum_unnecessary(skb))
 		status |= TP_STATUS_CSUM_VALID;
+	if (skb_is_gso(skb) && skb_is_gso_tcp(skb))
+		status |= TP_STATUS_GSO_TCP;
 
 	if (snaplen > res)
 		snaplen = res;
@@ -3522,6 +3524,8 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 		else if (skb->pkt_type != PACKET_OUTGOING &&
 			 skb_csum_unnecessary(skb))
 			aux.tp_status |= TP_STATUS_CSUM_VALID;
+		if (skb_is_gso(skb) && skb_is_gso_tcp(skb))
+			aux.tp_status |= TP_STATUS_GSO_TCP;
 
 		aux.tp_len = origlen;
 		aux.tp_snaplen = skb->len;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (7 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-01-31 14:59   ` Paolo Abeni
  2023-02-01 15:36   ` David Ahern
  2023-01-28 15:58 ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp Xin Long
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
per device and adds netlink attributes for them, so that IPV4
BIG TCP can be guarded by a separate tunable in the next patch.

To not break the old application using "gso/gro_max_size" for
IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
in netif_set_gso/gro_max_size() if the new size isn't greater
than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
userspace doesn't realize the new netlink attributes.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/linux/netdevice.h    |  6 ++++++
 include/uapi/linux/if_link.h |  3 +++
 net/core/dev.c               |  4 ++++
 net/core/dev.h               | 18 ++++++++++++++++++
 net/core/rtnetlink.c         | 33 +++++++++++++++++++++++++++++++++
 5 files changed, 64 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2466afa25078..d5ef4c1fedd2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1964,6 +1964,8 @@ enum netdev_ml_priv_type {
  *	@gso_max_segs:	Maximum number of segments that can be passed to the
  *			NIC for GSO
  *	@tso_max_segs:	Device (as in HW) limit on the max TSO segment count
+ * 	@gso_ipv4_max_size:	Maximum size of generic segmentation offload,
+ * 				for IPv4.
  *
  *	@dcbnl_ops:	Data Center Bridging netlink ops
  *	@num_tc:	Number of traffic classes in the net device
@@ -2004,6 +2006,8 @@ enum netdev_ml_priv_type {
  *			keep a list of interfaces to be deleted.
  *	@gro_max_size:	Maximum size of aggregated packet in generic
  *			receive offload (GRO)
+ * 	@gro_ipv4_max_size:	Maximum size of aggregated packet in generic
+ * 				receive offload (GRO), for IPv4.
  *
  *	@dev_addr_shadow:	Copy of @dev_addr to catch direct writes.
  *	@linkwatch_dev_tracker:	refcount tracker used by linkwatch.
@@ -2207,6 +2211,7 @@ struct net_device {
  */
 #define GRO_MAX_SIZE		(8 * 65535u)
 	unsigned int		gro_max_size;
+	unsigned int		gro_ipv4_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
@@ -2330,6 +2335,7 @@ struct net_device {
 	u16			gso_max_segs;
 #define TSO_MAX_SEGS		U16_MAX
 	u16			tso_max_segs;
+	unsigned int		gso_ipv4_max_size;
 
 #ifdef CONFIG_DCB
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 1021a7e47a86..02b87e4c65be 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -374,6 +374,9 @@ enum {
 
 	IFLA_DEVLINK_PORT,
 
+	IFLA_GSO_IPV4_MAX_SIZE,
+	IFLA_GRO_IPV4_MAX_SIZE,
+
 	__IFLA_MAX
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f72f5c4ee7e2..bb42150a38ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3001,6 +3001,8 @@ void netif_set_tso_max_size(struct net_device *dev, unsigned int size)
 	dev->tso_max_size = min(GSO_MAX_SIZE, size);
 	if (size < READ_ONCE(dev->gso_max_size))
 		netif_set_gso_max_size(dev, size);
+	if (size < READ_ONCE(dev->gso_ipv4_max_size))
+		netif_set_gso_ipv4_max_size(dev, size);
 }
 EXPORT_SYMBOL(netif_set_tso_max_size);
 
@@ -10614,6 +10616,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->gso_max_size = GSO_LEGACY_MAX_SIZE;
 	dev->gso_max_segs = GSO_MAX_SEGS;
 	dev->gro_max_size = GRO_LEGACY_MAX_SIZE;
+	dev->gso_ipv4_max_size = GSO_LEGACY_MAX_SIZE;
+	dev->gro_ipv4_max_size = GRO_LEGACY_MAX_SIZE;
 	dev->tso_max_size = TSO_LEGACY_MAX_SIZE;
 	dev->tso_max_segs = TSO_MAX_SEGS;
 	dev->upper_level = 1;
diff --git a/net/core/dev.h b/net/core/dev.h
index 814ed5b7b960..a065b7571441 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -100,6 +100,8 @@ static inline void netif_set_gso_max_size(struct net_device *dev,
 {
 	/* dev->gso_max_size is read locklessly from sk_setup_caps() */
 	WRITE_ONCE(dev->gso_max_size, size);
+	if (size <= GSO_LEGACY_MAX_SIZE)
+		WRITE_ONCE(dev->gso_ipv4_max_size, size);
 }
 
 static inline void netif_set_gso_max_segs(struct net_device *dev,
@@ -114,6 +116,22 @@ static inline void netif_set_gro_max_size(struct net_device *dev,
 {
 	/* This pairs with the READ_ONCE() in skb_gro_receive() */
 	WRITE_ONCE(dev->gro_max_size, size);
+	if (size <= GRO_LEGACY_MAX_SIZE)
+		WRITE_ONCE(dev->gro_ipv4_max_size, size);
+}
+
+static inline void netif_set_gso_ipv4_max_size(struct net_device *dev,
+					       unsigned int size)
+{
+	/* dev->gso_ipv4_max_size is read locklessly from sk_setup_caps() */
+	WRITE_ONCE(dev->gso_ipv4_max_size, size);
+}
+
+static inline void netif_set_gro_ipv4_max_size(struct net_device *dev,
+					       unsigned int size)
+{
+	/* This pairs with the READ_ONCE() in skb_gro_receive() */
+	WRITE_ONCE(dev->gro_ipv4_max_size, size);
 }
 
 #endif
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 64289bc98887..b9f584955b77 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1074,6 +1074,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SEGS */
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_GRO_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_GSO_IPV4_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_GRO_IPV4_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SEGS */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
@@ -1807,6 +1809,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_GSO_MAX_SEGS, dev->gso_max_segs) ||
 	    nla_put_u32(skb, IFLA_GSO_MAX_SIZE, dev->gso_max_size) ||
 	    nla_put_u32(skb, IFLA_GRO_MAX_SIZE, dev->gro_max_size) ||
+	    nla_put_u32(skb, IFLA_GSO_IPV4_MAX_SIZE, dev->gso_ipv4_max_size) ||
+	    nla_put_u32(skb, IFLA_GRO_IPV4_MAX_SIZE, dev->gro_ipv4_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_MAX_SIZE, dev->tso_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_MAX_SEGS, dev->tso_max_segs) ||
 #ifdef CONFIG_RPS
@@ -1968,6 +1972,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_TSO_MAX_SIZE]	= { .type = NLA_REJECT },
 	[IFLA_TSO_MAX_SEGS]	= { .type = NLA_REJECT },
 	[IFLA_ALLMULTI]		= { .type = NLA_REJECT },
+	[IFLA_GSO_IPV4_MAX_SIZE]	= { .type = NLA_U32 },
+	[IFLA_GRO_IPV4_MAX_SIZE]	= { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2883,6 +2889,29 @@ static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_GSO_IPV4_MAX_SIZE]) {
+		u32 max_size = nla_get_u32(tb[IFLA_GSO_IPV4_MAX_SIZE]);
+
+		if (max_size > dev->tso_max_size) {
+			err = -EINVAL;
+			goto errout;
+		}
+
+		if (dev->gso_ipv4_max_size ^ max_size) {
+			netif_set_gso_ipv4_max_size(dev, max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
+	if (tb[IFLA_GRO_IPV4_MAX_SIZE]) {
+		u32 gro_max_size = nla_get_u32(tb[IFLA_GRO_IPV4_MAX_SIZE]);
+
+		if (dev->gro_ipv4_max_size ^ gro_max_size) {
+			netif_set_gro_ipv4_max_size(dev, gro_max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
 	if (tb[IFLA_OPERSTATE])
 		set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
 
@@ -3325,6 +3354,10 @@ struct net_device *rtnl_create_link(struct net *net, const char *ifname,
 		netif_set_gso_max_segs(dev, nla_get_u32(tb[IFLA_GSO_MAX_SEGS]));
 	if (tb[IFLA_GRO_MAX_SIZE])
 		netif_set_gro_max_size(dev, nla_get_u32(tb[IFLA_GRO_MAX_SIZE]));
+	if (tb[IFLA_GSO_IPV4_MAX_SIZE])
+		netif_set_gso_ipv4_max_size(dev, nla_get_u32(tb[IFLA_GSO_IPV4_MAX_SIZE]));
+	if (tb[IFLA_GRO_IPV4_MAX_SIZE])
+		netif_set_gro_ipv4_max_size(dev, nla_get_u32(tb[IFLA_GRO_IPV4_MAX_SIZE]));
 
 	return dev;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (8 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
@ 2023-01-28 15:58 ` Xin Long
  2023-02-01 15:38   ` David Ahern
  2023-02-02  9:24   ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp: manual merge Matthieu Baerts
  2023-02-01  8:53 ` [PATCHv4 net-next 00/10] net: support ipv4 big tcp Eric Dumazet
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 23+ messages in thread
From: Xin Long @ 2023-01-28 15:58 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.

Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.

Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.

Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.

Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/core/gro.c       | 12 +++++++-----
 net/core/sock.c      | 26 ++++++++++++++------------
 net/ipv4/af_inet.c   |  7 ++++---
 net/ipv4/ip_input.c  |  2 +-
 net/ipv4/ip_output.c |  2 +-
 5 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/net/core/gro.c b/net/core/gro.c
index 506f83d715f8..b15f85546bdd 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -162,16 +162,18 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	struct sk_buff *lp;
 	int segs;
 
-	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
-	gro_max_size = READ_ONCE(p->dev->gro_max_size);
+	/* pairs with WRITE_ONCE() in netif_set_gro(_ipv4)_max_size() */
+	gro_max_size = p->protocol == htons(ETH_P_IPV6) ?
+			READ_ONCE(p->dev->gro_max_size) :
+				READ_ONCE(p->dev->gro_ipv4_max_size);
 
 	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
 		return -E2BIG;
 
 	if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
-		if (p->protocol != htons(ETH_P_IPV6) ||
-		    skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
-		    ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
+		if (NAPI_GRO_CB(skb)->proto != IPPROTO_TCP ||
+		    (p->protocol == htons(ETH_P_IPV6) &&
+		     skb_headroom(p) < sizeof(struct hop_jumbo_hdr)) ||
 		    p->encapsulation)
 			return -E2BIG;
 	}
diff --git a/net/core/sock.c b/net/core/sock.c
index 7ba4891460ad..f08b76acde9b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2373,17 +2373,22 @@ void sk_free_unlock_clone(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_free_unlock_clone);
 
-static void sk_trim_gso_size(struct sock *sk)
+static u32 sk_dst_gso_max_size(struct sock *sk, struct dst_entry *dst)
 {
-	if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE)
-		return;
+	bool is_ipv6 = false;
+	u32 max_size;
+
 #if IS_ENABLED(CONFIG_IPV6)
-	if (sk->sk_family == AF_INET6 &&
-	    sk_is_tcp(sk) &&
-	    !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
-		return;
+	is_ipv6 = (sk->sk_family == AF_INET6 &&
+		   !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr));
 #endif
-	sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
+	/* pairs with the WRITE_ONCE() in netif_set_gso(_ipv4)_max_size() */
+	max_size = is_ipv6 ? READ_ONCE(dst->dev->gso_max_size) :
+			READ_ONCE(dst->dev->gso_ipv4_max_size);
+	if (max_size > GSO_LEGACY_MAX_SIZE && !sk_is_tcp(sk))
+		max_size = GSO_LEGACY_MAX_SIZE;
+
+	return max_size - (MAX_TCP_HEADER + 1);
 }
 
 void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
@@ -2403,10 +2408,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
 		} else {
 			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
-			/* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
-			sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
-			sk_trim_gso_size(sk);
-			sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
+			sk->sk_gso_max_size = sk_dst_gso_max_size(sk, dst);
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
 			max_segs = max_t(u32, READ_ONCE(dst->dev->gso_max_segs), 1);
 		}
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6c0ec2789943..2f992a323b95 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1485,6 +1485,7 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 	if (unlikely(ip_fast_csum((u8 *)iph, 5)))
 		goto out;
 
+	NAPI_GRO_CB(skb)->proto = proto;
 	id = ntohl(*(__be32 *)&iph->id);
 	flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
 	id >>= 16;
@@ -1618,9 +1619,9 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
-	__be16 newlen = htons(skb->len - nhoff);
 	struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
 	const struct net_offload *ops;
+	__be16 totlen = iph->tot_len;
 	int proto = iph->protocol;
 	int err = -ENOSYS;
 
@@ -1629,8 +1630,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 		skb_set_inner_network_header(skb, nhoff);
 	}
 
-	csum_replace2(&iph->check, iph->tot_len, newlen);
-	iph->tot_len = newlen;
+	iph_set_totlen(iph, skb->len - nhoff);
+	csum_replace2(&iph->check, totlen, iph->tot_len);
 
 	ops = rcu_dereference(inet_offloads[proto]);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index e880ce77322a..fe9ead9ee863 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -511,7 +511,7 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
 		goto csum_error;
 
-	len = ntohs(iph->tot_len);
+	len = iph_totlen(skb, iph);
 	if (skb->len < len) {
 		drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
 		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 922c87ef1ab5..4e4e308c3230 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -100,7 +100,7 @@ int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct iphdr *iph = ip_hdr(skb);
 
-	iph->tot_len = htons(skb->len);
+	iph_set_totlen(iph, skb->len);
 	ip_send_check(iph);
 
 	/* if egress device is enslaved to an L3 master device pass the
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  2023-01-28 15:58 ` [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
@ 2023-01-31 14:59   ` Paolo Abeni
  2023-01-31 17:55     ` Xin Long
  2023-02-01 15:36   ` David Ahern
  1 sibling, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2023-01-31 14:59 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, David Ahern, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On Sat, 2023-01-28 at 10:58 -0500, Xin Long wrote:
> This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
> per device and adds netlink attributes for them, so that IPV4
> BIG TCP can be guarded by a separate tunable in the next patch.
> 
> To not break the old application using "gso/gro_max_size" for
> IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
> in netif_set_gso/gro_max_size() if the new size isn't greater
> than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
> userspace doesn't realize the new netlink attributes.

Not a big deal, but I think it would be nice to include the pahole info
showing where the new fields are located and why that are good
locations.

No need to send a new version for just for the above, unless Eric asks
otherwise ;)

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 02/10] bridge: use skb_ip_totlen in br netfilter
  2023-01-28 15:58 ` [PATCHv4 net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
@ 2023-01-31 15:01   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 23+ messages in thread
From: Nikolay Aleksandrov @ 2023-01-31 15:01 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Mahesh Bandewar, Paul Moore, Guillaume Nault

On 28/01/2023 17:58, Xin Long wrote:
> These 3 places in bridge netfilter are called on RX path after GRO
> and IPv4 TCP GSO packets may come through, so replace iph tot_len
> accessing with skb_ip_totlen() in there.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_netfilter_hooks.c            | 2 +-
>  net/bridge/netfilter/nf_conntrack_bridge.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
> index f20f4373ff40..b67c9c98effa 100644
> --- a/net/bridge/br_netfilter_hooks.c
> +++ b/net/bridge/br_netfilter_hooks.c
> @@ -214,7 +214,7 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
>  	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
>  		goto csum_error;
>  
> -	len = ntohs(iph->tot_len);
> +	len = skb_ip_totlen(skb);
>  	if (skb->len < len) {
>  		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>  		goto drop;
> diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
> index 5c5dd437f1c2..71056ee84773 100644
> --- a/net/bridge/netfilter/nf_conntrack_bridge.c
> +++ b/net/bridge/netfilter/nf_conntrack_bridge.c
> @@ -212,7 +212,7 @@ static int nf_ct_br_ip_check(const struct sk_buff *skb)
>  	    iph->version != 4)
>  		return -1;
>  
> -	len = ntohs(iph->tot_len);
> +	len = skb_ip_totlen(skb);
>  	if (skb->len < nhoff + len ||
>  	    len < (iph->ihl * 4))
>                  return -1;
> @@ -256,7 +256,7 @@ static unsigned int nf_ct_bridge_pre(void *priv, struct sk_buff *skb,
>  		if (!pskb_may_pull(skb, sizeof(struct iphdr)))
>  			return NF_ACCEPT;
>  
> -		len = ntohs(ip_hdr(skb)->tot_len);
> +		len = skb_ip_totlen(skb);
>  		if (pskb_trim_rcsum(skb, len))
>  			return NF_ACCEPT;
>  

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  2023-01-31 14:59   ` Paolo Abeni
@ 2023-01-31 17:55     ` Xin Long
  0 siblings, 0 replies; 23+ messages in thread
From: Xin Long @ 2023-01-31 17:55 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: network dev, davem, kuba, Eric Dumazet, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Tue, Jan 31, 2023 at 9:59 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sat, 2023-01-28 at 10:58 -0500, Xin Long wrote:
> > This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
> > per device and adds netlink attributes for them, so that IPV4
> > BIG TCP can be guarded by a separate tunable in the next patch.
> >
> > To not break the old application using "gso/gro_max_size" for
> > IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
> > in netif_set_gso/gro_max_size() if the new size isn't greater
> > than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
> > userspace doesn't realize the new netlink attributes.
>
> Not a big deal, but I think it would be nice to include the pahole info
> showing where the new fields are located and why that are good
> locations.
>
> No need to send a new version for just for the above, unless Eric asks
> otherwise ;)
>
The the pahole info without and with the patch shows below:

- Without the Patch:

# pahole --hex -C net_device vmlinux
struct net_device {
...
long unsigned int          gro_flush_timeout;    /* 0x330   0x8 */
int                        napi_defer_hard_irqs; /* 0x338   0x4 */
unsigned int               gro_max_size;         /* 0x33c   0x4 */  <---------
/* --- cacheline 13 boundary (832 bytes) --- */
rx_handler_func_t *        rx_handler;           /* 0x340   0x8 */
void *                     rx_handler_data;      /* 0x348   0x8 */
struct mini_Qdisc *        miniq_ingress;        /* 0x350   0x8 */
struct netdev_queue *      ingress_queue;        /* 0x358   0x8 */
struct nf_hook_entries *   nf_hooks_ingress;     /* 0x360   0x8 */
unsigned char              broadcast[32];        /* 0x368  0x20 */
/* --- cacheline 14 boundary (896 bytes) was 8 bytes ago --- */
struct cpu_rmap *          rx_cpu_rmap;          /* 0x388   0x8 */
struct hlist_node          index_hlist;          /* 0x390  0x10 */

/* XXX 32 bytes hole, try to pack */

/* --- cacheline 15 boundary (960 bytes) --- */
struct netdev_queue *      _tx __attribute__((__aligned__(64))); /*
0x3c0   0x8 */
...

/* --- cacheline 32 boundary (2048 bytes) was 24 bytes ago --- */
const struct attribute_group  * sysfs_groups[4]; /* 0x818  0x20 */
const struct attribute_group  * sysfs_rx_queue_group; /* 0x838   0x8 */
/* --- cacheline 33 boundary (2112 bytes) --- */
const struct rtnl_link_ops  * rtnl_link_ops;     /* 0x840   0x8 */
unsigned int               gso_max_size;         /* 0x848   0x4 */
unsigned int               tso_max_size;         /* 0x84c   0x4 */
u16                        gso_max_segs;         /* 0x850   0x2 */
u16                        tso_max_segs;         /* 0x852   0x2 */   <---------

/* XXX 4 bytes hole, try to pack */

const struct dcbnl_rtnl_ops  * dcbnl_ops;        /* 0x858   0x8 */
s16                        num_tc;               /* 0x860   0x2 */
struct netdev_tc_txq       tc_to_txq[16];        /* 0x862  0x40 */
/* --- cacheline 34 boundary (2176 bytes) was 34 bytes ago --- */
u8                         prio_tc_map[16];      /* 0x8a2  0x10 */
...
}


- With the Patch:

For "gso_ipv4_max_size", it filled the hole as expected.

/* --- cacheline 33 boundary (2112 bytes) --- */
const struct rtnl_link_ops  * rtnl_link_ops;     /* 0x840   0x8 */
unsigned int               gso_max_size;         /* 0x848   0x4 */
unsigned int               tso_max_size;         /* 0x84c   0x4 */
u16                        gso_max_segs;         /* 0x850   0x2 */
u16                        tso_max_segs;         /* 0x852   0x2 */
unsigned int               gso_ipv4_max_size;    /* 0x854   0x4 */ <-------
const struct dcbnl_rtnl_ops  * dcbnl_ops;        /* 0x858   0x8 */
s16                        num_tc;               /* 0x860   0x2 */
struct netdev_tc_txq       tc_to_txq[16];        /* 0x862  0x40 */
/* --- cacheline 34 boundary (2176 bytes) was 34 bytes ago --- */
u8                         prio_tc_map[16];      /* 0x8a2  0x10 */


For "gro_ipv4_max_size", these are no byte holes, I just put it
in the "Cache lines mostly used on receive path" area, and
next to gro_max_size.

long unsigned int          gro_flush_timeout;    /* 0x330   0x8 */
int                        napi_defer_hard_irqs; /* 0x338   0x4 */
unsigned int               gro_max_size;         /* 0x33c   0x4 */
/* --- cacheline 13 boundary (832 bytes) --- */
unsigned int               gro_ipv4_max_size;    /* 0x340   0x4 */  <------

/* XXX 4 bytes hole, try to pack */

rx_handler_func_t *        rx_handler;           /* 0x348   0x8 */
void *                     rx_handler_data;      /* 0x350   0x8 */
struct mini_Qdisc *        miniq_ingress;        /* 0x358   0x8 */
struct netdev_queue *      ingress_queue;        /* 0x360   0x8 */
struct nf_hook_entries *   nf_hooks_ingress;     /* 0x368   0x8 */
unsigned char              broadcast[32];        /* 0x370  0x20 */
/* --- cacheline 14 boundary (896 bytes) was 16 bytes ago --- */
struct cpu_rmap *          rx_cpu_rmap;          /* 0x390   0x8 */
struct hlist_node          index_hlist;          /* 0x398  0x10 */

/* XXX 24 bytes hole, try to pack */

/* --- cacheline 15 boundary (960 bytes) --- */
struct netdev_queue *      _tx __attribute__((__aligned__(64))); /*
0x3c0   0x8 */


Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 00/10] net: support ipv4 big tcp
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (9 preceding siblings ...)
  2023-01-28 15:58 ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp Xin Long
@ 2023-02-01  8:53 ` Eric Dumazet
  2023-02-01 15:39 ` David Ahern
  2023-02-02  5:10 ` patchwork-bot+netdevbpf
  12 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2023-02-01  8:53 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Sat, Jan 28, 2023 at 4:58 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> This is similar to the BIG TCP patchset added by Eric for IPv6:
>
>   https://lwn.net/Articles/895398/
>
> Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> indicate this might be a BIG TCP packet and use skb->len as the real
> IPv4 total length.
>
> This will work safely, as all BIG TCP packets are GSO/GRO packets and
> processed on the same host as they were created; There is no padding
> in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> packet total length; Also, before implementing the feature, all those
> places that may get iph tot_len from BIG TCP packets are taken care
> with some new APIs:
>
> Patch 1 adds some APIs for iph tot_len setting and getting, which are
> used in all these places where IPv4 BIG TCP packets may reach in Patch
> 2-7, Patch 8 adds a GSO_TCP tp_status for af_packet users, and Patch 9
> add new netlink attributes to make IPv4 BIG TCP independent from IPv6
> BIG TCP on configuration, and Patch 10 implements this feature.
>
> Note that the similar change as in Patch 2-6 are also needed for IPv6
> BIG TCP packets, and will be addressed in another patchset.
>
> The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
> and 1.5K MTU:
>
> No BIG TCP:
> for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
> 168          322          337          3776.49
> 143          236          277          4654.67
> 128          258          288          4772.83
> 171          229          278          4645.77
> 175          228          243          4678.93
> 149          239          279          4599.86
> 164          234          268          4606.94
> 155          276          289          4235.82
> 180          255          268          4418.95
> 168          241          249          4417.82
>
> Enable BIG TCP:
> ip link set dev ens1f0np0 gro_ipv4_max_size 128000 gso_ipv4_max_size 128000
> for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
> 161          241          252          4821.73
> 174          205          217          5098.28
> 167          208          220          5001.43
> 164          228          249          4883.98
> 150          233          249          4914.90
> 180          233          244          4819.66
> 154          208          219          5004.92
> 157          209          247          4999.78
> 160          218          246          4842.31
> 174          206          217          5080.99
>
> Thanks for the feedback from Eric and David Ahern.
>
> v1->v2:
>   - remove the fixes and the selftest for IPv6 BIG TCP, will do it in
>     another patchset.
>   - add GSO_TCP for tp_status in packet sockets to tell the af_packet
>     users that this is a TCP GSO packet in Patch 8.
>   - also check skb_is_gso() when checking if it's a GSO TCP packet in
>     Patch 1.
> v2->v3:
>   - add gso/gro_ipv4_max_size per device and netlink attributes for them
>     in Patch 9, so that we can selectively enable BIG TCP for IPv6, and
>     not for IPv4, as Eric required.
>   - remove the selftest, as it requires userspace iproute2 change after
>     making IPv4 BIG TCP independent from IPv6 BIG TCP on configuration.
> v3->v4:
>   - put gso/gro_ipv4_max_size close to other related fields, so that we
>     do not need an extra cache line miss, as Eric suggested.
>   - also check ipv6_addr_v4mapped() when reading gso(_ipv4)_max_size in
>     sk_setup_caps(), as Eric noticed.

For the series:

Reviewed-by: Eric Dumazet <edumazet@google.com>

Please make sure to add needed changes to tcpdump/libpcap

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack
  2023-01-28 15:58 ` [PATCHv4 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
@ 2023-02-01 13:29   ` Aaron Conole
  0 siblings, 0 replies; 23+ messages in thread
From: Aaron Conole @ 2023-02-01 13:29 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Roopa Prabhu,
	Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

Xin Long <lucien.xin@gmail.com> writes:

> IPv4 GSO packets may get processed in ovs_skb_network_trim(),
> and we need to use skb_ip_totlen() to get iph totlen.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---

Reviewed-by: Aaron Conole <aconole@redhat.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 01/10] net: add a couple of helpers for iph tot_len
  2023-01-28 15:58 ` [PATCHv4 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
@ 2023-02-01 15:31   ` David Ahern
  0 siblings, 0 replies; 23+ messages in thread
From: David Ahern @ 2023-02-01 15:31 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/28/23 8:58 AM, Xin Long wrote:
> This patch adds three APIs to replace the iph->tot_len setting
> and getting in all places where IPv4 BIG TCP packets may reach,
> they will be used in the following patches.
> 
> Note that iph_totlen() will be used when iph is not in linear
> data of the skb.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  include/linux/ip.h  | 21 +++++++++++++++++++++
>  include/net/route.h |  3 ---
>  2 files changed, 21 insertions(+), 3 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status
  2023-01-28 15:58 ` [PATCHv4 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status Xin Long
@ 2023-02-01 15:32   ` David Ahern
  0 siblings, 0 replies; 23+ messages in thread
From: David Ahern @ 2023-02-01 15:32 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/28/23 8:58 AM, Xin Long wrote:
> Introduce TP_STATUS_GSO_TCP tp_status flag to tell the af_packet user
> that this is a TCP GSO packet. When parsing IPv4 BIG TCP packets in
> tcpdump/libpcap, it can use tp_len as the IPv4 packet len when this
> flag is set, as iph tot_len is set to 0 for IPv4 BIG TCP packets.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  include/uapi/linux/if_packet.h | 1 +
>  net/packet/af_packet.c         | 4 ++++
>  2 files changed, 5 insertions(+)
> 

Reviewed-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  2023-01-28 15:58 ` [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
  2023-01-31 14:59   ` Paolo Abeni
@ 2023-02-01 15:36   ` David Ahern
  1 sibling, 0 replies; 23+ messages in thread
From: David Ahern @ 2023-02-01 15:36 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/28/23 8:58 AM, Xin Long wrote:
> This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
> per device and adds netlink attributes for them, so that IPV4
> BIG TCP can be guarded by a separate tunable in the next patch.
> 
> To not break the old application using "gso/gro_max_size" for
> IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
> in netif_set_gso/gro_max_size() if the new size isn't greater
> than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
> userspace doesn't realize the new netlink attributes.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  include/linux/netdevice.h    |  6 ++++++
>  include/uapi/linux/if_link.h |  3 +++
>  net/core/dev.c               |  4 ++++
>  net/core/dev.h               | 18 ++++++++++++++++++
>  net/core/rtnetlink.c         | 33 +++++++++++++++++++++++++++++++++
>  5 files changed, 64 insertions(+)
> 

Reviewed-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp
  2023-01-28 15:58 ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp Xin Long
@ 2023-02-01 15:38   ` David Ahern
  2023-02-02  9:24   ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp: manual merge Matthieu Baerts
  1 sibling, 0 replies; 23+ messages in thread
From: David Ahern @ 2023-02-01 15:38 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/28/23 8:58 AM, Xin Long wrote:
> Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
> 
> Firstly, allow sk->sk_gso_max_size to be set to a value greater than
> GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
> for IPv4 TCP sockets.
> 
> Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
> in __ip_local_out() to allow to send BIG TCP packets, and this implies
> that skb->len is the length of a IPv4 packet; On RX path, use skb->len
> as the length of the IPv4 packet when the IP header tot_len is 0 and
> skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
> skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
> need to update these APIs.
> 
> Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
> the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
> GRO complete, set IP header tot_len to 0 when the merged packet size
> greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
> on RX path.
> 
> Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
> this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
> packets.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/core/gro.c       | 12 +++++++-----
>  net/core/sock.c      | 26 ++++++++++++++------------
>  net/ipv4/af_inet.c   |  7 ++++---
>  net/ipv4/ip_input.c  |  2 +-
>  net/ipv4/ip_output.c |  2 +-
>  5 files changed, 27 insertions(+), 22 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 00/10] net: support ipv4 big tcp
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (10 preceding siblings ...)
  2023-02-01  8:53 ` [PATCHv4 net-next 00/10] net: support ipv4 big tcp Eric Dumazet
@ 2023-02-01 15:39 ` David Ahern
  2023-02-02  5:10 ` patchwork-bot+netdevbpf
  12 siblings, 0 replies; 23+ messages in thread
From: David Ahern @ 2023-02-01 15:39 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/28/23 8:58 AM, Xin Long wrote:
> This is similar to the BIG TCP patchset added by Eric for IPv6:
> 
>   https://lwn.net/Articles/895398/
> 
> Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> indicate this might be a BIG TCP packet and use skb->len as the real
> IPv4 total length.
> 
> This will work safely, as all BIG TCP packets are GSO/GRO packets and
> processed on the same host as they were created; There is no padding
> in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> packet total length; Also, before implementing the feature, all those
> places that may get iph tot_len from BIG TCP packets are taken care
> with some new APIs:
> 

Thanks for working on this.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 00/10] net: support ipv4 big tcp
  2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (11 preceding siblings ...)
  2023-02-01 15:39 ` David Ahern
@ 2023-02-02  5:10 ` patchwork-bot+netdevbpf
  12 siblings, 0 replies; 23+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-02-02  5:10 UTC (permalink / raw)
  To: Xin Long
  Cc: netdev, davem, kuba, edumazet, pabeni, dsahern, yoshfuji,
	pshelar, jhs, xiyou.wangcong, jiri, pablo, fw, marcelo.leitner,
	i.maximets, aconole, roopa, razor, maheshb, paul, gnault

Hello:

This series was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Sat, 28 Jan 2023 10:58:29 -0500 you wrote:
> This is similar to the BIG TCP patchset added by Eric for IPv6:
> 
>   https://lwn.net/Articles/895398/
> 
> Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> indicate this might be a BIG TCP packet and use skb->len as the real
> IPv4 total length.
> 
> [...]

Here is the summary with links:
  - [PATCHv4,net-next,01/10] net: add a couple of helpers for iph tot_len
    https://git.kernel.org/netdev/net-next/c/058a8f7f73aa
  - [PATCHv4,net-next,02/10] bridge: use skb_ip_totlen in br netfilter
    https://git.kernel.org/netdev/net-next/c/46abd17302ba
  - [PATCHv4,net-next,03/10] openvswitch: use skb_ip_totlen in conntrack
    https://git.kernel.org/netdev/net-next/c/ec84c955a0d0
  - [PATCHv4,net-next,04/10] net: sched: use skb_ip_totlen and iph_totlen
    https://git.kernel.org/netdev/net-next/c/043e397e48c5
  - [PATCHv4,net-next,05/10] netfilter: use skb_ip_totlen and iph_totlen
    https://git.kernel.org/netdev/net-next/c/a13fbf5ed5b4
  - [PATCHv4,net-next,06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
    https://git.kernel.org/netdev/net-next/c/7eb072be41ba
  - [PATCHv4,net-next,07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
    https://git.kernel.org/netdev/net-next/c/50e6fb5c6efb
  - [PATCHv4,net-next,08/10] packet: add TP_STATUS_GSO_TCP for tp_status
    https://git.kernel.org/netdev/net-next/c/8e08bb75b60f
  - [PATCHv4,net-next,09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device
    https://git.kernel.org/netdev/net-next/c/9eefedd58ae1
  - [PATCHv4,net-next,10/10] net: add support for ipv4 big tcp
    https://git.kernel.org/netdev/net-next/c/b1a78b9b9886

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp: manual merge
  2023-01-28 15:58 ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp Xin Long
  2023-02-01 15:38   ` David Ahern
@ 2023-02-02  9:24   ` Matthieu Baerts
  1 sibling, 0 replies; 23+ messages in thread
From: Matthieu Baerts @ 2023-02-02  9:24 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern, Hideaki YOSHIFUJI

[-- Attachment #1: Type: text/plain, Size: 2472 bytes --]

Hello,

(I reduced the Cc list to the maintainers of the files modified by this
patch)

On 28/01/2023 16:58, Xin Long wrote:
> Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
> 
> Firstly, allow sk->sk_gso_max_size to be set to a value greater than
> GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
> for IPv4 TCP sockets.
> 
> Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
> in __ip_local_out() to allow to send BIG TCP packets, and this implies
> that skb->len is the length of a IPv4 packet; On RX path, use skb->len
> as the length of the IPv4 packet when the IP header tot_len is 0 and
> skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
> skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
> need to update these APIs.
> 
> Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
> the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
> GRO complete, set IP header tot_len to 0 when the merged packet size
> greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
> on RX path.
> 
> Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
> this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
> packets.

(...)

> diff --git a/net/core/gro.c b/net/core/gro.c
> index 506f83d715f8..b15f85546bdd 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -162,16 +162,18 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>  	struct sk_buff *lp;
>  	int segs;
>  
> -	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
> -	gro_max_size = READ_ONCE(p->dev->gro_max_size);
> +	/* pairs with WRITE_ONCE() in netif_set_gro(_ipv4)_max_size() */
> +	gro_max_size = p->protocol == htons(ETH_P_IPV6) ?
> +			READ_ONCE(p->dev->gro_max_size) :
> +				READ_ONCE(p->dev->gro_ipv4_max_size);
>  
FYI, we got a small conflict when merging -net in net-next in the MPTCP
tree due to another patch from -net:

  7d2c89b32587 ("skb: Do mix page pool and page referenced frags in GRO")

and this one applied in net-next:

  b1a78b9b9886 ("net: add support for ipv4 big tcp")

The conflict has been resolved on our side[1] by keeping the
modifications from both sides and the resolution we suggest is attached
to this email.

Cheers,
Matt

[1] https://github.com/multipath-tcp/mptcp_net-next/commit/56e08652439a
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

[-- Attachment #2: 56e08652439ad5b87d5dc668e8a40ab934b58a45.patch --]
[-- Type: text/x-patch, Size: 992 bytes --]

diff --cc net/core/gro.c
index 4bac7ea6e025,b15f85546bdd..bb28f4038ed4
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@@ -162,17 -162,10 +162,19 @@@ int skb_gro_receive(struct sk_buff *p, 
  	struct sk_buff *lp;
  	int segs;
  
 +	/* Do not splice page pool based packets w/ non-page pool
 +	 * packets. This can result in reference count issues as page
 +	 * pool pages will not decrement the reference count and will
 +	 * instead be immediately returned to the pool or have frag
 +	 * count decremented.
 +	 */
 +	if (p->pp_recycle != skb->pp_recycle)
 +		return -ETOOMANYREFS;
 +
- 	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
- 	gro_max_size = READ_ONCE(p->dev->gro_max_size);
+ 	/* pairs with WRITE_ONCE() in netif_set_gro(_ipv4)_max_size() */
+ 	gro_max_size = p->protocol == htons(ETH_P_IPV6) ?
+ 			READ_ONCE(p->dev->gro_max_size) :
+ 				READ_ONCE(p->dev->gro_ipv4_max_size);
  
  	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
  		return -E2BIG;

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-02-02  9:25 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-28 15:58 [PATCHv4 net-next 00/10] net: support ipv4 big tcp Xin Long
2023-01-28 15:58 ` [PATCHv4 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
2023-02-01 15:31   ` David Ahern
2023-01-28 15:58 ` [PATCHv4 net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
2023-01-31 15:01   ` Nikolay Aleksandrov
2023-01-28 15:58 ` [PATCHv4 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
2023-02-01 13:29   ` Aaron Conole
2023-01-28 15:58 ` [PATCHv4 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
2023-01-28 15:58 ` [PATCHv4 net-next 05/10] netfilter: " Xin Long
2023-01-28 15:58 ` [PATCHv4 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
2023-01-28 15:58 ` [PATCHv4 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
2023-01-28 15:58 ` [PATCHv4 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status Xin Long
2023-02-01 15:32   ` David Ahern
2023-01-28 15:58 ` [PATCHv4 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
2023-01-31 14:59   ` Paolo Abeni
2023-01-31 17:55     ` Xin Long
2023-02-01 15:36   ` David Ahern
2023-01-28 15:58 ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp Xin Long
2023-02-01 15:38   ` David Ahern
2023-02-02  9:24   ` [PATCHv4 net-next 10/10] net: add support for ipv4 big tcp: manual merge Matthieu Baerts
2023-02-01  8:53 ` [PATCHv4 net-next 00/10] net: support ipv4 big tcp Eric Dumazet
2023-02-01 15:39 ` David Ahern
2023-02-02  5:10 ` patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.