All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 net-next 00/10] net: support ipv4 big tcp
@ 2023-01-27 15:59 Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This is similar to the BIG TCP patchset added by Eric for IPv6:

  https://lwn.net/Articles/895398/

Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
doesn't have exthdrs(options) for the BIG TCP packets' length. To make
it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
indicate this might be a BIG TCP packet and use skb->len as the real
IPv4 total length.

This will work safely, as all BIG TCP packets are GSO/GRO packets and
processed on the same host as they were created; There is no padding
in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
packet total length; Also, before implementing the feature, all those
places that may get iph tot_len from BIG TCP packets are taken care
with some new APIs:

Patch 1 adds some APIs for iph tot_len setting and getting, which are
used in all these places where IPv4 BIG TCP packets may reach in Patch
2-7, Patch 8 adds a GSO_TCP tp_status for af_packet users, and Patch 9
add new netlink attributes to make IPv4 BIG TCP independent from IPv6
BIG TCP on configuration, and Patch 10 implements this feature.

Note that the similar change as in Patch 2-6 are also needed for IPv6
BIG TCP packets, and will be addressed in another patchset.

The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
and 1.5K MTU:

No BIG TCP:
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
168          322          337          3776.49
143          236          277          4654.67
128          258          288          4772.83
171          229          278          4645.77
175          228          243          4678.93
149          239          279          4599.86
164          234          268          4606.94
155          276          289          4235.82
180          255          268          4418.95
168          241          249          4417.82

Enable BIG TCP:
ip link set dev ens1f0np0 gro_ipv4_max_size 128000 gso_ipv4_max_size 128000
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
161          241          252          4821.73
174          205          217          5098.28
167          208          220          5001.43
164          228          249          4883.98
150          233          249          4914.90
180          233          244          4819.66
154          208          219          5004.92
157          209          247          4999.78
160          218          246          4842.31
174          206          217          5080.99

Thanks for the feedback from Eric and David Ahern.

v1->v2:
  - remove the fixes and the selftest for IPv6 BIG TCP, will do it in
    another patchset.
  - add GSO_TCP for tp_status in packet sockets to tell the af_packet
    users that this is a TCP GSO packet in Patch 8.
  - also check skb_is_gso() when checking if it's a GSO TCP packet in
    Patch 1.
v2->v3:
  - add gso/gro_ipv4_max_size per device and netlink attributes for them
    in Patch 9, so that we can selectively enable BIG TCP for IPv6, and
    not for IPv4, as Eric required.
  - remove the selftest, as it requires userspace iproute2 change after
    making IPv4 BIG TCP independent from IPv6 BIG TCP on configuration.

Xin Long (10):
  net: add a couple of helpers for iph tot_len
  bridge: use skb_ip_totlen in br netfilter
  openvswitch: use skb_ip_totlen in conntrack
  net: sched: use skb_ip_totlen and iph_totlen
  netfilter: use skb_ip_totlen and iph_totlen
  cipso_ipv4: use iph_set_totlen in skbuff_setattr
  ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
  packet: add TP_STATUS_GSO_TCP for tp_status
  net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  net: add support for ipv4 big tcp

 drivers/net/ipvlan/ipvlan_core.c           |  2 +-
 include/linux/ip.h                         | 21 ++++++++++++++
 include/linux/netdevice.h                  |  6 ++++
 include/net/netfilter/nf_tables_ipv4.h     |  4 +--
 include/net/route.h                        |  3 --
 include/uapi/linux/if_link.h               |  3 ++
 include/uapi/linux/if_packet.h             |  1 +
 net/bridge/br_netfilter_hooks.c            |  2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c |  4 +--
 net/core/dev.c                             |  4 +++
 net/core/dev.h                             | 18 ++++++++++++
 net/core/gro.c                             | 12 ++++----
 net/core/rtnetlink.c                       | 33 ++++++++++++++++++++++
 net/core/sock.c                            |  8 ++++--
 net/ipv4/af_inet.c                         |  7 +++--
 net/ipv4/cipso_ipv4.c                      |  2 +-
 net/ipv4/ip_input.c                        |  2 +-
 net/ipv4/ip_output.c                       |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c            |  2 +-
 net/netfilter/nf_log_syslog.c              |  2 +-
 net/netfilter/xt_length.c                  |  2 +-
 net/openvswitch/conntrack.c                |  2 +-
 net/packet/af_packet.c                     |  4 +++
 net/sched/act_ct.c                         |  2 +-
 net/sched/sch_cake.c                       |  2 +-
 25 files changed, 122 insertions(+), 28 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 01/10] net: add a couple of helpers for iph tot_len
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This patch adds three APIs to replace the iph->tot_len setting
and getting in all places where IPv4 BIG TCP packets may reach,
they will be used in the following patches.

Note that iph_totlen() will be used when iph is not in linear
data of the skb.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/linux/ip.h  | 21 +++++++++++++++++++++
 include/net/route.h |  3 ---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 3d9c6750af62..d11c25f5030a 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -35,4 +35,25 @@ static inline unsigned int ip_transport_len(const struct sk_buff *skb)
 {
 	return ntohs(ip_hdr(skb)->tot_len) - skb_network_header_len(skb);
 }
+
+static inline unsigned int iph_totlen(const struct sk_buff *skb, const struct iphdr *iph)
+{
+	u32 len = ntohs(iph->tot_len);
+
+	return (len || !skb_is_gso(skb) || !skb_is_gso_tcp(skb)) ?
+	       len : skb->len - skb_network_offset(skb);
+}
+
+static inline unsigned int skb_ip_totlen(const struct sk_buff *skb)
+{
+	return iph_totlen(skb, ip_hdr(skb));
+}
+
+/* IPv4 datagram length is stored into 16bit field (tot_len) */
+#define IP_MAX_MTU	0xFFFFU
+
+static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
+{
+	iph->tot_len = len <= IP_MAX_MTU ? htons(len) : 0;
+}
 #endif	/* _LINUX_IP_H */
diff --git a/include/net/route.h b/include/net/route.h
index 6e92dd5bcd61..fe00b0a2e475 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -35,9 +35,6 @@
 #include <linux/cache.h>
 #include <linux/security.h>
 
-/* IPv4 datagram length is stored into 16bit field (tot_len) */
-#define IP_MAX_MTU	0xFFFFU
-
 #define RTO_ONLINK	0x01
 
 #define RT_CONN_FLAGS(sk)   (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 02/10] bridge: use skb_ip_totlen in br netfilter
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

These 3 places in bridge netfilter are called on RX path after GRO
and IPv4 TCP GSO packets may come through, so replace iph tot_len
accessing with skb_ip_totlen() in there.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/bridge/br_netfilter_hooks.c            | 2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index f20f4373ff40..b67c9c98effa 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -214,7 +214,7 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
 		goto csum_error;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < len) {
 		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index 5c5dd437f1c2..71056ee84773 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -212,7 +212,7 @@ static int nf_ct_br_ip_check(const struct sk_buff *skb)
 	    iph->version != 4)
 		return -1;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < nhoff + len ||
 	    len < (iph->ihl * 4))
                 return -1;
@@ -256,7 +256,7 @@ static unsigned int nf_ct_bridge_pre(void *priv, struct sk_buff *skb,
 		if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 			return NF_ACCEPT;
 
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		if (pskb_trim_rcsum(skb, len))
 			return NF_ACCEPT;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

IPv4 GSO packets may get processed in ovs_skb_network_trim(),
and we need to use skb_ip_totlen() to get iph totlen.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/openvswitch/conntrack.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c8b137649ca4..2172930b1f17 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1103,7 +1103,7 @@ static int ovs_skb_network_trim(struct sk_buff *skb)
 
 	switch (skb->protocol) {
 	case htons(ETH_P_IP):
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		break;
 	case htons(ETH_P_IPV6):
 		len = sizeof(struct ipv6hdr)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (2 preceding siblings ...)
  2023-01-27 15:59 ` [PATCHv3 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 05/10] netfilter: " Xin Long
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets
and access iph->tot_len, replace them with skb_ip_totlen() and
iph_totlen() accordingly.

Note that we don't need to replace the one in tcf_csum_ipv4(), as it
will return for TCP GSO packets in tcf_csum_ipv4_tcp().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sched/act_ct.c   | 2 +-
 net/sched/sch_cake.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 0ca2bb8ed026..d68bb5dbf0dc 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -707,7 +707,7 @@ static int tcf_ct_skb_network_trim(struct sk_buff *skb, int family)
 
 	switch (family) {
 	case NFPROTO_IPV4:
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		break;
 	case NFPROTO_IPV6:
 		len = sizeof(struct ipv6hdr)
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 3ed0c3342189..7970217b565a 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1209,7 +1209,7 @@ static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
 			    iph_check->daddr != iph->daddr)
 				continue;
 
-			seglen = ntohs(iph_check->tot_len) -
+			seglen = iph_totlen(skb, iph_check) -
 				       (4 * iph_check->ihl);
 		} else if (iph_check->version == 6) {
 			ipv6h = (struct ipv6hdr *)iph;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 05/10] netfilter: use skb_ip_totlen and iph_totlen
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (3 preceding siblings ...)
  2023-01-27 15:59 ` [PATCHv3 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

There are also quite some places in netfilter that may process IPv4 TCP
GSO packets, we need to replace them too.

In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen()
return value, otherwise it may overflow and mismatch. This change will
also help us add selftest for IPv4 BIG TCP in the following patch.

Note that we don't need to replace the one in tcpmss_tg4(), as it will
return if there is data after tcphdr in tcpmss_mangle_packet(). The
same in mangle_contents() in nf_nat_helper.c, it returns false when
skb->len + extra > 65535 in enlarge_skb().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/netfilter/nf_tables_ipv4.h | 4 ++--
 net/netfilter/ipvs/ip_vs_xmit.c        | 2 +-
 net/netfilter/nf_log_syslog.c          | 2 +-
 net/netfilter/xt_length.c              | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_tables_ipv4.h b/include/net/netfilter/nf_tables_ipv4.h
index 112708f7a6b4..947973623dc7 100644
--- a/include/net/netfilter/nf_tables_ipv4.h
+++ b/include/net/netfilter/nf_tables_ipv4.h
@@ -29,7 +29,7 @@ static inline int __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt)
 	if (iph->ihl < 5 || iph->version != 4)
 		return -1;
 
-	len = ntohs(iph->tot_len);
+	len = iph_totlen(pkt->skb, iph);
 	thoff = iph->ihl * 4;
 	if (pkt->skb->len < len)
 		return -1;
@@ -64,7 +64,7 @@ static inline int nft_set_pktinfo_ipv4_ingress(struct nft_pktinfo *pkt)
 	if (iph->ihl < 5 || iph->version != 4)
 		goto inhdr_error;
 
-	len = ntohs(iph->tot_len);
+	len = iph_totlen(pkt->skb, iph);
 	thoff = iph->ihl * 4;
 	if (pkt->skb->len < len) {
 		__IP_INC_STATS(nft_net(pkt), IPSTATS_MIB_INTRUNCATEDPKTS);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 029171379884..80448885c3d7 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -994,7 +994,7 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
 		old_dsfield = ipv4_get_dsfield(old_iph);
 		*ttl = old_iph->ttl;
 		if (payload_len)
-			*payload_len = ntohs(old_iph->tot_len);
+			*payload_len = skb_ip_totlen(skb);
 	}
 
 	/* Implement full-functionality option for ECN encapsulation */
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index cb894f0d63e9..c66689ad2b49 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -322,7 +322,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 
 	/* Max length: 46 "LEN=65535 TOS=0xFF PREC=0xFF TTL=255 ID=65535 " */
 	nf_log_buf_add(m, "LEN=%u TOS=0x%02X PREC=0x%02X TTL=%u ID=%u ",
-		       ntohs(ih->tot_len), ih->tos & IPTOS_TOS_MASK,
+		       iph_totlen(skb, ih), ih->tos & IPTOS_TOS_MASK,
 		       ih->tos & IPTOS_PREC_MASK, ih->ttl, ntohs(ih->id));
 
 	/* Max length: 6 "CE DF MF " */
diff --git a/net/netfilter/xt_length.c b/net/netfilter/xt_length.c
index 1873da3a945a..b3d623a52885 100644
--- a/net/netfilter/xt_length.c
+++ b/net/netfilter/xt_length.c
@@ -21,7 +21,7 @@ static bool
 length_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_length_info *info = par->matchinfo;
-	u_int16_t pktlen = ntohs(ip_hdr(skb)->tot_len);
+	u32 pktlen = skb_ip_totlen(skb);
 
 	return (pktlen >= info->min && pktlen <= info->max) ^ info->invert;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (4 preceding siblings ...)
  2023-01-27 15:59 ` [PATCHv3 net-next 05/10] netfilter: " Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
the iph->tot_len update should use iph_set_totlen().

Note that for these non GSO packets, the new iph tot_len with extra
iph option len added may become greater than 65535, the old process
will cast it and set iph->tot_len to it, which is a bug. In theory,
iph options shouldn't be added for these big packets in here, a fix
may be needed here in the future. For now this patch is only to set
iph->tot_len to 0 when it happens.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/ipv4/cipso_ipv4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 6cd3b6c559f0..79ae7204e8ed 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -2222,7 +2222,7 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
 		memset((char *)(iph + 1) + buf_len, 0, opt_len - buf_len);
 	if (len_delta != 0) {
 		iph->ihl = 5 + (opt_len >> 2);
-		iph->tot_len = htons(skb->len);
+		iph_set_totlen(iph, skb->len);
 	}
 	ip_send_check(iph);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (5 preceding siblings ...)
  2023-01-27 15:59 ` [PATCHv3 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status Xin Long
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

ipvlan devices calls netif_inherit_tso_max() to get the tso_max_size/segs
from the lower device, so when lower device supports BIG TCP, the ipvlan
devices support it too. We also should consider its iph tot_len accessing.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/ipvlan/ipvlan_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index bb1c298c1e78..460b3d4f2245 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -157,7 +157,7 @@ void *ipvlan_get_L3_hdr(struct ipvl_port *port, struct sk_buff *skb, int *type)
 			return NULL;
 
 		ip4h = ip_hdr(skb);
-		pktlen = ntohs(ip4h->tot_len);
+		pktlen = skb_ip_totlen(skb);
 		if (ip4h->ihl < 5 || ip4h->version != 4)
 			return NULL;
 		if (skb->len < pktlen || pktlen < (ip4h->ihl * 4))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (6 preceding siblings ...)
  2023-01-27 15:59 ` [PATCHv3 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
  2023-01-27 15:59 ` [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp Xin Long
  9 siblings, 0 replies; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

Introduce TP_STATUS_GSO_TCP tp_status flag to tell the af_packet user
that this is a TCP GSO packet. When parsing IPv4 BIG TCP packets in
tcpdump/libpcap, it can use tp_len as the IPv4 packet len when this
flag is set, as iph tot_len is set to 0 for IPv4 BIG TCP packets.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/uapi/linux/if_packet.h | 1 +
 net/packet/af_packet.c         | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index a8516b3594a4..78c981d6a9d4 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -115,6 +115,7 @@ struct tpacket_auxdata {
 #define TP_STATUS_BLK_TMO		(1 << 5)
 #define TP_STATUS_VLAN_TPID_VALID	(1 << 6) /* auxdata has valid tp_vlan_tpid */
 #define TP_STATUS_CSUM_VALID		(1 << 7)
+#define TP_STATUS_GSO_TCP		(1 << 8)
 
 /* Tx ring - header status */
 #define TP_STATUS_AVAILABLE	      0
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b5ab98ca2511..8ffb19c643ab 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2296,6 +2296,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	else if (skb->pkt_type != PACKET_OUTGOING &&
 		 skb_csum_unnecessary(skb))
 		status |= TP_STATUS_CSUM_VALID;
+	if (skb_is_gso(skb) && skb_is_gso_tcp(skb))
+		status |= TP_STATUS_GSO_TCP;
 
 	if (snaplen > res)
 		snaplen = res;
@@ -3522,6 +3524,8 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 		else if (skb->pkt_type != PACKET_OUTGOING &&
 			 skb_csum_unnecessary(skb))
 			aux.tp_status |= TP_STATUS_CSUM_VALID;
+		if (skb_is_gso(skb) && skb_is_gso_tcp(skb))
+			aux.tp_status |= TP_STATUS_GSO_TCP;
 
 		aux.tp_len = origlen;
 		aux.tp_snaplen = skb->len;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (7 preceding siblings ...)
  2023-01-27 15:59 ` [PATCHv3 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 17:48   ` Eric Dumazet
  2023-01-27 15:59 ` [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp Xin Long
  9 siblings, 1 reply; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
per device and adds netlink attributes for them, so that IPV4
BIG TCP can be guarded by a separate tunable in the next patch.

To not break the old application using "gso/gro_max_size" for
IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
in netif_set_gso/gro_max_size() if the new size isn't greater
than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
userspace doesn't realize the new netlink attributes.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/linux/netdevice.h    |  6 ++++++
 include/uapi/linux/if_link.h |  3 +++
 net/core/dev.c               |  4 ++++
 net/core/dev.h               | 18 ++++++++++++++++++
 net/core/rtnetlink.c         | 33 +++++++++++++++++++++++++++++++++
 5 files changed, 64 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 63b77cbc947e..ce075241ec47 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2010,6 +2010,9 @@ enum netdev_ml_priv_type {
  *			SET_NETDEV_DEVLINK_PORT macro. This pointer is static
  *			during the time netdevice is registered.
  *
+ * 	@gso_ipv4_max_size:	Maximum size of IPv4 GSO packets.
+ * 	@gro_ipv4_max_size:	Maximum size of IPv4 GRO packets.
+ *
  *	FIXME: cleanup struct net_device such that network protocol info
  *	moves out.
  */
@@ -2362,6 +2365,9 @@ struct net_device {
 	struct rtnl_hw_stats64	*offload_xstats_l3;
 
 	struct devlink_port	*devlink_port;
+
+	unsigned int		gso_ipv4_max_size;
+	unsigned int		gro_ipv4_max_size;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 1021a7e47a86..02b87e4c65be 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -374,6 +374,9 @@ enum {
 
 	IFLA_DEVLINK_PORT,
 
+	IFLA_GSO_IPV4_MAX_SIZE,
+	IFLA_GRO_IPV4_MAX_SIZE,
+
 	__IFLA_MAX
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 9c60190fe352..45e955eadca4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3001,6 +3001,8 @@ void netif_set_tso_max_size(struct net_device *dev, unsigned int size)
 	dev->tso_max_size = min(GSO_MAX_SIZE, size);
 	if (size < READ_ONCE(dev->gso_max_size))
 		netif_set_gso_max_size(dev, size);
+	if (size < READ_ONCE(dev->gso_ipv4_max_size))
+		netif_set_gso_ipv4_max_size(dev, size);
 }
 EXPORT_SYMBOL(netif_set_tso_max_size);
 
@@ -10610,6 +10612,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->gso_max_size = GSO_LEGACY_MAX_SIZE;
 	dev->gso_max_segs = GSO_MAX_SEGS;
 	dev->gro_max_size = GRO_LEGACY_MAX_SIZE;
+	dev->gso_ipv4_max_size = GSO_LEGACY_MAX_SIZE;
+	dev->gro_ipv4_max_size = GRO_LEGACY_MAX_SIZE;
 	dev->tso_max_size = TSO_LEGACY_MAX_SIZE;
 	dev->tso_max_segs = TSO_MAX_SEGS;
 	dev->upper_level = 1;
diff --git a/net/core/dev.h b/net/core/dev.h
index 814ed5b7b960..a065b7571441 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -100,6 +100,8 @@ static inline void netif_set_gso_max_size(struct net_device *dev,
 {
 	/* dev->gso_max_size is read locklessly from sk_setup_caps() */
 	WRITE_ONCE(dev->gso_max_size, size);
+	if (size <= GSO_LEGACY_MAX_SIZE)
+		WRITE_ONCE(dev->gso_ipv4_max_size, size);
 }
 
 static inline void netif_set_gso_max_segs(struct net_device *dev,
@@ -114,6 +116,22 @@ static inline void netif_set_gro_max_size(struct net_device *dev,
 {
 	/* This pairs with the READ_ONCE() in skb_gro_receive() */
 	WRITE_ONCE(dev->gro_max_size, size);
+	if (size <= GRO_LEGACY_MAX_SIZE)
+		WRITE_ONCE(dev->gro_ipv4_max_size, size);
+}
+
+static inline void netif_set_gso_ipv4_max_size(struct net_device *dev,
+					       unsigned int size)
+{
+	/* dev->gso_ipv4_max_size is read locklessly from sk_setup_caps() */
+	WRITE_ONCE(dev->gso_ipv4_max_size, size);
+}
+
+static inline void netif_set_gro_ipv4_max_size(struct net_device *dev,
+					       unsigned int size)
+{
+	/* This pairs with the READ_ONCE() in skb_gro_receive() */
+	WRITE_ONCE(dev->gro_ipv4_max_size, size);
 }
 
 #endif
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 64289bc98887..b9f584955b77 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1074,6 +1074,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SEGS */
 	       + nla_total_size(4) /* IFLA_GSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_GRO_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_GSO_IPV4_MAX_SIZE */
+	       + nla_total_size(4) /* IFLA_GRO_IPV4_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SIZE */
 	       + nla_total_size(4) /* IFLA_TSO_MAX_SEGS */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
@@ -1807,6 +1809,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	    nla_put_u32(skb, IFLA_GSO_MAX_SEGS, dev->gso_max_segs) ||
 	    nla_put_u32(skb, IFLA_GSO_MAX_SIZE, dev->gso_max_size) ||
 	    nla_put_u32(skb, IFLA_GRO_MAX_SIZE, dev->gro_max_size) ||
+	    nla_put_u32(skb, IFLA_GSO_IPV4_MAX_SIZE, dev->gso_ipv4_max_size) ||
+	    nla_put_u32(skb, IFLA_GRO_IPV4_MAX_SIZE, dev->gro_ipv4_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_MAX_SIZE, dev->tso_max_size) ||
 	    nla_put_u32(skb, IFLA_TSO_MAX_SEGS, dev->tso_max_segs) ||
 #ifdef CONFIG_RPS
@@ -1968,6 +1972,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_TSO_MAX_SIZE]	= { .type = NLA_REJECT },
 	[IFLA_TSO_MAX_SEGS]	= { .type = NLA_REJECT },
 	[IFLA_ALLMULTI]		= { .type = NLA_REJECT },
+	[IFLA_GSO_IPV4_MAX_SIZE]	= { .type = NLA_U32 },
+	[IFLA_GRO_IPV4_MAX_SIZE]	= { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2883,6 +2889,29 @@ static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_GSO_IPV4_MAX_SIZE]) {
+		u32 max_size = nla_get_u32(tb[IFLA_GSO_IPV4_MAX_SIZE]);
+
+		if (max_size > dev->tso_max_size) {
+			err = -EINVAL;
+			goto errout;
+		}
+
+		if (dev->gso_ipv4_max_size ^ max_size) {
+			netif_set_gso_ipv4_max_size(dev, max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
+	if (tb[IFLA_GRO_IPV4_MAX_SIZE]) {
+		u32 gro_max_size = nla_get_u32(tb[IFLA_GRO_IPV4_MAX_SIZE]);
+
+		if (dev->gro_ipv4_max_size ^ gro_max_size) {
+			netif_set_gro_ipv4_max_size(dev, gro_max_size);
+			status |= DO_SETLINK_MODIFIED;
+		}
+	}
+
 	if (tb[IFLA_OPERSTATE])
 		set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
 
@@ -3325,6 +3354,10 @@ struct net_device *rtnl_create_link(struct net *net, const char *ifname,
 		netif_set_gso_max_segs(dev, nla_get_u32(tb[IFLA_GSO_MAX_SEGS]));
 	if (tb[IFLA_GRO_MAX_SIZE])
 		netif_set_gro_max_size(dev, nla_get_u32(tb[IFLA_GRO_MAX_SIZE]));
+	if (tb[IFLA_GSO_IPV4_MAX_SIZE])
+		netif_set_gso_ipv4_max_size(dev, nla_get_u32(tb[IFLA_GSO_IPV4_MAX_SIZE]));
+	if (tb[IFLA_GRO_IPV4_MAX_SIZE])
+		netif_set_gro_ipv4_max_size(dev, nla_get_u32(tb[IFLA_GRO_IPV4_MAX_SIZE]));
 
 	return dev;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp
  2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (8 preceding siblings ...)
  2023-01-27 15:59 ` [PATCHv3 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
@ 2023-01-27 15:59 ` Xin Long
  2023-01-27 17:41   ` Eric Dumazet
  9 siblings, 1 reply; 15+ messages in thread
From: Xin Long @ 2023-01-27 15:59 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.

Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.

Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.

Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.

Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/core/gro.c       | 12 +++++++-----
 net/core/sock.c      |  8 ++++++--
 net/ipv4/af_inet.c   |  7 ++++---
 net/ipv4/ip_input.c  |  2 +-
 net/ipv4/ip_output.c |  2 +-
 5 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/net/core/gro.c b/net/core/gro.c
index 506f83d715f8..b15f85546bdd 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -162,16 +162,18 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	struct sk_buff *lp;
 	int segs;
 
-	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
-	gro_max_size = READ_ONCE(p->dev->gro_max_size);
+	/* pairs with WRITE_ONCE() in netif_set_gro(_ipv4)_max_size() */
+	gro_max_size = p->protocol == htons(ETH_P_IPV6) ?
+			READ_ONCE(p->dev->gro_max_size) :
+				READ_ONCE(p->dev->gro_ipv4_max_size);
 
 	if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
 		return -E2BIG;
 
 	if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
-		if (p->protocol != htons(ETH_P_IPV6) ||
-		    skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
-		    ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
+		if (NAPI_GRO_CB(skb)->proto != IPPROTO_TCP ||
+		    (p->protocol == htons(ETH_P_IPV6) &&
+		     skb_headroom(p) < sizeof(struct hop_jumbo_hdr)) ||
 		    p->encapsulation)
 			return -E2BIG;
 	}
diff --git a/net/core/sock.c b/net/core/sock.c
index 7ba4891460ad..c98f9a4eeff9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2383,6 +2383,8 @@ static void sk_trim_gso_size(struct sock *sk)
 	    !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
 		return;
 #endif
+	if (sk->sk_family == AF_INET && sk_is_tcp(sk))
+		return;
 	sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
 }
 
@@ -2403,8 +2405,10 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
 		} else {
 			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
-			/* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
-			sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
+			/* pairs with the WRITE_ONCE() in netif_set_gso(_ipv4)_max_size() */
+			sk->sk_gso_max_size = sk->sk_family == AF_INET6 ?
+					READ_ONCE(dst->dev->gso_max_size) :
+						READ_ONCE(dst->dev->gso_ipv4_max_size);
 			sk_trim_gso_size(sk);
 			sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
 			/* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6c0ec2789943..2f992a323b95 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1485,6 +1485,7 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 	if (unlikely(ip_fast_csum((u8 *)iph, 5)))
 		goto out;
 
+	NAPI_GRO_CB(skb)->proto = proto;
 	id = ntohl(*(__be32 *)&iph->id);
 	flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
 	id >>= 16;
@@ -1618,9 +1619,9 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
-	__be16 newlen = htons(skb->len - nhoff);
 	struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
 	const struct net_offload *ops;
+	__be16 totlen = iph->tot_len;
 	int proto = iph->protocol;
 	int err = -ENOSYS;
 
@@ -1629,8 +1630,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 		skb_set_inner_network_header(skb, nhoff);
 	}
 
-	csum_replace2(&iph->check, iph->tot_len, newlen);
-	iph->tot_len = newlen;
+	iph_set_totlen(iph, skb->len - nhoff);
+	csum_replace2(&iph->check, totlen, iph->tot_len);
 
 	ops = rcu_dereference(inet_offloads[proto]);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index e880ce77322a..0aa8c49b4e1b 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -511,7 +511,7 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
 		goto csum_error;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < len) {
 		drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
 		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 922c87ef1ab5..4e4e308c3230 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -100,7 +100,7 @@ int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct iphdr *iph = ip_hdr(skb);
 
-	iph->tot_len = htons(skb->len);
+	iph_set_totlen(iph, skb->len);
 	ip_send_check(iph);
 
 	/* if egress device is enslaved to an L3 master device pass the
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp
  2023-01-27 15:59 ` [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp Xin Long
@ 2023-01-27 17:41   ` Eric Dumazet
  2023-01-27 18:37     ` Xin Long
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2023-01-27 17:41 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Fri, Jan 27, 2023 at 5:00 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
>
> Firstly, allow sk->sk_gso_max_size to be set to a value greater than
> GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
> for IPv4 TCP sockets.
>
> Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
> in __ip_local_out() to allow to send BIG TCP packets, and this implies
> that skb->len is the length of a IPv4 packet; On RX path, use skb->len
> as the length of the IPv4 packet when the IP header tot_len is 0 and
> skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
> skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
> need to update these APIs.
>
> Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
> the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
> GRO complete, set IP header tot_len to 0 when the merged packet size
> greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
> on RX path.
>
> Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
> this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
> packets.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/core/gro.c       | 12 +++++++-----
>  net/core/sock.c      |  8 ++++++--
>  net/ipv4/af_inet.c   |  7 ++++---
>  net/ipv4/ip_input.c  |  2 +-
>  net/ipv4/ip_output.c |  2 +-
>  5 files changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 506f83d715f8..b15f85546bdd 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -162,16 +162,18 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>         struct sk_buff *lp;
>         int segs;
>
> -       /* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
> -       gro_max_size = READ_ONCE(p->dev->gro_max_size);
> +       /* pairs with WRITE_ONCE() in netif_set_gro(_ipv4)_max_size() */
> +       gro_max_size = p->protocol == htons(ETH_P_IPV6) ?
> +                       READ_ONCE(p->dev->gro_max_size) :
> +                               READ_ONCE(p->dev->gro_ipv4_max_size);
>
>         if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
>                 return -E2BIG;
>
>         if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
> -               if (p->protocol != htons(ETH_P_IPV6) ||
> -                   skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
> -                   ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
> +               if (NAPI_GRO_CB(skb)->proto != IPPROTO_TCP ||
> +                   (p->protocol == htons(ETH_P_IPV6) &&
> +                    skb_headroom(p) < sizeof(struct hop_jumbo_hdr)) ||
>                     p->encapsulation)
>                         return -E2BIG;
>         }
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 7ba4891460ad..c98f9a4eeff9 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2383,6 +2383,8 @@ static void sk_trim_gso_size(struct sock *sk)
>             !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
>                 return;
>  #endif
> +       if (sk->sk_family == AF_INET && sk_is_tcp(sk))
> +               return;

Or simply

diff --git a/net/core/sock.c b/net/core/sock.c
index 7ba4891460adbd6c13c0ce1dcdd7f23c8c1f0f5d..dcb8fff91fd9a9472267a2cf2fdc98114a7d2b7d
100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2375,14 +2375,9 @@ EXPORT_SYMBOL_GPL(sk_free_unlock_clone);

 static void sk_trim_gso_size(struct sock *sk)
 {
-       if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE)
+       if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE ||
+           sk_is_tcp(sk))
                return;
-#if IS_ENABLED(CONFIG_IPV6)
-       if (sk->sk_family == AF_INET6 &&
-           sk_is_tcp(sk) &&
-           !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
-               return;
-#endif
        sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
 }



>         sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
>  }
>
> @@ -2403,8 +2405,10 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
>                         sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
>                 } else {
>                         sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
> -                       /* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
> -                       sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
> +                       /* pairs with the WRITE_ONCE() in netif_set_gso(_ipv4)_max_size() */
> +                       sk->sk_gso_max_size = sk->sk_family == AF_INET6 ?
> +                                       READ_ONCE(dst->dev->gso_max_size) :
> +                                               READ_ONCE(dst->dev->gso_ipv4_max_size);
>                         sk_trim_gso_size(sk);
>                         sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
>                         /* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 6c0ec2789943..2f992a323b95 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1485,6 +1485,7 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
>         if (unlikely(ip_fast_csum((u8 *)iph, 5)))
>                 goto out;
>
> +       NAPI_GRO_CB(skb)->proto = proto;
>         id = ntohl(*(__be32 *)&iph->id);
>         flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
>         id >>= 16;
> @@ -1618,9 +1619,9 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
>
>  int inet_gro_complete(struct sk_buff *skb, int nhoff)
>  {
> -       __be16 newlen = htons(skb->len - nhoff);
>         struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
>         const struct net_offload *ops;
> +       __be16 totlen = iph->tot_len;
>         int proto = iph->protocol;
>         int err = -ENOSYS;
>
> @@ -1629,8 +1630,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
>                 skb_set_inner_network_header(skb, nhoff);
>         }
>
> -       csum_replace2(&iph->check, iph->tot_len, newlen);
> -       iph->tot_len = newlen;
> +       iph_set_totlen(iph, skb->len - nhoff);
> +       csum_replace2(&iph->check, totlen, iph->tot_len);
>
>         ops = rcu_dereference(inet_offloads[proto]);
>         if (WARN_ON(!ops || !ops->callbacks.gro_complete))
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index e880ce77322a..0aa8c49b4e1b 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -511,7 +511,7 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
>         if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
>                 goto csum_error;
>
> -       len = ntohs(iph->tot_len);
> +       len = skb_ip_totlen(skb);

len = iph_totlen(skb, iph);

>         if (skb->len < len) {
>                 drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>                 __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 922c87ef1ab5..4e4e308c3230 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -100,7 +100,7 @@ int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
>         struct iphdr *iph = ip_hdr(skb);
>
> -       iph->tot_len = htons(skb->len);
> +       iph_set_totlen(iph, skb->len);
>         ip_send_check(iph);
>
>         /* if egress device is enslaved to an L3 master device pass the
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv3 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device
  2023-01-27 15:59 ` [PATCHv3 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
@ 2023-01-27 17:48   ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-01-27 17:48 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Fri, Jan 27, 2023 at 5:00 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
> per device and adds netlink attributes for them, so that IPV4
> BIG TCP can be guarded by a separate tunable in the next patch.
>
> To not break the old application using "gso/gro_max_size" for
> IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
> in netif_set_gso/gro_max_size() if the new size isn't greater
> than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
> userspace doesn't realize the new netlink attributes.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  include/linux/netdevice.h    |  6 ++++++
>  include/uapi/linux/if_link.h |  3 +++
>  net/core/dev.c               |  4 ++++
>  net/core/dev.h               | 18 ++++++++++++++++++
>  net/core/rtnetlink.c         | 33 +++++++++++++++++++++++++++++++++
>  5 files changed, 64 insertions(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 63b77cbc947e..ce075241ec47 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2010,6 +2010,9 @@ enum netdev_ml_priv_type {
>   *                     SET_NETDEV_DEVLINK_PORT macro. This pointer is static
>   *                     during the time netdevice is registered.
>   *
> + *     @gso_ipv4_max_size:     Maximum size of IPv4 GSO packets.
> + *     @gro_ipv4_max_size:     Maximum size of IPv4 GRO packets.
> + *
>   *     FIXME: cleanup struct net_device such that network protocol info
>   *     moves out.
>   */
> @@ -2362,6 +2365,9 @@ struct net_device {
>         struct rtnl_hw_stats64  *offload_xstats_l3;
>
>         struct devlink_port     *devlink_port;
> +
> +       unsigned int            gso_ipv4_max_size;
> +       unsigned int            gro_ipv4_max_size;

This seems a pretty bad choice in terms of data locality.

Field order in "struct net_device" is very important for performance.

Please put gro_ipv4_max_size close to other related fields, so that we
do not need an extra cache line miss.

Same for gso_ipv4_max_size.

Use "pahole --hex" to study how "struct net_device" is currently partitioned.
It seems we have a hole after tso_max_segs, so this would be for
gso_ipv4_max_size


>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp
  2023-01-27 17:41   ` Eric Dumazet
@ 2023-01-27 18:37     ` Xin Long
  2023-01-27 18:44       ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Xin Long @ 2023-01-27 18:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: network dev, davem, kuba, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Fri, Jan 27, 2023 at 12:41 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jan 27, 2023 at 5:00 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
> >
> > Firstly, allow sk->sk_gso_max_size to be set to a value greater than
> > GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
> > for IPv4 TCP sockets.
> >
> > Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
> > in __ip_local_out() to allow to send BIG TCP packets, and this implies
> > that skb->len is the length of a IPv4 packet; On RX path, use skb->len
> > as the length of the IPv4 packet when the IP header tot_len is 0 and
> > skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
> > skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
> > need to update these APIs.
> >
> > Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
> > the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
> > GRO complete, set IP header tot_len to 0 when the merged packet size
> > greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
> > on RX path.
> >
> > Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
> > this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
> > packets.
> >
> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > ---
> >  net/core/gro.c       | 12 +++++++-----
> >  net/core/sock.c      |  8 ++++++--
> >  net/ipv4/af_inet.c   |  7 ++++---
> >  net/ipv4/ip_input.c  |  2 +-
> >  net/ipv4/ip_output.c |  2 +-
> >  5 files changed, 19 insertions(+), 12 deletions(-)
> >
> > diff --git a/net/core/gro.c b/net/core/gro.c
> > index 506f83d715f8..b15f85546bdd 100644
> > --- a/net/core/gro.c
> > +++ b/net/core/gro.c
> > @@ -162,16 +162,18 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> >         struct sk_buff *lp;
> >         int segs;
> >
> > -       /* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
> > -       gro_max_size = READ_ONCE(p->dev->gro_max_size);
> > +       /* pairs with WRITE_ONCE() in netif_set_gro(_ipv4)_max_size() */
> > +       gro_max_size = p->protocol == htons(ETH_P_IPV6) ?
> > +                       READ_ONCE(p->dev->gro_max_size) :
> > +                               READ_ONCE(p->dev->gro_ipv4_max_size);
> >
> >         if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
> >                 return -E2BIG;
> >
> >         if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
> > -               if (p->protocol != htons(ETH_P_IPV6) ||
> > -                   skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
> > -                   ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
> > +               if (NAPI_GRO_CB(skb)->proto != IPPROTO_TCP ||
> > +                   (p->protocol == htons(ETH_P_IPV6) &&
> > +                    skb_headroom(p) < sizeof(struct hop_jumbo_hdr)) ||
> >                     p->encapsulation)
> >                         return -E2BIG;
> >         }
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 7ba4891460ad..c98f9a4eeff9 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -2383,6 +2383,8 @@ static void sk_trim_gso_size(struct sock *sk)
> >             !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
> >                 return;
> >  #endif
> > +       if (sk->sk_family == AF_INET && sk_is_tcp(sk))
> > +               return;
>
> Or simply
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 7ba4891460adbd6c13c0ce1dcdd7f23c8c1f0f5d..dcb8fff91fd9a9472267a2cf2fdc98114a7d2b7d
> 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2375,14 +2375,9 @@ EXPORT_SYMBOL_GPL(sk_free_unlock_clone);
>
>  static void sk_trim_gso_size(struct sock *sk)
>  {
> -       if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE)
> +       if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE ||
> +           sk_is_tcp(sk))
>                 return;
> -#if IS_ENABLED(CONFIG_IPV6)
> -       if (sk->sk_family == AF_INET6 &&
> -           sk_is_tcp(sk) &&
> -           !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
> -               return;
> -#endif
>         sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
>  }
There's a difference,  AF_INET6 TCP socket may send ipv4 packets with
ipv6_addr_v4mapped, if we don't check ipv6_addr_v4mapped(), IPV4
GSO packets might go with the "gso_max_size" for IPV6.

I think we could use the change you wrote above, but we also need to
use dst->ops->family instead of sk->sk_family in sk_setup_caps():

+                       sk->sk_gso_max_size = dst->ops->family == AF_INET6 ?
+                                       READ_ONCE(dst->dev->gso_max_size) :
+
READ_ONCE(dst->dev->gso_ipv4_max_size);

>
>
>
> >         sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
> >  }
> >
> > @@ -2403,8 +2405,10 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
> >                         sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
> >                 } else {
> >                         sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
> > -                       /* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
> > -                       sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
> > +                       /* pairs with the WRITE_ONCE() in netif_set_gso(_ipv4)_max_size() */
> > +                       sk->sk_gso_max_size = sk->sk_family == AF_INET6 ?
> > +                                       READ_ONCE(dst->dev->gso_max_size) :
> > +                                               READ_ONCE(dst->dev->gso_ipv4_max_size);
> >                         sk_trim_gso_size(sk);
> >                         sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
> >                         /* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 6c0ec2789943..2f992a323b95 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1485,6 +1485,7 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
> >         if (unlikely(ip_fast_csum((u8 *)iph, 5)))
> >                 goto out;
> >
> > +       NAPI_GRO_CB(skb)->proto = proto;
> >         id = ntohl(*(__be32 *)&iph->id);
> >         flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
> >         id >>= 16;
> > @@ -1618,9 +1619,9 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
> >
> >  int inet_gro_complete(struct sk_buff *skb, int nhoff)
> >  {
> > -       __be16 newlen = htons(skb->len - nhoff);
> >         struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
> >         const struct net_offload *ops;
> > +       __be16 totlen = iph->tot_len;
> >         int proto = iph->protocol;
> >         int err = -ENOSYS;
> >
> > @@ -1629,8 +1630,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
> >                 skb_set_inner_network_header(skb, nhoff);
> >         }
> >
> > -       csum_replace2(&iph->check, iph->tot_len, newlen);
> > -       iph->tot_len = newlen;
> > +       iph_set_totlen(iph, skb->len - nhoff);
> > +       csum_replace2(&iph->check, totlen, iph->tot_len);
> >
> >         ops = rcu_dereference(inet_offloads[proto]);
> >         if (WARN_ON(!ops || !ops->callbacks.gro_complete))
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index e880ce77322a..0aa8c49b4e1b 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -511,7 +511,7 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
> >         if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> >                 goto csum_error;
> >
> > -       len = ntohs(iph->tot_len);
> > +       len = skb_ip_totlen(skb);
>
> len = iph_totlen(skb, iph);
OK, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp
  2023-01-27 18:37     ` Xin Long
@ 2023-01-27 18:44       ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-01-27 18:44 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Fri, Jan 27, 2023 at 7:37 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Fri, Jan 27, 2023 at 12:41 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 5:00 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
> > >
> > > Firstly, allow sk->sk_gso_max_size to be set to a value greater than
> > > GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
> > > for IPv4 TCP sockets.
> > >
> > > Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
> > > in __ip_local_out() to allow to send BIG TCP packets, and this implies
> > > that skb->len is the length of a IPv4 packet; On RX path, use skb->len
> > > as the length of the IPv4 packet when the IP header tot_len is 0 and
> > > skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
> > > skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
> > > need to update these APIs.
> > >
> > > Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
> > > the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
> > > GRO complete, set IP header tot_len to 0 when the merged packet size
> > > greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
> > > on RX path.
> > >
> > > Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
> > > this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
> > > packets.
> > >
> > > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > > ---
> > >  net/core/gro.c       | 12 +++++++-----
> > >  net/core/sock.c      |  8 ++++++--
> > >  net/ipv4/af_inet.c   |  7 ++++---
> > >  net/ipv4/ip_input.c  |  2 +-
> > >  net/ipv4/ip_output.c |  2 +-
> > >  5 files changed, 19 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/net/core/gro.c b/net/core/gro.c
> > > index 506f83d715f8..b15f85546bdd 100644
> > > --- a/net/core/gro.c
> > > +++ b/net/core/gro.c
> > > @@ -162,16 +162,18 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> > >         struct sk_buff *lp;
> > >         int segs;
> > >
> > > -       /* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
> > > -       gro_max_size = READ_ONCE(p->dev->gro_max_size);
> > > +       /* pairs with WRITE_ONCE() in netif_set_gro(_ipv4)_max_size() */
> > > +       gro_max_size = p->protocol == htons(ETH_P_IPV6) ?
> > > +                       READ_ONCE(p->dev->gro_max_size) :
> > > +                               READ_ONCE(p->dev->gro_ipv4_max_size);
> > >
> > >         if (unlikely(p->len + len >= gro_max_size || NAPI_GRO_CB(skb)->flush))
> > >                 return -E2BIG;
> > >
> > >         if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
> > > -               if (p->protocol != htons(ETH_P_IPV6) ||
> > > -                   skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
> > > -                   ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
> > > +               if (NAPI_GRO_CB(skb)->proto != IPPROTO_TCP ||
> > > +                   (p->protocol == htons(ETH_P_IPV6) &&
> > > +                    skb_headroom(p) < sizeof(struct hop_jumbo_hdr)) ||
> > >                     p->encapsulation)
> > >                         return -E2BIG;
> > >         }
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index 7ba4891460ad..c98f9a4eeff9 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -2383,6 +2383,8 @@ static void sk_trim_gso_size(struct sock *sk)
> > >             !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
> > >                 return;
> > >  #endif
> > > +       if (sk->sk_family == AF_INET && sk_is_tcp(sk))
> > > +               return;
> >
> > Or simply
> >
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 7ba4891460adbd6c13c0ce1dcdd7f23c8c1f0f5d..dcb8fff91fd9a9472267a2cf2fdc98114a7d2b7d
> > 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -2375,14 +2375,9 @@ EXPORT_SYMBOL_GPL(sk_free_unlock_clone);
> >
> >  static void sk_trim_gso_size(struct sock *sk)
> >  {
> > -       if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE)
> > +       if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE ||
> > +           sk_is_tcp(sk))
> >                 return;
> > -#if IS_ENABLED(CONFIG_IPV6)
> > -       if (sk->sk_family == AF_INET6 &&
> > -           sk_is_tcp(sk) &&
> > -           !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
> > -               return;
> > -#endif
> >         sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
> >  }
> There's a difference,  AF_INET6 TCP socket may send ipv4 packets with
> ipv6_addr_v4mapped, if we don't check ipv6_addr_v4mapped(), IPV4
> GSO packets might go with the "gso_max_size" for IPV6.
>

But the change you wrote in sk_setup_caps() only checked sk_family.


> I think we could use the change you wrote above, but we also need to
> use dst->ops->family instead of sk->sk_family in sk_setup_caps():
>
> +                       sk->sk_gso_max_size = dst->ops->family == AF_INET6 ?
> +                                       READ_ONCE(dst->dev->gso_max_size) :
> +
> READ_ONCE(dst->dev->gso_ipv4_max_size);
>
> >
> >
> >
> > >         sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
> > >  }
> > >
> > > @@ -2403,8 +2405,10 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
> > >                         sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
> > >                 } else {
> > >                         sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
> > > -                       /* pairs with the WRITE_ONCE() in netif_set_gso_max_size() */
> > > -                       sk->sk_gso_max_size = READ_ONCE(dst->dev->gso_max_size);
> > > +                       /* pairs with the WRITE_ONCE() in netif_set_gso(_ipv4)_max_size() */
> > > +                       sk->sk_gso_max_size = sk->sk_family == AF_INET6 ?
> > > +                                       READ_ONCE(dst->dev->gso_max_size) :
> > > +                                               READ_ONCE(dst->dev->gso_ipv4_max_size);

Here...

So if you need ipv6_addr_v4mapped() this should be done here anyway.

> > >                         sk_trim_gso_size(sk);
> > >                         sk->sk_gso_max_size -= (MAX_TCP_HEADER + 1);
> > >                         /* pairs with the WRITE_ONCE() in netif_set_gso_max_segs() */
> > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > index 6c0ec2789943..2f992a323b95 100644
> > > --- a/net/ipv4/af_inet.c
> > > +++ b/net/ipv4/af_inet.c
> > > @@ -1485,6 +1485,7 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
> > >         if (unlikely(ip_fast_csum((u8 *)iph, 5)))
> > >                 goto out;
> > >
> > > +       NAPI_GRO_CB(skb)->proto = proto;
> > >         id = ntohl(*(__be32 *)&iph->id);
> > >         flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
> > >         id >>= 16;
> > > @@ -1618,9 +1619,9 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
> > >
> > >  int inet_gro_complete(struct sk_buff *skb, int nhoff)
> > >  {
> > > -       __be16 newlen = htons(skb->len - nhoff);
> > >         struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
> > >         const struct net_offload *ops;
> > > +       __be16 totlen = iph->tot_len;
> > >         int proto = iph->protocol;
> > >         int err = -ENOSYS;
> > >
> > > @@ -1629,8 +1630,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
> > >                 skb_set_inner_network_header(skb, nhoff);
> > >         }
> > >
> > > -       csum_replace2(&iph->check, iph->tot_len, newlen);
> > > -       iph->tot_len = newlen;
> > > +       iph_set_totlen(iph, skb->len - nhoff);
> > > +       csum_replace2(&iph->check, totlen, iph->tot_len);
> > >
> > >         ops = rcu_dereference(inet_offloads[proto]);
> > >         if (WARN_ON(!ops || !ops->callbacks.gro_complete))
> > > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > > index e880ce77322a..0aa8c49b4e1b 100644
> > > --- a/net/ipv4/ip_input.c
> > > +++ b/net/ipv4/ip_input.c
> > > @@ -511,7 +511,7 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
> > >         if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> > >                 goto csum_error;
> > >
> > > -       len = ntohs(iph->tot_len);
> > > +       len = skb_ip_totlen(skb);
> >
> > len = iph_totlen(skb, iph);
> OK, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-01-27 18:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-27 15:59 [PATCHv3 net-next 00/10] net: support ipv4 big tcp Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 05/10] netfilter: " Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 08/10] packet: add TP_STATUS_GSO_TCP for tp_status Xin Long
2023-01-27 15:59 ` [PATCHv3 net-next 09/10] net: add gso_ipv4_max_size and gro_ipv4_max_size per device Xin Long
2023-01-27 17:48   ` Eric Dumazet
2023-01-27 15:59 ` [PATCHv3 net-next 10/10] net: add support for ipv4 big tcp Xin Long
2023-01-27 17:41   ` Eric Dumazet
2023-01-27 18:37     ` Xin Long
2023-01-27 18:44       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.