All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 00/10] net: support ipv4 big tcp
@ 2023-01-14  3:31 Xin Long
  2023-01-14  3:31 ` [PATCH net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
                   ` (11 more replies)
  0 siblings, 12 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This is similar to the BIG TCP patchset added by Eric for IPv6:

  https://lwn.net/Articles/895398/

Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
doesn't have exthdrs(options) for the BIG TCP packets' length. To make
it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
indicate this might be a BIG TCP packet and use skb->len as the real
IPv4 total length.

This will work safely, as all BIG TCP packets are GSO/GRO packets and
processed on the same host as they were created; There is no padding
in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
packet total length; Also, before implementing the feature, all those
places that may get iph tot_len from BIG TCP packets are taken care
with some new APIs:

Patch 1 adds some APIs for iph tot_len setting and getting, which are
used in all these places where IPv4 BIG TCP packets may reach in Patch
2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
can also cover IPv6 BIG TCP.

Note that the similar change as in Patch 2-7 are also needed for IPv6
BIG TCP packets, and will be addressed in another patchset.

The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
and 1.5K MTU:

No BIG TCP:
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
168          322          337          3776.49
143          236          277          4654.67
128          258          288          4772.83
171          229          278          4645.77
175          228          243          4678.93
149          239          279          4599.86
164          234          268          4606.94
155          276          289          4235.82
180          255          268          4418.95
168          241          249          4417.82

Enable BIG TCP:
ip link set dev ens1f0np0 gro_max_size 128000 gso_max_size 128000
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
161          241          252          4821.73
174          205          217          5098.28
167          208          220          5001.43
164          228          249          4883.98
150          233          249          4914.90
180          233          244          4819.66
154          208          219          5004.92
157          209          247          4999.78
160          218          246          4842.31
174          206          217          5080.99

Xin Long (10):
  net: add a couple of helpers for iph tot_len
  bridge: use skb_ip_totlen in br netfilter
  openvswitch: use skb_ip_totlen in conntrack
  net: sched: use skb_ip_totlen and iph_totlen
  netfilter: use skb_ip_totlen and iph_totlen
  cipso_ipv4: use iph_set_totlen in skbuff_setattr
  ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
  net: add support for ipv4 big tcp
  netfilter: get ipv6 pktlen properly in length_mt6
  selftests: add a selftest for big tcp

 drivers/net/ipvlan/ipvlan_core.c           |   2 +-
 include/linux/ip.h                         |  20 +++
 include/linux/ipv6.h                       |   9 ++
 include/net/netfilter/nf_tables_ipv4.h     |   4 +-
 include/net/route.h                        |   3 -
 net/bridge/br_netfilter_hooks.c            |   2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c |   4 +-
 net/core/gro.c                             |   6 +-
 net/core/sock.c                            |  11 +-
 net/ipv4/af_inet.c                         |   7 +-
 net/ipv4/cipso_ipv4.c                      |   2 +-
 net/ipv4/ip_input.c                        |   2 +-
 net/ipv4/ip_output.c                       |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c            |   2 +-
 net/netfilter/nf_log_syslog.c              |   2 +-
 net/netfilter/xt_length.c                  |   5 +-
 net/openvswitch/conntrack.c                |   2 +-
 net/sched/act_ct.c                         |   2 +-
 net/sched/sch_cake.c                       |   2 +-
 tools/testing/selftests/net/Makefile       |   1 +
 tools/testing/selftests/net/big_tcp.sh     | 157 +++++++++++++++++++++
 21 files changed, 212 insertions(+), 35 deletions(-)
 create mode 100755 tools/testing/selftests/net/big_tcp.sh

-- 
2.31.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH net-next 01/10] net: add a couple of helpers for iph tot_len
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14  3:31 ` [PATCH net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This patch adds three APIs to replace the iph->tot_len setting
and getting in all places where IPv4 BIG TCP packets may reach,
they will be used in the following patches.

Note that iph_totlen() will be used when iph is not in linear
data of the skb.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/linux/ip.h  | 20 ++++++++++++++++++++
 include/net/route.h |  3 ---
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/ip.h b/include/linux/ip.h
index 3d9c6750af62..53a0bf5d3f06 100644
--- a/include/linux/ip.h
+++ b/include/linux/ip.h
@@ -35,4 +35,24 @@ static inline unsigned int ip_transport_len(const struct sk_buff *skb)
 {
 	return ntohs(ip_hdr(skb)->tot_len) - skb_network_header_len(skb);
 }
+
+static inline unsigned int iph_totlen(const struct sk_buff *skb, const struct iphdr *iph)
+{
+	return ntohs(iph->tot_len) ?: (skb_is_gso_tcp(skb) ?
+				       skb->len - skb_network_offset(skb) :
+				       0);
+}
+
+static inline unsigned int skb_ip_totlen(const struct sk_buff *skb)
+{
+	return iph_totlen(skb, ip_hdr(skb));
+}
+
+/* IPv4 datagram length is stored into 16bit field (tot_len) */
+#define IP_MAX_MTU	0xFFFFU
+
+static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
+{
+	iph->tot_len = len <= IP_MAX_MTU ? htons(len) : 0;
+}
 #endif	/* _LINUX_IP_H */
diff --git a/include/net/route.h b/include/net/route.h
index 6e92dd5bcd61..fe00b0a2e475 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -35,9 +35,6 @@
 #include <linux/cache.h>
 #include <linux/security.h>
 
-/* IPv4 datagram length is stored into 16bit field (tot_len) */
-#define IP_MAX_MTU	0xFFFFU
-
 #define RTO_ONLINK	0x01
 
 #define RT_CONN_FLAGS(sk)   (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 02/10] bridge: use skb_ip_totlen in br netfilter
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
  2023-01-14  3:31 ` [PATCH net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14  3:31 ` [PATCH net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

These 3 places in bridge netfilter are called on RX path after GRO
and IPv4 TCP GSO packets may come through, so replace iph tot_len
accessing with skb_ip_totlen() in there.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/bridge/br_netfilter_hooks.c            | 2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index f20f4373ff40..b67c9c98effa 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -214,7 +214,7 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
 		goto csum_error;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < len) {
 		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index 5c5dd437f1c2..71056ee84773 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -212,7 +212,7 @@ static int nf_ct_br_ip_check(const struct sk_buff *skb)
 	    iph->version != 4)
 		return -1;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < nhoff + len ||
 	    len < (iph->ihl * 4))
                 return -1;
@@ -256,7 +256,7 @@ static unsigned int nf_ct_bridge_pre(void *priv, struct sk_buff *skb,
 		if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 			return NF_ACCEPT;
 
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		if (pskb_trim_rcsum(skb, len))
 			return NF_ACCEPT;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 03/10] openvswitch: use skb_ip_totlen in conntrack
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
  2023-01-14  3:31 ` [PATCH net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
  2023-01-14  3:31 ` [PATCH net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14  3:31 ` [PATCH net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

IPv4 GSO packets may get processed in ovs_skb_network_trim(),
and we need to use skb_ip_totlen() to get iph totlen.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/openvswitch/conntrack.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c8b137649ca4..2172930b1f17 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1103,7 +1103,7 @@ static int ovs_skb_network_trim(struct sk_buff *skb)
 
 	switch (skb->protocol) {
 	case htons(ETH_P_IP):
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		break;
 	case htons(ETH_P_IPV6):
 		len = sizeof(struct ipv6hdr)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (2 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14  3:31 ` [PATCH net-next 05/10] netfilter: " Xin Long
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets
and access iph->tot_len, replace them with skb_ip_totlen() and
iph_totlen() accordingly.

Note that we don't need to replace the one in tcf_csum_ipv4(), as it
will return for TCP GSO packets in tcf_csum_ipv4_tcp().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sched/act_ct.c   | 2 +-
 net/sched/sch_cake.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 0ca2bb8ed026..d68bb5dbf0dc 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -707,7 +707,7 @@ static int tcf_ct_skb_network_trim(struct sk_buff *skb, int family)
 
 	switch (family) {
 	case NFPROTO_IPV4:
-		len = ntohs(ip_hdr(skb)->tot_len);
+		len = skb_ip_totlen(skb);
 		break;
 	case NFPROTO_IPV6:
 		len = sizeof(struct ipv6hdr)
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 3ed0c3342189..7970217b565a 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1209,7 +1209,7 @@ static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
 			    iph_check->daddr != iph->daddr)
 				continue;
 
-			seglen = ntohs(iph_check->tot_len) -
+			seglen = iph_totlen(skb, iph_check) -
 				       (4 * iph_check->ihl);
 		} else if (iph_check->version == 6) {
 			ipv6h = (struct ipv6hdr *)iph;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 05/10] netfilter: use skb_ip_totlen and iph_totlen
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (3 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14  3:31 ` [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

There are also quite some places in netfilter that may process IPv4 TCP
GSO packets, we need to replace them too.

In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen()
return value, otherwise it may overflow and mismatch. This change will
also help us add selftest for IPv4 BIG TCP in the following patch.

Note that we don't need to replace the one in tcpmss_tg4(), as it will
return if there is data after tcphdr in tcpmss_mangle_packet(). The
same in mangle_contents() in nf_nat_helper.c, it returns false when
skb->len + extra > 65535 in enlarge_skb().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/netfilter/nf_tables_ipv4.h | 4 ++--
 net/netfilter/ipvs/ip_vs_xmit.c        | 2 +-
 net/netfilter/nf_log_syslog.c          | 2 +-
 net/netfilter/xt_length.c              | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_tables_ipv4.h b/include/net/netfilter/nf_tables_ipv4.h
index 112708f7a6b4..947973623dc7 100644
--- a/include/net/netfilter/nf_tables_ipv4.h
+++ b/include/net/netfilter/nf_tables_ipv4.h
@@ -29,7 +29,7 @@ static inline int __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt)
 	if (iph->ihl < 5 || iph->version != 4)
 		return -1;
 
-	len = ntohs(iph->tot_len);
+	len = iph_totlen(pkt->skb, iph);
 	thoff = iph->ihl * 4;
 	if (pkt->skb->len < len)
 		return -1;
@@ -64,7 +64,7 @@ static inline int nft_set_pktinfo_ipv4_ingress(struct nft_pktinfo *pkt)
 	if (iph->ihl < 5 || iph->version != 4)
 		goto inhdr_error;
 
-	len = ntohs(iph->tot_len);
+	len = iph_totlen(pkt->skb, iph);
 	thoff = iph->ihl * 4;
 	if (pkt->skb->len < len) {
 		__IP_INC_STATS(nft_net(pkt), IPSTATS_MIB_INTRUNCATEDPKTS);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 029171379884..80448885c3d7 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -994,7 +994,7 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
 		old_dsfield = ipv4_get_dsfield(old_iph);
 		*ttl = old_iph->ttl;
 		if (payload_len)
-			*payload_len = ntohs(old_iph->tot_len);
+			*payload_len = skb_ip_totlen(skb);
 	}
 
 	/* Implement full-functionality option for ECN encapsulation */
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index cb894f0d63e9..c66689ad2b49 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -322,7 +322,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 
 	/* Max length: 46 "LEN=65535 TOS=0xFF PREC=0xFF TTL=255 ID=65535 " */
 	nf_log_buf_add(m, "LEN=%u TOS=0x%02X PREC=0x%02X TTL=%u ID=%u ",
-		       ntohs(ih->tot_len), ih->tos & IPTOS_TOS_MASK,
+		       iph_totlen(skb, ih), ih->tos & IPTOS_TOS_MASK,
 		       ih->tos & IPTOS_PREC_MASK, ih->ttl, ntohs(ih->id));
 
 	/* Max length: 6 "CE DF MF " */
diff --git a/net/netfilter/xt_length.c b/net/netfilter/xt_length.c
index 1873da3a945a..b3d623a52885 100644
--- a/net/netfilter/xt_length.c
+++ b/net/netfilter/xt_length.c
@@ -21,7 +21,7 @@ static bool
 length_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_length_info *info = par->matchinfo;
-	u_int16_t pktlen = ntohs(ip_hdr(skb)->tot_len);
+	u32 pktlen = skb_ip_totlen(skb);
 
 	return (pktlen >= info->min && pktlen <= info->max) ^ info->invert;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (4 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 05/10] netfilter: " Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14 15:38   ` Paul Moore
  2023-01-14  3:31 ` [PATCH net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
the iph->tot_len update should use iph_set_totlen().

Note that for these non GSO packets, the new iph tot_len with extra
iph option len added may become greater than 65535, the old process
will cast it and set iph->tot_len to it, which is a bug. In theory,
iph options shouldn't be added for these big packets in here, a fix
may be needed here in the future. For now this patch is only to set
iph->tot_len to 0 when it happens.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/ipv4/cipso_ipv4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 6cd3b6c559f0..79ae7204e8ed 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -2222,7 +2222,7 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
 		memset((char *)(iph + 1) + buf_len, 0, opt_len - buf_len);
 	if (len_delta != 0) {
 		iph->ihl = 5 + (opt_len >> 2);
-		iph->tot_len = htons(skb->len);
+		iph_set_totlen(iph, skb->len);
 	}
 	ip_send_check(iph);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (5 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14  3:31 ` [PATCH net-next 08/10] net: add support for ipv4 big tcp Xin Long
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

ipvlan devices calls netif_inherit_tso_max() to get the tso_max_size/segs
from the lower device, so when lower device supports BIG TCP, the ipvlan
devices support it too. We also should consider its iph tot_len accessing.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/ipvlan/ipvlan_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index bb1c298c1e78..460b3d4f2245 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -157,7 +157,7 @@ void *ipvlan_get_L3_hdr(struct ipvl_port *port, struct sk_buff *skb, int *type)
 			return NULL;
 
 		ip4h = ip_hdr(skb);
-		pktlen = ntohs(ip4h->tot_len);
+		pktlen = skb_ip_totlen(skb);
 		if (ip4h->ihl < 5 || ip4h->version != 4)
 			return NULL;
 		if (skb->len < pktlen || pktlen < (ip4h->ihl * 4))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 08/10] net: add support for ipv4 big tcp
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (6 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-14  3:31 ` [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6 Xin Long
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.

Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.

Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.

Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.

Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/core/gro.c       |  6 +++---
 net/core/sock.c      | 11 ++---------
 net/ipv4/af_inet.c   |  7 ++++---
 net/ipv4/ip_input.c  |  2 +-
 net/ipv4/ip_output.c |  2 +-
 5 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/net/core/gro.c b/net/core/gro.c
index 506f83d715f8..82656dc787f2 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -169,9 +169,9 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 		return -E2BIG;
 
 	if (unlikely(p->len + len >= GRO_LEGACY_MAX_SIZE)) {
-		if (p->protocol != htons(ETH_P_IPV6) ||
-		    skb_headroom(p) < sizeof(struct hop_jumbo_hdr) ||
-		    ipv6_hdr(p)->nexthdr != IPPROTO_TCP ||
+		if (NAPI_GRO_CB(skb)->proto != IPPROTO_TCP ||
+		    (p->protocol == htons(ETH_P_IPV6) &&
+		     skb_headroom(p) < sizeof(struct hop_jumbo_hdr)) ||
 		    p->encapsulation)
 			return -E2BIG;
 	}
diff --git a/net/core/sock.c b/net/core/sock.c
index f954d5893e79..554aa09fe504 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2375,15 +2375,8 @@ EXPORT_SYMBOL_GPL(sk_free_unlock_clone);
 
 static void sk_trim_gso_size(struct sock *sk)
 {
-	if (sk->sk_gso_max_size <= GSO_LEGACY_MAX_SIZE)
-		return;
-#if IS_ENABLED(CONFIG_IPV6)
-	if (sk->sk_family == AF_INET6 &&
-	    sk_is_tcp(sk) &&
-	    !ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
-		return;
-#endif
-	sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
+	if (sk->sk_gso_max_size > GSO_LEGACY_MAX_SIZE && !sk_is_tcp(sk))
+		sk->sk_gso_max_size = GSO_LEGACY_MAX_SIZE;
 }
 
 void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6c0ec2789943..2f992a323b95 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1485,6 +1485,7 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 	if (unlikely(ip_fast_csum((u8 *)iph, 5)))
 		goto out;
 
+	NAPI_GRO_CB(skb)->proto = proto;
 	id = ntohl(*(__be32 *)&iph->id);
 	flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
 	id >>= 16;
@@ -1618,9 +1619,9 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
-	__be16 newlen = htons(skb->len - nhoff);
 	struct iphdr *iph = (struct iphdr *)(skb->data + nhoff);
 	const struct net_offload *ops;
+	__be16 totlen = iph->tot_len;
 	int proto = iph->protocol;
 	int err = -ENOSYS;
 
@@ -1629,8 +1630,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 		skb_set_inner_network_header(skb, nhoff);
 	}
 
-	csum_replace2(&iph->check, iph->tot_len, newlen);
-	iph->tot_len = newlen;
+	iph_set_totlen(iph, skb->len - nhoff);
+	csum_replace2(&iph->check, totlen, iph->tot_len);
 
 	ops = rcu_dereference(inet_offloads[proto]);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index e880ce77322a..0aa8c49b4e1b 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -511,7 +511,7 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
 		goto csum_error;
 
-	len = ntohs(iph->tot_len);
+	len = skb_ip_totlen(skb);
 	if (skb->len < len) {
 		drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
 		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 922c87ef1ab5..4e4e308c3230 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -100,7 +100,7 @@ int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct iphdr *iph = ip_hdr(skb);
 
-	iph->tot_len = htons(skb->len);
+	iph_set_totlen(iph, skb->len);
 	ip_send_check(iph);
 
 	/* if egress device is enslaved to an L3 master device pass the
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (7 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 08/10] net: add support for ipv4 big tcp Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-15 15:41   ` David Ahern
  2023-01-14  3:31 ` [PATCH net-next 10/10] selftests: add a selftest for big tcp Xin Long
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

For IPv6 jumbogram packets, the packet size is bigger than 65535,
it's not right to get it from payload_len and save it to an u16
variable.

This patch only fixes it for IPv6 BIG TCP packets, so instead of
parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
gets the pktlen via 'skb->len - skb_network_offset(skb)' when
skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
BIG TCP packets.

This fix will also help us add selftest for IPv6 BIG TCP in the
following patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/linux/ipv6.h      | 9 +++++++++
 net/netfilter/xt_length.c | 3 +--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 37dfdcfcdd54..b8edd6c599eb 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -175,6 +175,15 @@ static inline bool inet6_is_jumbogram(const struct sk_buff *skb)
 	return !!(IP6CB(skb)->flags & IP6SKB_JUMBOGRAM);
 }
 
+static inline unsigned int skb_ipv6_totlen(const struct sk_buff *skb)
+{
+	u32 pl = ntohs(ipv6_hdr(skb)->payload_len);
+
+	return pl ? pl + sizeof(struct ipv6hdr)
+		  : (skb_is_gso_v6(skb) ? skb->len - skb_network_offset(skb)
+					: pl + sizeof(struct ipv6hdr));
+}
+
 /* can not be used in TCP layer after tcp_v6_fill_cb */
 static inline int inet6_sdif(const struct sk_buff *skb)
 {
diff --git a/net/netfilter/xt_length.c b/net/netfilter/xt_length.c
index b3d623a52885..61518ec05c6e 100644
--- a/net/netfilter/xt_length.c
+++ b/net/netfilter/xt_length.c
@@ -30,8 +30,7 @@ static bool
 length_mt6(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_length_info *info = par->matchinfo;
-	const u_int16_t pktlen = ntohs(ipv6_hdr(skb)->payload_len) +
-				 sizeof(struct ipv6hdr);
+	u32 pktlen = skb_ipv6_totlen(skb);
 
 	return (pktlen >= info->min && pktlen <= info->max) ^ info->invert;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 10/10] selftests: add a selftest for big tcp
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (8 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6 Xin Long
@ 2023-01-14  3:31 ` Xin Long
  2023-01-15 15:45 ` [PATCH net-next 00/10] net: support ipv4 " David Ahern
  2023-01-15 16:04 ` Eric Dumazet
  11 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-14  3:31 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

This test runs on the client-router-server topo, and monitors the traffic
on the RX devices of router and server while sending BIG TCP packets with
netperf from client to server. Meanwhile, it changes 'tso' on the TX devs
and 'gro' on the RX devs. Then it checks if any BIG TCP packets appears
on the RX devs with 'ip/ip6tables -m length ! --length 0:65535' for each
case.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 tools/testing/selftests/net/Makefile   |   1 +
 tools/testing/selftests/net/big_tcp.sh | 157 +++++++++++++++++++++++++
 2 files changed, 158 insertions(+)
 create mode 100755 tools/testing/selftests/net/big_tcp.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 3007e98a6d64..e7ff63df5fcc 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -75,6 +75,7 @@ TEST_GEN_PROGS += so_incoming_cpu
 TEST_PROGS += sctp_vrf.sh
 TEST_GEN_FILES += sctp_hello
 TEST_GEN_FILES += csum
+TEST_PROGS += big_tcp.sh
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/big_tcp.sh b/tools/testing/selftests/net/big_tcp.sh
new file mode 100755
index 000000000000..7d0f0549ce59
--- /dev/null
+++ b/tools/testing/selftests/net/big_tcp.sh
@@ -0,0 +1,157 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing For IPv4 and IPv6 BIG TCP.
+# TOPO: CLIENT_NS (link0)<--->(link1) ROUTER_NS (link2)<--->(link3) SERVER_NS
+
+CLIENT_NS=$(mktemp -u client-XXXXXXXX)
+CLIENT_IP4="198.51.100.1"
+CLIENT_IP6="2001:db8:1::1"
+
+SERVER_NS=$(mktemp -u client-XXXXXXXX)
+SERVER_IP4="203.0.113.1"
+SERVER_IP6="2001:db8:2::1"
+
+ROUTER_NS=$(mktemp -u router-XXXXXXXX)
+SERVER_GW4="203.0.113.2"
+CLIENT_GW4="198.51.100.2"
+SERVER_GW6="2001:db8:2::2"
+CLIENT_GW6="2001:db8:1::2"
+
+MAX_SIZE=128000
+CHK_SIZE=65535
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+setup() {
+	ip netns add $CLIENT_NS
+	ip netns add $SERVER_NS
+	ip netns add $ROUTER_NS
+	ip -net $ROUTER_NS link add link1 type veth peer name link0 netns $CLIENT_NS
+	ip -net $ROUTER_NS link add link2 type veth peer name link3 netns $SERVER_NS
+
+	ip -net $CLIENT_NS link set link0 up
+	ip -net $CLIENT_NS link set link0 mtu 1442
+	ip -net $CLIENT_NS addr add $CLIENT_IP4/24 dev link0
+	ip -net $CLIENT_NS addr add $CLIENT_IP6/64 dev link0 nodad
+	ip -net $CLIENT_NS route add $SERVER_IP4 dev link0 via $CLIENT_GW4
+	ip -net $CLIENT_NS route add $SERVER_IP6 dev link0 via $CLIENT_GW6
+	ip -net $CLIENT_NS link set dev link0 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE
+	ip net exec $CLIENT_NS sysctl -wq net.ipv4.tcp_window_scaling=10
+
+	ip -net $ROUTER_NS link set link1 up
+	ip -net $ROUTER_NS link set link2 up
+	ip -net $ROUTER_NS addr add $CLIENT_GW4/24 dev link1
+	ip -net $ROUTER_NS addr add $CLIENT_GW6/64 dev link1 nodad
+	ip -net $ROUTER_NS addr add $SERVER_GW4/24 dev link2
+	ip -net $ROUTER_NS addr add $SERVER_GW6/64 dev link2 nodad
+	ip -net $ROUTER_NS link set dev link1 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE
+	ip -net $ROUTER_NS link set dev link2 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE
+	ip net exec $ROUTER_NS sysctl -wq net.ipv4.ip_forward=1
+	ip net exec $ROUTER_NS sysctl -wq net.ipv6.conf.all.forwarding=1
+
+	ip -net $SERVER_NS link set link3 up
+	ip -net $SERVER_NS addr add $SERVER_IP4/24 dev link3
+	ip -net $SERVER_NS addr add $SERVER_IP6/64 dev link3 nodad
+	ip -net $SERVER_NS route add $CLIENT_IP4 dev link3 via $SERVER_GW4
+	ip -net $SERVER_NS route add $CLIENT_IP6 dev link3 via $SERVER_GW6
+	ip -net $SERVER_NS link set dev link3 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE
+	ip net exec $SERVER_NS sysctl -wq net.ipv4.tcp_window_scaling=10
+	ip net exec $SERVER_NS netserver 2>&1 >/dev/null
+}
+
+cleanup() {
+	ip net exec $SERVER_NS pkill netserver
+	ip -net $ROUTER_NS link del link1
+	ip -net $ROUTER_NS link del link2
+	ip netns del "$CLIENT_NS"
+	ip netns del "$SERVER_NS"
+	ip netns del "$ROUTER_NS"
+}
+
+start_counter() {
+	local ipt="iptables"
+	local iface=$1
+	local netns=$2
+
+	[ "$NF" = "6" ] && ipt="ip6tables"
+	ip net exec $netns $ipt -t raw -A PREROUTING -i $iface \
+		-m length ! --length 0:$CHK_SIZE -j ACCEPT
+}
+
+check_counter() {
+	local ipt="iptables"
+	local iface=$1
+	local netns=$2
+
+	[ "$NF" = "6" ] && ipt="ip6tables"
+	test `ip net exec $netns $ipt -t raw -L -v |grep $iface | awk '{print $1}'` != "0"
+}
+
+stop_counter() {
+	local ipt="iptables"
+	local iface=$1
+	local netns=$2
+
+	[ "$NF" = "6" ] && ipt="ip6tables"
+	ip net exec $netns $ipt -t raw -D PREROUTING -i $iface \
+		-m length ! --length 0:$CHK_SIZE -j ACCEPT
+}
+
+do_netperf() {
+	local serip=$SERVER_IP4
+	local netns=$1
+
+	[ "$NF" = "6" ] && serip=$SERVER_IP6
+	ip net exec $netns netperf -$NF -t TCP_STREAM -H $serip 2>&1 >/dev/null
+}
+
+do_test() {
+	local cli_tso=$1
+	local gw_gro=$2
+	local gw_tso=$3
+	local ser_gro=$4
+	local ret="PASS"
+
+	ip net exec $CLIENT_NS ethtool -K link0 tso $cli_tso
+	ip net exec $ROUTER_NS ethtool -K link1 gro $gw_gro
+	ip net exec $ROUTER_NS ethtool -K link2 tso $gw_tso
+	ip net exec $SERVER_NS ethtool -K link3 gro $ser_gro
+
+	start_counter link1 $ROUTER_NS
+	start_counter link3 $SERVER_NS
+	do_netperf $CLIENT_NS
+
+	if check_counter link1 $ROUTER_NS; then
+		check_counter link3 $SERVER_NS || ret="FAIL_on_link3"
+	else
+		ret="FAIL_on_link1"
+	fi
+
+	stop_counter link1 $ROUTER_NS
+	stop_counter link3 $SERVER_NS
+	printf "%-9s %-8s %-8s %-8s: [%s]\n" \
+		$cli_tso $gw_gro $gw_tso $ser_gro $ret
+	test $ret = "PASS"
+}
+
+testup() {
+	echo "CLI GSO | GW GRO | GW GSO | SER GRO" && \
+	do_test "on"  "on"  "on"  "on"  && \
+	do_test "on"  "off" "on"  "off" && \
+	do_test "off" "on"  "on"  "on"  && \
+	do_test "on"  "on"  "off" "on"  && \
+	do_test "off" "on"  "off" "on"
+}
+
+if ! netperf -V &> /dev/null; then
+	echo "SKIP: Could not run test without netperf tool"
+	exit $ksft_skip
+fi
+
+trap cleanup EXIT
+setup && echo "Testing for BIG TCP:" && \
+NF=4 testup && echo "***v4 Tests Done***" && \
+NF=6 testup && echo "***v6 Tests Done***"
+exit $?
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-14  3:31 ` [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
@ 2023-01-14 15:38   ` Paul Moore
  2023-01-14 17:52     ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Moore @ 2023-01-14 15:38 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
> the iph->tot_len update should use iph_set_totlen().
>
> Note that for these non GSO packets, the new iph tot_len with extra
> iph option len added may become greater than 65535, the old process
> will cast it and set iph->tot_len to it, which is a bug. In theory,
> iph options shouldn't be added for these big packets in here, a fix
> may be needed here in the future. For now this patch is only to set
> iph->tot_len to 0 when it happens.

I'm not entirely clear on the paragraph above, but we do need to be
able to set/modify the IP options in cipso_v4_skbuff_setattr() in
order to support CIPSO labeling.  I'm open to better and/or
alternative solutions compared to what we are doing now, but I can't
support a change that is a bug waiting to bite us.  My apologies if
I'm interpreting your comments incorrectly and that isn't the case
here.

> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/ipv4/cipso_ipv4.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> index 6cd3b6c559f0..79ae7204e8ed 100644
> --- a/net/ipv4/cipso_ipv4.c
> +++ b/net/ipv4/cipso_ipv4.c
> @@ -2222,7 +2222,7 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
>                 memset((char *)(iph + 1) + buf_len, 0, opt_len - buf_len);
>         if (len_delta != 0) {
>                 iph->ihl = 5 + (opt_len >> 2);
> -               iph->tot_len = htons(skb->len);
> +               iph_set_totlen(iph, skb->len);
>         }
>         ip_send_check(iph);
>
> --
> 2.31.1

--
paul-moore.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-14 15:38   ` Paul Moore
@ 2023-01-14 17:52     ` Xin Long
  2023-01-16 16:45       ` Paul Moore
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-14 17:52 UTC (permalink / raw)
  To: Paul Moore
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Sat, Jan 14, 2023 at 10:39 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
> > the iph->tot_len update should use iph_set_totlen().
> >
> > Note that for these non GSO packets, the new iph tot_len with extra
> > iph option len added may become greater than 65535, the old process
> > will cast it and set iph->tot_len to it, which is a bug. In theory,
> > iph options shouldn't be added for these big packets in here, a fix
> > may be needed here in the future. For now this patch is only to set
> > iph->tot_len to 0 when it happens.
>
> I'm not entirely clear on the paragraph above, but we do need to be
> able to set/modify the IP options in cipso_v4_skbuff_setattr() in
> order to support CIPSO labeling.  I'm open to better and/or
> alternative solutions compared to what we are doing now, but I can't
> support a change that is a bug waiting to bite us.  My apologies if
> I'm interpreting your comments incorrectly and that isn't the case
> here.
setting the IP options may cause the packet size to grow (both iph->tot_len
and skb->len), for example:

before setting it, iph->tot_len=65535,
after setting it, iph->tot_len=65535 + 14 (assume the IP option len is 14)
however, tot_len is 16 bit, and can't be set to "65535 + 14".

Hope the above makes it clearer.

This problem exists with or without this patch. Not sure how it should
be fixed in cipso_v4. I knew tcpmss_tg4() setting TCP options which also
causes the packet size to grow, but it requires no data payload:

        /* There is data after the header so the option can't be added
         * without moving it, and doing so may make the SYN packet
         * itself too large. Accept the packet unmodified instead.
         */
        if (len > tcp_hdrlen)
                return 0;

and then the new iph->tot_len won't exceed 65535.

Not sure if we can skip the big packet, or segment/fragment the big packet
in cipso_v4_skbuff_setattr().

Again, this patch fixes the issue when it's an IPv4 BIG TCP packets, and
it doesn't introduce any new issues or fix any old issues, but only fix
what the IPv4 BIG TCP may introduce.

Thanks.

>
> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > ---
> >  net/ipv4/cipso_ipv4.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> > index 6cd3b6c559f0..79ae7204e8ed 100644
> > --- a/net/ipv4/cipso_ipv4.c
> > +++ b/net/ipv4/cipso_ipv4.c
> > @@ -2222,7 +2222,7 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
> >                 memset((char *)(iph + 1) + buf_len, 0, opt_len - buf_len);
> >         if (len_delta != 0) {
> >                 iph->ihl = 5 + (opt_len >> 2);
> > -               iph->tot_len = htons(skb->len);
> > +               iph_set_totlen(iph, skb->len);
> >         }
> >         ip_send_check(iph);
> >
> > --
> > 2.31.1
>
> --
> paul-moore.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-14  3:31 ` [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6 Xin Long
@ 2023-01-15 15:41   ` David Ahern
  2023-01-15 17:42     ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: David Ahern @ 2023-01-15 15:41 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/13/23 8:31 PM, Xin Long wrote:
> For IPv6 jumbogram packets, the packet size is bigger than 65535,
> it's not right to get it from payload_len and save it to an u16
> variable.
> 
> This patch only fixes it for IPv6 BIG TCP packets, so instead of
> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> BIG TCP packets.
> 
> This fix will also help us add selftest for IPv6 BIG TCP in the
> following patch.
> 

If this is a bug fix for the existing IPv6 support, send it outside of
this set for -net.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 00/10] net: support ipv4 big tcp
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (9 preceding siblings ...)
  2023-01-14  3:31 ` [PATCH net-next 10/10] selftests: add a selftest for big tcp Xin Long
@ 2023-01-15 15:45 ` David Ahern
  2023-01-15 16:04 ` Eric Dumazet
  11 siblings, 0 replies; 45+ messages in thread
From: David Ahern @ 2023-01-15 15:45 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/13/23 8:31 PM, Xin Long wrote:
> This is similar to the BIG TCP patchset added by Eric for IPv6:
> 
>   https://lwn.net/Articles/895398/
> 
> Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> indicate this might be a BIG TCP packet and use skb->len as the real
> IPv4 total length.
> 
> This will work safely, as all BIG TCP packets are GSO/GRO packets and
> processed on the same host as they were created; There is no padding
> in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> packet total length; Also, before implementing the feature, all those
> places that may get iph tot_len from BIG TCP packets are taken care
> with some new APIs:
> 
> Patch 1 adds some APIs for iph tot_len setting and getting, which are
> used in all these places where IPv4 BIG TCP packets may reach in Patch
> 2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
> for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
> can also cover IPv6 BIG TCP.
> 
> Note that the similar change as in Patch 2-7 are also needed for IPv6
> BIG TCP packets, and will be addressed in another patchset.
> 
> The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
> and 1.5K MTU:
> 
> No BIG TCP:
> for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
> 168          322          337          3776.49
> 143          236          277          4654.67
> 128          258          288          4772.83
> 171          229          278          4645.77
> 175          228          243          4678.93
> 149          239          279          4599.86
> 164          234          268          4606.94
> 155          276          289          4235.82
> 180          255          268          4418.95
> 168          241          249          4417.82
> 
> Enable BIG TCP:
> ip link set dev ens1f0np0 gro_max_size 128000 gso_max_size 128000
> for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
> 161          241          252          4821.73
> 174          205          217          5098.28
> 167          208          220          5001.43
> 164          228          249          4883.98
> 150          233          249          4914.90
> 180          233          244          4819.66
> 154          208          219          5004.92
> 157          209          247          4999.78
> 160          218          246          4842.31
> 174          206          217          5080.99
> 
> Xin Long (10):
>   net: add a couple of helpers for iph tot_len
>   bridge: use skb_ip_totlen in br netfilter
>   openvswitch: use skb_ip_totlen in conntrack
>   net: sched: use skb_ip_totlen and iph_totlen
>   netfilter: use skb_ip_totlen and iph_totlen
>   cipso_ipv4: use iph_set_totlen in skbuff_setattr
>   ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
>   net: add support for ipv4 big tcp
>   netfilter: get ipv6 pktlen properly in length_mt6
>   selftests: add a selftest for big tcp
> 
>  drivers/net/ipvlan/ipvlan_core.c           |   2 +-
>  include/linux/ip.h                         |  20 +++
>  include/linux/ipv6.h                       |   9 ++
>  include/net/netfilter/nf_tables_ipv4.h     |   4 +-
>  include/net/route.h                        |   3 -
>  net/bridge/br_netfilter_hooks.c            |   2 +-
>  net/bridge/netfilter/nf_conntrack_bridge.c |   4 +-
>  net/core/gro.c                             |   6 +-
>  net/core/sock.c                            |  11 +-
>  net/ipv4/af_inet.c                         |   7 +-
>  net/ipv4/cipso_ipv4.c                      |   2 +-
>  net/ipv4/ip_input.c                        |   2 +-
>  net/ipv4/ip_output.c                       |   2 +-
>  net/netfilter/ipvs/ip_vs_xmit.c            |   2 +-
>  net/netfilter/nf_log_syslog.c              |   2 +-
>  net/netfilter/xt_length.c                  |   5 +-
>  net/openvswitch/conntrack.c                |   2 +-
>  net/sched/act_ct.c                         |   2 +-
>  net/sched/sch_cake.c                       |   2 +-
>  tools/testing/selftests/net/Makefile       |   1 +
>  tools/testing/selftests/net/big_tcp.sh     | 157 +++++++++++++++++++++
>  21 files changed, 212 insertions(+), 35 deletions(-)
>  create mode 100755 tools/testing/selftests/net/big_tcp.sh
> 

Thanks for working on this and writing the selftests.

A couple of years ago I was experimenting with a simpler version of this
change (only changed what I needed to run tests). tcpdump (as an example
of packet socket app) was confused about the packet length and reported
truncated packet errors. As I recall that is the only really tricky part
to getting large packets for IPv4.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 00/10] net: support ipv4 big tcp
  2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
                   ` (10 preceding siblings ...)
  2023-01-15 15:45 ` [PATCH net-next 00/10] net: support ipv4 " David Ahern
@ 2023-01-15 16:04 ` Eric Dumazet
  2023-01-15 17:33   ` Xin Long
  11 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2023-01-15 16:04 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Sat, Jan 14, 2023 at 4:31 AM Xin Long <lucien.xin@gmail.com> wrote:
>
> This is similar to the BIG TCP patchset added by Eric for IPv6:
>
>   https://lwn.net/Articles/895398/
>
> Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> indicate this might be a BIG TCP packet and use skb->len as the real
> IPv4 total length.
>
> This will work safely, as all BIG TCP packets are GSO/GRO packets and
> processed on the same host as they were created; There is no padding
> in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> packet total length; Also, before implementing the feature, all those
> places that may get iph tot_len from BIG TCP packets are taken care
> with some new APIs:
>
> Patch 1 adds some APIs for iph tot_len setting and getting, which are
> used in all these places where IPv4 BIG TCP packets may reach in Patch
> 2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
> for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
> can also cover IPv6 BIG TCP.
>
> Note that the similar change as in Patch 2-7 are also needed for IPv6
> BIG TCP packets, and will be addressed in another patchset.
>
> The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
> and 1.5K MTU:
>

This is broken, sorry.

There are reasons BIG TCP was implemented for IPv6 only, it seems you
missed a lot of them.

Networking needs observability and diagnostic tools.

Until you come back with a proper way for tcpdump to not mess things,
there is no way I can ACK these changes.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 00/10] net: support ipv4 big tcp
  2023-01-15 16:04 ` Eric Dumazet
@ 2023-01-15 17:33   ` Xin Long
  0 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-15 17:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: network dev, davem, kuba, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Sun, Jan 15, 2023 at 11:05 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Jan 14, 2023 at 4:31 AM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > This is similar to the BIG TCP patchset added by Eric for IPv6:
> >
> >   https://lwn.net/Articles/895398/
> >
> > Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> > doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> > it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> > indicate this might be a BIG TCP packet and use skb->len as the real
> > IPv4 total length.
> >
> > This will work safely, as all BIG TCP packets are GSO/GRO packets and
> > processed on the same host as they were created; There is no padding
> > in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> > packet total length; Also, before implementing the feature, all those
> > places that may get iph tot_len from BIG TCP packets are taken care
> > with some new APIs:
> >
> > Patch 1 adds some APIs for iph tot_len setting and getting, which are
> > used in all these places where IPv4 BIG TCP packets may reach in Patch
> > 2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
> > for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
> > can also cover IPv6 BIG TCP.
> >
> > Note that the similar change as in Patch 2-7 are also needed for IPv6
> > BIG TCP packets, and will be addressed in another patchset.
> >
> > The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
> > and 1.5K MTU:
> >
>
> This is broken, sorry.
>
> There are reasons BIG TCP was implemented for IPv6 only, it seems you
> missed a lot of them.
>
> Networking needs observability and diagnostic tools.
>
> Until you come back with a proper way for tcpdump to not mess things,
> there is no way I can ACK these changes.
For the installed tcpdump, it's just parsing iph->tot_len from the raw pkt,
and I'm not sure how to make it without any change in tcpdump. But,

https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18

As this is already in tcpdump, we can build tcpdump with "-DGUESS_TSO":

# ./configure CFLAGS=-DGUESS_TSO

It seems someone had met such problems even without IPv4 BIG TCP, not
sure in Linux or other OS.
Now it's time to enable this CFLAG. :-)

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-15 15:41   ` David Ahern
@ 2023-01-15 17:42     ` Xin Long
  2023-01-15 19:40       ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-15 17:42 UTC (permalink / raw)
  To: David Ahern
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 1/13/23 8:31 PM, Xin Long wrote:
> > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > it's not right to get it from payload_len and save it to an u16
> > variable.
> >
> > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > BIG TCP packets.
> >
> > This fix will also help us add selftest for IPv6 BIG TCP in the
> > following patch.
> >
>
> If this is a bug fix for the existing IPv6 support, send it outside of
> this set for -net.
>
Sure,
I was thinking of adding it here to be able to support selftest for
IPv6 too in the next patch. But it seems to make more sense to
get it into -net first, then add this selftest after it goes to net-next.

I will post it and all other fixes I mentioned in the cover-letter for
IPv6 BIG TCP for -net.

But before that, I hope Eric can confirm it is okay to read the length
of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
instead of parsing JUMBO exthdr?

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-15 17:42     ` Xin Long
@ 2023-01-15 19:40       ` Eric Dumazet
  2023-01-15 20:14         ` Xin Long
  2023-01-15 23:58         ` David Ahern
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2023-01-15 19:40 UTC (permalink / raw)
  To: Xin Long
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 1/13/23 8:31 PM, Xin Long wrote:
> > > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > > it's not right to get it from payload_len and save it to an u16
> > > variable.
> > >
> > > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > > BIG TCP packets.
> > >
> > > This fix will also help us add selftest for IPv6 BIG TCP in the
> > > following patch.
> > >
> >
> > If this is a bug fix for the existing IPv6 support, send it outside of
> > this set for -net.
> >
> Sure,
> I was thinking of adding it here to be able to support selftest for
> IPv6 too in the next patch. But it seems to make more sense to
> get it into -net first, then add this selftest after it goes to net-next.
>
> I will post it and all other fixes I mentioned in the cover-letter for
> IPv6 BIG TCP for -net.
>
> But before that, I hope Eric can confirm it is okay to read the length
> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
> instead of parsing JUMBO exthdr?
>

I do not think it is ok, but I will leave the question to netfilter maintainers.

Guessing things in tcpdump or other tools is up to user space implementations,
trying to work around some (kernel ?) deficiencies.

Yes, IPv6 extensions headers are a pain, we all agree.

Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
or ipv6_hop_jumbo())

So skb->len is not the root of trust. Some transport mediums might add paddings.

Ipv4 has a similar logic in ip_rcv_core().

len = ntohs(iph->tot_len);
if (skb->len < len) {
     drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
     __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
    goto drop;
} else if (len < (iph->ihl*4))
     goto inhdr_error;

/* Our transport medium may have padded the buffer out. Now we know it
* is IP we can trim to the true length of the frame.
* Note this now means skb->len holds ntohs(iph->tot_len).
*/
if (pskb_trim_rcsum(skb, len)) {
      __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
      goto drop;
}

After your changes, we might accept illegal packets that were properly
dropped before.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-15 19:40       ` Eric Dumazet
@ 2023-01-15 20:14         ` Xin Long
  2023-01-15 23:57           ` David Ahern
  2023-01-16  9:24           ` Eric Dumazet
  2023-01-15 23:58         ` David Ahern
  1 sibling, 2 replies; 45+ messages in thread
From: Xin Long @ 2023-01-15 20:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 1/13/23 8:31 PM, Xin Long wrote:
> > > > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > > > it's not right to get it from payload_len and save it to an u16
> > > > variable.
> > > >
> > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > > > BIG TCP packets.
> > > >
> > > > This fix will also help us add selftest for IPv6 BIG TCP in the
> > > > following patch.
> > > >
> > >
> > > If this is a bug fix for the existing IPv6 support, send it outside of
> > > this set for -net.
> > >
> > Sure,
> > I was thinking of adding it here to be able to support selftest for
> > IPv6 too in the next patch. But it seems to make more sense to
> > get it into -net first, then add this selftest after it goes to net-next.
> >
> > I will post it and all other fixes I mentioned in the cover-letter for
> > IPv6 BIG TCP for -net.
> >
> > But before that, I hope Eric can confirm it is okay to read the length
> > of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
> > instead of parsing JUMBO exthdr?
> >
>
> I do not think it is ok, but I will leave the question to netfilter maintainers.
Just note that the issue doesn't only exist in netfilter.
All the changes in Patch 2-7 from this patchset are also needed for IPv6
BIG TCP packets.

>
> Guessing things in tcpdump or other tools is up to user space implementations,
> trying to work around some (kernel ?) deficiencies.
>
> Yes, IPv6 extensions headers are a pain, we all agree.
>
> Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
> skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
> or ipv6_hop_jumbo())
>
> So skb->len is not the root of trust. Some transport mediums might add paddings.
>
> Ipv4 has a similar logic in ip_rcv_core().
>
> len = ntohs(iph->tot_len);
> if (skb->len < len) {
>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>     goto drop;
> } else if (len < (iph->ihl*4))
>      goto inhdr_error;
>
> /* Our transport medium may have padded the buffer out. Now we know it
> * is IP we can trim to the true length of the frame.
> * Note this now means skb->len holds ntohs(iph->tot_len).
> */
> if (pskb_trim_rcsum(skb, len)) {
>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>       goto drop;
> }
>
> After your changes, we might accept illegal packets that were properly
> dropped before.
I think skb->len is trustable for GSO/GRO packets.
In ipv6_gro_complete/inet_gro_complete():
The new length for payload_len or iph->tot_len are all calculated from skb->len.
As I said in the cover-letter, "there is no padding in GSO/GRO packets".
Or am I missing something?

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-15 20:14         ` Xin Long
@ 2023-01-15 23:57           ` David Ahern
  2023-01-16  9:24           ` Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: David Ahern @ 2023-01-15 23:57 UTC (permalink / raw)
  To: Xin Long, Eric Dumazet
  Cc: network dev, davem, kuba, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/15/23 1:14 PM, Xin Long wrote:
>>
>> So skb->len is not the root of trust. Some transport mediums might add paddings.
>>

That padding is to meet minimum packet sizes AIUI.

Ethernet for example has a minimum packet length of 60B. Some protocols
running over ethernet can generate packets less than 60B. For example,
ARP is 52B as I recall so something in the hardware pipeline must pad
that out to 60B. IPv4 with UDP and small L4 payloads is another example.
So the trim is to remove the padding added to meet a minimum size. That
should not be relevant for large packets.

>> Ipv4 has a similar logic in ip_rcv_core().
>>
>> len = ntohs(iph->tot_len);
>> if (skb->len < len) {
>>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>>     goto drop;
>> } else if (len < (iph->ihl*4))
>>      goto inhdr_error;
>>
>> /* Our transport medium may have padded the buffer out. Now we know it
>> * is IP we can trim to the true length of the frame.
>> * Note this now means skb->len holds ntohs(iph->tot_len).
>> */
>> if (pskb_trim_rcsum(skb, len)) {
>>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>>       goto drop;
>> }
>>
>> After your changes, we might accept illegal packets that were properly
>> dropped before.
> I think skb->len is trustable for GSO/GRO packets.

That is my understanding as well.

> In ipv6_gro_complete/inet_gro_complete():
> The new length for payload_len or iph->tot_len are all calculated from skb->len.
> As I said in the cover-letter, "there is no padding in GSO/GRO packets".
> Or am I missing something?
> 
> Thanks.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-15 19:40       ` Eric Dumazet
  2023-01-15 20:14         ` Xin Long
@ 2023-01-15 23:58         ` David Ahern
  1 sibling, 0 replies; 45+ messages in thread
From: David Ahern @ 2023-01-15 23:58 UTC (permalink / raw)
  To: Eric Dumazet, Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/15/23 12:40 PM, Eric Dumazet wrote:
> On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
>>
>> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>>>
>>> On 1/13/23 8:31 PM, Xin Long wrote:
>>>> For IPv6 jumbogram packets, the packet size is bigger than 65535,
>>>> it's not right to get it from payload_len and save it to an u16
>>>> variable.
>>>>
>>>> This patch only fixes it for IPv6 BIG TCP packets, so instead of
>>>> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
>>>> gets the pktlen via 'skb->len - skb_network_offset(skb)' when
>>>> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
>>>> BIG TCP packets.
>>>>
>>>> This fix will also help us add selftest for IPv6 BIG TCP in the
>>>> following patch.
>>>>
>>>
>>> If this is a bug fix for the existing IPv6 support, send it outside of
>>> this set for -net.
>>>
>> Sure,
>> I was thinking of adding it here to be able to support selftest for
>> IPv6 too in the next patch. But it seems to make more sense to
>> get it into -net first, then add this selftest after it goes to net-next.
>>
>> I will post it and all other fixes I mentioned in the cover-letter for
>> IPv6 BIG TCP for -net.
>>
>> But before that, I hope Eric can confirm it is okay to read the length
>> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
>> instead of parsing JUMBO exthdr?
>>
> 
> I do not think it is ok, but I will leave the question to netfilter maintainers.
> 
> Guessing things in tcpdump or other tools is up to user space implementations,
> trying to work around some (kernel ?) deficiencies.
> 
> Yes, IPv6 extensions headers are a pain, we all agree.
> 
> Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
> skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
> or ipv6_hop_jumbo())
> 
> So skb->len is not the root of trust. Some transport mediums might add paddings.
> 
> Ipv4 has a similar logic in ip_rcv_core().
> 
> len = ntohs(iph->tot_len);
> if (skb->len < len) {
>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>     goto drop;
> } else if (len < (iph->ihl*4))
>      goto inhdr_error;
> 
> /* Our transport medium may have padded the buffer out. Now we know it
> * is IP we can trim to the true length of the frame.
> * Note this now means skb->len holds ntohs(iph->tot_len).
> */
> if (pskb_trim_rcsum(skb, len)) {
>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>       goto drop;
> }
> 
> After your changes, we might accept illegal packets that were properly
> dropped before.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-15 20:14         ` Xin Long
  2023-01-15 23:57           ` David Ahern
@ 2023-01-16  9:24           ` Eric Dumazet
  2023-01-16 15:07             ` David Ahern
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2023-01-16  9:24 UTC (permalink / raw)
  To: Xin Long
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Sun, Jan 15, 2023 at 9:15 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > > >
> > > > On 1/13/23 8:31 PM, Xin Long wrote:
> > > > > For IPv6 jumbogram packets, the packet size is bigger than 65535,
> > > > > it's not right to get it from payload_len and save it to an u16
> > > > > variable.
> > > > >
> > > > > This patch only fixes it for IPv6 BIG TCP packets, so instead of
> > > > > parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
> > > > > gets the pktlen via 'skb->len - skb_network_offset(skb)' when
> > > > > skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
> > > > > BIG TCP packets.
> > > > >
> > > > > This fix will also help us add selftest for IPv6 BIG TCP in the
> > > > > following patch.
> > > > >
> > > >
> > > > If this is a bug fix for the existing IPv6 support, send it outside of
> > > > this set for -net.
> > > >
> > > Sure,
> > > I was thinking of adding it here to be able to support selftest for
> > > IPv6 too in the next patch. But it seems to make more sense to
> > > get it into -net first, then add this selftest after it goes to net-next.
> > >
> > > I will post it and all other fixes I mentioned in the cover-letter for
> > > IPv6 BIG TCP for -net.
> > >
> > > But before that, I hope Eric can confirm it is okay to read the length
> > > of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
> > > instead of parsing JUMBO exthdr?
> > >
> >
> > I do not think it is ok, but I will leave the question to netfilter maintainers.
> Just note that the issue doesn't only exist in netfilter.
> All the changes in Patch 2-7 from this patchset are also needed for IPv6
> BIG TCP packets.
>
> >
> > Guessing things in tcpdump or other tools is up to user space implementations,
> > trying to work around some (kernel ?) deficiencies.
> >
> > Yes, IPv6 extensions headers are a pain, we all agree.
> >
> > Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
> > skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
> > or ipv6_hop_jumbo())
> >
> > So skb->len is not the root of trust. Some transport mediums might add paddings.
> >
> > Ipv4 has a similar logic in ip_rcv_core().
> >
> > len = ntohs(iph->tot_len);
> > if (skb->len < len) {
> >      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
> >      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
> >     goto drop;
> > } else if (len < (iph->ihl*4))
> >      goto inhdr_error;
> >
> > /* Our transport medium may have padded the buffer out. Now we know it
> > * is IP we can trim to the true length of the frame.
> > * Note this now means skb->len holds ntohs(iph->tot_len).
> > */
> > if (pskb_trim_rcsum(skb, len)) {
> >       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
> >       goto drop;
> > }
> >
> > After your changes, we might accept illegal packets that were properly
> > dropped before.
> I think skb->len is trustable for GSO/GRO packets.
> In ipv6_gro_complete/inet_gro_complete():
> The new length for payload_len or iph->tot_len are all calculated from skb->len.
> As I said in the cover-letter, "there is no padding in GSO/GRO packets".
> Or am I missing something?

This seems to be a contract violation with user space providing GSO packets.

In our changes we added some sanity checks, inherent to JUMBO specs.

Here, a GSO packet can now have a zero ip length, no matter if it is
BIG TCP or not.

It seems we lower the bar for consistency, and allow bugs (say
changing skb->len) to not be detected.

As you said, user space sniffing packets now have to guess what is the
intent, instead of headers carrying all the needed information
that can be fully validated by parsers.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-16  9:24           ` Eric Dumazet
@ 2023-01-16 15:07             ` David Ahern
  2023-01-16 16:02               ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: David Ahern @ 2023-01-16 15:07 UTC (permalink / raw)
  To: Eric Dumazet, Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/16/23 2:24 AM, Eric Dumazet wrote:
> On Sun, Jan 15, 2023 at 9:15 PM Xin Long <lucien.xin@gmail.com> wrote:
>>
>> On Sun, Jan 15, 2023 at 2:40 PM Eric Dumazet <edumazet@google.com> wrote:
>>>
>>> On Sun, Jan 15, 2023 at 6:43 PM Xin Long <lucien.xin@gmail.com> wrote:
>>>>
>>>> On Sun, Jan 15, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>>>>>
>>>>> On 1/13/23 8:31 PM, Xin Long wrote:
>>>>>> For IPv6 jumbogram packets, the packet size is bigger than 65535,
>>>>>> it's not right to get it from payload_len and save it to an u16
>>>>>> variable.
>>>>>>
>>>>>> This patch only fixes it for IPv6 BIG TCP packets, so instead of
>>>>>> parsing IPV6_TLV_JUMBO exthdr, which is quite some work, it only
>>>>>> gets the pktlen via 'skb->len - skb_network_offset(skb)' when
>>>>>> skb_is_gso_v6() and saves it to an u32 variable, similar to IPv4
>>>>>> BIG TCP packets.
>>>>>>
>>>>>> This fix will also help us add selftest for IPv6 BIG TCP in the
>>>>>> following patch.
>>>>>>
>>>>>
>>>>> If this is a bug fix for the existing IPv6 support, send it outside of
>>>>> this set for -net.
>>>>>
>>>> Sure,
>>>> I was thinking of adding it here to be able to support selftest for
>>>> IPv6 too in the next patch. But it seems to make more sense to
>>>> get it into -net first, then add this selftest after it goes to net-next.
>>>>
>>>> I will post it and all other fixes I mentioned in the cover-letter for
>>>> IPv6 BIG TCP for -net.
>>>>
>>>> But before that, I hope Eric can confirm it is okay to read the length
>>>> of IPv6 BIG TCP packets with skb_ipv6_totlen() defined in this patch,
>>>> instead of parsing JUMBO exthdr?
>>>>
>>>
>>> I do not think it is ok, but I will leave the question to netfilter maintainers.
>> Just note that the issue doesn't only exist in netfilter.
>> All the changes in Patch 2-7 from this patchset are also needed for IPv6
>> BIG TCP packets.
>>
>>>
>>> Guessing things in tcpdump or other tools is up to user space implementations,
>>> trying to work around some (kernel ?) deficiencies.
>>>
>>> Yes, IPv6 extensions headers are a pain, we all agree.
>>>
>>> Look at how ip6_rcv_core() properly dissects extension headers _and_ trim
>>> skb accordingly (pskb_trim_rcsum() called either from ip6_rcv_core()
>>> or ipv6_hop_jumbo())
>>>
>>> So skb->len is not the root of trust. Some transport mediums might add paddings.
>>>
>>> Ipv4 has a similar logic in ip_rcv_core().
>>>
>>> len = ntohs(iph->tot_len);
>>> if (skb->len < len) {
>>>      drop_reason = SKB_DROP_REASON_PKT_TOO_SMALL;
>>>      __IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
>>>     goto drop;
>>> } else if (len < (iph->ihl*4))
>>>      goto inhdr_error;
>>>
>>> /* Our transport medium may have padded the buffer out. Now we know it
>>> * is IP we can trim to the true length of the frame.
>>> * Note this now means skb->len holds ntohs(iph->tot_len).
>>> */
>>> if (pskb_trim_rcsum(skb, len)) {
>>>       __IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
>>>       goto drop;
>>> }
>>>
>>> After your changes, we might accept illegal packets that were properly
>>> dropped before.
>> I think skb->len is trustable for GSO/GRO packets.
>> In ipv6_gro_complete/inet_gro_complete():
>> The new length for payload_len or iph->tot_len are all calculated from skb->len.
>> As I said in the cover-letter, "there is no padding in GSO/GRO packets".
>> Or am I missing something?
> 
> This seems to be a contract violation with user space providing GSO packets.
> 
> In our changes we added some sanity checks, inherent to JUMBO specs.
> 
> Here, a GSO packet can now have a zero ip length, no matter if it is
> BIG TCP or not.

Meaning your preference is to set tot_len anytime it is <= 64kB so the
only time tot_len == 0 is for large GRO/TSO packets? That is doable.

> 
> It seems we lower the bar for consistency, and allow bugs (say
> changing skb->len) to not be detected.

not sure why you think it would not be detected. Today's model for gro
sets tot_len based on skb->len. There is an inherent trust that the
user's of the gro API set the length correctly. If it is not, the
payload to userspace would ultimately be non-sense and hence detectable.
I tend to use ssh to test changes like this for this reason - L4 payload
must make sense.

For the Tx path, there is a similar line of trust that the skb->len
passed to the L3 layer is correct. IPv4/IPv6 blindly trust what it is
told for length.


> 
> As you said, user space sniffing packets now have to guess what is the
> intent, instead of headers carrying all the needed information
> that can be fully validated by parsers.

This is a solveable problem within the packet socket API, and the entire
thing is opt-in. If a user's tcpdump / packet capture program is out of
date and does not support the new API for large packets, then that user
does not have to enable large GRO/TSO.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-16 15:07             ` David Ahern
@ 2023-01-16 16:02               ` Eric Dumazet
  2023-01-16 19:09                 ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2023-01-16 16:02 UTC (permalink / raw)
  To: David Ahern
  Cc: Xin Long, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
>

> not sure why you think it would not be detected. Today's model for gro
> sets tot_len based on skb->len. There is an inherent trust that the
> user's of the gro API set the length correctly. If it is not, the
> payload to userspace would ultimately be non-sense and hence detectable.

Only if you use some kind of upper protocol adding message integrity
verification.

> >
> > As you said, user space sniffing packets now have to guess what is the
> > intent, instead of headers carrying all the needed information
> > that can be fully validated by parsers.
>
> This is a solveable problem within the packet socket API, and the entire
> thing is opt-in. If a user's tcpdump / packet capture program is out of
> date and does not support the new API for large packets, then that user
> does not have to enable large GRO/TSO.

I do not see this being solved yet.

We have enabled BIG TCP only for IPv6, we do not want the same to
magically be enabled for ipv4
when a new kernel is deployed.

Make sure that IPV4 BIG TCP is guarded by a separate tunable.

Note that our initial patches were adding IPv6 tunables for a reason.

We agreed to change them to IFLA_GRO_MAX_SIZE, IFLA_TSO_MAX_SIZE,
as long as they would not enable unwanted/untested behavior.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-14 17:52     ` Xin Long
@ 2023-01-16 16:45       ` Paul Moore
  2023-01-16 17:36         ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Moore @ 2023-01-16 16:45 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Sat, Jan 14, 2023 at 12:54 PM Xin Long <lucien.xin@gmail.com> wrote:
> On Sat, Jan 14, 2023 at 10:39 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
> > > the iph->tot_len update should use iph_set_totlen().
> > >
> > > Note that for these non GSO packets, the new iph tot_len with extra
> > > iph option len added may become greater than 65535, the old process
> > > will cast it and set iph->tot_len to it, which is a bug. In theory,
> > > iph options shouldn't be added for these big packets in here, a fix
> > > may be needed here in the future. For now this patch is only to set
> > > iph->tot_len to 0 when it happens.
> >
> > I'm not entirely clear on the paragraph above, but we do need to be
> > able to set/modify the IP options in cipso_v4_skbuff_setattr() in
> > order to support CIPSO labeling.  I'm open to better and/or
> > alternative solutions compared to what we are doing now, but I can't
> > support a change that is a bug waiting to bite us.  My apologies if
> > I'm interpreting your comments incorrectly and that isn't the case
> > here.
> setting the IP options may cause the packet size to grow (both iph->tot_len
> and skb->len), for example:
>
> before setting it, iph->tot_len=65535,
> after setting it, iph->tot_len=65535 + 14 (assume the IP option len is 14)
> however, tot_len is 16 bit, and can't be set to "65535 + 14".
>
> Hope the above makes it clearer.

Thanks, it does.

> This problem exists with or without this patch. Not sure how it should
> be fixed in cipso_v4 ...
>
> Not sure if we can skip the big packet, or segment/fragment the big packet
> in cipso_v4_skbuff_setattr().

We can't skip the CIPSO labeling as that would be the network packet
equivalent of not assigning a owner/group/mode to a file on the
filesystem, which is a Very Bad Thing :)

I spent a little bit of time this morning looking at the problem and I
think the right approach is two-fold: first introduce a simple check
in cipso_v4_skbuff_setattr() which returns -E2BIG if the packet length
grows beyond 65535.  It's rather crude, but it's a tiny patch and
should at least ensure that the upper layers (NetLabel and SELinux)
don't send the packet with a bogus length field; it will result in
packet drops, but honestly that seems preferable to a mangled packet
which will likely be dropped at some point in the network anyway.

diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 6cd3b6c559f0..f19c9beda745 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -2183,8 +2183,10 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
        * that the security label is applied to the packet - we do the same
        * thing when using the socket options and it hasn't caused a problem,
        * if we need to we can always revisit this choice later */
-
       len_delta = opt_len - opt->optlen;
+       if ((skb->len + len_delta) > 65535)
+               return -E2BIG;
+
       /* if we don't ensure enough headroom we could panic on the skb_push()
        * call below so make sure we have enough, we are also "mangling" the
        * packet so we should probably do a copy-on-write call anyway */

The second step will be to add a max-length IPv4 option filled with
IPOPT_NOOP to the local sockets in the case of
netlbl_sock_setattr()/NETLBL_NLTYPE_ADDRSELECT.  In this case we would
either end up replacing the padding with a proper CIPSO option or
removing it completely in netlbl_skbuff_setattr(); in either case the
total packet length remains the same or decreases so we should be
"safe".

The forwarded packet case is still a bit of an issue, but I think the
likelihood of someone using 64k max-size IPv4 packets on the wire
*and* passing this traffic through a Linux cross-domain router with
CIPSO (re)labeling is about as close to zero as one can possibly get.
At least given the size check present in step one (patchlet above) the
packet will be safely (?) dropped on the Linux system so an admin will
have some idea where to start looking.

I've got some basic patches written for both, but I need to at least
give them a quick sanity test before posting.

--
paul-moore.com

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-16 16:45       ` Paul Moore
@ 2023-01-16 17:36         ` Xin Long
  2023-01-16 18:12           ` Paul Moore
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-16 17:36 UTC (permalink / raw)
  To: Paul Moore
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Mon, Jan 16, 2023 at 11:46 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Sat, Jan 14, 2023 at 12:54 PM Xin Long <lucien.xin@gmail.com> wrote:
> > On Sat, Jan 14, 2023 at 10:39 AM Paul Moore <paul@paul-moore.com> wrote:
> > > On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > >
> > > > It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
> > > > the iph->tot_len update should use iph_set_totlen().
> > > >
> > > > Note that for these non GSO packets, the new iph tot_len with extra
> > > > iph option len added may become greater than 65535, the old process
> > > > will cast it and set iph->tot_len to it, which is a bug. In theory,
> > > > iph options shouldn't be added for these big packets in here, a fix
> > > > may be needed here in the future. For now this patch is only to set
> > > > iph->tot_len to 0 when it happens.
> > >
> > > I'm not entirely clear on the paragraph above, but we do need to be
> > > able to set/modify the IP options in cipso_v4_skbuff_setattr() in
> > > order to support CIPSO labeling.  I'm open to better and/or
> > > alternative solutions compared to what we are doing now, but I can't
> > > support a change that is a bug waiting to bite us.  My apologies if
> > > I'm interpreting your comments incorrectly and that isn't the case
> > > here.
> > setting the IP options may cause the packet size to grow (both iph->tot_len
> > and skb->len), for example:
> >
> > before setting it, iph->tot_len=65535,
> > after setting it, iph->tot_len=65535 + 14 (assume the IP option len is 14)
> > however, tot_len is 16 bit, and can't be set to "65535 + 14".
> >
> > Hope the above makes it clearer.
>
> Thanks, it does.
>
> > This problem exists with or without this patch. Not sure how it should
> > be fixed in cipso_v4 ...
> >
> > Not sure if we can skip the big packet, or segment/fragment the big packet
> > in cipso_v4_skbuff_setattr().
>
> We can't skip the CIPSO labeling as that would be the network packet
> equivalent of not assigning a owner/group/mode to a file on the
> filesystem, which is a Very Bad Thing :)
>
> I spent a little bit of time this morning looking at the problem and I
> think the right approach is two-fold: first introduce a simple check
> in cipso_v4_skbuff_setattr() which returns -E2BIG if the packet length
> grows beyond 65535.  It's rather crude, but it's a tiny patch and
> should at least ensure that the upper layers (NetLabel and SELinux)
> don't send the packet with a bogus length field; it will result in
> packet drops, but honestly that seems preferable to a mangled packet
> which will likely be dropped at some point in the network anyway.
>
> diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> index 6cd3b6c559f0..f19c9beda745 100644
> --- a/net/ipv4/cipso_ipv4.c
> +++ b/net/ipv4/cipso_ipv4.c
> @@ -2183,8 +2183,10 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
>         * that the security label is applied to the packet - we do the same
>         * thing when using the socket options and it hasn't caused a problem,
>         * if we need to we can always revisit this choice later */
> -
>        len_delta = opt_len - opt->optlen;
> +       if ((skb->len + len_delta) > 65535)
> +               return -E2BIG;
> +
Right, looks crude. :-)
Note that for BIG TCP packets skb->len is bigger than 65535.
(I assume CIPSO labeling will drop BIG TCP packets.)

>        /* if we don't ensure enough headroom we could panic on the skb_push()
>         * call below so make sure we have enough, we are also "mangling" the
>         * packet so we should probably do a copy-on-write call anyway */
>
> The second step will be to add a max-length IPv4 option filled with
> IPOPT_NOOP to the local sockets in the case of
> netlbl_sock_setattr()/NETLBL_NLTYPE_ADDRSELECT.  In this case we would
> either end up replacing the padding with a proper CIPSO option or
> removing it completely in netlbl_skbuff_setattr(); in either case the
> total packet length remains the same or decreases so we should be
> "safe".
sounds better.

>
> The forwarded packet case is still a bit of an issue, but I think the
> likelihood of someone using 64k max-size IPv4 packets on the wire
> *and* passing this traffic through a Linux cross-domain router with
> CIPSO (re)labeling is about as close to zero as one can possibly get.
> At least given the size check present in step one (patchlet above) the
> packet will be safely (?) dropped on the Linux system so an admin will
> have some idea where to start looking.
don't know the likelihood of CIPSO (re)labeling on a Linux cross-domain router.
But 64K packets could be GRO packets merged on the interface (GRO enabled)
of the router, not directly from the wire.

>
> I've got some basic patches written for both, but I need to at least
> give them a quick sanity test before posting.
>
Good that you can take care of it from the security side.

I think a similar problem exists in calipso_skbuff_setattr() in
net/ipv6/calipso.c.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-16 17:36         ` Xin Long
@ 2023-01-16 18:12           ` Paul Moore
  2023-01-16 19:33             ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Moore @ 2023-01-16 18:12 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Mon, Jan 16, 2023 at 12:37 PM Xin Long <lucien.xin@gmail.com> wrote:
> On Mon, Jan 16, 2023 at 11:46 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Sat, Jan 14, 2023 at 12:54 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > On Sat, Jan 14, 2023 at 10:39 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > > >
> > > > > It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
> > > > > the iph->tot_len update should use iph_set_totlen().
> > > > >
> > > > > Note that for these non GSO packets, the new iph tot_len with extra
> > > > > iph option len added may become greater than 65535, the old process
> > > > > will cast it and set iph->tot_len to it, which is a bug. In theory,
> > > > > iph options shouldn't be added for these big packets in here, a fix
> > > > > may be needed here in the future. For now this patch is only to set
> > > > > iph->tot_len to 0 when it happens.
> > > >
> > > > I'm not entirely clear on the paragraph above, but we do need to be
> > > > able to set/modify the IP options in cipso_v4_skbuff_setattr() in
> > > > order to support CIPSO labeling.  I'm open to better and/or
> > > > alternative solutions compared to what we are doing now, but I can't
> > > > support a change that is a bug waiting to bite us.  My apologies if
> > > > I'm interpreting your comments incorrectly and that isn't the case
> > > > here.
> > > setting the IP options may cause the packet size to grow (both iph->tot_len
> > > and skb->len), for example:
> > >
> > > before setting it, iph->tot_len=65535,
> > > after setting it, iph->tot_len=65535 + 14 (assume the IP option len is 14)
> > > however, tot_len is 16 bit, and can't be set to "65535 + 14".
> > >
> > > Hope the above makes it clearer.
> >
> > Thanks, it does.
> >
> > > This problem exists with or without this patch. Not sure how it should
> > > be fixed in cipso_v4 ...
> > >
> > > Not sure if we can skip the big packet, or segment/fragment the big packet
> > > in cipso_v4_skbuff_setattr().
> >
> > We can't skip the CIPSO labeling as that would be the network packet
> > equivalent of not assigning a owner/group/mode to a file on the
> > filesystem, which is a Very Bad Thing :)
> >
> > I spent a little bit of time this morning looking at the problem and I
> > think the right approach is two-fold: first introduce a simple check
> > in cipso_v4_skbuff_setattr() which returns -E2BIG if the packet length
> > grows beyond 65535.  It's rather crude, but it's a tiny patch and
> > should at least ensure that the upper layers (NetLabel and SELinux)
> > don't send the packet with a bogus length field; it will result in
> > packet drops, but honestly that seems preferable to a mangled packet
> > which will likely be dropped at some point in the network anyway.
> >
> > diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> > index 6cd3b6c559f0..f19c9beda745 100644
> > --- a/net/ipv4/cipso_ipv4.c
> > +++ b/net/ipv4/cipso_ipv4.c
> > @@ -2183,8 +2183,10 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
> >         * that the security label is applied to the packet - we do the same
> >         * thing when using the socket options and it hasn't caused a problem,
> >         * if we need to we can always revisit this choice later */
> > -
> >        len_delta = opt_len - opt->optlen;
> > +       if ((skb->len + len_delta) > 65535)
> > +               return -E2BIG;
> > +
>
> Right, looks crude. :-)

Yes, but what else can we do?  There is fragmentation, but that is
rather ugly and we would still need a solution for when the don't
fragment bit is set.  I'm open to suggestions.

> Note that for BIG TCP packets skb->len is bigger than 65535.
> (I assume CIPSO labeling will drop BIG TCP packets.)

It seems like there is still ongoing discussion about even enabling
BIG TCP for IPv4, however for this discussion let's assume that BIG
TCP is merged for IPv4.

We really should have a solution that allows CIPSO for both normal and
BIG TCP, if we don't we force distros and admins to choose between the
two and that isn't good.  We should do better.  If skb->len > 64k in
the case of BIG TCP, how is the packet eventually divided/fragmented
in such a way that the total length field in the IPv4 header doesn't
overflow?  Or is that simply handled at the driver/device layer and we
simply set skb->len to whatever the size is, regardless of the 16-bit
length limit?  If that is the case, does the driver/device layer
handle copying the IPv4 options and setting the header/total-length
fields in each packet?  Or is it something else completely?

> >        /* if we don't ensure enough headroom we could panic on the skb_push()
> >         * call below so make sure we have enough, we are also "mangling" the
> >         * packet so we should probably do a copy-on-write call anyway */
> >
> > The second step will be to add a max-length IPv4 option filled with
> > IPOPT_NOOP to the local sockets in the case of
> > netlbl_sock_setattr()/NETLBL_NLTYPE_ADDRSELECT.  In this case we would
> > either end up replacing the padding with a proper CIPSO option or
> > removing it completely in netlbl_skbuff_setattr(); in either case the
> > total packet length remains the same or decreases so we should be
> > "safe".
>
> sounds better.

To be clear, it's not a one-or-the-other choice, both patches would be
necessary as the padding route described in the second step would only
apply to traffic generated locally from a socket.  We still have the
ugly problem of dealing with forwarded traffic, which it looks like we
discuss more on that below ...

> > The forwarded packet case is still a bit of an issue, but I think the
> > likelihood of someone using 64k max-size IPv4 packets on the wire
> > *and* passing this traffic through a Linux cross-domain router with
> > CIPSO (re)labeling is about as close to zero as one can possibly get.
> > At least given the size check present in step one (patchlet above) the
> > packet will be safely (?) dropped on the Linux system so an admin will
> > have some idea where to start looking.
>
> don't know the likelihood of CIPSO (re)labeling on a Linux cross-domain router.
> But 64K packets could be GRO packets merged on the interface (GRO enabled)
> of the router, not directly from the wire.

In the GRO case, is it safe to grow the packet such that skb->len is
greater than 64k?  I presume that the device/driver is going to split
the packet anyway and populate the IPv4 total length fields in the
header anyway, right?  If we can't grow the packet beyond 64k, is
there some way to signal to the driver/device at runtime that the
largest packet we can process is 64k minus 40 bytes (for the IPv4
options)?

> > I've got some basic patches written for both, but I need to at least
> > give them a quick sanity test before posting.
>
> Good that you can take care of it from the security side.

It's not yet clear to me that we can, see the above questions.

> I think a similar problem exists in calipso_skbuff_setattr() in
> net/ipv6/calipso.c.

Probably, but one mess at a time, especially since several of the
questions above apply to both IPv4 and IPv6.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-16 16:02               ` Eric Dumazet
@ 2023-01-16 19:09                 ` Xin Long
  2023-01-16 20:37                   ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-16 19:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> >
>
> > not sure why you think it would not be detected. Today's model for gro
> > sets tot_len based on skb->len. There is an inherent trust that the
> > user's of the gro API set the length correctly. If it is not, the
> > payload to userspace would ultimately be non-sense and hence detectable.
>
> Only if you use some kind of upper protocol adding message integrity
> verification.
>
> > >
> > > As you said, user space sniffing packets now have to guess what is the
> > > intent, instead of headers carrying all the needed information
> > > that can be fully validated by parsers.
> >
> > This is a solveable problem within the packet socket API, and the entire
> > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > date and does not support the new API for large packets, then that user
> > does not have to enable large GRO/TSO.
>
> I do not see this being solved yet.
I think it's common that we add a feature that is disabled by
default in the kernel if the userspace might not support it.

>
> We have enabled BIG TCP only for IPv6, we do not want the same to
> magically be enabled for ipv4
> when a new kernel is deployed.
>
> Make sure that IPV4 BIG TCP is guarded by a separate tunable.
I can understand this.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-16 18:12           ` Paul Moore
@ 2023-01-16 19:33             ` Xin Long
  2023-01-17  4:54               ` David Ahern
  2023-01-17 19:51               ` Paul Moore
  0 siblings, 2 replies; 45+ messages in thread
From: Xin Long @ 2023-01-16 19:33 UTC (permalink / raw)
  To: Paul Moore
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Mon, Jan 16, 2023 at 1:13 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Mon, Jan 16, 2023 at 12:37 PM Xin Long <lucien.xin@gmail.com> wrote:
> > On Mon, Jan 16, 2023 at 11:46 AM Paul Moore <paul@paul-moore.com> wrote:
> > > On Sat, Jan 14, 2023 at 12:54 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > > On Sat, Jan 14, 2023 at 10:39 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > > On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > > > >
> > > > > > It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
> > > > > > the iph->tot_len update should use iph_set_totlen().
> > > > > >
> > > > > > Note that for these non GSO packets, the new iph tot_len with extra
> > > > > > iph option len added may become greater than 65535, the old process
> > > > > > will cast it and set iph->tot_len to it, which is a bug. In theory,
> > > > > > iph options shouldn't be added for these big packets in here, a fix
> > > > > > may be needed here in the future. For now this patch is only to set
> > > > > > iph->tot_len to 0 when it happens.
> > > > >
> > > > > I'm not entirely clear on the paragraph above, but we do need to be
> > > > > able to set/modify the IP options in cipso_v4_skbuff_setattr() in
> > > > > order to support CIPSO labeling.  I'm open to better and/or
> > > > > alternative solutions compared to what we are doing now, but I can't
> > > > > support a change that is a bug waiting to bite us.  My apologies if
> > > > > I'm interpreting your comments incorrectly and that isn't the case
> > > > > here.
> > > > setting the IP options may cause the packet size to grow (both iph->tot_len
> > > > and skb->len), for example:
> > > >
> > > > before setting it, iph->tot_len=65535,
> > > > after setting it, iph->tot_len=65535 + 14 (assume the IP option len is 14)
> > > > however, tot_len is 16 bit, and can't be set to "65535 + 14".
> > > >
> > > > Hope the above makes it clearer.
> > >
> > > Thanks, it does.
> > >
> > > > This problem exists with or without this patch. Not sure how it should
> > > > be fixed in cipso_v4 ...
> > > >
> > > > Not sure if we can skip the big packet, or segment/fragment the big packet
> > > > in cipso_v4_skbuff_setattr().
> > >
> > > We can't skip the CIPSO labeling as that would be the network packet
> > > equivalent of not assigning a owner/group/mode to a file on the
> > > filesystem, which is a Very Bad Thing :)
> > >
> > > I spent a little bit of time this morning looking at the problem and I
> > > think the right approach is two-fold: first introduce a simple check
> > > in cipso_v4_skbuff_setattr() which returns -E2BIG if the packet length
> > > grows beyond 65535.  It's rather crude, but it's a tiny patch and
> > > should at least ensure that the upper layers (NetLabel and SELinux)
> > > don't send the packet with a bogus length field; it will result in
> > > packet drops, but honestly that seems preferable to a mangled packet
> > > which will likely be dropped at some point in the network anyway.
> > >
> > > diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> > > index 6cd3b6c559f0..f19c9beda745 100644
> > > --- a/net/ipv4/cipso_ipv4.c
> > > +++ b/net/ipv4/cipso_ipv4.c
> > > @@ -2183,8 +2183,10 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
> > >         * that the security label is applied to the packet - we do the same
> > >         * thing when using the socket options and it hasn't caused a problem,
> > >         * if we need to we can always revisit this choice later */
> > > -
> > >        len_delta = opt_len - opt->optlen;
> > > +       if ((skb->len + len_delta) > 65535)
> > > +               return -E2BIG;
> > > +
> >
> > Right, looks crude. :-)
>
> Yes, but what else can we do?  There is fragmentation, but that is
> rather ugly and we would still need a solution for when the don't
> fragment bit is set.  I'm open to suggestions.
looking at ovs_dp_upcall(), for GSO/GRO packets it goes to
queue_gso_packets() where it calls __skb_gso_segment()
to segment it into small segs/skbs, then process these segs instead.

I'm thinking you can try to do the same in cipso_v4_skbuff_setattr(),
and I don't think 64K non-GSO packets exist in the user environment,
so taking care of GSO packets should be enough.

I just don't know if the security_hook will be able to process these
smaller segs/skbs after the segment.

>
> > Note that for BIG TCP packets skb->len is bigger than 65535.
> > (I assume CIPSO labeling will drop BIG TCP packets.)
>
> It seems like there is still ongoing discussion about even enabling
> BIG TCP for IPv4, however for this discussion let's assume that BIG
> TCP is merged for IPv4.
>
> We really should have a solution that allows CIPSO for both normal and
> BIG TCP, if we don't we force distros and admins to choose between the
> two and that isn't good.  We should do better.  If skb->len > 64k in
> the case of BIG TCP, how is the packet eventually divided/fragmented
> in such a way that the total length field in the IPv4 header doesn't
> overflow?  Or is that simply handled at the driver/device layer and we
> simply set skb->len to whatever the size is, regardless of the 16-bit
Yes, for BIG TCP, 16-bit length is set to 0, and it just uses skb->len
as the IP packet length.

> length limit?  If that is the case, does the driver/device layer
> handle copying the IPv4 options and setting the header/total-length
> fields in each packet?  Or is it something else completely?
Yes, I think the driver/device layer will handle copying the IPv4 options
and setting the header/total-length, and that's how it works.

>
> > >        /* if we don't ensure enough headroom we could panic on the skb_push()
> > >         * call below so make sure we have enough, we are also "mangling" the
> > >         * packet so we should probably do a copy-on-write call anyway */
> > >
> > > The second step will be to add a max-length IPv4 option filled with
> > > IPOPT_NOOP to the local sockets in the case of
> > > netlbl_sock_setattr()/NETLBL_NLTYPE_ADDRSELECT.  In this case we would
> > > either end up replacing the padding with a proper CIPSO option or
> > > removing it completely in netlbl_skbuff_setattr(); in either case the
> > > total packet length remains the same or decreases so we should be
> > > "safe".
> >
> > sounds better.
>
> To be clear, it's not a one-or-the-other choice, both patches would be
> necessary as the padding route described in the second step would only
> apply to traffic generated locally from a socket.  We still have the
> ugly problem of dealing with forwarded traffic, which it looks like we
> discuss more on that below ...
I got that.

>
> > > The forwarded packet case is still a bit of an issue, but I think the
> > > likelihood of someone using 64k max-size IPv4 packets on the wire
> > > *and* passing this traffic through a Linux cross-domain router with
> > > CIPSO (re)labeling is about as close to zero as one can possibly get.
> > > At least given the size check present in step one (patchlet above) the
> > > packet will be safely (?) dropped on the Linux system so an admin will
> > > have some idea where to start looking.
> >
> > don't know the likelihood of CIPSO (re)labeling on a Linux cross-domain router.
> > But 64K packets could be GRO packets merged on the interface (GRO enabled)
> > of the router, not directly from the wire.
>
> In the GRO case, is it safe to grow the packet such that skb->len is
> greater than 64k?  I presume that the device/driver is going to split
> the packet anyway and populate the IPv4 total length fields in the
> header anyway, right?  If we can't grow the packet beyond 64k, is
> there some way to signal to the driver/device at runtime that the
> largest packet we can process is 64k minus 40 bytes (for the IPv4
> options)?
at runtime, not as far as I know.
It's a field of the network device that can be modified by:
# ip link set dev eth0 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE

>
> > > I've got some basic patches written for both, but I need to at least
> > > give them a quick sanity test before posting.
> >
> > Good that you can take care of it from the security side.
>
> It's not yet clear to me that we can, see the above questions.
>
> > I think a similar problem exists in calipso_skbuff_setattr() in
> > net/ipv6/calipso.c.
>
> Probably, but one mess at a time, especially since several of the
> questions above apply to both IPv4 and IPv6.
>
Just wanted you to know that IPv6 BIG TCP is something already
in the kernel, unlike IPv4 BIG TCP, we can avoid this problem in
advance before it's in the kernel.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-16 19:09                 ` Xin Long
@ 2023-01-16 20:37                   ` Eric Dumazet
  2023-01-17 15:47                     ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2023-01-16 20:37 UTC (permalink / raw)
  To: Xin Long
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> > >
> >
> > > not sure why you think it would not be detected. Today's model for gro
> > > sets tot_len based on skb->len. There is an inherent trust that the
> > > user's of the gro API set the length correctly. If it is not, the
> > > payload to userspace would ultimately be non-sense and hence detectable.
> >
> > Only if you use some kind of upper protocol adding message integrity
> > verification.
> >
> > > >
> > > > As you said, user space sniffing packets now have to guess what is the
> > > > intent, instead of headers carrying all the needed information
> > > > that can be fully validated by parsers.
> > >
> > > This is a solveable problem within the packet socket API, and the entire
> > > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > > date and does not support the new API for large packets, then that user
> > > does not have to enable large GRO/TSO.
> >
> > I do not see this being solved yet.
> I think it's common that we add a feature that is disabled by
> default in the kernel if the userspace might not support it.

One critical feature for us is the ability to restrict packet capture
to the headers only.

Privacy is a key requirement.

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436

So please make sure that packet captures truncated to headers will
still be understood by tcpdump


>
> >
> > We have enabled BIG TCP only for IPv6, we do not want the same to
> > magically be enabled for ipv4
> > when a new kernel is deployed.
> >
> > Make sure that IPV4 BIG TCP is guarded by a separate tunable.
> I can understand this.
>
> Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-16 19:33             ` Xin Long
@ 2023-01-17  4:54               ` David Ahern
  2023-01-17 19:51               ` Paul Moore
  1 sibling, 0 replies; 45+ messages in thread
From: David Ahern @ 2023-01-17  4:54 UTC (permalink / raw)
  To: Xin Long, Paul Moore
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On 1/16/23 12:33 PM, Xin Long wrote:
>> We really should have a solution that allows CIPSO for both normal and
>> BIG TCP, if we don't we force distros and admins to choose between the
>> two and that isn't good.  We should do better.  If skb->len > 64k in
>> the case of BIG TCP, how is the packet eventually divided/fragmented
>> in such a way that the total length field in the IPv4 header doesn't
>> overflow?  Or is that simply handled at the driver/device layer and we
>> simply set skb->len to whatever the size is, regardless of the 16-bit
> Yes, for BIG TCP, 16-bit length is set to 0, and it just uses skb->len
> as the IP packet length.
> 
>> length limit?  If that is the case, does the driver/device layer
>> handle copying the IPv4 options and setting the header/total-length
>> fields in each packet?  Or is it something else completely?
> Yes, I think the driver/device layer will handle copying the IPv4 options
> and setting the header/total-length, and that's how it works.

IPv4 options, like TCP options, should be part of the header that gets
replicate across GSO sliced packets by the hardware. ie., both should be
transparent to well designed hardware (and for h/w that made poor
choices standard 64kB GSO is the limit for its users).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-16 20:37                   ` Eric Dumazet
@ 2023-01-17 15:47                     ` Xin Long
  2023-01-19  1:18                       ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-17 15:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Mon, Jan 16, 2023 at 3:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> > > >
> > >
> > > > not sure why you think it would not be detected. Today's model for gro
> > > > sets tot_len based on skb->len. There is an inherent trust that the
> > > > user's of the gro API set the length correctly. If it is not, the
> > > > payload to userspace would ultimately be non-sense and hence detectable.
> > >
> > > Only if you use some kind of upper protocol adding message integrity
> > > verification.
> > >
> > > > >
> > > > > As you said, user space sniffing packets now have to guess what is the
> > > > > intent, instead of headers carrying all the needed information
> > > > > that can be fully validated by parsers.
> > > >
> > > > This is a solveable problem within the packet socket API, and the entire
> > > > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > > > date and does not support the new API for large packets, then that user
> > > > does not have to enable large GRO/TSO.
> > >
> > > I do not see this being solved yet.
> > I think it's common that we add a feature that is disabled by
> > default in the kernel if the userspace might not support it.
>
> One critical feature for us is the ability to restrict packet capture
> to the headers only.
>
> Privacy is a key requirement.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436
>
> So please make sure that packet captures truncated to headers will
> still be understood by tcpdump
IIUC we can try to adjust iph->tot_len to 65535 or at least the length
of all headers for IPv4 BIG TCP packets on the PACKET socket rcv
path? (or even truncate the ipv4 BIG TCP packets there.). This way
tcpdump will be able to parse all the headers.

But what if some applications want all the raw data via PACKET sockets?
Maybe adjust iph->tot_len only, not truncate the packet?

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-16 19:33             ` Xin Long
  2023-01-17  4:54               ` David Ahern
@ 2023-01-17 19:51               ` Paul Moore
  2023-01-17 22:46                 ` Paul Moore
  1 sibling, 1 reply; 45+ messages in thread
From: Paul Moore @ 2023-01-17 19:51 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Mon, Jan 16, 2023 at 2:35 PM Xin Long <lucien.xin@gmail.com> wrote:
> On Mon, Jan 16, 2023 at 1:13 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Mon, Jan 16, 2023 at 12:37 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > On Mon, Jan 16, 2023 at 11:46 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > On Sat, Jan 14, 2023 at 12:54 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > > > On Sat, Jan 14, 2023 at 10:39 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > > > On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:

...

> > > > We can't skip the CIPSO labeling as that would be the network packet
> > > > equivalent of not assigning a owner/group/mode to a file on the
> > > > filesystem, which is a Very Bad Thing :)
> > > >
> > > > I spent a little bit of time this morning looking at the problem and I
> > > > think the right approach is two-fold: first introduce a simple check
> > > > in cipso_v4_skbuff_setattr() which returns -E2BIG if the packet length
> > > > grows beyond 65535.  It's rather crude, but it's a tiny patch and
> > > > should at least ensure that the upper layers (NetLabel and SELinux)
> > > > don't send the packet with a bogus length field; it will result in
> > > > packet drops, but honestly that seems preferable to a mangled packet
> > > > which will likely be dropped at some point in the network anyway.
> > > >
> > > > diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> > > > index 6cd3b6c559f0..f19c9beda745 100644
> > > > --- a/net/ipv4/cipso_ipv4.c
> > > > +++ b/net/ipv4/cipso_ipv4.c
> > > > @@ -2183,8 +2183,10 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
> > > >         * that the security label is applied to the packet - we do the same
> > > >         * thing when using the socket options and it hasn't caused a problem,
> > > >         * if we need to we can always revisit this choice later */
> > > > -
> > > >        len_delta = opt_len - opt->optlen;
> > > > +       if ((skb->len + len_delta) > 65535)
> > > > +               return -E2BIG;
> > > > +
> > >
> > > Right, looks crude. :-)
> >
> > Yes, but what else can we do?  There is fragmentation, but that is
> > rather ugly and we would still need a solution for when the don't
> > fragment bit is set.  I'm open to suggestions.
>
> looking at ovs_dp_upcall(), for GSO/GRO packets it goes to
> queue_gso_packets() where it calls __skb_gso_segment()
> to segment it into small segs/skbs, then process these segs instead.
>
> I'm thinking you can try to do the same in cipso_v4_skbuff_setattr(),
> and I don't think 64K non-GSO packets exist in the user environment,
> so taking care of GSO packets should be enough.

Thanks, I'll take a look.

> I just don't know if the security_hook will be able to process these
> smaller segs/skbs after the segment.

As long as the smaller, segmented packets have the IPv4 options
preserved/copied on each smaller packet it should be okay.

> > It seems like there is still ongoing discussion about even enabling
> > BIG TCP for IPv4, however for this discussion let's assume that BIG
> > TCP is merged for IPv4.
> >
> > We really should have a solution that allows CIPSO for both normal and
> > BIG TCP, if we don't we force distros and admins to choose between the
> > two and that isn't good.  We should do better.  If skb->len > 64k in
> > the case of BIG TCP, how is the packet eventually divided/fragmented
> > in such a way that the total length field in the IPv4 header doesn't
> > overflow?  Or is that simply handled at the driver/device layer and we
> > simply set skb->len to whatever the size is, regardless of the 16-bit
>
> Yes, for BIG TCP, 16-bit length is set to 0, and it just uses skb->len
> as the IP packet length.

In the BIG TCP case, when is the IPv4 header zero'd out?  Currently
cipso_v4_skbuff_setattr() is called in the NF_INET_LOCAL_OUT and
NF_INET_FORWARD chains, is there an easy way to distinguish between a
traditional segmentation offload mechanism, e.g. GSO, and BIG TCP?  If
BIG TCP allows for arbitrarily large packets we can just grow the
skb->len value as needed and leave the total length field in the IPv4
header untouched/zero, but we would need to be able to distinguish
between a segmentation offload and BIG TCP.

> > In the GRO case, is it safe to grow the packet such that skb->len is
> > greater than 64k?  I presume that the device/driver is going to split
> > the packet anyway and populate the IPv4 total length fields in the
> > header anyway, right?  If we can't grow the packet beyond 64k, is
> > there some way to signal to the driver/device at runtime that the
> > largest packet we can process is 64k minus 40 bytes (for the IPv4
> > options)?
>
> at runtime, not as far as I know.
> It's a field of the network device that can be modified by:
> # ip link set dev eth0 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE

I need to look at the OVS case above, but one possibility would be to
have the kernel adjust the GSO size down by 40 bytes when
CONFIG_NETLABEL is enabled, but that isn't a great option, and not
something I consider a first (or second) choice.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-17 19:51               ` Paul Moore
@ 2023-01-17 22:46                 ` Paul Moore
  2023-01-18  2:47                   ` David Ahern
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Moore @ 2023-01-17 22:46 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, David Ahern,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Tue, Jan 17, 2023 at 2:51 PM Paul Moore <paul@paul-moore.com> wrote:
> On Mon, Jan 16, 2023 at 2:35 PM Xin Long <lucien.xin@gmail.com> wrote:
> > On Mon, Jan 16, 2023 at 1:13 PM Paul Moore <paul@paul-moore.com> wrote:
> > > On Mon, Jan 16, 2023 at 12:37 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > > On Mon, Jan 16, 2023 at 11:46 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > > On Sat, Jan 14, 2023 at 12:54 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > > > > On Sat, Jan 14, 2023 at 10:39 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > > > > On Fri, Jan 13, 2023 at 10:31 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> ...
>
> > > > > We can't skip the CIPSO labeling as that would be the network packet
> > > > > equivalent of not assigning a owner/group/mode to a file on the
> > > > > filesystem, which is a Very Bad Thing :)
> > > > >
> > > > > I spent a little bit of time this morning looking at the problem and I
> > > > > think the right approach is two-fold: first introduce a simple check
> > > > > in cipso_v4_skbuff_setattr() which returns -E2BIG if the packet length
> > > > > grows beyond 65535.  It's rather crude, but it's a tiny patch and
> > > > > should at least ensure that the upper layers (NetLabel and SELinux)
> > > > > don't send the packet with a bogus length field; it will result in
> > > > > packet drops, but honestly that seems preferable to a mangled packet
> > > > > which will likely be dropped at some point in the network anyway.
> > > > >
> > > > > diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> > > > > index 6cd3b6c559f0..f19c9beda745 100644
> > > > > --- a/net/ipv4/cipso_ipv4.c
> > > > > +++ b/net/ipv4/cipso_ipv4.c
> > > > > @@ -2183,8 +2183,10 @@ int cipso_v4_skbuff_setattr(struct sk_buff *skb,
> > > > >         * that the security label is applied to the packet - we do the same
> > > > >         * thing when using the socket options and it hasn't caused a problem,
> > > > >         * if we need to we can always revisit this choice later */
> > > > > -
> > > > >        len_delta = opt_len - opt->optlen;
> > > > > +       if ((skb->len + len_delta) > 65535)
> > > > > +               return -E2BIG;
> > > > > +
> > > >
> > > > Right, looks crude. :-)
> > >
> > > Yes, but what else can we do?  There is fragmentation, but that is
> > > rather ugly and we would still need a solution for when the don't
> > > fragment bit is set.  I'm open to suggestions.
> >
> > looking at ovs_dp_upcall(), for GSO/GRO packets it goes to
> > queue_gso_packets() where it calls __skb_gso_segment()
> > to segment it into small segs/skbs, then process these segs instead.
> >
> > I'm thinking you can try to do the same in cipso_v4_skbuff_setattr(),
> > and I don't think 64K non-GSO packets exist in the user environment,
> > so taking care of GSO packets should be enough.
>
> Thanks, I'll take a look.

Unfortunately I don't think the ovs_dp_upcall() approach will work as
that is an endpoint in the kernel which sends the GSO'd packet up to
userspace in segements.  In the case of cipso_v4_skbuff_setattr() we
are setting an IPv4 option on a packet in either the NF_INET_LOCAL_OUT
or NF_INET_FORWARD output path.  I believe we can resolve the
LOCAL_OUT case with the padding approach I mentioned previously, but
the FORWARD path remains a challenge; I simply don't see a way to
handle growing the packet beyond 64k in the forward path.  I'm also
realizing that we should be sending a ICMP_FRAG_NEEDED in the forward
case when we have to drop the packet due to size issues, as the normal
MTU/size check happens prior to the NF_INET_FORWARD hooks (and hence
cipso_v4_skbuff_setattr()).

> > > It seems like there is still ongoing discussion about even enabling
> > > BIG TCP for IPv4, however for this discussion let's assume that BIG
> > > TCP is merged for IPv4.
> > >
> > > We really should have a solution that allows CIPSO for both normal and
> > > BIG TCP, if we don't we force distros and admins to choose between the
> > > two and that isn't good.  We should do better.  If skb->len > 64k in
> > > the case of BIG TCP, how is the packet eventually divided/fragmented
> > > in such a way that the total length field in the IPv4 header doesn't
> > > overflow?  Or is that simply handled at the driver/device layer and we
> > > simply set skb->len to whatever the size is, regardless of the 16-bit
> >
> > Yes, for BIG TCP, 16-bit length is set to 0, and it just uses skb->len
> > as the IP packet length.
>
> In the BIG TCP case, when is the IPv4 header zero'd out?  Currently
> cipso_v4_skbuff_setattr() is called in the NF_INET_LOCAL_OUT and
> NF_INET_FORWARD chains, is there an easy way to distinguish between a
> traditional segmentation offload mechanism, e.g. GSO, and BIG TCP?  If
> BIG TCP allows for arbitrarily large packets we can just grow the
> skb->len value as needed and leave the total length field in the IPv4
> header untouched/zero, but we would need to be able to distinguish
> between a segmentation offload and BIG TCP.

Keeping the above questions as they still apply, rather I could still
use some help understanding what a BIG TCP packet would look like
during LOCAL_OUT and FORWARD.

> > > In the GRO case, is it safe to grow the packet such that skb->len is
> > > greater than 64k?  I presume that the device/driver is going to split
> > > the packet anyway and populate the IPv4 total length fields in the
> > > header anyway, right?  If we can't grow the packet beyond 64k, is
> > > there some way to signal to the driver/device at runtime that the
> > > largest packet we can process is 64k minus 40 bytes (for the IPv4
> > > options)?
> >
> > at runtime, not as far as I know.
> > It's a field of the network device that can be modified by:
> > # ip link set dev eth0 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE
>
> I need to look at the OVS case above, but one possibility would be to
> have the kernel adjust the GSO size down by 40 bytes when
> CONFIG_NETLABEL is enabled, but that isn't a great option, and not
> something I consider a first (or second) choice.

Looking more at the GSO related code, this isn't likely to work.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-17 22:46                 ` Paul Moore
@ 2023-01-18  2:47                   ` David Ahern
  2023-01-18 19:18                     ` Paul Moore
  0 siblings, 1 reply; 45+ messages in thread
From: David Ahern @ 2023-01-18  2:47 UTC (permalink / raw)
  To: Paul Moore, Xin Long
  Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On 1/17/23 3:46 PM, Paul Moore wrote:
>>
>> In the BIG TCP case, when is the IPv4 header zero'd out?  Currently
>> cipso_v4_skbuff_setattr() is called in the NF_INET_LOCAL_OUT and
>> NF_INET_FORWARD chains, is there an easy way to distinguish between a
>> traditional segmentation offload mechanism, e.g. GSO, and BIG TCP?  If
>> BIG TCP allows for arbitrarily large packets we can just grow the
>> skb->len value as needed and leave the total length field in the IPv4
>> header untouched/zero, but we would need to be able to distinguish
>> between a segmentation offload and BIG TCP.
> 
> Keeping the above questions as they still apply, rather I could still
> use some help understanding what a BIG TCP packet would look like
> during LOCAL_OUT and FORWARD.

skb->len > 64kb. you don't typically look at the IP / IPv6 header and
its total length field and I thought the first patch in the series added
a handler for doing that.

> 
>>>> In the GRO case, is it safe to grow the packet such that skb->len is
>>>> greater than 64k?  I presume that the device/driver is going to split
>>>> the packet anyway and populate the IPv4 total length fields in the
>>>> header anyway, right?  If we can't grow the packet beyond 64k, is
>>>> there some way to signal to the driver/device at runtime that the
>>>> largest packet we can process is 64k minus 40 bytes (for the IPv4
>>>> options)?
>>>
>>> at runtime, not as far as I know.
>>> It's a field of the network device that can be modified by:
>>> # ip link set dev eth0 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE
>>
>> I need to look at the OVS case above, but one possibility would be to
>> have the kernel adjust the GSO size down by 40 bytes when
>> CONFIG_NETLABEL is enabled, but that isn't a great option, and not
>> something I consider a first (or second) choice.
> 
> Looking more at the GSO related code, this isn't likely to work.
> 

icsk_ext_hdr_len is adjusted by cipso for its options. Does that not
cover what is needed?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr
  2023-01-18  2:47                   ` David Ahern
@ 2023-01-18 19:18                     ` Paul Moore
  0 siblings, 0 replies; 45+ messages in thread
From: Paul Moore @ 2023-01-18 19:18 UTC (permalink / raw)
  To: David Ahern
  Cc: Xin Long, network dev, davem, kuba, Eric Dumazet, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar,
	Guillaume Nault, linux-security-module, selinux

On Tue, Jan 17, 2023 at 9:47 PM David Ahern <dsahern@gmail.com> wrote:
> On 1/17/23 3:46 PM, Paul Moore wrote:
> >>
> >> In the BIG TCP case, when is the IPv4 header zero'd out?  Currently
> >> cipso_v4_skbuff_setattr() is called in the NF_INET_LOCAL_OUT and
> >> NF_INET_FORWARD chains, is there an easy way to distinguish between a
> >> traditional segmentation offload mechanism, e.g. GSO, and BIG TCP?  If
> >> BIG TCP allows for arbitrarily large packets we can just grow the
> >> skb->len value as needed and leave the total length field in the IPv4
> >> header untouched/zero, but we would need to be able to distinguish
> >> between a segmentation offload and BIG TCP.
> >
> > Keeping the above questions as they still apply, rather I could still
> > use some help understanding what a BIG TCP packet would look like
> > during LOCAL_OUT and FORWARD.
>
> skb->len > 64kb. you don't typically look at the IP / IPv6 header and
> its total length field and I thought the first patch in the series added
> a handler for doing that.

Thanks, I was just curious if there was some other mechanism but that works.

As of this moment, the patchset I'm working on is still independent of
the BIG TCP patches, and I want to make sure I'm not doing anything
that will make the BIG TCP patches any more challenging.

> >>>> In the GRO case, is it safe to grow the packet such that skb->len is
> >>>> greater than 64k?  I presume that the device/driver is going to split
> >>>> the packet anyway and populate the IPv4 total length fields in the
> >>>> header anyway, right?  If we can't grow the packet beyond 64k, is
> >>>> there some way to signal to the driver/device at runtime that the
> >>>> largest packet we can process is 64k minus 40 bytes (for the IPv4
> >>>> options)?
> >>>
> >>> at runtime, not as far as I know.
> >>> It's a field of the network device that can be modified by:
> >>> # ip link set dev eth0 gro_max_size $MAX_SIZE gso_max_size $MAX_SIZE
> >>
> >> I need to look at the OVS case above, but one possibility would be to
> >> have the kernel adjust the GSO size down by 40 bytes when
> >> CONFIG_NETLABEL is enabled, but that isn't a great option, and not
> >> something I consider a first (or second) choice.
> >
> > Looking more at the GSO related code, this isn't likely to work.
>
> icsk_ext_hdr_len is adjusted by cipso for its options. Does that not
> cover what is needed?

Adjusting the icsk_ext_hdr_len only applies to CIPSO labels that are
attached via the associated local sock, traffic that is labeled by
cipso_v4_skbuff_setattr() in the LOCAL_OUT or FORWARD netfilter hooks
does not have the icsk_ext_hdr_len adjustment.

Although as I mentioned earlier, I am adding a patch which would pad
out the IPv4 option header in the LOCAL_OUT labeling scenario so
icsk_ext_hdr_len will be adjusted for all locally generated
TCP/connected/is_icsk traffic.  Forwarded traffic still remains an
issue; but I think the only thing we can do is drop it and send an
icmp message back to the sender with an adjusted MTU value.

--
paul-moore.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-17 15:47                     ` Xin Long
@ 2023-01-19  1:18                       ` Xin Long
  2023-01-19  3:13                         ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-19  1:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Tue, Jan 17, 2023 at 10:47 AM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Mon, Jan 16, 2023 at 3:37 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Jan 16, 2023 at 8:10 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 11:02 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 4:08 PM David Ahern <dsahern@gmail.com> wrote:
> > > > >
> > > >
> > > > > not sure why you think it would not be detected. Today's model for gro
> > > > > sets tot_len based on skb->len. There is an inherent trust that the
> > > > > user's of the gro API set the length correctly. If it is not, the
> > > > > payload to userspace would ultimately be non-sense and hence detectable.
> > > >
> > > > Only if you use some kind of upper protocol adding message integrity
> > > > verification.
> > > >
> > > > > >
> > > > > > As you said, user space sniffing packets now have to guess what is the
> > > > > > intent, instead of headers carrying all the needed information
> > > > > > that can be fully validated by parsers.
> > > > >
> > > > > This is a solveable problem within the packet socket API, and the entire
> > > > > thing is opt-in. If a user's tcpdump / packet capture program is out of
> > > > > date and does not support the new API for large packets, then that user
> > > > > does not have to enable large GRO/TSO.
> > > >
> > > > I do not see this being solved yet.
> > > I think it's common that we add a feature that is disabled by
> > > default in the kernel if the userspace might not support it.
> >
> > One critical feature for us is the ability to restrict packet capture
> > to the headers only.
> >
> > Privacy is a key requirement.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3e5289d5e3f98b7b5b8cac32e9e5a7004c067436
> >
> > So please make sure that packet captures truncated to headers will
> > still be understood by tcpdump
> IIUC we can try to adjust iph->tot_len to 65535 or at least the length
> of all headers for IPv4 BIG TCP packets on the PACKET socket rcv
> path? (or even truncate the ipv4 BIG TCP packets there.). This way
> tcpdump will be able to parse all the headers.
>
> But what if some applications want all the raw data via PACKET sockets?
> Maybe adjust iph->tot_len only, not truncate the packet?
>
I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
my env (RHEL-8), and it breaks too:

19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]

it doesn't show what we want from the TCP header either.

For the latest tcpdump on upstream, it can display headers well for
IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
that supports HBH parsing.

For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
and 'tshark' even supports it by default.

I think we should NOT go with "adjust tot_len" or "truncate packets" way,
and it makes more sense to make it supported in "tcpdump" by default,
just like in "tshark". I believe commit [1] was added for some problems
they've met, we should enable it for both.

[1] https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-19  1:18                       ` Xin Long
@ 2023-01-19  3:13                         ` Eric Dumazet
  2023-01-19 15:41                           ` David Ahern
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2023-01-19  3:13 UTC (permalink / raw)
  To: Xin Long
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:

> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> my env (RHEL-8), and it breaks too:
>
> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>

Please make sure to use a not too old tcpdump.

> it doesn't show what we want from the TCP header either.
>
> For the latest tcpdump on upstream, it can display headers well for
> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> that supports HBH parsing.

User error. If an admin wants to diagnose TCP potential issues, it should use
a correct version.

>
> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> and 'tshark' even supports it by default.

Not with privacy _requirements_, where only the headers are captured.

I am keeping a NACK, until you make sure you do not break this
important feature.

>
> I think we should NOT go with "adjust tot_len" or "truncate packets" way,
> and it makes more sense to make it supported in "tcpdump" by default,
> just like in "tshark". I believe commit [1] was added for some problems
> they've met, we should enable it for both.
>
> [1] https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18
>
> Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-19  3:13                         ` Eric Dumazet
@ 2023-01-19 15:41                           ` David Ahern
  2023-01-19 16:49                             ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: David Ahern @ 2023-01-19 15:41 UTC (permalink / raw)
  To: Eric Dumazet, Xin Long
  Cc: network dev, davem, kuba, Paolo Abeni, Hideaki YOSHIFUJI,
	Pravin B Shelar, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Florian Westphal, Marcelo Ricardo Leitner,
	Ilya Maximets, Aaron Conole, Roopa Prabhu, Nikolay Aleksandrov,
	Mahesh Bandewar, Paul Moore, Guillaume Nault

On 1/18/23 8:13 PM, Eric Dumazet wrote:
> On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> 
>> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
>> my env (RHEL-8), and it breaks too:
>>
>> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
>>
> 
> Please make sure to use a not too old tcpdump.
> 
>> it doesn't show what we want from the TCP header either.
>>
>> For the latest tcpdump on upstream, it can display headers well for
>> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
>> that supports HBH parsing.
> 
> User error. If an admin wants to diagnose TCP potential issues, it should use
> a correct version.

Both of those just fall under "if you want a new feature, update your
tools."


> 
>>
>> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
>> and 'tshark' even supports it by default.
> 
> Not with privacy _requirements_, where only the headers are captured.
> 
> I am keeping a NACK, until you make sure you do not break this
> important feature.

I think the request here is to keep the snaplen in place (e.g., to make
only headers visible to userspace) while also returning the >64kB packet
length as meta data.

My last pass on the packet socket code suggests this is possible;
someone (Xin) needs to work through the details.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-19 15:41                           ` David Ahern
@ 2023-01-19 16:49                             ` Xin Long
  2023-01-19 18:10                               ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-19 16:49 UTC (permalink / raw)
  To: David Ahern
  Cc: Eric Dumazet, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> >
> >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> >> my env (RHEL-8), and it breaks too:
> >>
> >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> >>
> >
> > Please make sure to use a not too old tcpdump.
> >
> >> it doesn't show what we want from the TCP header either.
> >>
> >> For the latest tcpdump on upstream, it can display headers well for
> >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> >> that supports HBH parsing.
> >
> > User error. If an admin wants to diagnose TCP potential issues, it should use
> > a correct version.
>
> Both of those just fall under "if you want a new feature, update your
> tools."
>
>
> >
> >>
> >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> >> and 'tshark' even supports it by default.
> >
> > Not with privacy _requirements_, where only the headers are captured.
> >
> > I am keeping a NACK, until you make sure you do not break this
> > important feature.
>
> I think the request here is to keep the snaplen in place (e.g., to make
> only headers visible to userspace) while also returning the >64kB packet
> length as meta data.
>
> My last pass on the packet socket code suggests this is possible;
> someone (Xin) needs to work through the details.
>
To be honest, I don't really like such a change in a packet socket,
I tried, and the code doesn't look nice.

I'm thinking since skb->len is trustable, why don't we use
IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
namely, only change these 2 helpers to:

static inline unsigned int iph_totlen(const struct sk_buff *skb, const
struct iphdr *iph)
{
        u16 len = ntohs(iph->tot_len);

        return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
                skb->len - skb_network_offset(skb);
}

static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
{
        iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
}

What do you think?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-19 16:49                             ` Xin Long
@ 2023-01-19 18:10                               ` Eric Dumazet
  2023-01-19 18:57                                 ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2023-01-19 18:10 UTC (permalink / raw)
  To: Xin Long
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > >> my env (RHEL-8), and it breaks too:
> > >>
> > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > >>
> > >
> > > Please make sure to use a not too old tcpdump.
> > >
> > >> it doesn't show what we want from the TCP header either.
> > >>
> > >> For the latest tcpdump on upstream, it can display headers well for
> > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > >> that supports HBH parsing.
> > >
> > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > a correct version.
> >
> > Both of those just fall under "if you want a new feature, update your
> > tools."
> >
> >
> > >
> > >>
> > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > >> and 'tshark' even supports it by default.
> > >
> > > Not with privacy _requirements_, where only the headers are captured.
> > >
> > > I am keeping a NACK, until you make sure you do not break this
> > > important feature.
> >
> > I think the request here is to keep the snaplen in place (e.g., to make
> > only headers visible to userspace) while also returning the >64kB packet
> > length as meta data.
> >
> > My last pass on the packet socket code suggests this is possible;
> > someone (Xin) needs to work through the details.
> >
> To be honest, I don't really like such a change in a packet socket,
> I tried, and the code doesn't look nice.
>
> I'm thinking since skb->len is trustable, why don't we use
> IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> namely, only change these 2 helpers to:
>
> static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> struct iphdr *iph)
> {
>         u16 len = ntohs(iph->tot_len);
>
>         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
>                 skb->len - skb_network_offset(skb);
> }
>
> static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> {
>         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> }
>
> What do you think?

I think this is a no go for me.

I think I stated clearly what was the problem.
If you care about TCP diagnostics, you want the truth, not truncated
sequence ranges,
making it impossible to know if a packet was sent.

Without headers describing precisely payload length (solution taken in
IPv6 BI TCP),
you have to augment AF_PACKET to provide this information in
additional meta-data.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-19 18:10                               ` Eric Dumazet
@ 2023-01-19 18:57                                 ` Xin Long
  2023-01-19 19:17                                   ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Xin Long @ 2023-01-19 18:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > > >
> > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > > >> my env (RHEL-8), and it breaks too:
> > > >>
> > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > >>
> > > >
> > > > Please make sure to use a not too old tcpdump.
> > > >
> > > >> it doesn't show what we want from the TCP header either.
> > > >>
> > > >> For the latest tcpdump on upstream, it can display headers well for
> > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > > >> that supports HBH parsing.
> > > >
> > > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > > a correct version.
> > >
> > > Both of those just fall under "if you want a new feature, update your
> > > tools."
> > >
> > >
> > > >
> > > >>
> > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > > >> and 'tshark' even supports it by default.
> > > >
> > > > Not with privacy _requirements_, where only the headers are captured.
> > > >
> > > > I am keeping a NACK, until you make sure you do not break this
> > > > important feature.
> > >
> > > I think the request here is to keep the snaplen in place (e.g., to make
> > > only headers visible to userspace) while also returning the >64kB packet
> > > length as meta data.
> > >
> > > My last pass on the packet socket code suggests this is possible;
> > > someone (Xin) needs to work through the details.
> > >
> > To be honest, I don't really like such a change in a packet socket,
> > I tried, and the code doesn't look nice.
> >
> > I'm thinking since skb->len is trustable, why don't we use
> > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> > namely, only change these 2 helpers to:
> >
> > static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> > struct iphdr *iph)
> > {
> >         u16 len = ntohs(iph->tot_len);
> >
> >         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
> >                 skb->len - skb_network_offset(skb);
> > }
> >
> > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> > {
> >         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> > }
> >
> > What do you think?
>
> I think this is a no go for me.
>
> I think I stated clearly what was the problem.
> If you care about TCP diagnostics, you want the truth, not truncated
> sequence ranges,
> making it impossible to know if a packet was sent.
Sorry Eric if I didn't get you well.

With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535),
all TCP headers will display well, no truncated sequence ranges:

#  ip net exec router tcpdump -i link1
13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483
13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483
13:36:46.675542 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1532895495:1532960978, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483
13:36:46.675550 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
seq 1533021985:1533087468, ack 1, win 504, options [nop,nop,TS val
2975547125 ecr 2379476018], length 65483

I just don't want to modify the iph tot_len in IPv4 header from the
raw data in the packet socket.
We're trying to avoid iph->tot_len too small for IPv4 BIG TCP to
display tcphdr in tcpdump, aren't we?
That's why I think using IP_MAX_MTU will avoid this.

>
> Without headers describing precisely payload length (solution taken in
> IPv6 BI TCP),
> you have to augment AF_PACKET to provide this information in
> additional meta-data.
For this, I agree to provide the >64kB packet into additional meta-data.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-19 18:57                                 ` Xin Long
@ 2023-01-19 19:17                                   ` Eric Dumazet
  2023-01-19 19:30                                     ` Xin Long
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2023-01-19 19:17 UTC (permalink / raw)
  To: Xin Long
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Thu, Jan 19, 2023 at 7:59 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > > >
> > > > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > > > >
> > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > > > >> my env (RHEL-8), and it breaks too:
> > > > >>
> > > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > >>
> > > > >
> > > > > Please make sure to use a not too old tcpdump.
> > > > >
> > > > >> it doesn't show what we want from the TCP header either.
> > > > >>
> > > > >> For the latest tcpdump on upstream, it can display headers well for
> > > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > > > >> that supports HBH parsing.
> > > > >
> > > > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > > > a correct version.
> > > >
> > > > Both of those just fall under "if you want a new feature, update your
> > > > tools."
> > > >
> > > >
> > > > >
> > > > >>
> > > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > > > >> and 'tshark' even supports it by default.
> > > > >
> > > > > Not with privacy _requirements_, where only the headers are captured.
> > > > >
> > > > > I am keeping a NACK, until you make sure you do not break this
> > > > > important feature.
> > > >
> > > > I think the request here is to keep the snaplen in place (e.g., to make
> > > > only headers visible to userspace) while also returning the >64kB packet
> > > > length as meta data.
> > > >
> > > > My last pass on the packet socket code suggests this is possible;
> > > > someone (Xin) needs to work through the details.
> > > >
> > > To be honest, I don't really like such a change in a packet socket,
> > > I tried, and the code doesn't look nice.
> > >
> > > I'm thinking since skb->len is trustable, why don't we use
> > > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> > > namely, only change these 2 helpers to:
> > >
> > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> > > struct iphdr *iph)
> > > {
> > >         u16 len = ntohs(iph->tot_len);
> > >
> > >         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
> > >                 skb->len - skb_network_offset(skb);
> > > }
> > >
> > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> > > {
> > >         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> > > }
> > >
> > > What do you think?
> >
> > I think this is a no go for me.
> >
> > I think I stated clearly what was the problem.
> > If you care about TCP diagnostics, you want the truth, not truncated
> > sequence ranges,
> > making it impossible to know if a packet was sent.
> Sorry Eric if I didn't get you well.
>
> With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535),
> all TCP headers will display well, no truncated sequence ranges:
>
> #  ip net exec router tcpdump -i link1
> 13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483
> 13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483

This is completely truncated, don't you see this ?

According to tcpdump, we sent sequences 1532642515:1532707998 and
1532769005:1532834488

And payload was of  65483 bytes per packet (this is not true)

What happened for 1532707998 -> 1532769005 ???

How network engineers will know "oh wait, data was sent/received after all",
and not dropped somewhere in the network or in netfilter or ... in a kernel bug.

> 13:36:46.675542 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1532895495:1532960978, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483
> 13:36:46.675550 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> seq 1533021985:1533087468, ack 1, win 504, options [nop,nop,TS val
> 2975547125 ecr 2379476018], length 65483

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6
  2023-01-19 19:17                                   ` Eric Dumazet
@ 2023-01-19 19:30                                     ` Xin Long
  0 siblings, 0 replies; 45+ messages in thread
From: Xin Long @ 2023-01-19 19:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, network dev, davem, kuba, Paolo Abeni,
	Hideaki YOSHIFUJI, Pravin B Shelar, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Pablo Neira Ayuso, Florian Westphal,
	Marcelo Ricardo Leitner, Ilya Maximets, Aaron Conole,
	Roopa Prabhu, Nikolay Aleksandrov, Mahesh Bandewar, Paul Moore,
	Guillaume Nault

On Thu, Jan 19, 2023 at 2:17 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jan 19, 2023 at 7:59 PM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Jan 19, 2023 at 5:51 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 19, 2023 at 10:41 AM David Ahern <dsahern@gmail.com> wrote:
> > > > >
> > > > > On 1/18/23 8:13 PM, Eric Dumazet wrote:
> > > > > > On Thu, Jan 19, 2023 at 2:19 AM Xin Long <lucien.xin@gmail.com> wrote:
> > > > > >
> > > > > >> I think that IPv6 BIG TCP has a similar problem, below is the tcpdump in
> > > > > >> my env (RHEL-8), and it breaks too:
> > > > > >>
> > > > > >> 19:43:59.964272 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964282 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964292 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964300 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >> 19:43:59.964308 IP6 2001:db8:1::1 > 2001:db8:2::1: [|HBH]
> > > > > >>
> > > > > >
> > > > > > Please make sure to use a not too old tcpdump.
> > > > > >
> > > > > >> it doesn't show what we want from the TCP header either.
> > > > > >>
> > > > > >> For the latest tcpdump on upstream, it can display headers well for
> > > > > >> IPv6 BIG TCP. But we can't expect all systems to use the tcpdump
> > > > > >> that supports HBH parsing.
> > > > > >
> > > > > > User error. If an admin wants to diagnose TCP potential issues, it should use
> > > > > > a correct version.
> > > > >
> > > > > Both of those just fall under "if you want a new feature, update your
> > > > > tools."
> > > > >
> > > > >
> > > > > >
> > > > > >>
> > > > > >> For IPv4 BIG TCP, it's just a CFLAGS change to support it in "tcpdump,"
> > > > > >> and 'tshark' even supports it by default.
> > > > > >
> > > > > > Not with privacy _requirements_, where only the headers are captured.
> > > > > >
> > > > > > I am keeping a NACK, until you make sure you do not break this
> > > > > > important feature.
> > > > >
> > > > > I think the request here is to keep the snaplen in place (e.g., to make
> > > > > only headers visible to userspace) while also returning the >64kB packet
> > > > > length as meta data.
> > > > >
> > > > > My last pass on the packet socket code suggests this is possible;
> > > > > someone (Xin) needs to work through the details.
> > > > >
> > > > To be honest, I don't really like such a change in a packet socket,
> > > > I tried, and the code doesn't look nice.
> > > >
> > > > I'm thinking since skb->len is trustable, why don't we use
> > > > IP_MAX_MTU(0xFFFF) as iph->tot_len for IPv4 BIG TCP?
> > > > namely, only change these 2 helpers to:
> > > >
> > > > static inline unsigned int iph_totlen(const struct sk_buff *skb, const
> > > > struct iphdr *iph)
> > > > {
> > > >         u16 len = ntohs(iph->tot_len);
> > > >
> > > >         return (len < IP_MAX_MTU || !skb_is_gso_tcp(skb)) ? len :
> > > >                 skb->len - skb_network_offset(skb);
> > > > }
> > > >
> > > > static inline void iph_set_totlen(struct iphdr *iph, unsigned int len)
> > > > {
> > > >         iph->tot_len = len < IP_MAX_MTU ? htons(len) : htons(IP_MAX_MTU);
> > > > }
> > > >
> > > > What do you think?
> > >
> > > I think this is a no go for me.
> > >
> > > I think I stated clearly what was the problem.
> > > If you care about TCP diagnostics, you want the truth, not truncated
> > > sequence ranges,
> > > making it impossible to know if a packet was sent.
> > Sorry Eric if I didn't get you well.
> >
> > With new helpers, the iph->tot_len will be set to IP_MAX_MTU(65535),
> > all TCP headers will display well, no truncated sequence ranges:
> >
> > #  ip net exec router tcpdump -i link1
> > 13:36:46.675522 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> > seq 1532642515:1532707998, ack 1, win 504, options [nop,nop,TS val
> > 2975547125 ecr 2379476018], length 65483
> > 13:36:46.675534 IP 198.51.100.1.42289 > 203.0.113.1.45103: Flags [P.],
> > seq 1532769005:1532834488, ack 1, win 504, options [nop,nop,TS val
> > 2975547125 ecr 2379476018], length 65483
>
> This is completely truncated, don't you see this ?
OK, got you now.
Thanks for the explanation.

>
> According to tcpdump, we sent sequences 1532642515:1532707998 and
> 1532769005:1532834488
>
> And payload was of  65483 bytes per packet (this is not true)
>
> What happened for 1532707998 -> 1532769005 ???
>
> How network engineers will know "oh wait, data was sent/received after all",
> and not dropped somewhere in the network or in netfilter or ... in a kernel bug.
>
I will work on providing the >64kB packet length in meta-data instead.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2023-01-19 19:32 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-14  3:31 [PATCH net-next 00/10] net: support ipv4 big tcp Xin Long
2023-01-14  3:31 ` [PATCH net-next 01/10] net: add a couple of helpers for iph tot_len Xin Long
2023-01-14  3:31 ` [PATCH net-next 02/10] bridge: use skb_ip_totlen in br netfilter Xin Long
2023-01-14  3:31 ` [PATCH net-next 03/10] openvswitch: use skb_ip_totlen in conntrack Xin Long
2023-01-14  3:31 ` [PATCH net-next 04/10] net: sched: use skb_ip_totlen and iph_totlen Xin Long
2023-01-14  3:31 ` [PATCH net-next 05/10] netfilter: " Xin Long
2023-01-14  3:31 ` [PATCH net-next 06/10] cipso_ipv4: use iph_set_totlen in skbuff_setattr Xin Long
2023-01-14 15:38   ` Paul Moore
2023-01-14 17:52     ` Xin Long
2023-01-16 16:45       ` Paul Moore
2023-01-16 17:36         ` Xin Long
2023-01-16 18:12           ` Paul Moore
2023-01-16 19:33             ` Xin Long
2023-01-17  4:54               ` David Ahern
2023-01-17 19:51               ` Paul Moore
2023-01-17 22:46                 ` Paul Moore
2023-01-18  2:47                   ` David Ahern
2023-01-18 19:18                     ` Paul Moore
2023-01-14  3:31 ` [PATCH net-next 07/10] ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr Xin Long
2023-01-14  3:31 ` [PATCH net-next 08/10] net: add support for ipv4 big tcp Xin Long
2023-01-14  3:31 ` [PATCH net-next 09/10] netfilter: get ipv6 pktlen properly in length_mt6 Xin Long
2023-01-15 15:41   ` David Ahern
2023-01-15 17:42     ` Xin Long
2023-01-15 19:40       ` Eric Dumazet
2023-01-15 20:14         ` Xin Long
2023-01-15 23:57           ` David Ahern
2023-01-16  9:24           ` Eric Dumazet
2023-01-16 15:07             ` David Ahern
2023-01-16 16:02               ` Eric Dumazet
2023-01-16 19:09                 ` Xin Long
2023-01-16 20:37                   ` Eric Dumazet
2023-01-17 15:47                     ` Xin Long
2023-01-19  1:18                       ` Xin Long
2023-01-19  3:13                         ` Eric Dumazet
2023-01-19 15:41                           ` David Ahern
2023-01-19 16:49                             ` Xin Long
2023-01-19 18:10                               ` Eric Dumazet
2023-01-19 18:57                                 ` Xin Long
2023-01-19 19:17                                   ` Eric Dumazet
2023-01-19 19:30                                     ` Xin Long
2023-01-15 23:58         ` David Ahern
2023-01-14  3:31 ` [PATCH net-next 10/10] selftests: add a selftest for big tcp Xin Long
2023-01-15 15:45 ` [PATCH net-next 00/10] net: support ipv4 " David Ahern
2023-01-15 16:04 ` Eric Dumazet
2023-01-15 17:33   ` Xin Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.