Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling
@ 2021-01-12 17:45 Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.
V6: Use bpf_check_mtu() in selftest
V7: Fix logic using tot_len and add another selftest
V8: Add better selftests for BPF-helper bpf_check_mtu
V9: Remove patch that use skb_set_redirected
V10: Fix selftests and 'tot_len' MTU check like XDP
V11: Fix nitpicks in selftests

---

Jesper Dangaard Brouer (7):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
      bpf/selftests: tests using bpf_check_mtu BPF-helper


 include/linux/netdevice.h                          |   31 +++
 include/uapi/linux/bpf.h                           |   78 +++++++
 net/core/dev.c                                     |   19 --
 net/core/filter.c                                  |  183 +++++++++++++++--
 tools/include/uapi/linux/bpf.h                     |   78 +++++++
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  216 ++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  198 ++++++++++++++++++
 .../selftests/bpf/progs/test_cls_redirect.c        |    7 +
 8 files changed, 766 insertions(+), 44 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

--


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len
  2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
@ 2021-01-12 17:45 ` Jesper Dangaard Brouer
  2021-01-14  8:18   ` John Fastabend
  2021-01-12 17:45 ` [PATCH bpf-next V11 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 255aeee72402..f8f198252ff2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 	return 0;
 }
 
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
-			  SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 	   u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 {
 	u32 len_cur, len_diff_abs = abs(len_diff);
 	u32 len_min = bpf_skb_net_base_len(skb);
-	u32 len_max = __bpf_skb_max_len(skb);
+	u32 len_max = BPF_SKB_MAX_LEN;
 	__be16 proto = skb->protocol;
 	bool shrink = len_diff < 0;
 	u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 min_len = __bpf_skb_min_len(skb);
 	int ret;
 
@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 new_len = skb->len + head_room;
 	int ret;
 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V11 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
  2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2021-01-12 17:45 ` Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use
bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,
by adjusting fib_params 'tot_len' with the packet length plus the expected
encap size. (Just like the bpf_check_mtu helper supports). He discovered
that for SKB ctx the param->tot_len was not used, instead skb->len was used
(via MTU check in is_skb_forwardable() that checks against netdev MTU).

Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g.
zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU
check is done like XDP code-path, which checks against FIB-dst MTU.

V10:
- Use same method as XDP for 'tot_len' MTU check

Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status")
Reported-by: Carlo Carraro <colrack@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index f8f198252ff2..c1e460193bae 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5548,6 +5548,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 {
 	struct net *net = dev_net(skb->dev);
 	int rc = -EAFNOSUPPORT;
+	bool check_mtu = false;
 
 	if (plen < sizeof(*params))
 		return -EINVAL;
@@ -5555,22 +5556,28 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT))
 		return -EINVAL;
 
+	if (params->tot_len)
+		check_mtu = true;
+
 	switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
-		rc = bpf_ipv4_fib_lookup(net, params, flags, false);
+		rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu);
 		break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		rc = bpf_ipv6_fib_lookup(net, params, flags, false);
+		rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu);
 		break;
 #endif
 	}
 
-	if (!rc) {
+	if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) {
 		struct net_device *dev;
 
+		/* When tot_len isn't provided by user,
+		 * check skb against net_device MTU
+		 */
 		dev = dev_get_by_index_rcu(net, params->ifindex);
 		if (!is_skb_forwardable(dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V11 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up
  2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx Jesper Dangaard Brouer
@ 2021-01-12 17:45 ` Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   11 +++++++++--
 net/core/filter.c              |   22 +++++++++++++++-------
 tools/include/uapi/linux/bpf.h |   11 +++++++++--
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77d7c1bb2923..649586d656b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2225,6 +2225,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and output params->mtu_result contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4975,9 +4978,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* L3 length from network hdr (iph->tot_len) */
 
+		/* output: MTU value */
+		__u16	mtu_result;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */
diff --git a/net/core/filter.c b/net/core/filter.c
index c1e460193bae..db59ab55572c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5268,12 +5268,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
 				  const struct neighbour *neigh,
-				  const struct net_device *dev)
+				  const struct net_device *dev, u32 mtu)
 {
 	memcpy(params->dmac, neigh->ha, ETH_ALEN);
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
+	if (mtu)
+		params->mtu_result = mtu; /* union with tot_len */
 
 	return 0;
 }
@@ -5289,8 +5291,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct net_device *dev;
 	struct fib_result res;
 	struct flowi4 fl4;
+	u32 mtu = 0;
 	int err;
-	u32 mtu;
 
 	dev = dev_get_by_index_rcu(net, params->ifindex);
 	if (unlikely(!dev))
@@ -5357,8 +5359,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu_result = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	nhc = res.nhc;
@@ -5392,7 +5396,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5409,7 +5413,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct flowi6 fl6;
 	int strict = 0;
 	int oif, err;
-	u32 mtu;
+	u32 mtu = 0;
 
 	/* link local addresses are never forwarded */
 	if (rt6_need_strict(dst) || rt6_need_strict(src))
@@ -5484,8 +5488,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu_result = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	if (res.nh->fib_nh_lws)
@@ -5505,7 +5511,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5581,6 +5587,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 		dev = dev_get_by_index_rcu(net, params->ifindex);
 		if (!is_skb_forwardable(dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
+
+		params->mtu_result = dev->mtu; /* union with tot_len */
 	}
 
 	return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 77d7c1bb2923..649586d656b6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2225,6 +2225,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and output params->mtu_result contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4975,9 +4978,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* L3 length from network hdr (iph->tot_len) */
 
+		/* output: MTU value */
+		__u16	mtu_result;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2021-01-12 17:45 ` [PATCH bpf-next V11 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2021-01-12 17:45 ` Jesper Dangaard Brouer
  2021-01-12 19:23   ` Andrii Nakryiko
  2021-01-13 23:07   ` Daniel Borkmann
  2021-01-12 17:45 ` [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
 net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
 3 files changed, 256 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 649586d656b6..fa2e99351758 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3833,6 +3833,61 @@ union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is harder to
+ *			check at this point, because in transmit path it is
+ *			possible for the skb packet to get re-segmented
+ *			(depending on net device features).  This could still be
+ *			a MTU violation, so this flag enables performing MTU
+ *			check against segments, with a different violation
+ *			return code to tell it apart. Check cannot use len_diff.
+ *
+ *		On return *mtu_len* pointer contains the MTU value of the net
+ *		device.  Remember the net device configured MTU is the L3 size,
+ *		which is returned here and XDP and TX length operate at L2.
+ *		Helper take this into account for you, but remember when using
+ *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ *		pointer and be initialized (to zero), else verifier will reject
+ *		BPF program.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_len* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_len* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3998,6 +4053,7 @@ union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_SEGS  = (1U << 0),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
diff --git a/net/core/filter.c b/net/core/filter.c
index db59ab55572c..3f2e593244ca 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5604,6 +5604,124 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
+					    u32 ifindex)
+{
+	struct net *netns = dev_net(dev_curr);
+
+	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
+	if (ifindex == 0)
+		return dev_curr;
+
+	return dev_get_by_index_rcu(netns, ifindex);
+}
+
+BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
+	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
+{
+	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+	struct net_device *dev = skb->dev;
+	int skb_len, dev_len;
+	int mtu;
+
+	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
+		return -EINVAL;
+
+	dev = __dev_via_ifindex(dev, ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	mtu = READ_ONCE(dev->mtu);
+
+	dev_len = mtu + dev->hard_header_len;
+	skb_len = skb->len + len_diff; /* minus result pass check */
+	if (skb_len <= dev_len) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+		goto out;
+	}
+	/* At this point, skb->len exceed MTU, but as it include length of all
+	 * segments, it can still be below MTU.  The SKB can possibly get
+	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
+	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
+	 * checked against MTU.
+	 */
+	if (skb_is_gso(skb)) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+
+		if (!(flags & BPF_MTU_CHK_SEGS))
+			goto out;
+
+		if (!skb_gso_validate_network_len(skb, mtu)) {
+			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
+			goto out;
+		}
+
+		skb_len = skb_headlen(skb) + len_diff;
+		if (skb_len > dev_len) {
+			ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+			goto out;
+		}
+	}
+out:
+	/* BPF verifier guarantees valid pointer */
+	*mtu_len = mtu;
+
+	return ret;
+}
+
+BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp,
+	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
+{
+	struct net_device *dev = xdp->rxq->dev;
+	int xdp_len = xdp->data_end - xdp->data;
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	int mtu, dev_len;
+
+	/* XDP variant doesn't support multi-buffer segment check (yet) */
+	if (unlikely(flags))
+		return -EINVAL;
+
+	dev = __dev_via_ifindex(dev, ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	mtu = READ_ONCE(dev->mtu);
+
+	/* Add L2-header as dev MTU is L3 size */
+	dev_len = mtu + dev->hard_header_len;
+
+	xdp_len += len_diff; /* minus result pass check */
+	if (xdp_len > dev_len)
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+
+	/* BPF verifier guarantees valid pointer */
+	*mtu_len = mtu;
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_skb_check_mtu_proto = {
+	.func		= bpf_skb_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_INT,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto bpf_xdp_check_mtu_proto = {
+	.func		= bpf_xdp_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_INT,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
 {
@@ -7173,6 +7291,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_skb_check_mtu_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
 	case BPF_FUNC_sk_storage_get:
@@ -7242,6 +7362,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_xdp_check_mtu_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 649586d656b6..fa2e99351758 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3833,6 +3833,61 @@ union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is harder to
+ *			check at this point, because in transmit path it is
+ *			possible for the skb packet to get re-segmented
+ *			(depending on net device features).  This could still be
+ *			a MTU violation, so this flag enables performing MTU
+ *			check against segments, with a different violation
+ *			return code to tell it apart. Check cannot use len_diff.
+ *
+ *		On return *mtu_len* pointer contains the MTU value of the net
+ *		device.  Remember the net device configured MTU is the L3 size,
+ *		which is returned here and XDP and TX length operate at L2.
+ *		Helper take this into account for you, but remember when using
+ *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
+ *		pointer and be initialized (to zero), else verifier will reject
+ *		BPF program.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_len* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_len* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_len* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3998,6 +4053,7 @@ union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_SEGS  = (1U << 0),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2021-01-12 17:45 ` [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2021-01-12 17:45 ` Jesper Dangaard Brouer
  2021-01-14  9:03   ` John Fastabend
  2021-01-12 17:45 ` [PATCH bpf-next V11 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper Jesper Dangaard Brouer
  6 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/

V9:
 - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
 net/core/dev.c            |   19 ++-----------------
 net/core/filter.c         |   14 +++++++++++---
 3 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1ec3ac5d5bbf..438943f98a9c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3930,11 +3930,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
 			const struct sk_buff *skb);
 
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+						 const struct sk_buff *skb,
+						 const bool check_mtu)
+{
+	const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+	unsigned int len;
+
+	if (!(dev->flags & IFF_UP))
+		return false;
+
+	if (!check_mtu)
+		return true;
+
+	len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+	if (skb->len <= len)
+		return true;
+
+	/* if TSO is enabled, we don't care about the length as the packet
+	 * could be forwarded without being segmented before
+	 */
+	if (skb_is_gso(skb))
+		return true;
+
+	return false;
+}
+
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
-					       struct sk_buff *skb)
+					       struct sk_buff *skb,
+					       const bool check_mtu)
 {
 	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-	    unlikely(!is_skb_forwardable(dev, skb))) {
+	    unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index 55499b017a42..51bc7f079958 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2194,28 +2194,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
 {
-	unsigned int len;
-
-	if (!(dev->flags & IFF_UP))
-		return false;
-
-	len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-	if (skb->len <= len)
-		return true;
-
-	/* if TSO is enabled, we don't care about the length as the packet
-	 * could be forwarded without being segmented before
-	 */
-	if (skb_is_gso(skb))
-		return true;
-
-	return false;
+	return __is_skb_forwardable(dev, skb, true);
 }
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, true);
 
 	if (likely(!ret)) {
 		skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index 3f2e593244ca..1908800b671c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	return dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
+
+	if (likely(!ret)) {
+		skb->protocol = eth_type_trans(skb, dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		ret = netif_rx(skb);
+	}
+
+	return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
 				      struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
 
 	if (likely(!ret)) {
 		skb->dev = dev;
@@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb)
 			goto out_drop;
 		dev = ops->ndo_get_peer_dev(dev);
 		if (unlikely(!dev ||
-			     !is_skb_forwardable(dev, skb) ||
+			     !(dev->flags & IFF_UP) ||
 			     net_eq(net, dev_net(dev))))
 			goto out_drop;
 		skb->dev = dev;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V11 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
  2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2021-01-12 17:45 ` [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2021-01-12 17:45 ` Jesper Dangaard Brouer
  2021-01-12 17:45 ` [PATCH bpf-next V11 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper Jesper Dangaard Brouer
  6 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

This demonstrate how bpf_check_mtu() helper can easily be used together
with bpf_skb_adjust_room() helper, prior to doing size adjustment, as
delta argument is already setup.

Hint: This specific test can be selected like this:
 ./test_progs -t cls_redirect

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 .../selftests/bpf/progs/test_cls_redirect.c        |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
index c9f8464996ea..3c1e042962e6 100644
--- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c
+++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c
@@ -70,6 +70,7 @@ typedef struct {
 	uint64_t errors_total_encap_adjust_failed;
 	uint64_t errors_total_encap_buffer_too_small;
 	uint64_t errors_total_redirect_loop;
+	uint64_t errors_total_encap_mtu_violate;
 } metrics_t;
 
 typedef enum {
@@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e
 		payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr);
 	int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead;
 	uint16_t proto = ETH_P_IP;
+	uint32_t mtu_len = 0;
 
 	/* Loop protection: the inner packet's TTL is decremented as a safeguard
 	 * against any forwarding loop. As the only interesting field is the TTL
@@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e
 		}
 	}
 
+	if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) {
+		metrics->errors_total_encap_mtu_violate++;
+		return TC_ACT_SHOT;
+	}
+
 	if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET,
 				BPF_F_ADJ_ROOM_FIXED_GSO |
 				BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V11 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
  2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  2021-01-12 17:45 ` [PATCH bpf-next V11 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect Jesper Dangaard Brouer
@ 2021-01-12 17:45 ` Jesper Dangaard Brouer
  2021-01-12 19:29   ` Andrii Nakryiko
  6 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-12 17:45 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

Adding selftest for BPF-helper bpf_check_mtu(). Making sure
it can be used from both XDP and TC.

V11:
 - Addresse nitpicks from Andrii Nakryiko

V10:
 - Remove errno non-zero test in CHECK_ATTR()
 - Addresse comments from Andrii Nakryiko

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  216 ++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  198 ++++++++++++++++++
 2 files changed, 414 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
new file mode 100644
index 000000000000..9e2fd01b7c65
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Jesper Dangaard Brouer */
+
+#include <linux/if_link.h> /* before test_progs.h, avoid bpf_util.h redefines */
+#include <test_progs.h>
+#include "test_check_mtu.skel.h"
+#include "network_helpers.h"
+
+#include <stdlib.h>
+#include <inttypes.h>
+
+#define IFINDEX_LO 1
+
+static __u32 duration; /* Hint: needed for CHECK macro */
+
+static int read_mtu_device_lo(void)
+{
+	const char *filename = "/sys/class/net/lo/mtu";
+	char buf[11] = {};
+	int value, n, fd;
+
+	fd = open(filename, 0, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	n = read(fd, buf, sizeof(buf));
+	close(fd);
+
+	if (n == -1)
+		return -2;
+
+	value = strtoimax(buf, NULL, 10);
+	if (errno == ERANGE)
+		return -3;
+
+	return value;
+}
+
+static void test_check_mtu_xdp_attach()
+{
+	struct bpf_link_info link_info;
+	__u32 link_info_len = sizeof(link_info);
+	struct test_check_mtu *skel;
+	struct bpf_program *prog;
+	struct bpf_link *link;
+	int err = 0;
+	int fd;
+
+	skel = test_check_mtu__open_and_load();
+	if (CHECK(!skel, "open and load skel", "failed"))
+		return; /* Exit if e.g. helper unknown to kernel */
+
+	prog = skel->progs.xdp_use_helper_basic;
+
+	link = bpf_program__attach_xdp(prog, IFINDEX_LO);
+	if (CHECK(IS_ERR(link), "link_attach", "failed: %ld\n", PTR_ERR(link)))
+		goto out;
+	skel->links.xdp_use_helper_basic = link;
+
+	memset(&link_info, 0, sizeof(link_info));
+	fd = bpf_link__fd(link);
+	err = bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+	if (CHECK(err, "link_info", "failed: %d\n", err))
+		goto out;
+
+	CHECK(link_info.type != BPF_LINK_TYPE_XDP, "link_type",
+	      "got %u != exp %u\n", link_info.type, BPF_LINK_TYPE_XDP);
+	CHECK(link_info.xdp.ifindex != IFINDEX_LO, "link_ifindex",
+	      "got %u != exp %u\n", link_info.xdp.ifindex, IFINDEX_LO);
+
+	err = bpf_link__detach(link);
+	CHECK(err, "link_detach", "failed %d\n", err);
+
+out:
+	test_check_mtu__destroy(skel);
+}
+
+static void test_check_mtu_run_xdp(struct test_check_mtu *skel,
+				   struct bpf_program *prog,
+				   __u32 mtu_expect)
+{
+	const char *prog_name = bpf_program__name(prog);
+	int retval_expect = XDP_PASS;
+	__u32 mtu_result = 0;
+	char buf[256] = {};
+	int err;
+	struct bpf_prog_test_run_attr tattr = {
+		.repeat = 1,
+		.data_in = &pkt_v4,
+		.data_size_in = sizeof(pkt_v4),
+		.data_out = buf,
+		.data_size_out = sizeof(buf),
+		.prog_fd = bpf_program__fd(prog),
+	};
+
+	err = bpf_prog_test_run_xattr(&tattr);
+	CHECK_ATTR(err != 0, "bpf_prog_test_run",
+		   "prog_name:%s (err %d errno %d retval %d)\n",
+		   prog_name, err, errno, tattr.retval);
+
+	CHECK(tattr.retval != retval_expect, "retval",
+	      "progname:%s unexpected retval=%d expected=%d\n",
+	      prog_name, tattr.retval, retval_expect);
+
+	/* Extract MTU that BPF-prog got */
+	mtu_result = skel->bss->global_bpf_mtu_xdp;
+	ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user");
+}
+
+
+static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex)
+{
+	struct test_check_mtu *skel;
+	int err;
+
+	skel = test_check_mtu__open();
+	if (CHECK(!skel, "skel_open", "failed"))
+		return;
+
+	/* Update "constants" in BPF-prog *BEFORE* libbpf load */
+	skel->rodata->GLOBAL_USER_MTU = mtu;
+	skel->rodata->GLOBAL_USER_IFINDEX = ifindex;
+
+	err = test_check_mtu__load(skel);
+	if (CHECK(err, "skel_load", "failed: %d\n", err))
+		goto cleanup;
+
+	test_check_mtu_run_xdp(skel, skel->progs.xdp_use_helper, mtu);
+	test_check_mtu_run_xdp(skel, skel->progs.xdp_exceed_mtu, mtu);
+	test_check_mtu_run_xdp(skel, skel->progs.xdp_minus_delta, mtu);
+
+cleanup:
+	test_check_mtu__destroy(skel);
+}
+
+static void test_check_mtu_run_tc(struct test_check_mtu *skel,
+				  struct bpf_program *prog,
+				  __u32 mtu_expect)
+{
+	const char *prog_name = bpf_program__name(prog);
+	int retval_expect = BPF_OK;
+	__u32 mtu_result = 0;
+	char buf[256] = {};
+	int err;
+	struct bpf_prog_test_run_attr tattr = {
+		.repeat = 1,
+		.data_in = &pkt_v4,
+		.data_size_in = sizeof(pkt_v4),
+		.data_out = buf,
+		.data_size_out = sizeof(buf),
+		.prog_fd = bpf_program__fd(prog),
+	};
+
+	err = bpf_prog_test_run_xattr(&tattr);
+	CHECK_ATTR(err != 0, "bpf_prog_test_run",
+		   "prog_name:%s (err %d errno %d retval %d)\n",
+		   prog_name, err, errno, tattr.retval);
+
+	CHECK(tattr.retval != retval_expect, "retval",
+	      "progname:%s unexpected retval=%d expected=%d\n",
+	      prog_name, tattr.retval, retval_expect);
+
+	/* Extract MTU that BPF-prog got */
+	mtu_result = skel->bss->global_bpf_mtu_tc;
+	ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user");
+}
+
+
+static void test_check_mtu_tc(__u32 mtu, __u32 ifindex)
+{
+	struct test_check_mtu *skel;
+	int err;
+
+	skel = test_check_mtu__open();
+	if (CHECK(!skel, "skel_open", "failed"))
+		return;
+
+	/* Update "constants" in BPF-prog *BEFORE* libbpf load */
+	skel->rodata->GLOBAL_USER_MTU = mtu;
+	skel->rodata->GLOBAL_USER_IFINDEX = ifindex;
+
+	err = test_check_mtu__load(skel);
+	if (CHECK(err, "skel_load", "failed: %d\n", err))
+		goto cleanup;
+
+	test_check_mtu_run_tc(skel, skel->progs.tc_use_helper, mtu);
+	test_check_mtu_run_tc(skel, skel->progs.tc_exceed_mtu, mtu);
+	test_check_mtu_run_tc(skel, skel->progs.tc_exceed_mtu_da, mtu);
+	test_check_mtu_run_tc(skel, skel->progs.tc_minus_delta, mtu);
+cleanup:
+	test_check_mtu__destroy(skel);
+}
+
+void test_check_mtu(void)
+{
+	__u32 mtu_lo;
+
+	if (test__start_subtest("bpf_check_mtu XDP-attach"))
+		test_check_mtu_xdp_attach();
+
+	mtu_lo = read_mtu_device_lo();
+	if (CHECK(mtu_lo < 0, "reading MTU value", "failed (err:%d)", mtu_lo))
+		return;
+
+	if (test__start_subtest("bpf_check_mtu XDP-run"))
+		test_check_mtu_xdp(mtu_lo, 0);
+
+	if (test__start_subtest("bpf_check_mtu XDP-run ifindex-lookup"))
+		test_check_mtu_xdp(mtu_lo, IFINDEX_LO);
+
+	if (test__start_subtest("bpf_check_mtu TC-run"))
+		test_check_mtu_tc(mtu_lo, 0);
+
+	if (test__start_subtest("bpf_check_mtu TC-run ifindex-lookup"))
+		test_check_mtu_tc(mtu_lo, IFINDEX_LO);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_check_mtu.c b/tools/testing/selftests/bpf/progs/test_check_mtu.c
new file mode 100644
index 000000000000..1b31d5ceb3c7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_check_mtu.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Jesper Dangaard Brouer */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/if_ether.h>
+
+#include <stddef.h>
+#include <stdint.h>
+
+char _license[] SEC("license") = "GPL";
+
+/* Userspace will update with MTU it can see on device */
+static volatile const int GLOBAL_USER_MTU;
+static volatile const __u32 GLOBAL_USER_IFINDEX;
+
+/* BPF-prog will update these with MTU values it can see */
+__u32 global_bpf_mtu_xdp = 0;
+__u32 global_bpf_mtu_tc  = 0;
+
+SEC("xdp")
+int xdp_use_helper_basic(struct xdp_md *ctx)
+{
+	__u32 mtu_len = 0;
+
+	if (bpf_check_mtu(ctx, 0, &mtu_len, 0, 0))
+		return XDP_ABORTED;
+
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_use_helper(struct xdp_md *ctx)
+{
+	int retval = XDP_PASS; /* Expected retval on successful test */
+	__u32 mtu_len = 0;
+	__u32 ifindex = 0;
+	int delta = 0;
+
+	/* When ifindex is zero, save net_device lookup and use ctx netdev */
+	if (GLOBAL_USER_IFINDEX > 0)
+		ifindex = GLOBAL_USER_IFINDEX;
+
+	if (bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0)) {
+		/* mtu_len is also valid when check fail */
+		retval = XDP_ABORTED;
+		goto out;
+	}
+
+	if (mtu_len != GLOBAL_USER_MTU)
+		retval = XDP_DROP;
+
+out:
+	global_bpf_mtu_xdp = mtu_len;
+	return retval;
+}
+
+SEC("xdp")
+int xdp_exceed_mtu(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	__u32 ifindex = GLOBAL_USER_IFINDEX;
+	__u32 data_len = data_end - data;
+	int retval = XDP_ABORTED; /* Fail */
+	__u32 mtu_len = 0;
+	int delta;
+	int err;
+
+	/* Exceed MTU with 1 via delta adjust */
+	delta = GLOBAL_USER_MTU - (data_len - ETH_HLEN) + 1;
+
+	err = bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0);
+	if (err) {
+		retval = XDP_PASS; /* Success in exceeding MTU check */
+		if (err != BPF_MTU_CHK_RET_FRAG_NEEDED)
+			retval = XDP_DROP;
+	}
+
+	global_bpf_mtu_xdp = mtu_len;
+	return retval;
+}
+
+SEC("xdp")
+int xdp_minus_delta(struct xdp_md *ctx)
+{
+	int retval = XDP_PASS; /* Expected retval on successful test */
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	__u32 ifindex = GLOBAL_USER_IFINDEX;
+	__u32 data_len = data_end - data;
+	__u32 mtu_len = 0;
+	int delta;
+
+	/* Boarderline test case: Minus delta exceeding packet length allowed */
+	delta = -((data_len - ETH_HLEN) + 1);
+
+	/* Minus length (adjusted via delta) still pass MTU check, other helpers
+	 * are responsible for catching this, when doing actual size adjust
+	 */
+	if (bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0))
+		retval = XDP_ABORTED;
+
+	global_bpf_mtu_xdp = mtu_len;
+	return retval;
+}
+
+SEC("classifier")
+int tc_use_helper(struct __sk_buff *ctx)
+{
+	int retval = BPF_OK; /* Expected retval on successful test */
+	__u32 mtu_len = 0;
+	int delta = 0;
+
+	if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) {
+		retval = BPF_DROP;
+		goto out;
+	}
+
+	if (mtu_len != GLOBAL_USER_MTU)
+		retval = BPF_REDIRECT;
+out:
+	global_bpf_mtu_tc = mtu_len;
+	return retval;
+}
+
+SEC("classifier")
+int tc_exceed_mtu(struct __sk_buff *ctx)
+{
+	__u32 ifindex = GLOBAL_USER_IFINDEX;
+	int retval = BPF_DROP; /* Fail */
+	__u32 skb_len = ctx->len;
+	__u32 mtu_len = 0;
+	int delta;
+	int err;
+
+	/* Exceed MTU with 1 via delta adjust */
+	delta = GLOBAL_USER_MTU - (skb_len - ETH_HLEN) + 1;
+
+	err = bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0);
+	if (err) {
+		retval = BPF_OK; /* Success in exceeding MTU check */
+		if (err != BPF_MTU_CHK_RET_FRAG_NEEDED)
+			retval = BPF_DROP;
+	}
+
+	global_bpf_mtu_tc = mtu_len;
+	return retval;
+}
+
+SEC("classifier")
+int tc_exceed_mtu_da(struct __sk_buff *ctx)
+{
+	/* SKB Direct-Access variant */
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	__u32 ifindex = GLOBAL_USER_IFINDEX;
+	__u32 data_len = data_end - data;
+	int retval = BPF_DROP; /* Fail */
+	__u32 mtu_len = 0;
+	int delta;
+	int err;
+
+	/* Exceed MTU with 1 via delta adjust */
+	delta = GLOBAL_USER_MTU - (data_len - ETH_HLEN) + 1;
+
+	err = bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0);
+	if (err) {
+		retval = BPF_OK; /* Success in exceeding MTU check */
+		if (err != BPF_MTU_CHK_RET_FRAG_NEEDED)
+			retval = BPF_DROP;
+	}
+
+	global_bpf_mtu_tc = mtu_len;
+	return retval;
+}
+
+SEC("classifier")
+int tc_minus_delta(struct __sk_buff *ctx)
+{
+	int retval = BPF_OK; /* Expected retval on successful test */
+	__u32 ifindex = GLOBAL_USER_IFINDEX;
+	__u32 skb_len = ctx->len;
+	__u32 mtu_len = 0;
+	int delta;
+
+	/* Boarderline test case: Minus delta exceeding packet length allowed */
+	delta = -((skb_len - ETH_HLEN) + 1);
+
+	/* Minus length (adjusted via delta) still pass MTU check, other helpers
+	 * are responsible for catching this, when doing actual size adjust
+	 */
+	if (bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0))
+		retval = BPF_DROP;
+
+	global_bpf_mtu_xdp = mtu_len;
+	return retval;
+}



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-12 17:45 ` [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2021-01-12 19:23   ` Andrii Nakryiko
  2021-01-14 14:52     ` Jesper Dangaard Brouer
  2021-01-13 23:07   ` Daniel Borkmann
  1 sibling, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2021-01-12 19:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Maciej Żenczykowski, Lorenz Bauer, shaun, Lorenzo Bianconi,
	Marek Majkowski, John Fastabend, Jakub Kicinski, eyal.birger,
	colrack

On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
>
> The SKB object is complex and the skb->len value (accessible from
> BPF-prog) also include the length of any extra GRO/GSO segments, but
> without taking into account that these GRO/GSO segments get added
> transport (L4) and network (L3) headers before being transmitted. Thus,
> this BPF-helper is created such that the BPF-programmer don't need to
> handle these details in the BPF-prog.
>
> The API is designed to help the BPF-programmer, that want to do packet
> context size changes, which involves other helpers. These other helpers
> usually does a delta size adjustment. This helper also support a delta
> size (len_diff), which allow BPF-programmer to reuse arguments needed by
> these other helpers, and perform the MTU check prior to doing any actual
> size adjustment of the packet context.
>
> It is on purpose, that we allow the len adjustment to become a negative
> result, that will pass the MTU check. This might seem weird, but it's not
> this helpers responsibility to "catch" wrong len_diff adjustments. Other
> helpers will take care of these checks, if BPF-programmer chooses to do
> actual size adjustment.
>
> V9:
> - Use dev->hard_header_len (instead of ETH_HLEN)
> - Annotate with unlikely req from Daniel
> - Fix logic error using skb_gso_validate_network_len from Daniel
>
> V6:
> - Took John's advice and dropped BPF_MTU_CHK_RELAX
> - Returned MTU is kept at L3-level (like fib_lookup)
>
> V4: Lot of changes
>  - ifindex 0 now use current netdev for MTU lookup
>  - rename helper from bpf_mtu_check to bpf_check_mtu
>  - fix bug for GSO pkt length (as skb->len is total len)
>  - remove __bpf_len_adj_positive, simply allow negative len adj
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
>  net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
>  3 files changed, 256 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 649586d656b6..fa2e99351758 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3833,6 +3833,61 @@ union bpf_attr {
>   *     Return
>   *             A pointer to a struct socket on success or NULL if the file is
>   *             not a socket.
> + *
> + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)

should return long, same as most other helpers

> + *     Description
> + *             Check ctx packet size against MTU of net device (based on
> + *             *ifindex*).  This helper will likely be used in combination with
> + *             helpers that adjust/change the packet size.  The argument
> + *             *len_diff* can be used for querying with a planned size
> + *             change. This allows to check MTU prior to changing packet ctx.
> + *

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper
  2021-01-12 17:45 ` [PATCH bpf-next V11 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper Jesper Dangaard Brouer
@ 2021-01-12 19:29   ` Andrii Nakryiko
  0 siblings, 0 replies; 19+ messages in thread
From: Andrii Nakryiko @ 2021-01-12 19:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Maciej Żenczykowski, Lorenz Bauer, shaun, Lorenzo Bianconi,
	Marek Majkowski, John Fastabend, Jakub Kicinski, eyal.birger,
	colrack

On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> Adding selftest for BPF-helper bpf_check_mtu(). Making sure
> it can be used from both XDP and TC.
>
> V11:
>  - Addresse nitpicks from Andrii Nakryiko
>
> V10:
>  - Remove errno non-zero test in CHECK_ATTR()
>  - Addresse comments from Andrii Nakryiko
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

Looks good from the generic BPF selftest perspective.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>  tools/testing/selftests/bpf/prog_tests/check_mtu.c |  216 ++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_check_mtu.c |  198 ++++++++++++++++++
>  2 files changed, 414 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c
>

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-12 17:45 ` [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
  2021-01-12 19:23   ` Andrii Nakryiko
@ 2021-01-13 23:07   ` Daniel Borkmann
  2021-01-14 14:36     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2021-01-13 23:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, colrack

On 1/12/21 6:45 PM, Jesper Dangaard Brouer wrote:
> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
[...]
> + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
> + *	Description
> + *		Check ctx packet size against MTU of net device (based on
> + *		*ifindex*).  This helper will likely be used in combination with
> + *		helpers that adjust/change the packet size.  The argument
> + *		*len_diff* can be used for querying with a planned size
> + *		change. This allows to check MTU prior to changing packet ctx.
> + *
> + *		Specifying *ifindex* zero means the MTU check is performed
> + *		against the current net device.  This is practical if this isn't
> + *		used prior to redirect.
> + *
> + *		The Linux kernel route table can configure MTUs on a more
> + *		specific per route level, which is not provided by this helper.
> + *		For route level MTU checks use the **bpf_fib_lookup**\ ()
> + *		helper.
> + *
> + *		*ctx* is either **struct xdp_md** for XDP programs or
> + *		**struct sk_buff** for tc cls_act programs.
> + *
> + *		The *flags* argument can be a combination of one or more of the
> + *		following values:
> + *
> + *		**BPF_MTU_CHK_SEGS**
> + *			This flag will only works for *ctx* **struct sk_buff**.
> + *			If packet context contains extra packet segment buffers
> + *			(often knows as GSO skb), then MTU check is harder to
> + *			check at this point, because in transmit path it is
> + *			possible for the skb packet to get re-segmented
> + *			(depending on net device features).  This could still be
> + *			a MTU violation, so this flag enables performing MTU
> + *			check against segments, with a different violation
> + *			return code to tell it apart. Check cannot use len_diff.
> + *
> + *		On return *mtu_len* pointer contains the MTU value of the net
> + *		device.  Remember the net device configured MTU is the L3 size,
> + *		which is returned here and XDP and TX length operate at L2.
> + *		Helper take this into account for you, but remember when using
> + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
> + *		pointer and be initialized (to zero), else verifier will reject
> + *		BPF program.
> + *
> + *	Return
> + *		* 0 on success, and populate MTU value in *mtu_len* pointer.
> + *
> + *		* < 0 if any input argument is invalid (*mtu_len* not updated)
> + *
> + *		MTU violations return positive values, but also populate MTU
> + *		value in *mtu_len* pointer, as this can be needed for
> + *		implementing PMTU handing:
> + *
> + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> + *
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -3998,6 +4053,7 @@ union bpf_attr {
>   	FN(ktime_get_coarse_ns),	\
>   	FN(ima_inode_hash),		\
>   	FN(sock_from_file),		\
> +	FN(check_mtu),			\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
>   	};
>   };
>   
> +/* bpf_check_mtu flags*/
> +enum  bpf_check_mtu_flags {
> +	BPF_MTU_CHK_SEGS  = (1U << 0),
> +};
> +
> +enum bpf_check_mtu_ret {
> +	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> +	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> +	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
> +};
> +
>   enum bpf_task_fd_type {
>   	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
>   	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index db59ab55572c..3f2e593244ca 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
>   	.arg4_type	= ARG_ANYTHING,
>   };
>   
> +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
> +					    u32 ifindex)
> +{
> +	struct net *netns = dev_net(dev_curr);
> +
> +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> +	if (ifindex == 0)
> +		return dev_curr;
> +
> +	return dev_get_by_index_rcu(netns, ifindex);
> +}
> +
> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> +{
> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> +	struct net_device *dev = skb->dev;
> +	int skb_len, dev_len;
> +	int mtu;
> +
> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> +		return -EINVAL;
> +
> +	dev = __dev_via_ifindex(dev, ifindex);
> +	if (unlikely(!dev))
> +		return -ENODEV;
> +
> +	mtu = READ_ONCE(dev->mtu);
> +
> +	dev_len = mtu + dev->hard_header_len;
> +	skb_len = skb->len + len_diff; /* minus result pass check */
> +	if (skb_len <= dev_len) {
> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> +		goto out;
> +	}
> +	/* At this point, skb->len exceed MTU, but as it include length of all
> +	 * segments, it can still be below MTU.  The SKB can possibly get
> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> +	 * checked against MTU.
> +	 */
> +	if (skb_is_gso(skb)) {
> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> +
> +		if (!(flags & BPF_MTU_CHK_SEGS))
> +			goto out;
> +
> +		if (!skb_gso_validate_network_len(skb, mtu)) {
> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> +			goto out;
> +		}
> +
> +		skb_len = skb_headlen(skb) + len_diff;
> +		if (skb_len > dev_len) {

This is still not universally correct given drivers could cook up non-linear
skbs (e.g. page frags) on rx. So the result from BPF_MTU_CHK_SEGS flag cannot
be relied on. Do you have a particular use case for the BPF_MTU_CHK_SEGS? I
also don't see the flag being used anywhere in your selftests, so I presume
not as otherwise you would have added an example there? I would just drop the
flag altogether for the tc helper..

> +			ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> +			goto out;
> +		}
> +	}
> +out:
> +	/* BPF verifier guarantees valid pointer */
> +	*mtu_len = mtu;
> +
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len
  2021-01-12 17:45 ` [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2021-01-14  8:18   ` John Fastabend
  0 siblings, 0 replies; 19+ messages in thread
From: John Fastabend @ 2021-01-14  8:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

Jesper Dangaard Brouer wrote:
> Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
> __bpf_skb_max_len() as the max-length. This function limit size against
> the current net_device MTU (skb->dev->mtu).
> 
> When a BPF-prog grow the packet size, then it should not be limited to the
> MTU. The MTU is a transmit limitation, and software receiving this packet
> should be allowed to increase the size. Further more, current MTU check in
> __bpf_skb_max_len uses the MTU from ingress/current net_device, which in
> case of redirects uses the wrong net_device.
> 
> This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
> is elsewhere in the system. Jesper's testing[1] showed it was not possible
> to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
> factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
> SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
> in-effect due to this being called from softirq context see code
> __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
> that frames above 16KiB can cause NICs to reset (but not crash). Keep this
> sanity limit at this level as memory layer can differ based on kernel
> config.
> 
> [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests
> 
> V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2021-01-12 17:45 ` [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2021-01-14  9:03   ` John Fastabend
  2021-01-14 16:14     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: John Fastabend @ 2021-01-14  9:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger, colrack

Jesper Dangaard Brouer wrote:
> The use-case for dropping the MTU check when TC-BPF does redirect to
> ingress, is described by Eyal Birger in email[0]. The summary is the
> ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> ingress redirect packet and let normal netstack fragment packet as needed.
> 
> [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
> 
> V9:
>  - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect
> 
> V4:
>  - Keep net_device "up" (IFF_UP) check.
>  - Adjustment to handle bpf_redirect_peer() helper
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
>  net/core/dev.c            |   19 ++-----------------
>  net/core/filter.c         |   14 +++++++++++---
>  3 files changed, 42 insertions(+), 22 deletions(-)
> 

[...]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3f2e593244ca..1908800b671c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
>  
>  static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
>  {
> -	return dev_forward_skb(dev, skb);

> +	int ret = ____dev_forward_skb(dev, skb, false);
> +
> +	if (likely(!ret)) {
> +		skb->protocol = eth_type_trans(skb, dev);
> +		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> +		ret = netif_rx(skb);
> +	}
> +
> +	return ret;

How about putting above block into a dev.c routine call it

 dev_forward_skb_nomtu(...)

or something like that. Then we keep this code next to its pair
with mtu check, dev_forward_skb().

dev_forward_skb() also uses netif_rx_internal() looks like maybe we should
just do the same here?

Thanks,
John

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-13 23:07   ` Daniel Borkmann
@ 2021-01-14 14:36     ` Jesper Dangaard Brouer
  2021-01-14 22:28       ` Daniel Borkmann
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-14 14:36 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, colrack, brouer

On Thu, 14 Jan 2021 00:07:14 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 1/12/21 6:45 PM, Jesper Dangaard Brouer wrote:
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.  
> [...]
> > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
> > + *	Description
> > + *		Check ctx packet size against MTU of net device (based on
> > + *		*ifindex*).  This helper will likely be used in combination with
> > + *		helpers that adjust/change the packet size.  The argument
> > + *		*len_diff* can be used for querying with a planned size
> > + *		change. This allows to check MTU prior to changing packet ctx.
> > + *
> > + *		Specifying *ifindex* zero means the MTU check is performed
> > + *		against the current net device.  This is practical if this isn't
> > + *		used prior to redirect.
> > + *
> > + *		The Linux kernel route table can configure MTUs on a more
> > + *		specific per route level, which is not provided by this helper.
> > + *		For route level MTU checks use the **bpf_fib_lookup**\ ()
> > + *		helper.
> > + *
> > + *		*ctx* is either **struct xdp_md** for XDP programs or
> > + *		**struct sk_buff** for tc cls_act programs.
> > + *
> > + *		The *flags* argument can be a combination of one or more of the
> > + *		following values:
> > + *
> > + *		**BPF_MTU_CHK_SEGS**
> > + *			This flag will only works for *ctx* **struct sk_buff**.
> > + *			If packet context contains extra packet segment buffers
> > + *			(often knows as GSO skb), then MTU check is harder to
> > + *			check at this point, because in transmit path it is
> > + *			possible for the skb packet to get re-segmented
> > + *			(depending on net device features).  This could still be
> > + *			a MTU violation, so this flag enables performing MTU
> > + *			check against segments, with a different violation
> > + *			return code to tell it apart. Check cannot use len_diff.
> > + *
> > + *		On return *mtu_len* pointer contains the MTU value of the net
> > + *		device.  Remember the net device configured MTU is the L3 size,
> > + *		which is returned here and XDP and TX length operate at L2.
> > + *		Helper take this into account for you, but remember when using
> > + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
> > + *		pointer and be initialized (to zero), else verifier will reject
> > + *		BPF program.
> > + *
> > + *	Return
> > + *		* 0 on success, and populate MTU value in *mtu_len* pointer.
> > + *
> > + *		* < 0 if any input argument is invalid (*mtu_len* not updated)
> > + *
> > + *		MTU violations return positive values, but also populate MTU
> > + *		value in *mtu_len* pointer, as this can be needed for
> > + *		implementing PMTU handing:
> > + *
> > + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> > + *
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)		\
> >   	FN(unspec),			\
> > @@ -3998,6 +4053,7 @@ union bpf_attr {
> >   	FN(ktime_get_coarse_ns),	\
> >   	FN(ima_inode_hash),		\
> >   	FN(sock_from_file),		\
> > +	FN(check_mtu),			\
> >   	/* */
> >   
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > @@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
> >   	};
> >   };
> >   
> > +/* bpf_check_mtu flags*/
> > +enum  bpf_check_mtu_flags {
> > +	BPF_MTU_CHK_SEGS  = (1U << 0),
> > +};
> > +
> > +enum bpf_check_mtu_ret {
> > +	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> > +	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> > +	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
> > +};
> > +
> >   enum bpf_task_fd_type {
> >   	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
> >   	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index db59ab55572c..3f2e593244ca 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
> >   	.arg4_type	= ARG_ANYTHING,
> >   };
> >   
> > +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
> > +					    u32 ifindex)
> > +{
> > +	struct net *netns = dev_net(dev_curr);
> > +
> > +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> > +	if (ifindex == 0)
> > +		return dev_curr;
> > +
> > +	return dev_get_by_index_rcu(netns, ifindex);
> > +}
> > +
> > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> > +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> > +{
> > +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +	struct net_device *dev = skb->dev;
> > +	int skb_len, dev_len;
> > +	int mtu;
> > +
> > +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> > +		return -EINVAL;
> > +
> > +	dev = __dev_via_ifindex(dev, ifindex);
> > +	if (unlikely(!dev))
> > +		return -ENODEV;
> > +
> > +	mtu = READ_ONCE(dev->mtu);
> > +
> > +	dev_len = mtu + dev->hard_header_len;
> > +	skb_len = skb->len + len_diff; /* minus result pass check */
> > +	if (skb_len <= dev_len) {
> > +		ret = BPF_MTU_CHK_RET_SUCCESS;
> > +		goto out;
> > +	}
> > +	/* At this point, skb->len exceed MTU, but as it include length of all
> > +	 * segments, it can still be below MTU.  The SKB can possibly get
> > +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> > +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> > +	 * checked against MTU.
> > +	 */
> > +	if (skb_is_gso(skb)) {
> > +		ret = BPF_MTU_CHK_RET_SUCCESS;
> > +
> > +		if (!(flags & BPF_MTU_CHK_SEGS))
> > +			goto out;
> > +
> > +		if (!skb_gso_validate_network_len(skb, mtu)) {
> > +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> > +			goto out;
> > +		}
> > +
> > +		skb_len = skb_headlen(skb) + len_diff;
> > +		if (skb_len > dev_len) {

Maybe I'm misunderstanding you below?  Do you just want the above two
lines moved from the patch? (sure I can do that... as it is just an
extra check of the "head"/first segment of the packet, and only done if
BPF_MTU_CHK_SEGS is set)

> 
> This is still not universally correct given drivers could cook up non-linear
> skbs (e.g. page frags) on rx. So the result from BPF_MTU_CHK_SEGS flag cannot
> be relied on. 

That is why it is a flag, that need to be explicitly set.

> Do you have a particular use case for the BPF_MTU_CHK_SEGS?

The complaint from Maze (and others) were that when skb_is_gso then all
the MTU checks are bypassed.  This flag enables checking the GSO part
via skb_gso_validate_network_len().  We cannot enable it per default,
as you say, it is universally correct in all cases.

> I also don't see the flag being used anywhere in your selftests, so I presume
> not as otherwise you would have added an example there?

I'm using the flag in the bpf-examples code[1], this is how I've tested
the code path.

I've not found a way to generate GSO packet via the selftests
infrastructure via bpf_prog_test_run_xattr().  I'm 

[1] https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c


> I would just drop the flag altogether for the tc helper..

As explain I cannot drop the flag altogether, I would also have to
remove the code then.  Sorry, but I don't 100% understand the change
you are requesting.


> > +			ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +			goto out;
> > +		}
> > +	}
> > +out:
> > +	/* BPF verifier guarantees valid pointer */
> > +	*mtu_len = mtu;
> > +
> > +	return ret;
> > +}  
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-12 19:23   ` Andrii Nakryiko
@ 2021-01-14 14:52     ` Jesper Dangaard Brouer
  2021-01-14 15:33       ` Yonghong Song
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-14 14:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Maciej Żenczykowski, Lorenz Bauer, shaun, Lorenzo Bianconi,
	Marek Majkowski, John Fastabend, Jakub Kicinski, eyal.birger,
	colrack, brouer

On Tue, 12 Jan 2021 11:23:33 -0800
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
> >
> > The SKB object is complex and the skb->len value (accessible from
> > BPF-prog) also include the length of any extra GRO/GSO segments, but
> > without taking into account that these GRO/GSO segments get added
> > transport (L4) and network (L3) headers before being transmitted. Thus,
> > this BPF-helper is created such that the BPF-programmer don't need to
> > handle these details in the BPF-prog.
> >
> > The API is designed to help the BPF-programmer, that want to do packet
> > context size changes, which involves other helpers. These other helpers
> > usually does a delta size adjustment. This helper also support a delta
> > size (len_diff), which allow BPF-programmer to reuse arguments needed by
> > these other helpers, and perform the MTU check prior to doing any actual
> > size adjustment of the packet context.
> >
> > It is on purpose, that we allow the len adjustment to become a negative
> > result, that will pass the MTU check. This might seem weird, but it's not
> > this helpers responsibility to "catch" wrong len_diff adjustments. Other
> > helpers will take care of these checks, if BPF-programmer chooses to do
> > actual size adjustment.
> >
> > V9:
> > - Use dev->hard_header_len (instead of ETH_HLEN)
> > - Annotate with unlikely req from Daniel
> > - Fix logic error using skb_gso_validate_network_len from Daniel
> >
> > V6:
> > - Took John's advice and dropped BPF_MTU_CHK_RELAX
> > - Returned MTU is kept at L3-level (like fib_lookup)
> >
> > V4: Lot of changes
> >  - ifindex 0 now use current netdev for MTU lookup
> >  - rename helper from bpf_mtu_check to bpf_check_mtu
> >  - fix bug for GSO pkt length (as skb->len is total len)
> >  - remove __bpf_len_adj_positive, simply allow negative len adj
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
> >  net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
> >  3 files changed, 256 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 649586d656b6..fa2e99351758 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3833,6 +3833,61 @@ union bpf_attr {
> >   *     Return
> >   *             A pointer to a struct socket on success or NULL if the file is
> >   *             not a socket.
> > + *
> > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)  
> 
> should return long, same as most other helpers

Is it enough to change it here?
(as this will be used for generating the helpers header file,
via ./scripts/bpf_helpers_doc.py --header)

Or do I also need to change bpf_func_proto.ret_type ?

> > + *     Description
> > + *             Check ctx packet size against MTU of net device (based on
> > + *             *ifindex*).  This helper will likely be used in combination with
> > + *             helpers that adjust/change the packet size.  The argument
> > + *             *len_diff* can be used for querying with a planned size
> > + *             change. This allows to check MTU prior to changing packet ctx.
> > + *  
> 
> [...]
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-14 14:52     ` Jesper Dangaard Brouer
@ 2021-01-14 15:33       ` Yonghong Song
  0 siblings, 0 replies; 19+ messages in thread
From: Yonghong Song @ 2021-01-14 15:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Andrii Nakryiko
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Maciej Żenczykowski, Lorenz Bauer, shaun, Lorenzo Bianconi,
	Marek Majkowski, John Fastabend, Jakub Kicinski, eyal.birger,
	colrack



On 1/14/21 6:52 AM, Jesper Dangaard Brouer wrote:
> On Tue, 12 Jan 2021 11:23:33 -0800
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
>> On Tue, Jan 12, 2021 at 9:49 AM Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>>
>>> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
>>>
>>> The SKB object is complex and the skb->len value (accessible from
>>> BPF-prog) also include the length of any extra GRO/GSO segments, but
>>> without taking into account that these GRO/GSO segments get added
>>> transport (L4) and network (L3) headers before being transmitted. Thus,
>>> this BPF-helper is created such that the BPF-programmer don't need to
>>> handle these details in the BPF-prog.
>>>
>>> The API is designed to help the BPF-programmer, that want to do packet
>>> context size changes, which involves other helpers. These other helpers
>>> usually does a delta size adjustment. This helper also support a delta
>>> size (len_diff), which allow BPF-programmer to reuse arguments needed by
>>> these other helpers, and perform the MTU check prior to doing any actual
>>> size adjustment of the packet context.
>>>
>>> It is on purpose, that we allow the len adjustment to become a negative
>>> result, that will pass the MTU check. This might seem weird, but it's not
>>> this helpers responsibility to "catch" wrong len_diff adjustments. Other
>>> helpers will take care of these checks, if BPF-programmer chooses to do
>>> actual size adjustment.
>>>
>>> V9:
>>> - Use dev->hard_header_len (instead of ETH_HLEN)
>>> - Annotate with unlikely req from Daniel
>>> - Fix logic error using skb_gso_validate_network_len from Daniel
>>>
>>> V6:
>>> - Took John's advice and dropped BPF_MTU_CHK_RELAX
>>> - Returned MTU is kept at L3-level (like fib_lookup)
>>>
>>> V4: Lot of changes
>>>   - ifindex 0 now use current netdev for MTU lookup
>>>   - rename helper from bpf_mtu_check to bpf_check_mtu
>>>   - fix bug for GSO pkt length (as skb->len is total len)
>>>   - remove __bpf_len_adj_positive, simply allow negative len adj
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>   include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++
>>>   net/core/filter.c              |  122 ++++++++++++++++++++++++++++++++++++++++
>>>   tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++
>>>   3 files changed, 256 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 649586d656b6..fa2e99351758 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -3833,6 +3833,61 @@ union bpf_attr {
>>>    *     Return
>>>    *             A pointer to a struct socket on success or NULL if the file is
>>>    *             not a socket.
>>> + *
>>> + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
>>
>> should return long, same as most other helpers
> 
> Is it enough to change it here?
> (as this will be used for generating the helpers header file,
> via ./scripts/bpf_helpers_doc.py --header)

Just change here is enough.

> 
> Or do I also need to change bpf_func_proto.ret_type ?

There is no need to change bpf_func_proto.ret_type.
RET_INTEGER already implies 64bit scalar.

> 
>>> + *     Description
>>> + *             Check ctx packet size against MTU of net device (based on
>>> + *             *ifindex*).  This helper will likely be used in combination with
>>> + *             helpers that adjust/change the packet size.  The argument
>>> + *             *len_diff* can be used for querying with a planned size
>>> + *             change. This allows to check MTU prior to changing packet ctx.
>>> + *
>>
>> [...]
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2021-01-14  9:03   ` John Fastabend
@ 2021-01-14 16:14     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-14 16:14 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	colrack, brouer

On Thu, 14 Jan 2021 01:03:33 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> Jesper Dangaard Brouer wrote:
> > The use-case for dropping the MTU check when TC-BPF does redirect to
> > ingress, is described by Eyal Birger in email[0]. The summary is the
> > ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> > ingress redirect packet and let normal netstack fragment packet as needed.
> > 
> > [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
> > 
> > V9:
> >  - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect
> > 
> > V4:
> >  - Keep net_device "up" (IFF_UP) check.
> >  - Adjustment to handle bpf_redirect_peer() helper
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
> >  net/core/dev.c            |   19 ++-----------------
> >  net/core/filter.c         |   14 +++++++++++---
> >  3 files changed, 42 insertions(+), 22 deletions(-)
> >   
> 
> [...]
> 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 3f2e593244ca..1908800b671c 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
> >  
> >  static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
> >  {
> > -	return dev_forward_skb(dev, skb);  
> 
> > +	int ret = ____dev_forward_skb(dev, skb, false);
> > +
> > +	if (likely(!ret)) {
> > +		skb->protocol = eth_type_trans(skb, dev);
> > +		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > +		ret = netif_rx(skb);
> > +	}
> > +
> > +	return ret;  
> 
> How about putting above block into a dev.c routine call it
> 
>  dev_forward_skb_nomtu(...)
> 
> or something like that. Then we keep this code next to its pair
> with mtu check, dev_forward_skb().
> 
> dev_forward_skb() also uses netif_rx_internal() looks like maybe we should
> just do the same here?

I love the idea.  I'm coding it up and it looks much nicer.  And yes we
obviously can use netif_rx_internal() once the code in core/dev.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-14 14:36     ` Jesper Dangaard Brouer
@ 2021-01-14 22:28       ` Daniel Borkmann
  2021-01-18 11:04         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2021-01-14 22:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, colrack

On 1/14/21 3:36 PM, Jesper Dangaard Brouer wrote:
[...]
>>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
>>> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
>>> +{
>>> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
>>> +	struct net_device *dev = skb->dev;
>>> +	int skb_len, dev_len;
>>> +	int mtu;
>>> +
>>> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
>>> +		return -EINVAL;
>>> +
>>> +	dev = __dev_via_ifindex(dev, ifindex);
>>> +	if (unlikely(!dev))
>>> +		return -ENODEV;
>>> +
>>> +	mtu = READ_ONCE(dev->mtu);
>>> +
>>> +	dev_len = mtu + dev->hard_header_len;
>>> +	skb_len = skb->len + len_diff; /* minus result pass check */
>>> +	if (skb_len <= dev_len) {
>>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
>>> +		goto out;
>>> +	}
>>> +	/* At this point, skb->len exceed MTU, but as it include length of all
>>> +	 * segments, it can still be below MTU.  The SKB can possibly get
>>> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
>>> +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
>>> +	 * checked against MTU.
>>> +	 */
>>> +	if (skb_is_gso(skb)) {
>>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
>>> +
>>> +		if (!(flags & BPF_MTU_CHK_SEGS))
>>> +			goto out;
>>> +
>>> +		if (!skb_gso_validate_network_len(skb, mtu)) {
>>> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
>>> +			goto out;
>>> +		}
>>> +
>>> +		skb_len = skb_headlen(skb) + len_diff;
>>> +		if (skb_len > dev_len) {
[...]
>> Do you have a particular use case for the BPF_MTU_CHK_SEGS?
> 
> The complaint from Maze (and others) were that when skb_is_gso then all
> the MTU checks are bypassed.  This flag enables checking the GSO part
> via skb_gso_validate_network_len().  We cannot enable it per default,
> as you say, it is universally correct in all cases.

If there is a desire to have access to the skb_gso_validate_network_len(), I'd
keep that behind the flag then, but would drop the skb_headlen(skb) + len_diff
case given the mentioned case on rx where it would yield misleading results to
users that might be unintuitive & hard to debug.

>> I also don't see the flag being used anywhere in your selftests, so I presume
>> not as otherwise you would have added an example there?
> 
> I'm using the flag in the bpf-examples code[1], this is how I've tested
> the code path.
> 
> I've not found a way to generate GSO packet via the selftests
> infrastructure via bpf_prog_test_run_xattr().  I'm
> 
> [1] https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c

Haven't checked but likely something as prog_tests/skb_ctx.c might not be sufficient
to pass it into the helper. For real case you might need a netns + veth setup like
some of the other tests are doing and then generating TCP stream from one end to the
other.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
  2021-01-14 22:28       ` Daniel Borkmann
@ 2021-01-18 11:04         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-18 11:04 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, colrack, brouer

On Thu, 14 Jan 2021 23:28:57 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 1/14/21 3:36 PM, Jesper Dangaard Brouer wrote:
> [...]
> >>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> >>> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> >>> +{
> >>> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> >>> +	struct net_device *dev = skb->dev;
> >>> +	int skb_len, dev_len;
> >>> +	int mtu;
> >>> +
> >>> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> >>> +		return -EINVAL;
> >>> +
> >>> +	dev = __dev_via_ifindex(dev, ifindex);
> >>> +	if (unlikely(!dev))
> >>> +		return -ENODEV;
> >>> +
> >>> +	mtu = READ_ONCE(dev->mtu);
> >>> +
> >>> +	dev_len = mtu + dev->hard_header_len;
> >>> +	skb_len = skb->len + len_diff; /* minus result pass check */
> >>> +	if (skb_len <= dev_len) {
> >>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> >>> +		goto out;
> >>> +	}
> >>> +	/* At this point, skb->len exceed MTU, but as it include length of all
> >>> +	 * segments, it can still be below MTU.  The SKB can possibly get
> >>> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> >>> +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> >>> +	 * checked against MTU.
> >>> +	 */
> >>> +	if (skb_is_gso(skb)) {
> >>> +		ret = BPF_MTU_CHK_RET_SUCCESS;
> >>> +
> >>> +		if (!(flags & BPF_MTU_CHK_SEGS))
> >>> +			goto out;
> >>> +
> >>> +		if (!skb_gso_validate_network_len(skb, mtu)) {
> >>> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> >>> +			goto out;
> >>> +		}
> >>> +
> >>> +		skb_len = skb_headlen(skb) + len_diff;
> >>> +		if (skb_len > dev_len) {  
> [...]
> >> Do you have a particular use case for the BPF_MTU_CHK_SEGS?  
> > 
> > The complaint from Maze (and others) were that when skb_is_gso then all
> > the MTU checks are bypassed.  This flag enables checking the GSO part
> > via skb_gso_validate_network_len().  We cannot enable it per default,
> > as you say, it is universally correct in all cases.  
> 
> If there is a desire to have access to the skb_gso_validate_network_len(), I'd
> keep that behind the flag then, but would drop the skb_headlen(skb) + len_diff
> case given the mentioned case on rx where it would yield misleading results to
> users that might be unintuitive & hard to debug.

Okay, I will update the patch, and drop those lines.

> >> I also don't see the flag being used anywhere in your selftests, so I presume
> >> not as otherwise you would have added an example there?  
> > 
> > I'm using the flag in the bpf-examples code[1], this is how I've tested
> > the code path.
> > 
> > I've not found a way to generate GSO packet via the selftests
> > infrastructure via bpf_prog_test_run_xattr().  I'm
> > 
> > [1] https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c  
> 
> Haven't checked but likely something as prog_tests/skb_ctx.c might not be sufficient
> to pass it into the helper. For real case you might need a netns + veth setup like
> some of the other tests are doing and then generating TCP stream from one end to the
> other.

I have looked at prog_tests/skb_ctx.c and (as you say yourself) this is
not sufficient.  I can look into creating a netns+veth setup, but I
will appreciate if we can merge this patchset to make forward progress,
as I'm sure the netns+veth setup will require its own round of nitpicking.

I have created netns+veth test scripts before (see test_xdp_vlan.sh),
but my experience is that people/maintainers forget/don't to run these
separate shell scripts.  Thus, if I create a netns+veth test, then I
will prefer if I can integrate this into the "test_progs", as I know
that will be run by people/maintainers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2021-01-14  8:18   ` John Fastabend
2021-01-12 17:45 ` [PATCH bpf-next V11 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2021-01-12 19:23   ` Andrii Nakryiko
2021-01-14 14:52     ` Jesper Dangaard Brouer
2021-01-14 15:33       ` Yonghong Song
2021-01-13 23:07   ` Daniel Borkmann
2021-01-14 14:36     ` Jesper Dangaard Brouer
2021-01-14 22:28       ` Daniel Borkmann
2021-01-18 11:04         ` Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2021-01-14  9:03   ` John Fastabend
2021-01-14 16:14     ` Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper Jesper Dangaard Brouer
2021-01-12 19:29   ` Andrii Nakryiko

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git