bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
@ 2020-10-08 14:08 Jesper Dangaard Brouer
  2020-10-08 14:09 ` [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:08 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check
V3: Drop enforcement of MTU in net-core, leave it to drivers

---

Jesper Dangaard Brouer (6):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: make it possible to identify BPF redirected SKBs
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      net: inline and splitup is_skb_forwardable


 include/linux/netdevice.h      |   32 +++++++-
 include/uapi/linux/bpf.h       |   74 +++++++++++++++++-
 net/core/dev.c                 |   25 +-----
 net/core/filter.c              |  166 ++++++++++++++++++++++++++++++++++++----
 net/sched/Kconfig              |    1 
 tools/include/uapi/linux/bpf.h |   74 +++++++++++++++++-
 6 files changed, 326 insertions(+), 46 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
@ 2020-10-08 14:09 ` Jesper Dangaard Brouer
  2020-10-09 16:12   ` Daniel Borkmann
  2020-10-08 14:09 ` [PATCH bpf-next V3 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:09 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

Keep a sanity max limit of IP6_MAX_MTU (under CONFIG_IPV6) which is 64KiB
plus 40 bytes IPv6 header size. If compiled without IPv6 use IP_MAX_MTU.

V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 05df73780dd3..ddc1f9ba89d1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3474,11 +3474,11 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 	return 0;
 }
 
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
-			  SKB_MAX_ALLOC;
-}
+#ifdef IP6_MAX_MTU /* Depend on CONFIG_IPV6 */
+#define BPF_SKB_MAX_LEN IP6_MAX_MTU
+#else
+#define BPF_SKB_MAX_LEN IP_MAX_MTU
+#endif
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 	   u32, mode, u64, flags)
@@ -3527,7 +3527,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 {
 	u32 len_cur, len_diff_abs = abs(len_diff);
 	u32 len_min = bpf_skb_net_base_len(skb);
-	u32 len_max = __bpf_skb_max_len(skb);
+	u32 len_max = BPF_SKB_MAX_LEN;
 	__be16 proto = skb->protocol;
 	bool shrink = len_diff < 0;
 	u32 off;
@@ -3610,7 +3610,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 min_len = __bpf_skb_min_len(skb);
 	int ret;
 
@@ -3686,7 +3686,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 new_len = skb->len + head_room;
 	int ret;
 



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH bpf-next V3 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2020-10-08 14:09 ` [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-08 14:09 ` Jesper Dangaard Brouer
  2020-10-09  4:05   ` David Ahern
  2020-10-08 14:09 ` [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:09 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   11 +++++++++--
 net/core/filter.c              |   17 ++++++++++++-----
 tools/include/uapi/linux/bpf.h |   11 +++++++++--
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d83561e8cd2c..4a46a1de6d16 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2216,6 +2216,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and result params->mtu contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4844,9 +4847,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* total length of packet from network hdr */
 
+		/* output: MTU value (if requested check_mtu) */
+		__u16	mtu;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */
diff --git a/net/core/filter.c b/net/core/filter.c
index ddc1f9ba89d1..da74d6ddc4d7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5186,13 +5186,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
 				  const struct neighbour *neigh,
-				  const struct net_device *dev)
+				  const struct net_device *dev, u32 mtu)
 {
 	memcpy(params->dmac, neigh->ha, ETH_ALEN);
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
 	params->ifindex = dev->ifindex;
+	params->mtu = mtu;
 
 	return 0;
 }
@@ -5276,8 +5277,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	nhc = res.nhc;
@@ -5310,7 +5313,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5402,8 +5405,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	if (res.nh->fib_nh_lws)
@@ -5422,7 +5427,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5491,6 +5496,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 		dev = dev_get_by_index_rcu(net, params->ifindex);
 		if (!is_skb_forwardable(dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
+
+		params->mtu = dev->mtu; /* union with tot_len */
 	}
 
 	return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d83561e8cd2c..4a46a1de6d16 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2216,6 +2216,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and result params->mtu contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4844,9 +4847,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* total length of packet from network hdr */
 
+		/* output: MTU value (if requested check_mtu) */
+		__u16	mtu;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking
  2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2020-10-08 14:09 ` [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
  2020-10-08 14:09 ` [PATCH bpf-next V3 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-08 14:09 ` Jesper Dangaard Brouer
  2020-10-09 23:29   ` Maciej Żenczykowski
  2020-10-12 15:54   ` Lorenz Bauer
  2020-10-08 14:09 ` [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:09 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This BPF-helper bpf_mtu_check() works for both XDP and TC-BPF programs.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

V3: Take L2/ETH_HLEN header size into account and document it.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   63 +++++++++++++++++++++
 net/core/filter.c              |  119 ++++++++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   63 +++++++++++++++++++++
 3 files changed, 245 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4a46a1de6d16..1dcf5d8195f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3718,6 +3718,56 @@ union bpf_attr {
  *		never return NULL.
  *	Return
  *		A pointer pointing to the kernel percpu variable on this cpu.
+ *
+ * int bpf_mtu_check(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *              **BPF_MTU_CHK_RELAX**
+ *			This flag relax or increase the MTU with room for one
+ *			VLAN header (4 bytes) and take into account net device
+ *			hard_header_len.  This relaxation is also used by the
+ *			kernels own forwarding MTU checks.
+ *
+ *		**BPF_MTU_CHK_GSO**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as frags), then those are also checked
+ *			against the MTU size.
+ *
+ *		The *mtu_result* pointer contains the MTU value of the net
+ *		device including the L2 header size (usually 14 bytes Ethernet
+ *		header). The net device configured MTU is the L3 size, but as
+ *		XDP and TX length operate at L2 this helper include L2 header
+ *		size in reported MTU.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_result* pointer, as this can be needed for
+ *		implemeting PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_GSO_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3875,6 +3925,7 @@ union bpf_attr {
 	FN(redirect_neigh),		\
 	FN(bpf_per_cpu_ptr),            \
 	FN(bpf_this_cpu_ptr),		\
+	FN(mtu_check),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4889,6 +4940,18 @@ struct bpf_fib_lookup {
 	__u8	dmac[6];     /* ETH_ALEN */
 };
 
+/* bpf_mtu_check flags*/
+enum  bpf_mtu_check_flags {
+	BPF_MTU_CHK_RELAX = (1U << 0),
+	BPF_MTU_CHK_GSO   = (1U << 1),
+};
+
+enum bpf_mtu_check_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_GSO_TOOBIG,   /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
diff --git a/net/core/filter.c b/net/core/filter.c
index da74d6ddc4d7..5986156e700e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5513,6 +5513,121 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+static int bpf_mtu_lookup(struct net *netns, u32 ifindex, u64 flags)
+{
+	struct net_device *dev;
+	int mtu;
+
+	dev = dev_get_by_index_rcu(netns, ifindex);
+	if (!dev)
+		return -ENODEV;
+
+	/* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
+	mtu = dev->mtu + dev->hard_header_len;
+
+	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
+	if (flags & BPF_MTU_CHK_RELAX)
+		mtu += VLAN_HLEN;
+
+	return mtu;
+}
+
+static unsigned int __bpf_len_adjust_positive(unsigned int len, int len_diff)
+{
+	int len_new = len + len_diff; /* notice len_diff can be negative */
+
+	if (len_new > 0)
+		return len_new;
+
+	return 0;
+}
+
+BPF_CALL_5(bpf_skb_mtu_check, struct sk_buff *, skb,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	struct net *netns = dev_net(skb->dev);
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	unsigned int len = skb->len;
+	int mtu;
+
+	if (flags & ~(BPF_MTU_CHK_RELAX | BPF_MTU_CHK_GSO))
+		return -EINVAL;
+
+	mtu = bpf_mtu_lookup(netns, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu; /* errno */
+
+	len = __bpf_len_adjust_positive(len, len_diff);
+	if (len > mtu) {
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+		goto out;
+	}
+
+	if (flags & BPF_MTU_CHK_GSO &&
+	    skb_is_gso(skb) &&
+	    skb_gso_validate_network_len(skb, mtu)) {
+		ret = BPF_MTU_CHK_RET_GSO_TOOBIG;
+		goto out;
+	}
+
+out:
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;
+}
+
+BPF_CALL_5(bpf_xdp_mtu_check, struct xdp_buff *, xdp,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	unsigned int len = xdp->data_end - xdp->data;
+	struct net_device *dev = xdp->rxq->dev;
+	struct net *netns = dev_net(dev);
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	int mtu;
+
+	/* XDP variant doesn't support multi-buffer segment check (yet) */
+	if (flags & ~BPF_MTU_CHK_RELAX)
+		return -EINVAL;
+
+	mtu = bpf_mtu_lookup(netns, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu; /* errno */
+
+	len = __bpf_len_adjust_positive(len, len_diff);
+	if (len > mtu) {
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+		goto out;
+	}
+out:
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_skb_mtu_check_proto = {
+	.func		= bpf_skb_mtu_check,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto bpf_xdp_mtu_check_proto = {
+	.func		= bpf_xdp_mtu_check,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
 {
@@ -7076,6 +7191,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_mtu_check:
+		return &bpf_skb_mtu_check_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
 	case BPF_FUNC_sk_storage_get:
@@ -7145,6 +7262,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
+	case BPF_FUNC_mtu_check:
+		return &bpf_xdp_mtu_check_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4a46a1de6d16..1dcf5d8195f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3718,6 +3718,56 @@ union bpf_attr {
  *		never return NULL.
  *	Return
  *		A pointer pointing to the kernel percpu variable on this cpu.
+ *
+ * int bpf_mtu_check(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *              **BPF_MTU_CHK_RELAX**
+ *			This flag relax or increase the MTU with room for one
+ *			VLAN header (4 bytes) and take into account net device
+ *			hard_header_len.  This relaxation is also used by the
+ *			kernels own forwarding MTU checks.
+ *
+ *		**BPF_MTU_CHK_GSO**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as frags), then those are also checked
+ *			against the MTU size.
+ *
+ *		The *mtu_result* pointer contains the MTU value of the net
+ *		device including the L2 header size (usually 14 bytes Ethernet
+ *		header). The net device configured MTU is the L3 size, but as
+ *		XDP and TX length operate at L2 this helper include L2 header
+ *		size in reported MTU.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_result* pointer, as this can be needed for
+ *		implemeting PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_GSO_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3875,6 +3925,7 @@ union bpf_attr {
 	FN(redirect_neigh),		\
 	FN(bpf_per_cpu_ptr),            \
 	FN(bpf_this_cpu_ptr),		\
+	FN(mtu_check),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4889,6 +4940,18 @@ struct bpf_fib_lookup {
 	__u8	dmac[6];     /* ETH_ALEN */
 };
 
+/* bpf_mtu_check flags*/
+enum  bpf_mtu_check_flags {
+	BPF_MTU_CHK_RELAX = (1U << 0),
+	BPF_MTU_CHK_GSO   = (1U << 1),
+};
+
+enum bpf_mtu_check_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_GSO_TOOBIG,   /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs
  2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2020-10-08 14:09 ` [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2020-10-08 14:09 ` Jesper Dangaard Brouer
  2020-10-09 16:47   ` Daniel Borkmann
  2020-10-08 14:09 ` [PATCH bpf-next V3 5/6] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:09 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This change makes it possible to identify SKBs that have been redirected
by TC-BPF (cls_act). This is needed for a number of cases.

(1) For collaborating with driver ifb net_devices.
(2) For avoiding starting generic-XDP prog on TC ingress redirect.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/dev.c    |    2 ++
 net/sched/Kconfig |    1 +
 2 files changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 9d55bf5d1a65..b433098896b2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3885,6 +3885,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		return NULL;
 	case TC_ACT_REDIRECT:
 		/* No need to push/pop skb's mac_header here on egress! */
+		skb_set_redirected(skb, false);
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
 		return NULL;
@@ -4974,6 +4975,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		 * redirecting to another netdev
 		 */
 		__skb_push(skb, skb->mac_len);
+		skb_set_redirected(skb, true);
 		skb_do_redirect(skb);
 		return NULL;
 	case TC_ACT_CONSUMED:
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a3b37d88800e..a1bbaa8fd054 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -384,6 +384,7 @@ config NET_SCH_INGRESS
 	depends on NET_CLS_ACT
 	select NET_INGRESS
 	select NET_EGRESS
+	select NET_REDIRECT
 	help
 	  Say Y here if you want to use classifiers for incoming and/or outgoing
 	  packets. This qdisc doesn't do anything else besides running classifiers,



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH bpf-next V3 5/6] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2020-10-08 14:09 ` [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
@ 2020-10-08 14:09 ` Jesper Dangaard Brouer
  2020-10-09 23:17   ` Maciej Żenczykowski
  2020-10-08 14:09 ` [PATCH bpf-next V3 6/6] net: inline and splitup is_skb_forwardable Jesper Dangaard Brouer
  2020-10-09 16:33 ` [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jakub Kicinski
  6 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:09 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |    5 +++--
 net/core/dev.c            |    2 +-
 net/core/filter.c         |   12 ++++++++++--
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 28cfa53daf72..58fb7b4869ba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3866,10 +3866,11 @@ bool is_skb_forwardable(const struct net_device *dev,
 			const struct sk_buff *skb);
 
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
-					       struct sk_buff *skb)
+					       struct sk_buff *skb,
+					       const bool mtu_check)
 {
 	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-	    unlikely(!is_skb_forwardable(dev, skb))) {
+	    (mtu_check && unlikely(!is_skb_forwardable(dev, skb)))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index b433098896b2..96b455f15872 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2209,7 +2209,7 @@ EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, true);
 
 	if (likely(!ret)) {
 		skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index 5986156e700e..a8e24092e4f5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	return dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
+
+	if (likely(!ret)) {
+		skb->protocol = eth_type_trans(skb, dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		ret = netif_rx(skb);
+	}
+
+	return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
 				      struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
 
 	if (likely(!ret)) {
 		skb->dev = dev;



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH bpf-next V3 6/6] net: inline and splitup is_skb_forwardable
  2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2020-10-08 14:09 ` [PATCH bpf-next V3 5/6] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2020-10-08 14:09 ` Jesper Dangaard Brouer
  2020-10-09 16:33 ` [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jakub Kicinski
  6 siblings, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:09 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The BPF-helper bpf_skb_fib_lookup() use is_skb_forwardable() that
also checks if net_device is "up", which is unnecessary for this
helper. This patch splitup is_skb_forwardable() into is_skb_fwd_size_ok()
such that the helper can use this instead.

This change also cause is_skb_forwardable() to be inlined in the
existing call sites. Most importantly in dev_forward_skb().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |   27 +++++++++++++++++++++++++--
 net/core/dev.c            |   21 ---------------------
 net/core/filter.c         |    2 +-
 3 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 58fb7b4869ba..4857c54590b5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3862,8 +3862,31 @@ int xdp_umem_query(struct net_device *dev, u16 queue_id);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
-bool is_skb_forwardable(const struct net_device *dev,
-			const struct sk_buff *skb);
+
+static __always_inline bool is_skb_fwd_size_ok(const struct net_device *dev,
+					       const struct sk_buff *skb)
+{
+	const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+	unsigned int mtu = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+
+	/* Assumes SKB length at L2 */
+	if (likely(skb->len <= mtu))
+		return true;
+
+	/* If TSO is enabled, we don't care about the length as the packet
+	 * could be forwarded without being segmented before.
+	 */
+	return skb_is_gso(skb);
+}
+
+static __always_inline bool is_skb_forwardable(const struct net_device *dev,
+					       const struct sk_buff *skb)
+{
+	if (unlikely(!(dev->flags & IFF_UP)))
+		return false;
+
+	return is_skb_fwd_size_ok(dev, skb);
+}
 
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
 					       struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index 96b455f15872..21b62bda0ef9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2186,27 +2186,6 @@ static inline void net_timestamp_set(struct sk_buff *skb)
 			__net_timestamp(SKB);			\
 	}							\
 
-bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
-{
-	unsigned int len;
-
-	if (!(dev->flags & IFF_UP))
-		return false;
-
-	len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-	if (skb->len <= len)
-		return true;
-
-	/* if TSO is enabled, we don't care about the length as the packet
-	 * could be forwarded without being segmented before
-	 */
-	if (skb_is_gso(skb))
-		return true;
-
-	return false;
-}
-EXPORT_SYMBOL_GPL(is_skb_forwardable);
-
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
 	int ret = ____dev_forward_skb(dev, skb, true);
diff --git a/net/core/filter.c b/net/core/filter.c
index a8e24092e4f5..14e6b93757d4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5502,7 +5502,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 		struct net_device *dev;
 
 		dev = dev_get_by_index_rcu(net, params->ifindex);
-		if (!is_skb_forwardable(dev, skb))
+		if (!is_skb_fwd_size_ok(dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
 
 		params->mtu = dev->mtu; /* union with tot_len */



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-08 14:09 ` [PATCH bpf-next V3 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-09  4:05   ` David Ahern
  0 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2020-10-09  4:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger

On 10/8/20 7:09 AM, Jesper Dangaard Brouer wrote:
> The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
> don't know the MTU value that caused this rejection.
> 
> If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> need to know this MTU value for the ICMP packet.
> 
> Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> value as output via a union with 'tot_len' as this is the value used for
> the MTU lookup.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h       |   11 +++++++++--
>  net/core/filter.c              |   17 ++++++++++++-----
>  tools/include/uapi/linux/bpf.h |   11 +++++++++--
>  3 files changed, 30 insertions(+), 9 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@gmail.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-08 14:09 ` [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-09 16:12   ` Daniel Borkmann
  2020-10-09 18:26     ` Maciej Żenczykowski
  2020-10-10 10:25     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 32+ messages in thread
From: Daniel Borkmann @ 2020-10-09 16:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, willemdebruijn.kernel

On 10/8/20 4:09 PM, Jesper Dangaard Brouer wrote:
> Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
> __bpf_skb_max_len() as the max-length. This function limit size against
> the current net_device MTU (skb->dev->mtu).
> 
> When a BPF-prog grow the packet size, then it should not be limited to the
> MTU. The MTU is a transmit limitation, and software receiving this packet
> should be allowed to increase the size. Further more, current MTU check in
> __bpf_skb_max_len uses the MTU from ingress/current net_device, which in
> case of redirects uses the wrong net_device.
> 
> Keep a sanity max limit of IP6_MAX_MTU (under CONFIG_IPV6) which is 64KiB
> plus 40 bytes IPv6 header size. If compiled without IPv6 use IP_MAX_MTU.
> 
> V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   net/core/filter.c |   16 ++++++++--------
>   1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 05df73780dd3..ddc1f9ba89d1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3474,11 +3474,11 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
>   	return 0;
>   }
>   
> -static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> -{
> -	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> -			  SKB_MAX_ALLOC;
> -}
> +#ifdef IP6_MAX_MTU /* Depend on CONFIG_IPV6 */
> +#define BPF_SKB_MAX_LEN IP6_MAX_MTU
> +#else
> +#define BPF_SKB_MAX_LEN IP_MAX_MTU
> +#endif

Shouldn't that check on skb->protocol? The way I understand it is that a number of devices
including virtual ones use ETH_MAX_MTU as their dev->max_mtu, so the mtu must be in the range
of dev->min_mtu(=ETH_MIN_MTU), dev->max_mtu(=ETH_MAX_MTU). __dev_set_mtu() then sets the user
value to dev->mtu in the core if within this range. That means in your case skb->dev->hard_header_len
for example is left out, meaning if we go for some constant, that would need to be higher.

>   BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
>   	   u32, mode, u64, flags)
> @@ -3527,7 +3527,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
>   {
>   	u32 len_cur, len_diff_abs = abs(len_diff);
>   	u32 len_min = bpf_skb_net_base_len(skb);
> -	u32 len_max = __bpf_skb_max_len(skb);
> +	u32 len_max = BPF_SKB_MAX_LEN;
>   	__be16 proto = skb->protocol;
>   	bool shrink = len_diff < 0;
>   	u32 off;
> @@ -3610,7 +3610,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
>   static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
>   					u64 flags)
>   {
> -	u32 max_len = __bpf_skb_max_len(skb);
> +	u32 max_len = BPF_SKB_MAX_LEN;
>   	u32 min_len = __bpf_skb_min_len(skb);
>   	int ret;
>   
> @@ -3686,7 +3686,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = {
>   static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
>   					u64 flags)
>   {
> -	u32 max_len = __bpf_skb_max_len(skb);
> +	u32 max_len = BPF_SKB_MAX_LEN;
>   	u32 new_len = skb->len + head_room;
>   	int ret;
>   
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  2020-10-08 14:09 ` [PATCH bpf-next V3 6/6] net: inline and splitup is_skb_forwardable Jesper Dangaard Brouer
@ 2020-10-09 16:33 ` Jakub Kicinski
  2020-10-09 20:49   ` John Fastabend
  6 siblings, 1 reply; 32+ messages in thread
From: Jakub Kicinski @ 2020-10-09 16:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, eyal.birger

On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> V3: Drop enforcement of MTU in net-core, leave it to drivers

Sorry for being late to the discussion.

I absolutely disagree. We had cases in the past where HW would lock up
if it was sent a frame with bad geometry.

We will not be sprinkling validation checks across the drivers because
some reconfiguration path may occasionally yield a bad packet, or it's
hard to do something right with BPF.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs
  2020-10-08 14:09 ` [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
@ 2020-10-09 16:47   ` Daniel Borkmann
  2020-10-09 18:33     ` Maciej Żenczykowski
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2020-10-09 16:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger

On 10/8/20 4:09 PM, Jesper Dangaard Brouer wrote:
> This change makes it possible to identify SKBs that have been redirected
> by TC-BPF (cls_act). This is needed for a number of cases.
> 
> (1) For collaborating with driver ifb net_devices.
> (2) For avoiding starting generic-XDP prog on TC ingress redirect.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Not sure if anyone actually cares about ifb devices, but my worry is that the
generic XDP vs tc interaction has been as-is for quite some time so this change
in behavior could break in the wild.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-09 16:12   ` Daniel Borkmann
@ 2020-10-09 18:26     ` Maciej Żenczykowski
  2020-10-10 10:25     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 32+ messages in thread
From: Maciej Żenczykowski @ 2020-10-09 18:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jesper Dangaard Brouer, bpf, Linux NetDev, Daniel Borkmann,
	Alexei Starovoitov, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, John Fastabend,
	Jakub Kicinski, Eyal Birger, willemdebruijn.kernel

> > Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
> > __bpf_skb_max_len() as the max-length. This function limit size against
> > the current net_device MTU (skb->dev->mtu).
> >
> > When a BPF-prog grow the packet size, then it should not be limited to the
> > MTU. The MTU is a transmit limitation, and software receiving this packet
> > should be allowed to increase the size. Further more, current MTU check in
> > __bpf_skb_max_len uses the MTU from ingress/current net_device, which in
> > case of redirects uses the wrong net_device.
> >
> > Keep a sanity max limit of IP6_MAX_MTU (under CONFIG_IPV6) which is 64KiB
> > plus 40 bytes IPv6 header size. If compiled without IPv6 use IP_MAX_MTU.
> >
> > V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >   net/core/filter.c |   16 ++++++++--------
> >   1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 05df73780dd3..ddc1f9ba89d1 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3474,11 +3474,11 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
> >       return 0;
> >   }
> >
> > -static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> > -{
> > -     return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> > -                       SKB_MAX_ALLOC;
> > -}
> > +#ifdef IP6_MAX_MTU /* Depend on CONFIG_IPV6 */
> > +#define BPF_SKB_MAX_LEN IP6_MAX_MTU
> > +#else
> > +#define BPF_SKB_MAX_LEN IP_MAX_MTU
> > +#endif
>
> Shouldn't that check on skb->protocol? The way I understand it is that a number of devices
> including virtual ones use ETH_MAX_MTU as their dev->max_mtu, so the mtu must be in the range
> of dev->min_mtu(=ETH_MIN_MTU), dev->max_mtu(=ETH_MAX_MTU). __dev_set_mtu() then sets the user
> value to dev->mtu in the core if within this range. That means in your case skb->dev->hard_header_len
> for example is left out, meaning if we go for some constant, that would need to be higher.

I think in the past skb->protocol was not guaranteed to be correct -
could be zero...
(with [misconfigured] raw sockets - maybe that's fixed now?)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs
  2020-10-09 16:47   ` Daniel Borkmann
@ 2020-10-09 18:33     ` Maciej Żenczykowski
  2020-10-10 11:09       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 32+ messages in thread
From: Maciej Żenczykowski @ 2020-10-09 18:33 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jesper Dangaard Brouer, bpf, Linux NetDev, Daniel Borkmann,
	Alexei Starovoitov, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, John Fastabend,
	Jakub Kicinski, Eyal Birger

> > This change makes it possible to identify SKBs that have been redirected
> > by TC-BPF (cls_act). This is needed for a number of cases.
> >
> > (1) For collaborating with driver ifb net_devices.
> > (2) For avoiding starting generic-XDP prog on TC ingress redirect.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> Not sure if anyone actually cares about ifb devices, but my worry is that the
> generic XDP vs tc interaction has been as-is for quite some time so this change
> in behavior could break in the wild.

I'm not at all sure of the interactions/implications here.
But I do have a request to enable ifb on Android for ingress rate
limiting and separately we're trying to make XDP work...
So we might at some point end up with cellular interfaces with xdp
ebpf (redirect for forwarding/nat/tethering) + ifb + tc ebpf (for
device local stuff).
But this is still all very vague and 'ideas only' level.
(and in general I think I'd like to get rid of the redirect in tc
ebpf, and leave only xlat64 translation for to-the-device traffic in
there, so maybe there's no problem anyway??)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-09 16:33 ` [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jakub Kicinski
@ 2020-10-09 20:49   ` John Fastabend
  2020-10-09 21:07     ` Alexei Starovoitov
  2020-10-09 23:00     ` Jakub Kicinski
  0 siblings, 2 replies; 32+ messages in thread
From: John Fastabend @ 2020-10-09 20:49 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, eyal.birger

Jakub Kicinski wrote:
> On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> > V3: Drop enforcement of MTU in net-core, leave it to drivers
> 
> Sorry for being late to the discussion.
> 
> I absolutely disagree. We had cases in the past where HW would lock up
> if it was sent a frame with bad geometry.
> 
> We will not be sprinkling validation checks across the drivers because
> some reconfiguration path may occasionally yield a bad packet, or it's
> hard to do something right with BPF.

This is a driver bug then. As it stands today drivers may get hit with
skb with MTU greater than set MTU as best I can tell. Generally I
expect drivers use MTU to configure RX buffers not sure how it is going
to be used on TX side? Any examples? I just poked around through the
driver source to see and seems to confirm its primarily for RX side
configuration with some drivers throwing the event down to the firmware
for something that I can't see in the code?

I'm not suggestiong sprinkling validation checks across the drivers.
I'm suggesting if the drivers hang we fix them.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-09 20:49   ` John Fastabend
@ 2020-10-09 21:07     ` Alexei Starovoitov
  2020-10-09 21:57       ` Maciej Żenczykowski
  2020-10-09 23:00     ` Jakub Kicinski
  1 sibling, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2020-10-09 21:07 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jakub Kicinski, Jesper Dangaard Brouer, bpf, netdev,
	Daniel Borkmann, maze, lmb, shaun, Lorenzo Bianconi, marek,
	eyal.birger

On Fri, Oct 09, 2020 at 01:49:14PM -0700, John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> > > V3: Drop enforcement of MTU in net-core, leave it to drivers
> > 
> > Sorry for being late to the discussion.
> > 
> > I absolutely disagree. We had cases in the past where HW would lock up
> > if it was sent a frame with bad geometry.
> > 
> > We will not be sprinkling validation checks across the drivers because
> > some reconfiguration path may occasionally yield a bad packet, or it's
> > hard to do something right with BPF.
> 
> This is a driver bug then. As it stands today drivers may get hit with
> skb with MTU greater than set MTU as best I can tell. Generally I
> expect drivers use MTU to configure RX buffers not sure how it is going
> to be used on TX side? Any examples? I just poked around through the
> driver source to see and seems to confirm its primarily for RX side
> configuration with some drivers throwing the event down to the firmware
> for something that I can't see in the code?
> 
> I'm not suggestiong sprinkling validation checks across the drivers.
> I'm suggesting if the drivers hang we fix them.

+1

I've seen HW that hangs when certain sizes of the packet.
Like < 68 byte TX where size is one specific constant.
I don't think it's a job of the stack or the driver to deal with that.
It's firmware/hw bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-09 21:07     ` Alexei Starovoitov
@ 2020-10-09 21:57       ` Maciej Żenczykowski
  0 siblings, 0 replies; 32+ messages in thread
From: Maciej Żenczykowski @ 2020-10-09 21:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Jakub Kicinski, Jesper Dangaard Brouer, bpf,
	Linux NetDev, Daniel Borkmann, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, Eyal Birger

On Fri, Oct 9, 2020 at 2:07 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Oct 09, 2020 at 01:49:14PM -0700, John Fastabend wrote:
> > Jakub Kicinski wrote:
> > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> > > > V3: Drop enforcement of MTU in net-core, leave it to drivers
> > >
> > > Sorry for being late to the discussion.
> > >
> > > I absolutely disagree. We had cases in the past where HW would lock up
> > > if it was sent a frame with bad geometry.
> > >
> > > We will not be sprinkling validation checks across the drivers because
> > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > hard to do something right with BPF.
> >
> > This is a driver bug then. As it stands today drivers may get hit with
> > skb with MTU greater than set MTU as best I can tell. Generally I
> > expect drivers use MTU to configure RX buffers not sure how it is going
> > to be used on TX side? Any examples? I just poked around through the
> > driver source to see and seems to confirm its primarily for RX side
> > configuration with some drivers throwing the event down to the firmware
> > for something that I can't see in the code?
> >
> > I'm not suggestiong sprinkling validation checks across the drivers.
> > I'm suggesting if the drivers hang we fix them.
>
> +1
>
> I've seen HW that hangs when certain sizes of the packet.
> Like < 68 byte TX where size is one specific constant.
> I don't think it's a job of the stack or the driver to deal with that.
> It's firmware/hw bug.

+1
It's not the job of the core stack, but it *is* the job of the driver
to deal with firmware/hw bugs like this.
Sure fix in hw when you can (next rev), in fw if you can't (and have
fw, can release it, rev it, distribute it), but ultimately that's why
drivers have quirks for various revisions of hw/fw... so you very much
fix bugs like this in driver if needed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-09 20:49   ` John Fastabend
  2020-10-09 21:07     ` Alexei Starovoitov
@ 2020-10-09 23:00     ` Jakub Kicinski
  2020-10-10 10:44       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 32+ messages in thread
From: Jakub Kicinski @ 2020-10-09 23:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, bpf, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	eyal.birger

On Fri, 09 Oct 2020 13:49:14 -0700 John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:  
> > > V3: Drop enforcement of MTU in net-core, leave it to drivers  
> > 
> > Sorry for being late to the discussion.
> > 
> > I absolutely disagree. We had cases in the past where HW would lock up
> > if it was sent a frame with bad geometry.
> > 
> > We will not be sprinkling validation checks across the drivers because
> > some reconfiguration path may occasionally yield a bad packet, or it's
> > hard to do something right with BPF.  
> 
> This is a driver bug then. As it stands today drivers may get hit with
> skb with MTU greater than set MTU as best I can tell.

You're talking about taking it from "maybe this can happen, but will
still be at most jumbo" to "it's going to be very easy to trigger and
length may be > MAX_U16".

> Generally I expect drivers use MTU to configure RX buffers not sure
> how it is going to be used on TX side? Any examples? I just poked
> around through the driver source to see and seems to confirm its
> primarily for RX side configuration with some drivers throwing the
> event down to the firmware for something that I can't see in the code?

Right, but that could just be because nobody expects to get over sized
frames from the stack.

We actively encourage drivers to remove paranoid checks. It's really
not going to be a great experience for driver authors where they need
to consult a list of things they should and shouldn't check.

If we want to do this, the driver interface must most definitely say 
MRU and not MTU.

> I'm not suggestiong sprinkling validation checks across the drivers.
> I'm suggesting if the drivers hang we fix them.

We both know the level of testing drivers get, it's unlikely this will
be validated. It's packet of death waiting to happen. 

And all this for what? Saving 2 cycles on a branch that will almost
never be taken?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 5/6] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-10-08 14:09 ` [PATCH bpf-next V3 5/6] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2020-10-09 23:17   ` Maciej Żenczykowski
  0 siblings, 0 replies; 32+ messages in thread
From: Maciej Żenczykowski @ 2020-10-09 23:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, Linux NetDev, Daniel Borkmann, Alexei Starovoitov,
	Lorenz Bauer, Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	John Fastabend, Jakub Kicinski, Eyal Birger

On Thu, Oct 8, 2020 at 7:09 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> The use-case for dropping the MTU check when TC-BPF does redirect to
> ingress, is described by Eyal Birger in email[0]. The summary is the
> ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> ingress redirect packet and let normal netstack fragment packet as needed.
>
> [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/netdevice.h |    5 +++--
>  net/core/dev.c            |    2 +-
>  net/core/filter.c         |   12 ++++++++++--
>  3 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 28cfa53daf72..58fb7b4869ba 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3866,10 +3866,11 @@ bool is_skb_forwardable(const struct net_device *dev,
>                         const struct sk_buff *skb);
>
>  static __always_inline int ____dev_forward_skb(struct net_device *dev,
> -                                              struct sk_buff *skb)
> +                                              struct sk_buff *skb,
> +                                              const bool mtu_check)

check_mtu might be a better arg name then 'mtu_check'

>  {
>         if (skb_orphan_frags(skb, GFP_ATOMIC) ||
> -           unlikely(!is_skb_forwardable(dev, skb))) {
> +           (mtu_check && unlikely(!is_skb_forwardable(dev, skb)))) {
>                 atomic_long_inc(&dev->rx_dropped);
>                 kfree_skb(skb);
>                 return NET_RX_DROP;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b433098896b2..96b455f15872 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2209,7 +2209,7 @@ EXPORT_SYMBOL_GPL(is_skb_forwardable);
>
>  int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
>  {
> -       int ret = ____dev_forward_skb(dev, skb);
> +       int ret = ____dev_forward_skb(dev, skb, true);
>
>         if (likely(!ret)) {
>                 skb->protocol = eth_type_trans(skb, dev);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5986156e700e..a8e24092e4f5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
>
>  static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
>  {
> -       return dev_forward_skb(dev, skb);
> +       int ret = ____dev_forward_skb(dev, skb, false);
> +
> +       if (likely(!ret)) {
> +               skb->protocol = eth_type_trans(skb, dev);
> +               skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);

this blindly assumes eth header size in a function that does (by name)
seem ethernet specific...
could this use dev->hard_header_len?  or change func name to be
__bpf_ethernet_rx_skb or something

> +               ret = netif_rx(skb);
> +       }
> +
> +       return ret;
>  }
>
>  static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
>                                       struct sk_buff *skb)
>  {
> -       int ret = ____dev_forward_skb(dev, skb);
> +       int ret = ____dev_forward_skb(dev, skb, false);
>
>         if (likely(!ret)) {
>                 skb->dev = dev;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking
  2020-10-08 14:09 ` [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2020-10-09 23:29   ` Maciej Żenczykowski
  2020-10-21 11:32     ` Jesper Dangaard Brouer
  2020-10-12 15:54   ` Lorenz Bauer
  1 sibling, 1 reply; 32+ messages in thread
From: Maciej Żenczykowski @ 2020-10-09 23:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, Linux NetDev, Daniel Borkmann, Alexei Starovoitov,
	Lorenz Bauer, Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	John Fastabend, Jakub Kicinski, Eyal Birger

On Thu, Oct 8, 2020 at 7:09 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> This BPF-helper bpf_mtu_check() works for both XDP and TC-BPF programs.

bpf_check_mtu() seems a better name.

>
> The API is designed to help the BPF-programmer, that want to do packet
> context size changes, which involves other helpers. These other helpers
> usually does a delta size adjustment. This helper also support a delta
> size (len_diff), which allow BPF-programmer to reuse arguments needed by
> these other helpers, and perform the MTU check prior to doing any actual
> size adjustment of the packet context.
>
> V3: Take L2/ETH_HLEN header size into account and document it.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h       |   63 +++++++++++++++++++++
>  net/core/filter.c              |  119 ++++++++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |   63 +++++++++++++++++++++
>  3 files changed, 245 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4a46a1de6d16..1dcf5d8195f4 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3718,6 +3718,56 @@ union bpf_attr {
>   *             never return NULL.
>   *     Return
>   *             A pointer pointing to the kernel percpu variable on this cpu.
> + *
> + * int bpf_mtu_check(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
> + *     Description
> + *             Check ctx packet size against MTU of net device (based on
> + *             *ifindex*).  This helper will likely be used in combination with
> + *             helpers that adjust/change the packet size.  The argument
> + *             *len_diff* can be used for querying with a planned size
> + *             change. This allows to check MTU prior to changing packet ctx.
> + *
> + *             The Linux kernel route table can configure MTUs on a more
> + *             specific per route level, which is not provided by this helper.
> + *             For route level MTU checks use the **bpf_fib_lookup**\ ()
> + *             helper.
> + *
> + *             *ctx* is either **struct xdp_md** for XDP programs or
> + *             **struct sk_buff** for tc cls_act programs.
> + *
> + *             The *flags* argument can be a combination of one or more of the
> + *             following values:
> + *
> + *              **BPF_MTU_CHK_RELAX**
> + *                     This flag relax or increase the MTU with room for one
> + *                     VLAN header (4 bytes) and take into account net device
> + *                     hard_header_len.  This relaxation is also used by the
> + *                     kernels own forwarding MTU checks.
> + *
> + *             **BPF_MTU_CHK_GSO**
> + *                     This flag will only works for *ctx* **struct sk_buff**.
> + *                     If packet context contains extra packet segment buffers
> + *                     (often knows as frags), then those are also checked
> + *                     against the MTU size.

naming is weird... what does GSO have to do with frags?
Aren't these orthogonal things?

> + *
> + *             The *mtu_result* pointer contains the MTU value of the net
> + *             device including the L2 header size (usually 14 bytes Ethernet
> + *             header). The net device configured MTU is the L3 size, but as
> + *             XDP and TX length operate at L2 this helper include L2 header
> + *             size in reported MTU.
> + *
> + *     Return
> + *             * 0 on success, and populate MTU value in *mtu_result* pointer.
> + *
> + *             * < 0 if any input argument is invalid (*mtu_result* not updated)

not -EINVAL?

> + *
> + *             MTU violations return positive values, but also populate MTU
> + *             value in *mtu_result* pointer, as this can be needed for
> + *             implemeting PMTU handing:
implementing

> + *
> + *             * **BPF_MTU_CHK_RET_FRAG_NEEDED**
> + *             * **BPF_MTU_CHK_RET_GSO_TOOBIG**
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -3875,6 +3925,7 @@ union bpf_attr {
>         FN(redirect_neigh),             \
>         FN(bpf_per_cpu_ptr),            \
>         FN(bpf_this_cpu_ptr),           \
> +       FN(mtu_check),                  \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -4889,6 +4940,18 @@ struct bpf_fib_lookup {
>         __u8    dmac[6];     /* ETH_ALEN */
>  };
>
> +/* bpf_mtu_check flags*/
> +enum  bpf_mtu_check_flags {
> +       BPF_MTU_CHK_RELAX = (1U << 0),
> +       BPF_MTU_CHK_GSO   = (1U << 1),
> +};
> +
> +enum bpf_mtu_check_ret {
> +       BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> +       BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> +       BPF_MTU_CHK_RET_GSO_TOOBIG,   /* GSO re-segmentation needed to fwd */
> +};
> +
>  enum bpf_task_fd_type {
>         BPF_FD_TYPE_RAW_TRACEPOINT,     /* tp name */
>         BPF_FD_TYPE_TRACEPOINT,         /* tp name */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index da74d6ddc4d7..5986156e700e 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5513,6 +5513,121 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
>         .arg4_type      = ARG_ANYTHING,
>  };
>
> +static int bpf_mtu_lookup(struct net *netns, u32 ifindex, u64 flags)

bpf_lookup_mtu() ???

> +{
> +       struct net_device *dev;
> +       int mtu;
> +
> +       dev = dev_get_by_index_rcu(netns, ifindex);

my understanding is this is a bit of a perf hit, maybe ifindex 0 means
use skb->dev ???
or have bpf_lookup_mtu(skb) function as well?

> +       if (!dev)
> +               return -ENODEV;
> +
> +       /* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
> +       mtu = dev->mtu + dev->hard_header_len;
> +
> +       /*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> +       if (flags & BPF_MTU_CHK_RELAX)

could this check device vlan tx offload state instead?

> +               mtu += VLAN_HLEN;
> +
> +       return mtu;
> +}
> +
> +static unsigned int __bpf_len_adjust_positive(unsigned int len, int len_diff)
> +{
> +       int len_new = len + len_diff; /* notice len_diff can be negative */
> +
> +       if (len_new > 0)
> +               return len_new;
> +
> +       return 0;

not return len ?

oh I see the function doesn't do what the name implies...
nor sure this func is helpful... why not simply int len_new = (int)len
+ (int)len_diff; directly down below and check < 0 there?
>2GB skb->len is meaningless anyway

> +}
> +
> +BPF_CALL_5(bpf_skb_mtu_check, struct sk_buff *, skb,
> +          u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
> +{
> +       struct net *netns = dev_net(skb->dev);
> +       int ret = BPF_MTU_CHK_RET_SUCCESS;
> +       unsigned int len = skb->len;
> +       int mtu;
> +
> +       if (flags & ~(BPF_MTU_CHK_RELAX | BPF_MTU_CHK_GSO))
> +               return -EINVAL;
> +
> +       mtu = bpf_mtu_lookup(netns, ifindex, flags);
> +       if (unlikely(mtu < 0))
> +               return mtu; /* errno */
> +
> +       len = __bpf_len_adjust_positive(len, len_diff);
> +       if (len > mtu) {
> +               ret = BPF_MTU_CHK_RET_FRAG_NEEDED;

Can't this fail if skb->len includes the entire packet, and yet gso is
on, and packet is greater then mtu, yet gso size is smaller?

Think 200 byte gso packet with 2 100 byte segs, and a 150 byte mtu.
Does gso actually require frags?  [As you can tell I don't have a good
handle on gso vs frags vs skb->len, maybe what I"m asking is bogus]


> +               goto out;
> +       }
> +
> +       if (flags & BPF_MTU_CHK_GSO &&
> +           skb_is_gso(skb) &&
> +           skb_gso_validate_network_len(skb, mtu)) {
> +               ret = BPF_MTU_CHK_RET_GSO_TOOBIG;
> +               goto out;
> +       }
> +
> +out:
> +       if (mtu_result)
> +               *mtu_result = mtu;
> +
> +       return ret;
> +}
> +
> +BPF_CALL_5(bpf_xdp_mtu_check, struct xdp_buff *, xdp,
> +          u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
> +{
> +       unsigned int len = xdp->data_end - xdp->data;
> +       struct net_device *dev = xdp->rxq->dev;
> +       struct net *netns = dev_net(dev);
> +       int ret = BPF_MTU_CHK_RET_SUCCESS;
> +       int mtu;
> +
> +       /* XDP variant doesn't support multi-buffer segment check (yet) */
> +       if (flags & ~BPF_MTU_CHK_RELAX)
> +               return -EINVAL;
> +
> +       mtu = bpf_mtu_lookup(netns, ifindex, flags);
> +       if (unlikely(mtu < 0))
> +               return mtu; /* errno */
> +
> +       len = __bpf_len_adjust_positive(len, len_diff);
> +       if (len > mtu) {
> +               ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> +               goto out;
> +       }
> +out:
> +       if (mtu_result)
> +               *mtu_result = mtu;
> +
> +       return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_skb_mtu_check_proto = {
> +       .func           = bpf_skb_mtu_check,
> +       .gpl_only       = true,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_ANYTHING,
> +       .arg3_type      = ARG_PTR_TO_MEM,
> +       .arg4_type      = ARG_ANYTHING,
> +       .arg5_type      = ARG_ANYTHING,
> +};
> +
> +static const struct bpf_func_proto bpf_xdp_mtu_check_proto = {
> +       .func           = bpf_xdp_mtu_check,
> +       .gpl_only       = true,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_ANYTHING,
> +       .arg3_type      = ARG_PTR_TO_MEM,
> +       .arg4_type      = ARG_ANYTHING,
> +       .arg5_type      = ARG_ANYTHING,
> +};
> +
>  #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
>  static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
>  {
> @@ -7076,6 +7191,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_get_socket_uid_proto;
>         case BPF_FUNC_fib_lookup:
>                 return &bpf_skb_fib_lookup_proto;
> +       case BPF_FUNC_mtu_check:
> +               return &bpf_skb_mtu_check_proto;
>         case BPF_FUNC_sk_fullsock:
>                 return &bpf_sk_fullsock_proto;
>         case BPF_FUNC_sk_storage_get:
> @@ -7145,6 +7262,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_xdp_adjust_tail_proto;
>         case BPF_FUNC_fib_lookup:
>                 return &bpf_xdp_fib_lookup_proto;
> +       case BPF_FUNC_mtu_check:
> +               return &bpf_xdp_mtu_check_proto;
>  #ifdef CONFIG_INET
>         case BPF_FUNC_sk_lookup_udp:
>                 return &bpf_xdp_sk_lookup_udp_proto;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4a46a1de6d16..1dcf5d8195f4 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -3718,6 +3718,56 @@ union bpf_attr {
>   *             never return NULL.
>   *     Return
>   *             A pointer pointing to the kernel percpu variable on this cpu.
> + *
> + * int bpf_mtu_check(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
> + *     Description
> + *             Check ctx packet size against MTU of net device (based on
> + *             *ifindex*).  This helper will likely be used in combination with
> + *             helpers that adjust/change the packet size.  The argument
> + *             *len_diff* can be used for querying with a planned size
> + *             change. This allows to check MTU prior to changing packet ctx.
> + *
> + *             The Linux kernel route table can configure MTUs on a more
> + *             specific per route level, which is not provided by this helper.
> + *             For route level MTU checks use the **bpf_fib_lookup**\ ()
> + *             helper.
> + *
> + *             *ctx* is either **struct xdp_md** for XDP programs or
> + *             **struct sk_buff** for tc cls_act programs.
> + *
> + *             The *flags* argument can be a combination of one or more of the
> + *             following values:
> + *
> + *              **BPF_MTU_CHK_RELAX**
> + *                     This flag relax or increase the MTU with room for one
> + *                     VLAN header (4 bytes) and take into account net device
> + *                     hard_header_len.  This relaxation is also used by the
> + *                     kernels own forwarding MTU checks.
> + *
> + *             **BPF_MTU_CHK_GSO**
> + *                     This flag will only works for *ctx* **struct sk_buff**.
> + *                     If packet context contains extra packet segment buffers
> + *                     (often knows as frags), then those are also checked
> + *                     against the MTU size.
> + *
> + *             The *mtu_result* pointer contains the MTU value of the net
> + *             device including the L2 header size (usually 14 bytes Ethernet
> + *             header). The net device configured MTU is the L3 size, but as
> + *             XDP and TX length operate at L2 this helper include L2 header
> + *             size in reported MTU.
> + *
> + *     Return
> + *             * 0 on success, and populate MTU value in *mtu_result* pointer.
> + *
> + *             * < 0 if any input argument is invalid (*mtu_result* not updated)
> + *
> + *             MTU violations return positive values, but also populate MTU
> + *             value in *mtu_result* pointer, as this can be needed for
> + *             implemeting PMTU handing:
> + *
> + *             * **BPF_MTU_CHK_RET_FRAG_NEEDED**
> + *             * **BPF_MTU_CHK_RET_GSO_TOOBIG**
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -3875,6 +3925,7 @@ union bpf_attr {
>         FN(redirect_neigh),             \
>         FN(bpf_per_cpu_ptr),            \
>         FN(bpf_this_cpu_ptr),           \
> +       FN(mtu_check),                  \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -4889,6 +4940,18 @@ struct bpf_fib_lookup {
>         __u8    dmac[6];     /* ETH_ALEN */
>  };
>
> +/* bpf_mtu_check flags*/
> +enum  bpf_mtu_check_flags {
> +       BPF_MTU_CHK_RELAX = (1U << 0),
> +       BPF_MTU_CHK_GSO   = (1U << 1),
> +};
> +
> +enum bpf_mtu_check_ret {
> +       BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> +       BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> +       BPF_MTU_CHK_RET_GSO_TOOBIG,   /* GSO re-segmentation needed to fwd */
> +};
> +
>  enum bpf_task_fd_type {
>         BPF_FD_TYPE_RAW_TRACEPOINT,     /* tp name */
>         BPF_FD_TYPE_TRACEPOINT,         /* tp name */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-09 16:12   ` Daniel Borkmann
  2020-10-09 18:26     ` Maciej Żenczykowski
@ 2020-10-10 10:25     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-10 10:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, willemdebruijn.kernel, brouer

On Fri, 9 Oct 2020 18:12:20 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/8/20 4:09 PM, Jesper Dangaard Brouer wrote:
> > Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
> > __bpf_skb_max_len() as the max-length. This function limit size against
> > the current net_device MTU (skb->dev->mtu).
> > 
> > When a BPF-prog grow the packet size, then it should not be limited to the
> > MTU. The MTU is a transmit limitation, and software receiving this packet
> > should be allowed to increase the size. Further more, current MTU check in
> > __bpf_skb_max_len uses the MTU from ingress/current net_device, which in
> > case of redirects uses the wrong net_device.
> > 
> > Keep a sanity max limit of IP6_MAX_MTU (under CONFIG_IPV6) which is 64KiB
> > plus 40 bytes IPv6 header size. If compiled without IPv6 use IP_MAX_MTU.
> > 
> > V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >   net/core/filter.c |   16 ++++++++--------
> >   1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 05df73780dd3..ddc1f9ba89d1 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3474,11 +3474,11 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
> >   	return 0;
> >   }
> >   
> > -static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> > -{
> > -	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> > -			  SKB_MAX_ALLOC;
> > -}
> > +#ifdef IP6_MAX_MTU /* Depend on CONFIG_IPV6 */
> > +#define BPF_SKB_MAX_LEN IP6_MAX_MTU
> > +#else
> > +#define BPF_SKB_MAX_LEN IP_MAX_MTU
> > +#endif  
> 
> Shouldn't that check on skb->protocol? The way I understand it is
> that a number of devices including virtual ones use ETH_MAX_MTU as
> their dev->max_mtu, so the mtu must be in the range of
> dev->min_mtu(=ETH_MIN_MTU), dev->max_mtu(=ETH_MAX_MTU).
> __dev_set_mtu() then sets the user value to dev->mtu in the core if
> within this range. That means in your case skb->dev->hard_header_len
> for example is left out, meaning if we go for some constant, that
> would need to be higher.

Sorry, but I think you have missed the point.  This BPF_SKB_MAX_LEN is
just a sanity max limit.  We are removing the limit for BPF-progs to
change the size of the packet (regardless of MTU).

This will allow BPF-ingress to increase packet size (up-to this sanity
limit) and then BPF-egress can decrease packet size again, before
sending it to the actual dev.  It is up to the BPF-programmer that to
use this for, but I think this adds good flexibility, instead of being
limited to the *transmit* size (MTU) of the dev.  This is software why
have this MTU limit.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-09 23:00     ` Jakub Kicinski
@ 2020-10-10 10:44       ` Jesper Dangaard Brouer
  2020-10-10 16:32         ` Jakub Kicinski
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-10 10:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: John Fastabend, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	maze, lmb, shaun, Lorenzo Bianconi, marek, eyal.birger, brouer

On Fri, 9 Oct 2020 16:00:10 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> On Fri, 09 Oct 2020 13:49:14 -0700 John Fastabend wrote:
> > Jakub Kicinski wrote:  
> > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:    
> > > > V3: Drop enforcement of MTU in net-core, leave it to drivers    
> > > 
> > > Sorry for being late to the discussion.
> > > 
> > > I absolutely disagree. We had cases in the past where HW would lock up
> > > if it was sent a frame with bad geometry.

I agree with Jakub here.  I do find it risky not to do these MTU check
in net-core.

> > > We will not be sprinkling validation checks across the drivers because
> > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > hard to do something right with BPF.    
> > 
> > This is a driver bug then. As it stands today drivers may get hit with
> > skb with MTU greater than set MTU as best I can tell.  
> 
> You're talking about taking it from "maybe this can happen, but will
> still be at most jumbo" to "it's going to be very easy to trigger and
> length may be > MAX_U16".

It is interesting that a misbehaving BPF program can easily trigger this.
Next week, I will looking writing such a BPF-prog and then test it on
the hardware I have avail in my testlab.


> > Generally I expect drivers use MTU to configure RX buffers not sure
> > how it is going to be used on TX side? Any examples? I just poked
> > around through the driver source to see and seems to confirm its
> > primarily for RX side configuration with some drivers throwing the
> > event down to the firmware for something that I can't see in the code?  
> 
> Right, but that could just be because nobody expects to get over sized
> frames from the stack.
> 
> We actively encourage drivers to remove paranoid checks. It's really
> not going to be a great experience for driver authors where they need
> to consult a list of things they should and shouldn't check.
> 
> If we want to do this, the driver interface must most definitely say 
> MRU and not MTU.

What is MRU?

 
> > I'm not suggestiong sprinkling validation checks across the drivers.
> > I'm suggesting if the drivers hang we fix them.  
> 
> We both know the level of testing drivers get, it's unlikely this will
> be validated. It's packet of death waiting to happen. 
> 
> And all this for what? Saving 2 cycles on a branch that will almost
> never be taken?

I do think it is risky not to do this simple MTU check in net-core.  I
also believe the overhead is very very low.  Hint, I'm basically just
moving the MTU check from one place to another.  (And last patch in
patchset is an optimization that inlines and save cycles when doing
these kind of MTU checks).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs
  2020-10-09 18:33     ` Maciej Żenczykowski
@ 2020-10-10 11:09       ` Jesper Dangaard Brouer
  2020-10-12 21:04         ` Maciej Żenczykowski
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-10 11:09 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Daniel Borkmann, bpf, Linux NetDev, Daniel Borkmann,
	Alexei Starovoitov, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, John Fastabend,
	Jakub Kicinski, Eyal Birger, brouer

On Fri, 9 Oct 2020 11:33:33 -0700
Maciej Żenczykowski <maze@google.com> wrote:

> > > This change makes it possible to identify SKBs that have been redirected
> > > by TC-BPF (cls_act). This is needed for a number of cases.
> > >
> > > (1) For collaborating with driver ifb net_devices.
> > > (2) For avoiding starting generic-XDP prog on TC ingress redirect.
> > >
> > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>  
> >
> > Not sure if anyone actually cares about ifb devices, but my worry is that the
> > generic XDP vs tc interaction has been as-is for quite some time so this change
> > in behavior could break in the wild.  

No, I believe this happened as recent at kernel v5.2, when Stephen
Hemminger changed this in commit 458bf2f224f0 ("net: core: support XDP
generic on stacked devices.").  And for the record I think that
patch/change was a mistake, as people should not use generic-XDP for
these kind of stacked devices (they should really use TC-BPF as that is
the right tool for the job).


> I'm not at all sure of the interactions/implications here.
> But I do have a request to enable ifb on Android for ingress rate
> limiting and separately we're trying to make XDP work...
> So we might at some point end up with cellular interfaces with xdp
> ebpf (redirect for forwarding/nat/tethering) + ifb + tc ebpf (for
> device local stuff).

To me I was very surprised when I discovered tc-redirect didn't work
with ifb driver.  And it sounds like you have an actual use-case for
this on Android.

> But this is still all very vague and 'ideas only' level.
> (and in general I think I'd like to get rid of the redirect in tc
> ebpf, and leave only xlat64 translation for to-the-device traffic in
> there, so maybe there's no problem anyway??)

I know it sounds strange coming from me "Mr.XDP", but I actaully think
that in many cases you will be better off with using TC-BPF.
Especially on Android, as it will be very hard to get native-XDP
implemented in all these different drivers. (And you don't want to use
generic-XDP, because there is a high chance it causes a reallocation of
the SKB, which is a huge performance hit).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-10 10:44       ` Jesper Dangaard Brouer
@ 2020-10-10 16:32         ` Jakub Kicinski
  2020-10-10 23:52           ` John Fastabend
  2020-10-13 20:40           ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 32+ messages in thread
From: Jakub Kicinski @ 2020-10-10 16:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	maze, lmb, shaun, Lorenzo Bianconi, marek, eyal.birger

On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote:
> > > > We will not be sprinkling validation checks across the drivers because
> > > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > > hard to do something right with BPF.      
> > > 
> > > This is a driver bug then. As it stands today drivers may get hit with
> > > skb with MTU greater than set MTU as best I can tell.    
> > 
> > You're talking about taking it from "maybe this can happen, but will
> > still be at most jumbo" to "it's going to be very easy to trigger and
> > length may be > MAX_U16".  
> 
> It is interesting that a misbehaving BPF program can easily trigger this.
> Next week, I will looking writing such a BPF-prog and then test it on
> the hardware I have avail in my testlab.

FWIW I took a quick swing at testing it with the HW I have and it did
exactly what hardware should do. The TX unit entered an error state 
and then the driver detected that and reset it a few seconds later.

Hardware is almost always designed to behave like that. If some NIC
actually cleanly drops over sized TX frames, I'd bet it's done in FW,
or some other software piece.

There was also a statement earlier in the thread that we can put a large
frame on the wire and "let the switch drop it". I don't believe
that's possible either (as I mentioned previously BPF could generate
frames above jumbo size). My phy knowledge is very rudimentary and
rusty but from what I heard Ethernet PHYs have a hard design limit on
the length of a frame they can put of a wire (or pull from it), because
of symbol encoding, electrical charges on the wire etc. reasons. There
needs to be a bunch of idle symbols every now and then. And obviously
if one actually manages to get a longer frame to the PHY it will fault,
see above.

> > > Generally I expect drivers use MTU to configure RX buffers not sure
> > > how it is going to be used on TX side? Any examples? I just poked
> > > around through the driver source to see and seems to confirm its
> > > primarily for RX side configuration with some drivers throwing the
> > > event down to the firmware for something that I can't see in the code?    
> > 
> > Right, but that could just be because nobody expects to get over sized
> > frames from the stack.
> > 
> > We actively encourage drivers to remove paranoid checks. It's really
> > not going to be a great experience for driver authors where they need
> > to consult a list of things they should and shouldn't check.
> > 
> > If we want to do this, the driver interface must most definitely say 
> > MRU and not MTU.  
> 
> What is MRU?

Max Receive Unit, Jesse and others have been talking about how we 
should separate the TX config from RX config for drivers. Right now
drivers configure RX filters based on the max transmission unit, 
which is weird, and nobody is sure whether that's actually desired.

> > > I'm not suggestiong sprinkling validation checks across the drivers.
> > > I'm suggesting if the drivers hang we fix them.    
> > 
> > We both know the level of testing drivers get, it's unlikely this will
> > be validated. It's packet of death waiting to happen. 
> > 
> > And all this for what? Saving 2 cycles on a branch that will almost
> > never be taken?  
> 
> I do think it is risky not to do this simple MTU check in net-core.  I
> also believe the overhead is very very low.  Hint, I'm basically just
> moving the MTU check from one place to another.  (And last patch in
> patchset is an optimization that inlines and save cycles when doing
> these kind of MTU checks).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-10 16:32         ` Jakub Kicinski
@ 2020-10-10 23:52           ` John Fastabend
  2020-10-11 23:30             ` Jakub Kicinski
  2020-10-13 20:40           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 32+ messages in thread
From: John Fastabend @ 2020-10-10 23:52 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: John Fastabend, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	maze, lmb, shaun, Lorenzo Bianconi, marek, eyal.birger

Jakub Kicinski wrote:
> On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote:
> > > > > We will not be sprinkling validation checks across the drivers because
> > > > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > > > hard to do something right with BPF.      
> > > > 
> > > > This is a driver bug then. As it stands today drivers may get hit with
> > > > skb with MTU greater than set MTU as best I can tell.    
> > > 
> > > You're talking about taking it from "maybe this can happen, but will
> > > still be at most jumbo" to "it's going to be very easy to trigger and
> > > length may be > MAX_U16".  
> > 
> > It is interesting that a misbehaving BPF program can easily trigger this.
> > Next week, I will looking writing such a BPF-prog and then test it on
> > the hardware I have avail in my testlab.
> 
> FWIW I took a quick swing at testing it with the HW I have and it did
> exactly what hardware should do. The TX unit entered an error state 
> and then the driver detected that and reset it a few seconds later.

Ths seems like the right thing to do in my opinion. If the
stack gives the NIC garbage entering error state and reset
sounds expected. Thanks for actually trying it by the way.

We might have come to different conclusions though from my side
the conclusion is, good nothing horrible happened no MTU check needed.
If the user spews garbage at the nic from the BPF program great it
gets dropped and causes the driver/nic to punish you a bit by staying
hung. Fix your BPF program.

Now if the nic hangs and doesn't ever come back I would care. But,
we have watchdog logic for this.

I don't really feel like we need to guard bad BPF programs from
doing dumb things, setting MTU in this case, but other things might
be nested vlans that wont fly, overwriting checksums, corrupting
mac headers, etc.

> 
> Hardware is almost always designed to behave like that. If some NIC
> actually cleanly drops over sized TX frames, I'd bet it's done in FW,
> or some other software piece.

Agree.

> 
> There was also a statement earlier in the thread that we can put a large
> frame on the wire and "let the switch drop it". I don't believe
> that's possible either (as I mentioned previously BPF could generate
> frames above jumbo size). My phy knowledge is very rudimentary and

I think that was something I said, what I meant is if the hardware
sent a jumbo frame to a switch with a 1500MRU set I would expect
the receiver to drop it. On the hardware side I would guess the
error is it doesn't fit in the receive buffer. I think if you sent
a very large frame, something much larger than 9k (without TSO), the
sender itself will hang or abort and reset just like above.

From what I've seen mostly the maximum receive frame size mirrors
the MTU because no one has an explicit MRU to configure.

> rusty but from what I heard Ethernet PHYs have a hard design limit on
> the length of a frame they can put of a wire (or pull from it), because
> of symbol encoding, electrical charges on the wire etc. reasons. There
> needs to be a bunch of idle symbols every now and then. And obviously
> if one actually manages to get a longer frame to the PHY it will fault,
> see above.

Yes, I've seen this before on some hardware.

> 
> > > > Generally I expect drivers use MTU to configure RX buffers not sure
> > > > how it is going to be used on TX side? Any examples? I just poked
> > > > around through the driver source to see and seems to confirm its
> > > > primarily for RX side configuration with some drivers throwing the
> > > > event down to the firmware for something that I can't see in the code?    
> > > 
> > > Right, but that could just be because nobody expects to get over sized
> > > frames from the stack.
> > > 
> > > We actively encourage drivers to remove paranoid checks. It's really
> > > not going to be a great experience for driver authors where they need
> > > to consult a list of things they should and shouldn't check.
> > > 
> > > If we want to do this, the driver interface must most definitely say 
> > > MRU and not MTU.  
> > 
> > What is MRU?
> 
> Max Receive Unit, Jesse and others have been talking about how we 
> should separate the TX config from RX config for drivers. Right now
> drivers configure RX filters based on the max transmission unit, 
> which is weird, and nobody is sure whether that's actually desired.

Agree. But, its a reasonable default I think. An explicit MRU would
be a nice addition.

> 
> > > > I'm not suggestiong sprinkling validation checks across the drivers.
> > > > I'm suggesting if the drivers hang we fix them.    
> > > 
> > > We both know the level of testing drivers get, it's unlikely this will
> > > be validated. It's packet of death waiting to happen. 
> > > 

We could write some selftests for driver writers to run? I think any
selftests we could provide would be welcome.

 ./test_bpf_driver eth0
  Test large frame
  Test small frame
  Test corrupted checksum
  ...

> > > And all this for what? Saving 2 cycles on a branch that will almost
> > > never be taken?  

2 cycles here and 2 cycles there .... plus complexity to think about
it. Eventually it all adds up. At the risk of entering bike shedding
territory maybe.

> > 
> > I do think it is risky not to do this simple MTU check in net-core.  I
> > also believe the overhead is very very low.  Hint, I'm basically just
> > moving the MTU check from one place to another.  (And last patch in
> > patchset is an optimization that inlines and save cycles when doing
> > these kind of MTU checks).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-10 23:52           ` John Fastabend
@ 2020-10-11 23:30             ` Jakub Kicinski
  0 siblings, 0 replies; 32+ messages in thread
From: Jakub Kicinski @ 2020-10-11 23:30 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, bpf, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	eyal.birger

On Sat, 10 Oct 2020 16:52:48 -0700 John Fastabend wrote:
> Jakub Kicinski wrote:
> > FWIW I took a quick swing at testing it with the HW I have and it did
> > exactly what hardware should do. The TX unit entered an error state 
> > and then the driver detected that and reset it a few seconds later.  
> 
> Ths seems like the right thing to do in my opinion. If the
> stack gives the NIC garbage entering error state and reset
> sounds expected. Thanks for actually trying it by the way.
> 
> We might have come to different conclusions though from my side
> the conclusion is, good nothing horrible happened no MTU check needed.
> If the user spews garbage at the nic from the BPF program great it
> gets dropped and causes the driver/nic to punish you a bit by staying
> hung. Fix your BPF program.

Right probably difference of perspective. I understand that from
Cilium's POV you can probably feel pretty confident about the BPF
programs that are running. I bet Maciej is even more confident with
Android!

But in principle BPF was supposed to make the kernel end user
programmable. We have to ensure it's safe.

> > > > And all this for what? Saving 2 cycles on a branch that will almost
> > > > never be taken?    
> 
> 2 cycles here and 2 cycles there .... plus complexity to think about
> it. Eventually it all adds up. At the risk of entering bike shedding
> territory maybe.

Not sure it's a bike shedding territory but I doubt you want to be
making either the complexity or the performance argument to a fellow 
TLS maintainer.. cough cough.. ;)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking
  2020-10-08 14:09 ` [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
  2020-10-09 23:29   ` Maciej Żenczykowski
@ 2020-10-12 15:54   ` Lorenz Bauer
  1 sibling, 0 replies; 32+ messages in thread
From: Lorenz Bauer @ 2020-10-12 15:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, Networking, Daniel Borkmann, Alexei Starovoitov,
	Maciej Żenczykowski, Shaun Crampton, Lorenzo Bianconi,
	Marek Majkowski, John Fastabend, Jakub Kicinski, eyal.birger

On Thu, 8 Oct 2020 at 16:09, Jesper Dangaard Brouer <brouer@redhat.com> wrote:

...

> + *             The *flags* argument can be a combination of one or more of the
> + *             following values:
> + *
> + *              **BPF_MTU_CHK_RELAX**
> + *                     This flag relax or increase the MTU with room for one
> + *                     VLAN header (4 bytes) and take into account net device
> + *                     hard_header_len.  This relaxation is also used by the
> + *                     kernels own forwarding MTU checks.
> + *
> + *             **BPF_MTU_CHK_GSO**
> + *                     This flag will only works for *ctx* **struct sk_buff**.
> + *                     If packet context contains extra packet segment buffers
> + *                     (often knows as frags), then those are also checked
> + *                     against the MTU size.

Maybe this is a documentation issue, but how / when am I expected to
use these flags? I'm really ignorant when it comes to GSO, but could
BPF_MTU_CHK_GSO be implied when the skb is using GSO?

> + *
> + *             The *mtu_result* pointer contains the MTU value of the net
> + *             device including the L2 header size (usually 14 bytes Ethernet
> + *             header). The net device configured MTU is the L3 size, but as
> + *             XDP and TX length operate at L2 this helper include L2 header
> + *             size in reported MTU.

What does mtu_result represent in the GSO case? I can imagine there
being some funky interactions between skb->len and the return value,
depending on how this is defined.

> + *
> + *     Return
> + *             * 0 on success, and populate MTU value in *mtu_result* pointer.
> + *
> + *             * < 0 if any input argument is invalid (*mtu_result* not updated)
> + *
> + *             MTU violations return positive values, but also populate MTU
> + *             value in *mtu_result* pointer, as this can be needed for
> + *             implemeting PMTU handing:
> + *
> + *             * **BPF_MTU_CHK_RET_FRAG_NEEDED**
> + *             * **BPF_MTU_CHK_RET_GSO_TOOBIG**
> + *

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs
  2020-10-10 11:09       ` Jesper Dangaard Brouer
@ 2020-10-12 21:04         ` Maciej Żenczykowski
  0 siblings, 0 replies; 32+ messages in thread
From: Maciej Żenczykowski @ 2020-10-12 21:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, bpf, Linux NetDev, Daniel Borkmann,
	Alexei Starovoitov, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, John Fastabend,
	Jakub Kicinski, Eyal Birger

On Sat, Oct 10, 2020 at 4:09 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Fri, 9 Oct 2020 11:33:33 -0700
> Maciej Żenczykowski <maze@google.com> wrote:
>
> > > > This change makes it possible to identify SKBs that have been redirected
> > > > by TC-BPF (cls_act). This is needed for a number of cases.
> > > >
> > > > (1) For collaborating with driver ifb net_devices.
> > > > (2) For avoiding starting generic-XDP prog on TC ingress redirect.
> > > >
> > > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > >
> > > Not sure if anyone actually cares about ifb devices, but my worry is that the
> > > generic XDP vs tc interaction has been as-is for quite some time so this change
> > > in behavior could break in the wild.
>
> No, I believe this happened as recent at kernel v5.2, when Stephen
> Hemminger changed this in commit 458bf2f224f0 ("net: core: support XDP
> generic on stacked devices.").  And for the record I think that
> patch/change was a mistake, as people should not use generic-XDP for
> these kind of stacked devices (they should really use TC-BPF as that is
> the right tool for the job).
>
>
> > I'm not at all sure of the interactions/implications here.
> > But I do have a request to enable ifb on Android for ingress rate
> > limiting and separately we're trying to make XDP work...
> > So we might at some point end up with cellular interfaces with xdp
> > ebpf (redirect for forwarding/nat/tethering) + ifb + tc ebpf (for
> > device local stuff).
>
> To me I was very surprised when I discovered tc-redirect didn't work
> with ifb driver.  And it sounds like you have an actual use-case for
> this on Android.
>
> > But this is still all very vague and 'ideas only' level.
> > (and in general I think I'd like to get rid of the redirect in tc
> > ebpf, and leave only xlat64 translation for to-the-device traffic in
> > there, so maybe there's no problem anyway??)
>
> I know it sounds strange coming from me "Mr.XDP", but I actaully think
> that in many cases you will be better off with using TC-BPF.
> Especially on Android, as it will be very hard to get native-XDP
> implemented in all these different drivers. (And you don't want to use
> generic-XDP, because there is a high chance it causes a reallocation of
> the SKB, which is a huge performance hit).

We want the benefits of not allocating/zeroing skb metadata.
We probably can't (always) do zerocopy...

But let's list what we have on at least 1 sample device:
(a) cellular interface receives, no LRO, into skb, no build_skb
so each packet is <= mtu and requires meta alloc, meta zero, payload alloc
on some devices, payload is copied because nic does not receive into
all of system RAM, just SWMMIO style into a small ~60MB buffer.
(b) GRO happens
(c) TC BPF with redirect or routing/forwarding/iptables
(d) GSO happens, cause no TSO at NCM usb driver
(e) NCM driver copies payload, discards skb.
[and it allocates around 1 more skb per 16KB]

so we basically have at least 2 allocs and 2 payload copies per <=1500 packet
(and cellular mtus are likely closer to 1280 then 1500)

Lots of room for improvement - GRO/GSO are probably a net loss
(unclear) and all that allocation/copies.
If I can get XDP to eliminate the skb meta allocation and the fast
path payload copy in the cellular driver.
(so we only have copy from xdp frame into skb), then it's already a
huge win - we're down to 1 copy in NCM driver.
NCM could technically not require a copy with USB controller SG, but
current demo patches for that are not a win.
(Most likely usb controller is crappy, but lots of work left)
If forwarding/tethering is through XDP Redirect, then I also win due
to no GRO/GSO on that path.
(at least I think so)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-10 16:32         ` Jakub Kicinski
  2020-10-10 23:52           ` John Fastabend
@ 2020-10-13 20:40           ` Jesper Dangaard Brouer
  2020-10-13 23:07             ` Jakub Kicinski
  1 sibling, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-13 20:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: John Fastabend, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	maze, lmb, shaun, Lorenzo Bianconi, marek, eyal.birger, brouer

On Sat, 10 Oct 2020 09:32:12 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote:
> > > > > We will not be sprinkling validation checks across the drivers because
> > > > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > > > hard to do something right with BPF.        
> > > > 
> > > > This is a driver bug then. As it stands today drivers may get hit with
> > > > skb with MTU greater than set MTU as best I can tell.      
> > > 
> > > You're talking about taking it from "maybe this can happen, but will
> > > still be at most jumbo" to "it's going to be very easy to trigger and
> > > length may be > MAX_U16".    
> > 
> > It is interesting that a misbehaving BPF program can easily trigger this.
> > Next week, I will looking writing such a BPF-prog and then test it on
> > the hardware I have avail in my testlab.  

I've tested sending different packet sizes that exceed the MTU on
different hardware. They all silently drop the transmitted packet. mlx5
and i40e configured to (L3) MTU 1500, will lets through upto 1504, while
ixgbe will drop size 1504.

Packets can be observed locally with tcpdump, but the other end doesn't
receive the packet. I didn't find any counters (including ethtool -S)
indicating these packets were dropped at hardware/firmware level, which
were a little concerning for later troubleshooting.

Another observation is that size increases (with bpf_skb_adjust_room)
above 4096 + e.g 128 will likely fail, even-though I have the 64K limit in
this kernel.
 
> FWIW I took a quick swing at testing it with the HW I have and it did
> exactly what hardware should do. The TX unit entered an error state 
> and then the driver detected that and reset it a few seconds later.

The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error
state, when getting packets exceeding the MTU.  I didn't go much above
4K, so maybe I didn't trigger those cases.
 
> Hardware is almost always designed to behave like that. If some NIC
> actually cleanly drops over sized TX frames, I'd bet it's done in FW,
> or some other software piece.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-13 20:40           ` Jesper Dangaard Brouer
@ 2020-10-13 23:07             ` Jakub Kicinski
  2020-10-13 23:37               ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Jakub Kicinski @ 2020-10-13 23:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, bpf, netdev, Daniel Borkmann, Alexei Starovoitov,
	maze, lmb, shaun, Lorenzo Bianconi, marek, eyal.birger

On Tue, 13 Oct 2020 22:40:09 +0200 Jesper Dangaard Brouer wrote:
> > FWIW I took a quick swing at testing it with the HW I have and it did
> > exactly what hardware should do. The TX unit entered an error state 
> > and then the driver detected that and reset it a few seconds later.  
> 
> The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error
> state, when getting packets exceeding the MTU.  I didn't go much above
> 4K, so maybe I didn't trigger those cases.

You probably need to go above 16k to get out of the acceptable jumbo
frame size. I tested ixgbe by converting TSO frames to large TCP frames,
at low probability.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-13 23:07             ` Jakub Kicinski
@ 2020-10-13 23:37               ` Alexei Starovoitov
  2020-10-13 23:54                 ` Maciej Żenczykowski
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2020-10-13 23:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jesper Dangaard Brouer, John Fastabend, bpf, netdev,
	Daniel Borkmann, maze, lmb, shaun, Lorenzo Bianconi, marek,
	eyal.birger

On Tue, Oct 13, 2020 at 04:07:26PM -0700, Jakub Kicinski wrote:
> On Tue, 13 Oct 2020 22:40:09 +0200 Jesper Dangaard Brouer wrote:
> > > FWIW I took a quick swing at testing it with the HW I have and it did
> > > exactly what hardware should do. The TX unit entered an error state 
> > > and then the driver detected that and reset it a few seconds later.  
> > 
> > The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error
> > state, when getting packets exceeding the MTU.  I didn't go much above
> > 4K, so maybe I didn't trigger those cases.
> 
> You probably need to go above 16k to get out of the acceptable jumbo
> frame size. I tested ixgbe by converting TSO frames to large TCP frames,
> at low probability.

how about we set __bpf_skb_max_len() to jumbo like 8k and be done with it.

I guess some badly written driver/fw may still hang with <= 8k skb
that bpf redirected from one netdev with mtu=jumbo to another
netdev with mtu=1500, but then it's really a job of the driver/fw
to deal with it cleanly.

I think checking skb->tx_dev->mtu for every xmited packet is not great.
For typical load balancer it would be good to have MRU 1500 and MTU 15xx.
Especially if it's internet facing. Just to drop all known big
packets in hw via MRU check.
But the stack doesn't have MRU vs MTU distinction and XDP_TX doesn't
adhere to MTU. xdp_data_hard_end is the limit.
So xdp already allows growing the packet beyond MTU.
I think upgrading artificial limit in __bpf_skb_max_len() to 8k will
keep it safe enough for all practical cases and will avoid unnecessary
checks and complexity in xmit path.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling
  2020-10-13 23:37               ` Alexei Starovoitov
@ 2020-10-13 23:54                 ` Maciej Żenczykowski
  0 siblings, 0 replies; 32+ messages in thread
From: Maciej Żenczykowski @ 2020-10-13 23:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend, bpf,
	Linux NetDev, Daniel Borkmann, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, Eyal Birger

> how about we set __bpf_skb_max_len() to jumbo like 8k and be done with it.

8k is still far too small.  A lot of places do 9K or 16K jumbo frames.
You'd need at least a full 16K for it to be real jumbo compatible.

That said, if we're ever willing to ignore device mtu, then I see no
reason why an 8K or 16K or 32K limit is any better than 64K.
(which is at least max IP packet size compatible [let's ignore ipv6
jumbograms as not realistic])

If something in the firmware/driver fails at 64K it'll probably fail
at 8K as well.
Since the 'bad' hardware is most likely old and only ~1500 (or 1
pagesize) capable anyway...

In practice driver limitations maybe more around the number of pages
or sg sections, then rather on the max packet size anyway...
so failures may depend on individual skb layout...

And as a reminder there are interfaces (like lo) that default to 64K mtu.
(and I have veth setups with 64K mtu as well)

btw. our GCE folks tell us they occasionally see (and now discard)
>mtu packets from Linux VMs (using the virtio-net driver),
we've not had time to debug this (the VMs in question have some pretty
funky routing and for privacy reason I've not been able to get actual
dumps of the problematic frames), but gut feeling is >mtu packets
occasionally leak into the drivers (probably from the tcp stack).

> I guess some badly written driver/fw may still hang with <= 8k skb
> that bpf redirected from one netdev with mtu=jumbo to another
> netdev with mtu=1500, but then it's really a job of the driver/fw
> to deal with it cleanly.
>
> I think checking skb->tx_dev->mtu for every xmited packet is not great.
> For typical load balancer it would be good to have MRU 1500 and MTU 15xx.
> Especially if it's internet facing. Just to drop all known big
> packets in hw via MRU check.
> But the stack doesn't have MRU vs MTU distinction and XDP_TX doesn't
> adhere to MTU. xdp_data_hard_end is the limit.
> So xdp already allows growing the packet beyond MTU.
> I think upgrading artificial limit in __bpf_skb_max_len() to 8k will
> keep it safe enough for all practical cases and will avoid unnecessary
> checks and complexity in xmit path.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking
  2020-10-09 23:29   ` Maciej Żenczykowski
@ 2020-10-21 11:32     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-21 11:32 UTC (permalink / raw)
  To: Maciej Żenczykowski, Eyal Birger, brouer
  Cc: bpf, Linux NetDev, Daniel Borkmann, Alexei Starovoitov,
	Lorenz Bauer, Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	John Fastabend, Jakub Kicinski

On Fri, 9 Oct 2020 16:29:46 -0700
Maciej Żenczykowski <maze@google.com> wrote:

> On Thu, Oct 8, 2020 at 7:09 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >
> > This BPF-helper bpf_mtu_check() works for both XDP and TC-BPF programs.  
> 
> bpf_check_mtu() seems a better name.

Okay, we can rename it. I will go through the patch and change the name
of all the functions (so it resembles the helper name).


> >
> > The API is designed to help the BPF-programmer, that want to do packet
> > context size changes, which involves other helpers. These other helpers
> > usually does a delta size adjustment. This helper also support a delta
> > size (len_diff), which allow BPF-programmer to reuse arguments needed by
> > these other helpers, and perform the MTU check prior to doing any actual
> > size adjustment of the packet context.
> >
> > V3: Take L2/ETH_HLEN header size into account and document it.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/uapi/linux/bpf.h       |   63 +++++++++++++++++++++
> >  net/core/filter.c              |  119 ++++++++++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |   63 +++++++++++++++++++++
> >  3 files changed, 245 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4a46a1de6d16..1dcf5d8195f4 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3718,6 +3718,56 @@ union bpf_attr {
> >   *             never return NULL.
> >   *     Return
> >   *             A pointer pointing to the kernel percpu variable on this cpu.
> > + *
> > + * int bpf_mtu_check(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
> > + *     Description
> > + *             Check ctx packet size against MTU of net device (based on
> > + *             *ifindex*).  This helper will likely be used in combination with
> > + *             helpers that adjust/change the packet size.  The argument
> > + *             *len_diff* can be used for querying with a planned size
> > + *             change. This allows to check MTU prior to changing packet ctx.
> > + *
> > + *             The Linux kernel route table can configure MTUs on a more
> > + *             specific per route level, which is not provided by this helper.
> > + *             For route level MTU checks use the **bpf_fib_lookup**\ ()
> > + *             helper.
> > + *
> > + *             *ctx* is either **struct xdp_md** for XDP programs or
> > + *             **struct sk_buff** for tc cls_act programs.
> > + *
> > + *             The *flags* argument can be a combination of one or more of the
> > + *             following values:
> > + *
> > + *              **BPF_MTU_CHK_RELAX**
> > + *                     This flag relax or increase the MTU with room for one
> > + *                     VLAN header (4 bytes) and take into account net device
> > + *                     hard_header_len.  This relaxation is also used by the
> > + *                     kernels own forwarding MTU checks.
> > + *
> > + *             **BPF_MTU_CHK_GSO**
> > + *                     This flag will only works for *ctx* **struct sk_buff**.
> > + *                     If packet context contains extra packet segment buffers
> > + *                     (often knows as frags), then those are also checked
> > + *                     against the MTU size.  
> 
> naming is weird... what does GSO have to do with frags?
> Aren't these orthogonal things?

They are connected implementation wise. The name "frags" comes from the
implementation detail that GSO segments use "frags", but looking at
implementation details, it does seem like GSO segments actually use
member 'frag_list' (in struct skb_shared_info).  I actually hate the
name/term "frags" as it is very confusing to talk/write above, and
usually people talk past each-other (e.g. frags vs frag_list, and
general concepts packet fragments).

I think I will rename BPF_MTU_CHK_GSO to BPF_MTU_CHK_SEGMENTS.  I want
a more general flag name, as I also want Lorenzo to use this for
checking XDP multi-buffer segments.


> > + *
> > + *             The *mtu_result* pointer contains the MTU value of the net
> > + *             device including the L2 header size (usually 14 bytes Ethernet
> > + *             header). The net device configured MTU is the L3 size, but as
> > + *             XDP and TX length operate at L2 this helper include L2 header
> > + *             size in reported MTU.
> > + *
> > + *     Return
> > + *             * 0 on success, and populate MTU value in *mtu_result* pointer.
> > + *
> > + *             * < 0 if any input argument is invalid (*mtu_result* not updated)  
> 
> not -EINVAL?

Yes, also -EINVAL.

> > + *
> > + *             MTU violations return positive values, but also populate MTU
> > + *             value in *mtu_result* pointer, as this can be needed for
> > + *             implemeting PMTU handing:  
> implementing

Fixed

> > + *
> > + *             * **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > + *             * **BPF_MTU_CHK_RET_GSO_TOOBIG**
> > + *
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -3875,6 +3925,7 @@ union bpf_attr {
> >         FN(redirect_neigh),             \
> >         FN(bpf_per_cpu_ptr),            \
> >         FN(bpf_this_cpu_ptr),           \
> > +       FN(mtu_check),                  \
> >         /* */
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > @@ -4889,6 +4940,18 @@ struct bpf_fib_lookup {
> >         __u8    dmac[6];     /* ETH_ALEN */
> >  };
> >
> > +/* bpf_mtu_check flags*/
> > +enum  bpf_mtu_check_flags {
> > +       BPF_MTU_CHK_RELAX = (1U << 0),
> > +       BPF_MTU_CHK_GSO   = (1U << 1),
> > +};
> > +
> > +enum bpf_mtu_check_ret {
> > +       BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> > +       BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> > +       BPF_MTU_CHK_RET_GSO_TOOBIG,   /* GSO re-segmentation needed to fwd */
> > +};
> > +
> >  enum bpf_task_fd_type {
> >         BPF_FD_TYPE_RAW_TRACEPOINT,     /* tp name */
> >         BPF_FD_TYPE_TRACEPOINT,         /* tp name */
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index da74d6ddc4d7..5986156e700e 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5513,6 +5513,121 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
> >         .arg4_type      = ARG_ANYTHING,
> >  };
> >
> > +static int bpf_mtu_lookup(struct net *netns, u32 ifindex, u64 flags)  
> 
> bpf_lookup_mtu() ???

Sure, I can rename this but this is a helper function (not exported).
 
> > +{
> > +       struct net_device *dev;
> > +       int mtu;
> > +
> > +       dev = dev_get_by_index_rcu(netns, ifindex);  
> 
> my understanding is this is a bit of a perf hit, maybe ifindex 0 means
> use skb->dev ???

Might be a good idea.

> or have bpf_lookup_mtu(skb) function as well?

No, you can easily give parameters to bpf_check_mtu() that gives you a
lookup functionality, there is no need to create a second helper call.

> 
> > +       if (!dev)
> > +               return -ENODEV;
> > +
> > +       /* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
> > +       mtu = dev->mtu + dev->hard_header_len;
> > +
> > +       /*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> > +       if (flags & BPF_MTU_CHK_RELAX)  
> 
> could this check device vlan tx offload state instead?
> 
> > +               mtu += VLAN_HLEN;
> > +
> > +       return mtu;
> > +}
> > +
> > +static unsigned int __bpf_len_adjust_positive(unsigned int len, int len_diff)
> > +{
> > +       int len_new = len + len_diff; /* notice len_diff can be negative */
> > +
> > +       if (len_new > 0)
> > +               return len_new;
> > +
> > +       return 0;  
> 
> not return len ?

I prefer returning 0 here, but return len would also be okay for the
boarderline case/error that I want to handle.

> 
> oh I see the function doesn't do what the name implies...

Okay, suggestions for a better name?

> nor sure this func is helpful... why not simply
> int len_new = (int)len + (int)len_diff; 

(you do write int len_new, but I assume we want unsigned int len_new)

I don't like this approach, as a shrink that cause negative value, will
be turned into a very large value, which will failed the MTU check.

I'm actually trying to anticipate/help the BPF-programmer.  I can easily
imagine a BPF-prog that pops a VXLAN header, so programmer always call
bpf_check_mtu with len_diff and drops packets that exceed MTU, but
small packet that goes negative suddenly gets dropped with your
approach.  Thus, we force BPF-prog to do more checks before using our
BPF-helper, which I would like to avoid.

> directly down below and check < 0 there?

Because I use this helper function in two functions below.


> >2GB skb->len is meaningless anyway  
> 
> > +}
> > +
> > +BPF_CALL_5(bpf_skb_mtu_check, struct sk_buff *, skb,
> > +          u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
> > +{
> > +       struct net *netns = dev_net(skb->dev);
> > +       int ret = BPF_MTU_CHK_RET_SUCCESS;
> > +       unsigned int len = skb->len;
> > +       int mtu;
> > +
> > +       if (flags & ~(BPF_MTU_CHK_RELAX | BPF_MTU_CHK_GSO))
> > +               return -EINVAL;
> > +
> > +       mtu = bpf_mtu_lookup(netns, ifindex, flags);
> > +       if (unlikely(mtu < 0))
> > +               return mtu; /* errno */
> > +
> > +       len = __bpf_len_adjust_positive(len, len_diff);
> > +       if (len > mtu) {
> > +               ret = BPF_MTU_CHK_RET_FRAG_NEEDED;  
> 
> Can't this fail if skb->len includes the entire packet, and yet gso is
> on, and packet is greater then mtu, yet gso size is smaller?
>
> Think 200 byte gso packet with 2 100 byte segs, and a 150 byte mtu.
> Does gso actually require frags?  [As you can tell I don't have a good
> handle on gso vs frags vs skb->len, maybe what I"m asking is bogus]

Oh oh, does skb->len include the size of GSO segments (the individual
packet segments)? ... argh yes is does!  So, this *is* a bug, I will
fix.  Thanks for spotting it!

Looking at the code it is clear and also make more sense that people
are complaining that as long as skb_is_gso(skb) it can bypass these MTU
checks.

I could calculate the "first"/"head" packet length via subtracting
skb->data_len (which should contain the len of fragments). Well, I'll
figure out how to solve it in the code.


> 
> > +               goto out;
> > +       }
> > +
> > +       if (flags & BPF_MTU_CHK_GSO &&
> > +           skb_is_gso(skb) &&
> > +           skb_gso_validate_network_len(skb, mtu)) {
> > +               ret = BPF_MTU_CHK_RET_GSO_TOOBIG;
> > +               goto out;
> > +       }
> > +
> > +out:
> > +       if (mtu_result)
> > +               *mtu_result = mtu;
> > +
> > +       return ret;
> > +}
> > +
> > +BPF_CALL_5(bpf_xdp_mtu_check, struct xdp_buff *, xdp,
> > +          u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
> > +{
> > +       unsigned int len = xdp->data_end - xdp->data;
> > +       struct net_device *dev = xdp->rxq->dev;
> > +       struct net *netns = dev_net(dev);
> > +       int ret = BPF_MTU_CHK_RET_SUCCESS;
> > +       int mtu;
> > +
> > +       /* XDP variant doesn't support multi-buffer segment check (yet) */
> > +       if (flags & ~BPF_MTU_CHK_RELAX)
> > +               return -EINVAL;
> > +
> > +       mtu = bpf_mtu_lookup(netns, ifindex, flags);
> > +       if (unlikely(mtu < 0))
> > +               return mtu; /* errno */
> > +
> > +       len = __bpf_len_adjust_positive(len, len_diff);
> > +       if (len > mtu) {
> > +               ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +               goto out;
> > +       }
> > +out:
> > +       if (mtu_result)
> > +               *mtu_result = mtu;
> > +
> > +       return ret;
> > +}

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-10-21 11:32 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08 14:08 [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2020-10-08 14:09 ` [PATCH bpf-next V3 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2020-10-09 16:12   ` Daniel Borkmann
2020-10-09 18:26     ` Maciej Żenczykowski
2020-10-10 10:25     ` Jesper Dangaard Brouer
2020-10-08 14:09 ` [PATCH bpf-next V3 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2020-10-09  4:05   ` David Ahern
2020-10-08 14:09 ` [PATCH bpf-next V3 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2020-10-09 23:29   ` Maciej Żenczykowski
2020-10-21 11:32     ` Jesper Dangaard Brouer
2020-10-12 15:54   ` Lorenz Bauer
2020-10-08 14:09 ` [PATCH bpf-next V3 4/6] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
2020-10-09 16:47   ` Daniel Borkmann
2020-10-09 18:33     ` Maciej Żenczykowski
2020-10-10 11:09       ` Jesper Dangaard Brouer
2020-10-12 21:04         ` Maciej Żenczykowski
2020-10-08 14:09 ` [PATCH bpf-next V3 5/6] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2020-10-09 23:17   ` Maciej Żenczykowski
2020-10-08 14:09 ` [PATCH bpf-next V3 6/6] net: inline and splitup is_skb_forwardable Jesper Dangaard Brouer
2020-10-09 16:33 ` [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling Jakub Kicinski
2020-10-09 20:49   ` John Fastabend
2020-10-09 21:07     ` Alexei Starovoitov
2020-10-09 21:57       ` Maciej Żenczykowski
2020-10-09 23:00     ` Jakub Kicinski
2020-10-10 10:44       ` Jesper Dangaard Brouer
2020-10-10 16:32         ` Jakub Kicinski
2020-10-10 23:52           ` John Fastabend
2020-10-11 23:30             ` Jakub Kicinski
2020-10-13 20:40           ` Jesper Dangaard Brouer
2020-10-13 23:07             ` Jakub Kicinski
2020-10-13 23:37               ` Alexei Starovoitov
2020-10-13 23:54                 ` Maciej Żenczykowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).