[PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling
@ 2020-10-30 16:50 Jesper Dangaard Brouer
  2020-10-30 16:50 ` [PATCH bpf-next V5 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 16:50 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.

---

Jesper Dangaard Brouer (5):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      bpf: make it possible to identify BPF redirected SKBs

 include/linux/netdevice.h      |   31 +++++++
 include/uapi/linux/bpf.h       |   81 +++++++++++++++++++
 net/core/dev.c                 |   21 +----
 net/core/filter.c              |  168 ++++++++++++++++++++++++++++++++++++----
 net/sched/Kconfig              |    1 
 tools/include/uapi/linux/bpf.h |   81 +++++++++++++++++++
 6 files changed, 342 insertions(+), 41 deletions(-)

--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V5 1/5] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-30 16:50 [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
@ 2020-10-30 16:50 ` Jesper Dangaard Brouer
  2020-10-30 16:50 ` [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 16:50 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

Patch V4 keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..1ee97fdeea64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 	return 0;
 }

-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
-			  SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC

 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 	   u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 {
 	u32 len_cur, len_diff_abs = abs(len_diff);
 	u32 len_min = bpf_skb_net_base_len(skb);
-	u32 len_max = __bpf_skb_max_len(skb);
+	u32 len_max = BPF_SKB_MAX_LEN;
 	__be16 proto = skb->protocol;
 	bool shrink = len_diff < 0;
 	u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 min_len = __bpf_skb_min_len(skb);
 	int ret;

@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 new_len = skb->len + head_room;
 	int ret;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-30 16:50 [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2020-10-30 16:50 ` [PATCH bpf-next V5 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-30 16:50 ` Jesper Dangaard Brouer
  2020-10-30 19:40   ` John Fastabend
  2020-10-31 15:52   ` David Ahern
  2020-10-30 16:51 ` [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 16:50 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   11 +++++++++--
 net/core/filter.c              |   22 +++++++++++++++-------
 tools/include/uapi/linux/bpf.h |   11 +++++++++--
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e6ceac3f7d62..01b2b17c645a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2219,6 +2219,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and result params->mtu contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4872,9 +4875,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* total length of packet from network hdr */
 
+		/* output: MTU value (if requested check_mtu) */
+		__u16	mtu_result;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */
diff --git a/net/core/filter.c b/net/core/filter.c
index 1ee97fdeea64..edb543c477b6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5265,12 +5265,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
 				  const struct neighbour *neigh,
-				  const struct net_device *dev)
+				  const struct net_device *dev, u32 mtu)
 {
 	memcpy(params->dmac, neigh->ha, ETH_ALEN);
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
+	if (mtu)
+		params->mtu_result = mtu; /* union with tot_len */
 
 	return 0;
 }
@@ -5286,8 +5288,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct net_device *dev;
 	struct fib_result res;
 	struct flowi4 fl4;
+	u32 mtu = 0;
 	int err;
-	u32 mtu;
 
 	dev = dev_get_by_index_rcu(net, params->ifindex);
 	if (unlikely(!dev))
@@ -5354,8 +5356,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu_result = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	nhc = res.nhc;
@@ -5389,7 +5393,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5406,7 +5410,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct flowi6 fl6;
 	int strict = 0;
 	int oif, err;
-	u32 mtu;
+	u32 mtu = 0;
 
 	/* link local addresses are never forwarded */
 	if (rt6_need_strict(dst) || rt6_need_strict(src))
@@ -5481,8 +5485,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu_result = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	if (res.nh->fib_nh_lws)
@@ -5502,7 +5508,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5571,6 +5577,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 		dev = dev_get_by_index_rcu(net, params->ifindex);
 		if (!is_skb_forwardable(dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
+
+		params->mtu_result = dev->mtu; /* union with tot_len */
 	}
 
 	return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e6ceac3f7d62..01b2b17c645a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2219,6 +2219,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and result params->mtu contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4872,9 +4875,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* total length of packet from network hdr */
 
+		/* output: MTU value (if requested check_mtu) */
+		__u16	mtu_result;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking
  2020-10-30 16:50 [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2020-10-30 16:50 ` [PATCH bpf-next V5 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
  2020-10-30 16:50 ` [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-30 16:51 ` Jesper Dangaard Brouer
  2020-10-30 20:23   ` John Fastabend
  2020-10-30 16:51 ` [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
  2020-10-30 16:51 ` [PATCH bpf-next V5 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
  4 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 16:51 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

V3: Take L2/ETH_HLEN header size into account and document it.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   70 +++++++++++++++++++++++
 net/core/filter.c              |  120 ++++++++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   70 +++++++++++++++++++++++
 3 files changed, 260 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 01b2b17c645a..8f0fee2df3a6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3745,6 +3745,63 @@ union bpf_attr {
  * 	Return
  * 		The helper returns **TC_ACT_REDIRECT** on success or
  * 		**TC_ACT_SHOT** on error.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *              **BPF_MTU_CHK_RELAX**
+ *			This flag relax or increase the MTU with room for one
+ *			VLAN header (4 bytes). This relaxation is also used by
+ *			the kernels own forwarding MTU checks.
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is partly
+ *			skipped, because in transmit path it is possible for the
+ *			skb packet to get re-segmented (depending on net device
+ *			features).  This could still be a MTU violation, so this
+ *			flag enables performing MTU check against segments, with
+ *			a different violation return code to tell it apart.
+ *
+ *		The *mtu_result* pointer contains the MTU value of the net
+ *		device including the L2 header size (usually 14 bytes Ethernet
+ *		header). The net device configured MTU is the L3 size, but as
+ *		XDP and TX length operate at L2 this helper include L2 header
+ *		size in reported MTU.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_result* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3903,6 +3960,7 @@ union bpf_attr {
 	FN(bpf_per_cpu_ptr),            \
 	FN(bpf_this_cpu_ptr),		\
 	FN(redirect_peer),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4927,6 +4985,18 @@ struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_RELAX = (1U << 0),
+	BPF_MTU_CHK_SEGS  = (1U << 1),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
diff --git a/net/core/filter.c b/net/core/filter.c
index edb543c477b6..bd4a416bd9ad 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5594,6 +5594,122 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+static int __bpf_lookup_mtu(struct net_device *dev_curr, u32 ifindex, u64 flags)
+{
+	struct net *netns = dev_net(dev_curr);
+	struct net_device *dev;
+	int mtu;
+
+	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
+	if (ifindex == 0)
+		dev = dev_curr;
+	else
+		dev = dev_get_by_index_rcu(netns, ifindex);
+
+	if (!dev)
+		return -ENODEV;
+
+	/* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
+	mtu = dev->mtu + dev->hard_header_len;
+
+	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
+	if (flags & BPF_MTU_CHK_RELAX)
+		mtu += VLAN_HLEN;
+
+	return mtu;
+}
+
+BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+	struct net_device *dev = skb->dev;
+	int len = skb->len;
+	int mtu;
+
+	if (flags & ~(BPF_MTU_CHK_RELAX | BPF_MTU_CHK_SEGS))
+		return -EINVAL;
+
+	mtu = __bpf_lookup_mtu(dev, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu; /* errno */
+
+	len += len_diff; /* len_diff can be negative, minus result pass check */
+	if (len <= mtu) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+		goto out;
+	}
+	/* At this point, skb->len exceed MTU, but as it include length of all
+	 * segments, and SKB can get re-segmented in transmit path (see
+	 * validate_xmit_skb), we cannot reject MTU-check for GSO packets.
+	 */
+	if (skb_is_gso(skb)) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+
+		/* SKB could get dropped later due to segs > MTU or lacking
+		 * features, thus allow BPF-prog to validate segs length here.
+		 */
+		if (flags & BPF_MTU_CHK_SEGS &&
+		    skb_gso_validate_network_len(skb, mtu)) {
+			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
+			goto out;
+		}
+	}
+out:
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;
+}
+
+BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	struct net_device *dev = xdp->rxq->dev;
+	int len = xdp->data_end - xdp->data;
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	int mtu;
+
+	/* XDP variant doesn't support multi-buffer segment check (yet) */
+	if (flags & ~BPF_MTU_CHK_RELAX)
+		return -EINVAL;
+
+	mtu = __bpf_lookup_mtu(dev, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu; /* errno */
+
+	len += len_diff; /* len_diff can be negative, minus result pass check */
+	if (len > mtu)
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_skb_check_mtu_proto = {
+	.func		= bpf_skb_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto bpf_xdp_check_mtu_proto = {
+	.func		= bpf_xdp_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
 {
@@ -7159,6 +7275,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_skb_check_mtu_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
 	case BPF_FUNC_sk_storage_get:
@@ -7228,6 +7346,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_xdp_check_mtu_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 01b2b17c645a..8f0fee2df3a6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3745,6 +3745,63 @@ union bpf_attr {
  * 	Return
  * 		The helper returns **TC_ACT_REDIRECT** on success or
  * 		**TC_ACT_SHOT** on error.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *              **BPF_MTU_CHK_RELAX**
+ *			This flag relax or increase the MTU with room for one
+ *			VLAN header (4 bytes). This relaxation is also used by
+ *			the kernels own forwarding MTU checks.
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is partly
+ *			skipped, because in transmit path it is possible for the
+ *			skb packet to get re-segmented (depending on net device
+ *			features).  This could still be a MTU violation, so this
+ *			flag enables performing MTU check against segments, with
+ *			a different violation return code to tell it apart.
+ *
+ *		The *mtu_result* pointer contains the MTU value of the net
+ *		device including the L2 header size (usually 14 bytes Ethernet
+ *		header). The net device configured MTU is the L3 size, but as
+ *		XDP and TX length operate at L2 this helper include L2 header
+ *		size in reported MTU.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_result* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3903,6 +3960,7 @@ union bpf_attr {
 	FN(bpf_per_cpu_ptr),            \
 	FN(bpf_this_cpu_ptr),		\
 	FN(redirect_peer),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4927,6 +4985,18 @@ struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_RELAX = (1U << 0),
+	BPF_MTU_CHK_SEGS  = (1U << 1),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-10-30 16:50 [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2020-10-30 16:51 ` [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2020-10-30 16:51 ` Jesper Dangaard Brouer
  2020-10-30 20:36   ` John Fastabend
  2020-10-30 16:51 ` [PATCH bpf-next V5 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
  4 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 16:51 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
 net/core/dev.c            |   19 ++-----------------
 net/core/filter.c         |   14 +++++++++++---
 3 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 964b494b0e8d..bd02ddab8dfe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3891,11 +3891,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
 			const struct sk_buff *skb);
 
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+						 const struct sk_buff *skb,
+						 const bool check_mtu)
+{
+	const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+	unsigned int len;
+
+	if (!(dev->flags & IFF_UP))
+		return false;
+
+	if (!check_mtu)
+		return true;
+
+	len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+	if (skb->len <= len)
+		return true;
+
+	/* if TSO is enabled, we don't care about the length as the packet
+	 * could be forwarded without being segmented before
+	 */
+	if (skb_is_gso(skb))
+		return true;
+
+	return false;
+}
+
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
-					       struct sk_buff *skb)
+					       struct sk_buff *skb,
+					       const bool check_mtu)
 {
 	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-	    unlikely(!is_skb_forwardable(dev, skb))) {
+	    unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index 9499a414d67e..445ccf92c149 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2188,28 +2188,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
 {
-	unsigned int len;
-
-	if (!(dev->flags & IFF_UP))
-		return false;
-
-	len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-	if (skb->len <= len)
-		return true;
-
-	/* if TSO is enabled, we don't care about the length as the packet
-	 * could be forwarded without being segmented before
-	 */
-	if (skb_is_gso(skb))
-		return true;
-
-	return false;
+	return __is_skb_forwardable(dev, skb, true);
 }
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, true);
 
 	if (likely(!ret)) {
 		skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index bd4a416bd9ad..71b78b8d443c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	return dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
+
+	if (likely(!ret)) {
+		skb->protocol = eth_type_trans(skb, dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		ret = netif_rx(skb);
+	}
+
+	return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
 				      struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
 
 	if (likely(!ret)) {
 		skb->dev = dev;
@@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb)
 			goto out_drop;
 		dev = ops->ndo_get_peer_dev(dev);
 		if (unlikely(!dev ||
-			     !is_skb_forwardable(dev, skb) ||
+			     !__is_skb_forwardable(dev, skb, false) ||
 			     net_eq(net, dev_net(dev))))
 			goto out_drop;
 		skb->dev = dev;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V5 5/5] bpf: make it possible to identify BPF redirected SKBs
  2020-10-30 16:50 [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2020-10-30 16:51 ` [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2020-10-30 16:51 ` Jesper Dangaard Brouer
  4 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 16:51 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This change makes it possible to identify SKBs that have been redirected
by TC-BPF (cls_act). This is needed for a number of cases.

(1) For collaborating with driver ifb net_devices.
(2) For avoiding starting generic-XDP prog on TC ingress redirect.

It is most important to fix XDP case(2), because this can break userspace
when a driver gets support for native-XDP. Imagine userspace loads XDP
prog on eth0, which fallback to generic-XDP, and it process TC-redirected
packets. When kernel is updated with native-XDP support for eth0, then the
program no-longer see the TC-redirected packets. Therefore it is important
to keep the order intact; that XDP runs before TC-BPF.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/dev.c    |    2 ++
 net/sched/Kconfig |    1 +
 2 files changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 445ccf92c149..930c165a607e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		return NULL;
 	case TC_ACT_REDIRECT:
 		/* No need to push/pop skb's mac_header here on egress! */
+		skb_set_redirected(skb, false);
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
 		return NULL;
@@ -4959,6 +4960,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		 * redirecting to another netdev
 		 */
 		__skb_push(skb, skb->mac_len);
+		skb_set_redirected(skb, true);
 		if (skb_do_redirect(skb) == -EAGAIN) {
 			__skb_pull(skb, skb->mac_len);
 			*another = true;
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a3b37d88800e..a1bbaa8fd054 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -384,6 +384,7 @@ config NET_SCH_INGRESS
 	depends on NET_CLS_ACT
 	select NET_INGRESS
 	select NET_EGRESS
+	select NET_REDIRECT
 	help
 	  Say Y here if you want to use classifiers for incoming and/or outgoing
 	  packets. This qdisc doesn't do anything else besides running classifiers,



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-30 16:50 ` [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-30 19:40   ` John Fastabend
  2020-11-02  9:28     ` Jesper Dangaard Brouer
  2020-10-31 15:52   ` David Ahern
  1 sibling, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-10-30 19:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Jesper Dangaard Brouer wrote:
> The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
> don't know the MTU value that caused this rejection.
> 
> If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> need to know this MTU value for the ICMP packet.
> 
> Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> value as output via a union with 'tot_len' as this is the value used for
> the MTU lookup.
> 
> V5:
>  - Fixed uninit value spotted by Dan Carpenter.
>  - Name struct output member mtu_result
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h       |   11 +++++++++--
>  net/core/filter.c              |   22 +++++++++++++++-------
>  tools/include/uapi/linux/bpf.h |   11 +++++++++--
>  3 files changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e6ceac3f7d62..01b2b17c645a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2219,6 +2219,9 @@ union bpf_attr {
>   *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
>   *		  packet is not forwarded or needs assist from full stack
>   *
> + *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
> + *		was exceeded and result params->mtu contains the MTU.
> + *

Do we need to hide this behind a flag? It seems otherwise you might confuse
users. I imagine on error we could reuse the params arg, but now we changed
the tot_len value underneath them?

>   * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
>   *	Description
>   *		Add an entry to, or update a sockhash *map* referencing sockets.
> @@ -4872,9 +4875,13 @@ struct bpf_fib_lookup {
>  	__be16	sport;
>  	__be16	dport;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking
  2020-10-30 16:51 ` [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2020-10-30 20:23   ` John Fastabend
  2020-11-02 11:15     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-10-30 20:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Jesper Dangaard Brouer wrote:
> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
> 
> The API is designed to help the BPF-programmer, that want to do packet
> context size changes, which involves other helpers. These other helpers
> usually does a delta size adjustment. This helper also support a delta
> size (len_diff), which allow BPF-programmer to reuse arguments needed by
> these other helpers, and perform the MTU check prior to doing any actual
> size adjustment of the packet context.
> 
> It is on purpose, that we allow the len adjustment to become a negative
> result, that will pass the MTU check. This might seem weird, but it's not
> this helpers responsibility to "catch" wrong len_diff adjustments. Other
> helpers will take care of these checks, if BPF-programmer chooses to do
> actual size adjustment.
> 
> V4: Lot of changes
>  - ifindex 0 now use current netdev for MTU lookup
>  - rename helper from bpf_mtu_check to bpf_check_mtu
>  - fix bug for GSO pkt length (as skb->len is total len)
>  - remove __bpf_len_adj_positive, simply allow negative len adj
> 
> V3: Take L2/ETH_HLEN header size into account and document it.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

Sorry for the late feedback here.

This seems like a lot of baked in functionality into the helper? Can you
say something about why the simpler and, at least to me, more intuitive
helper to simply return the ifindex mtu is not ideal?

Rough pseudo code being,

 my_sender(struct __sk_buff *skb, int fwd_ifindex)
 {
   mtu = bpf_get_ifindex_mtu(fwd_ifindex, 0);
   if (skb->len + HDR_SIZE < mtu)
       return send_with_hdrs(skb);
   return -EMSGSIZE
 }


>  include/uapi/linux/bpf.h       |   70 +++++++++++++++++++++++
>  net/core/filter.c              |  120 ++++++++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |   70 +++++++++++++++++++++++
>  3 files changed, 260 insertions(+)
> 

[...]

> + *              **BPF_MTU_CHK_RELAX**
> + *			This flag relax or increase the MTU with room for one
> + *			VLAN header (4 bytes). This relaxation is also used by
> + *			the kernels own forwarding MTU checks.

I noted below as well, but not sure why this is needed. Seems if user
knows to add a flag because they want a vlan header we can just as
easily expect BPF program to do it. Also it only works for VLAN headers
any other header data wont be accounted for so it seems only useful
in one specific case.

> + *
> + *		**BPF_MTU_CHK_SEGS**
> + *			This flag will only works for *ctx* **struct sk_buff**.
> + *			If packet context contains extra packet segment buffers
> + *			(often knows as GSO skb), then MTU check is partly
> + *			skipped, because in transmit path it is possible for the
> + *			skb packet to get re-segmented (depending on net device
> + *			features).  This could still be a MTU violation, so this
> + *			flag enables performing MTU check against segments, with
> + *			a different violation return code to tell it apart.
> + *
> + *		The *mtu_result* pointer contains the MTU value of the net
> + *		device including the L2 header size (usually 14 bytes Ethernet
> + *		header). The net device configured MTU is the L3 size, but as
> + *		XDP and TX length operate at L2 this helper include L2 header
> + *		size in reported MTU.
> + *
> + *	Return
> + *		* 0 on success, and populate MTU value in *mtu_result* pointer.
> + *
> + *		* < 0 if any input argument is invalid (*mtu_result* not updated)
> + *
> + *		MTU violations return positive values, but also populate MTU
> + *		value in *mtu_result* pointer, as this can be needed for
> + *		implementing PMTU handing:
> + *
> + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> + *
>   */

[...]

> +static int __bpf_lookup_mtu(struct net_device *dev_curr, u32 ifindex, u64 flags)
> +{
> +	struct net *netns = dev_net(dev_curr);
> +	struct net_device *dev;
> +	int mtu;
> +
> +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> +	if (ifindex == 0)
> +		dev = dev_curr;
> +	else
> +		dev = dev_get_by_index_rcu(netns, ifindex);
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	/* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
> +	mtu = dev->mtu + dev->hard_header_len;

READ_ONCE() on dev->mtu and hard_header_len as well? We don't have
any locks.

> +
> +	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> +	if (flags & BPF_MTU_CHK_RELAX)
> +		mtu += VLAN_HLEN;

I'm trying to think about the use case where this might be used?
Compared to just adjusting MTU in BPF program side as needed for
packet encapsulation/headers/etc.

> +
> +	return mtu;
> +}
> +

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-10-30 16:51 ` [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2020-10-30 20:36   ` John Fastabend
  2020-11-02 12:46     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-10-30 20:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Jesper Dangaard Brouer wrote:
> The use-case for dropping the MTU check when TC-BPF does redirect to
> ingress, is described by Eyal Birger in email[0]. The summary is the
> ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> ingress redirect packet and let normal netstack fragment packet as needed.
> 
> [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
> 
> V4:
>  - Keep net_device "up" (IFF_UP) check.
>  - Adjustment to handle bpf_redirect_peer() helper
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
>  net/core/dev.c            |   19 ++-----------------
>  net/core/filter.c         |   14 +++++++++++---
>  3 files changed, 42 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 964b494b0e8d..bd02ddab8dfe 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3891,11 +3891,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
>  bool is_skb_forwardable(const struct net_device *dev,
>  			const struct sk_buff *skb);
>  
> +static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
> +						 const struct sk_buff *skb,
> +						 const bool check_mtu)

It looks like if check_mtu=false then this is just an interface up check.
Can we leave is_skb_forwardable logic alone and just change the spots where
this is called with false to something with a name that describes the check,
such as is_dev_up(dev). I think it will make this change smaller and the
code easier to read. Did I miss something?

> +{
> +	const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
> +	unsigned int len;
> +
> +	if (!(dev->flags & IFF_UP))
> +		return false;
> +
> +	if (!check_mtu)
> +		return true;
> +
> +	len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
> +	if (skb->len <= len)
> +		return true;
> +
> +	/* if TSO is enabled, we don't care about the length as the packet
> +	 * could be forwarded without being segmented before
> +	 */
> +	if (skb_is_gso(skb))
> +		return true;
> +
> +	return false;
> +}
> +
>  static __always_inline int ____dev_forward_skb(struct net_device *dev,
> -					       struct sk_buff *skb)
> +					       struct sk_buff *skb,
> +					       const bool check_mtu)
>  {

I guess you will get some duplication here if you have a dev_forward_skb()
and a dev_forward_skb_nocheck() or something. Take it or leave it. I know
I've added my share of bool swivel bits like this, but better to avoid
it if possible IMO.

Other than style aspects it looks correct to me.

>  	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
> -	    unlikely(!is_skb_forwardable(dev, skb))) {
> +	    unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
>  		atomic_long_inc(&dev->rx_dropped);
>  		kfree_skb(skb);
>  		return NET_RX_DROP;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9499a414d67e..445ccf92c149 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2188,28 +2188,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
>  

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-30 16:50 ` [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
  2020-10-30 19:40   ` John Fastabend
@ 2020-10-31 15:52   ` David Ahern
  1 sibling, 0 replies; 19+ messages in thread
From: David Ahern @ 2020-10-31 15:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger

On 10/30/20 10:50 AM, Jesper Dangaard Brouer wrote:
> The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
> don't know the MTU value that caused this rejection.
> 
> If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> need to know this MTU value for the ICMP packet.
> 
> Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> value as output via a union with 'tot_len' as this is the value used for
> the MTU lookup.
> 
> V5:
>  - Fixed uninit value spotted by Dan Carpenter.
>  - Name struct output member mtu_result
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h       |   11 +++++++++--
>  net/core/filter.c              |   22 +++++++++++++++-------
>  tools/include/uapi/linux/bpf.h |   11 +++++++++--
>  3 files changed, 33 insertions(+), 11 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@kernel.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-30 19:40   ` John Fastabend
@ 2020-11-02  9:28     ` Jesper Dangaard Brouer
  2020-11-02 15:59       ` David Ahern
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-02  9:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	brouer

On Fri, 30 Oct 2020 12:40:21 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Jesper Dangaard Brouer wrote:
> > The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> > can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
> > don't know the MTU value that caused this rejection.
> > 
> > If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> > need to know this MTU value for the ICMP packet.
> > 
> > Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> > value as output via a union with 'tot_len' as this is the value used for
> > the MTU lookup.
> > 
> > V5:
> >  - Fixed uninit value spotted by Dan Carpenter.
> >  - Name struct output member mtu_result
> > 
> > Reported-by: kernel test robot <lkp@intel.com>
> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/uapi/linux/bpf.h       |   11 +++++++++--
> >  net/core/filter.c              |   22 +++++++++++++++-------
> >  tools/include/uapi/linux/bpf.h |   11 +++++++++--
> >  3 files changed, 33 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e6ceac3f7d62..01b2b17c645a 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2219,6 +2219,9 @@ union bpf_attr {
> >   *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
> >   *		  packet is not forwarded or needs assist from full stack
> >   *
> > + *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
> > + *		was exceeded and result params->mtu contains the MTU.
> > + *  
> 
> Do we need to hide this behind a flag? It seems otherwise you might confuse
> users. I imagine on error we could reuse the params arg, but now we changed
> the tot_len value underneath them?

The principle behind this bpf_fib_lookup helper, is that params (struct
bpf_fib_lookup) is used for both input and output (results). Almost
every field is change after the lookup. (For performance reasons this
is kept at 64 bytes (cache-line))  Thus, users of this helper already
expect/knows the contents of params have changed.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


struct bpf_fib_lookup {
	/* input:  network family for lookup (AF_INET, AF_INET6)
	 * output: network family of egress nexthop
	 */
	__u8	family;

	/* set if lookup is to consider L4 data - e.g., FIB rules */
	__u8	l4_protocol;
	__be16	sport;
	__be16	dport;

	union {	/* used for MTU check */
		/* input to lookup */
		__u16	tot_len; /* total length of packet from network hdr */

		/* output: MTU value (if requested check_mtu) */
		__u16	mtu_result;
	};
	/* input: L3 device index for lookup
	 * output: device index from FIB lookup
	 */
	__u32	ifindex;

	union {
		/* inputs to lookup */
		__u8	tos;		/* AF_INET  */
		__be32	flowinfo;	/* AF_INET6, flow_label + priority */

		/* output: metric of fib result (IPv4/IPv6 only) */
		__u32	rt_metric;
	};

	union {
		__be32		ipv4_src;
		__u32		ipv6_src[4];  /* in6_addr; network order */
	};

	/* input to bpf_fib_lookup, ipv{4,6}_dst is destination address in
	 * network header. output: bpf_fib_lookup sets to gateway address
	 * if FIB lookup returns gateway route
	 */
	union {
		__be32		ipv4_dst;
		__u32		ipv6_dst[4];  /* in6_addr; network order */
	};

	/* output */
	__be16	h_vlan_proto;
	__be16	h_vlan_TCI;
	__u8	smac[6];     /* ETH_ALEN */
	__u8	dmac[6];     /* ETH_ALEN */
};


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking
  2020-10-30 20:23   ` John Fastabend
@ 2020-11-02 11:15     ` Jesper Dangaard Brouer
  2020-11-02 18:04       ` John Fastabend
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-02 11:15 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	brouer

On Fri, 30 Oct 2020 13:23:43 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Jesper Dangaard Brouer wrote:
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
> > 
> > The API is designed to help the BPF-programmer, that want to do packet
> > context size changes, which involves other helpers. These other helpers
> > usually does a delta size adjustment. This helper also support a delta
> > size (len_diff), which allow BPF-programmer to reuse arguments needed by
> > these other helpers, and perform the MTU check prior to doing any actual
> > size adjustment of the packet context.
> > 
> > It is on purpose, that we allow the len adjustment to become a negative
> > result, that will pass the MTU check. This might seem weird, but it's not
> > this helpers responsibility to "catch" wrong len_diff adjustments. Other
> > helpers will take care of these checks, if BPF-programmer chooses to do
> > actual size adjustment.
> > 
> > V4: Lot of changes
> >  - ifindex 0 now use current netdev for MTU lookup
> >  - rename helper from bpf_mtu_check to bpf_check_mtu
> >  - fix bug for GSO pkt length (as skb->len is total len)
> >  - remove __bpf_len_adj_positive, simply allow negative len adj
> > 
> > V3: Take L2/ETH_HLEN header size into account and document it.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---  
> 
> Sorry for the late feedback here.
> 
> This seems like a lot of baked in functionality into the helper? Can you
> say something about why the simpler and, at least to me, more intuitive
> helper to simply return the ifindex mtu is not ideal?

I tried to explain this in the patch description.  This is for easier
collaboration with other helpers, that also have the len_diff parameter.
This API allow to check the MTU *prior* to doing the size adjustment.

Let me explain what is not in the patch desc:

In the first patchset, I started with the simply implementation of
returning the MTU.  Then I realized that this puts more work into the
BPF program (thus increasing BPF code instructions).  As we in BPF-prog
need to extract the packet length to compare against the returned MTU
size. Looking at other programs that does the ctx/packet size adjust, we
don't extract the packet length as ctx is about to change, and we don't
need the MTU variable in the BPF prog (unless it fails).


> Rough pseudo code being,
> 
>  my_sender(struct __sk_buff *skb, int fwd_ifindex)
>  {
>    mtu = bpf_get_ifindex_mtu(fwd_ifindex, 0);
>    if (skb->len + HDR_SIZE < mtu)
>        return send_with_hdrs(skb);
>    return -EMSGSIZE
>  }
> 
> 
> >  include/uapi/linux/bpf.h       |   70 +++++++++++++++++++++++
> >  net/core/filter.c              |  120 ++++++++++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |   70 +++++++++++++++++++++++
> >  3 files changed, 260 insertions(+)
> >   
> 
> [...]
> 
> > + *              **BPF_MTU_CHK_RELAX**
> > + *			This flag relax or increase the MTU with room for one
> > + *			VLAN header (4 bytes). This relaxation is also used by
> > + *			the kernels own forwarding MTU checks.  
> 
> I noted below as well, but not sure why this is needed. Seems if user
> knows to add a flag because they want a vlan header we can just as
> easily expect BPF program to do it. Also it only works for VLAN headers
> any other header data wont be accounted for so it seems only useful
> in one specific case.

This was added because the kernels own forwarding have this relaxation
build in.  Thus, I though that I should add flag to compatible with the
kernels forwarding checks.

> > + *
> > + *		**BPF_MTU_CHK_SEGS**
> > + *			This flag will only works for *ctx* **struct sk_buff**.
> > + *			If packet context contains extra packet segment buffers
> > + *			(often knows as GSO skb), then MTU check is partly
> > + *			skipped, because in transmit path it is possible for the
> > + *			skb packet to get re-segmented (depending on net device
> > + *			features).  This could still be a MTU violation, so this
> > + *			flag enables performing MTU check against segments, with
> > + *			a different violation return code to tell it apart.
> > + *
> > + *		The *mtu_result* pointer contains the MTU value of the net
> > + *		device including the L2 header size (usually 14 bytes Ethernet
> > + *		header). The net device configured MTU is the L3 size, but as
> > + *		XDP and TX length operate at L2 this helper include L2 header
> > + *		size in reported MTU.
> > + *
> > + *	Return
> > + *		* 0 on success, and populate MTU value in *mtu_result* pointer.
> > + *
> > + *		* < 0 if any input argument is invalid (*mtu_result* not updated)
> > + *
> > + *		MTU violations return positive values, but also populate MTU
> > + *		value in *mtu_result* pointer, as this can be needed for
> > + *		implementing PMTU handing:
> > + *
> > + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> > + *
> >   */  
> 
> [...]
> 
> > +static int __bpf_lookup_mtu(struct net_device *dev_curr, u32 ifindex, u64 flags)
> > +{
> > +	struct net *netns = dev_net(dev_curr);
> > +	struct net_device *dev;
> > +	int mtu;
> > +
> > +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> > +	if (ifindex == 0)
> > +		dev = dev_curr;
> > +	else
> > +		dev = dev_get_by_index_rcu(netns, ifindex);
> > +
> > +	if (!dev)
> > +		return -ENODEV;
> > +
> > +	/* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
> > +	mtu = dev->mtu + dev->hard_header_len;  
> 
> READ_ONCE() on dev->mtu and hard_header_len as well? We don't have
> any locks.

This is based on similar checks done in the same execution context,
which don't have these READ_ONCE() macros.  I'm not introducing reading
these, I'm simply moving when they are read.  If this is really needed,
then I think we need separate fixes patches, for stable backporting.

While doing this work, I've realized that mtu + hard_header_len is
located on two different cache-lines, which is unfortunate, but I will
look at fixing this in followup patches.


> > +
> > +	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> > +	if (flags & BPF_MTU_CHK_RELAX)
> > +		mtu += VLAN_HLEN;  
> 
> I'm trying to think about the use case where this might be used?
> Compared to just adjusting MTU in BPF program side as needed for
> packet encapsulation/headers/etc.

As I wrote above, this were added because the kernels own forwarding
have this relaxation in it's checks (in is_skb_forwardable()).  I even
tried to dig through the history, introduced in [1] and copy-pasted
in[2].  And this seems to be a workaround, that have become standard,
that still have practical implications.

My practical experiments showed, that e.g. ixgbe driver with MTU=1500
(L3-size) will allow and fully send packets with 1504 (L3-size). But
i40e will not, and drops the packet in hardware/firmware step.  So,
what is the correct action, strict or relaxed?

My own conclusion is that we should inverse the flag.  Meaning to
default add this VLAN_HLEN (4 bytes) relaxation, and have a flag to do
more strict check,  e.g. BPF_MTU_CHK_STRICT. As for historical reasons
we must act like kernels version of MTU check. Unless you object, I will
do this in V6.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://git.kernel.org/torvalds/c/57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.") (Author: Ben Greear)
 
[2] https://git.kernel.org/torvalds/c/79b569f0ec53 ("netdev: fix mtu check when TSO is enabled") (Author: Daniel Lezcano)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-10-30 20:36   ` John Fastabend
@ 2020-11-02 12:46     ` Jesper Dangaard Brouer
  2020-11-02 16:23       ` John Fastabend
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-02 12:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	brouer

On Fri, 30 Oct 2020 13:36:05 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Jesper Dangaard Brouer wrote:
> > The use-case for dropping the MTU check when TC-BPF does redirect to
> > ingress, is described by Eyal Birger in email[0]. The summary is the
> > ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> > ingress redirect packet and let normal netstack fragment packet as needed.
> > 
> > [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
> > 
> > V4:
> >  - Keep net_device "up" (IFF_UP) check.
> >  - Adjustment to handle bpf_redirect_peer() helper
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
> >  net/core/dev.c            |   19 ++-----------------
> >  net/core/filter.c         |   14 +++++++++++---
> >  3 files changed, 42 insertions(+), 22 deletions(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 964b494b0e8d..bd02ddab8dfe 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3891,11 +3891,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
> >  bool is_skb_forwardable(const struct net_device *dev,
> >  			const struct sk_buff *skb);
> >  
> > +static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
> > +						 const struct sk_buff *skb,
> > +						 const bool check_mtu)  
> 
> It looks like if check_mtu=false then this is just an interface up check.
> Can we leave is_skb_forwardable logic alone and just change the spots where
> this is called with false to something with a name that describes the check,
> such as is_dev_up(dev). I think it will make this change smaller and the
> code easier to read. Did I miss something?

People should realized that this is constructed such, the compiler will
compile-time remove the actual argument (the const bool check_mtu).
And this propagates also to ____dev_forward_skb() where the call places
are also inlined.

Yes, this (check_mtu=false) is basically an interface up check, but the
only place it is used directly is in the ndo_get_peer_dev() case, and
reading the code I find it more readable that is says
__is_skb_forwardable because this is used as part of a forwarding step,
and is_dev_up() doesn't convey the intent in this use-case.


> > +{
> > +	const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
> > +	unsigned int len;
> > +
> > +	if (!(dev->flags & IFF_UP))
> > +		return false;
> > +
> > +	if (!check_mtu)
> > +		return true;
> > +
> > +	len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
> > +	if (skb->len <= len)
> > +		return true;
> > +
> > +	/* if TSO is enabled, we don't care about the length as the packet
> > +	 * could be forwarded without being segmented before
> > +	 */
> > +	if (skb_is_gso(skb))
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >  static __always_inline int ____dev_forward_skb(struct net_device *dev,
> > -					       struct sk_buff *skb)
> > +					       struct sk_buff *skb,
> > +					       const bool check_mtu)
> >  {  
> 
> I guess you will get some duplication here if you have a dev_forward_skb()
> and a dev_forward_skb_nocheck() or something. Take it or leave it. I know
> I've added my share of bool swivel bits like this, but better to avoid
> it if possible IMO.

As I wrote the bool will actually get compile-time removed, so I don't
see that as problematic.  And I avoided replicating the code in more
places.

The problematic part (which you didn't comment) on is this:

On Fri, 30 Oct 2020 17:51:07 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> diff --git a/net/core/filter.c b/net/core/filter.c
> index bd4a416bd9ad..71b78b8d443c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
>  
>  static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
>  {
> -	return dev_forward_skb(dev, skb);
> +	int ret = ____dev_forward_skb(dev, skb, false);
> +
> +	if (likely(!ret)) {
> +		skb->protocol = eth_type_trans(skb, dev);
> +		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> +		ret = netif_rx(skb);
> +	}
> +
> +	return ret;
>  }

I'm replicating two lines from dev_forward_skb(), but I couldn't find a
way to avoid this, without causing larger code changes (and slower code).



> Other than style aspects it looks correct to me.
> 
> >  	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
> > -	    unlikely(!is_skb_forwardable(dev, skb))) {
> > +	    unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
> >  		atomic_long_inc(&dev->rx_dropped);
> >  		kfree_skb(skb);
> >  		return NET_RX_DROP;
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 9499a414d67e..445ccf92c149 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -2188,28 +2188,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
> >    
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-11-02  9:28     ` Jesper Dangaard Brouer
@ 2020-11-02 15:59       ` David Ahern
  2020-11-02 16:18         ` John Fastabend
  0 siblings, 1 reply; 19+ messages in thread
From: David Ahern @ 2020-11-02 15:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger

On 11/2/20 2:28 AM, Jesper Dangaard Brouer wrote:
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index e6ceac3f7d62..01b2b17c645a 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2219,6 +2219,9 @@ union bpf_attr {
>>>   *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
>>>   *		  packet is not forwarded or needs assist from full stack
>>>   *
>>> + *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
>>> + *		was exceeded and result params->mtu contains the MTU.
>>> + *  
>>
>> Do we need to hide this behind a flag? It seems otherwise you might confuse
>> users. I imagine on error we could reuse the params arg, but now we changed
>> the tot_len value underneath them?
> 
> The principle behind this bpf_fib_lookup helper, is that params (struct
> bpf_fib_lookup) is used for both input and output (results). Almost
> every field is change after the lookup. (For performance reasons this
> is kept at 64 bytes (cache-line))  Thus, users of this helper already
> expect/knows the contents of params have changed.
> 

yes, that was done on purpose.

Jesper: you should remove the '(if requested check_mtu)' comment in the
documentation. That is an internal flag only -- xdp is true, tc is false.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-11-02 15:59       ` David Ahern
@ 2020-11-02 16:18         ` John Fastabend
  0 siblings, 0 replies; 19+ messages in thread
From: John Fastabend @ 2020-11-02 16:18 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer, John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger

David Ahern wrote:
> On 11/2/20 2:28 AM, Jesper Dangaard Brouer wrote:
> >>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>> index e6ceac3f7d62..01b2b17c645a 100644
> >>> --- a/include/uapi/linux/bpf.h
> >>> +++ b/include/uapi/linux/bpf.h
> >>> @@ -2219,6 +2219,9 @@ union bpf_attr {
> >>>   *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
> >>>   *		  packet is not forwarded or needs assist from full stack
> >>>   *
> >>> + *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
> >>> + *		was exceeded and result params->mtu contains the MTU.
> >>> + *  
> >>
> >> Do we need to hide this behind a flag? It seems otherwise you might confuse
> >> users. I imagine on error we could reuse the params arg, but now we changed
> >> the tot_len value underneath them?
> > 
> > The principle behind this bpf_fib_lookup helper, is that params (struct
> > bpf_fib_lookup) is used for both input and output (results). Almost
> > every field is change after the lookup. (For performance reasons this
> > is kept at 64 bytes (cache-line))  Thus, users of this helper already
> > expect/knows the contents of params have changed.
> > 
> 
> yes, that was done on purpose.

OK sounds good then. Thanks.

> 
> Jesper: you should remove the '(if requested check_mtu)' comment in the
> documentation. That is an internal flag only -- xdp is true, tc is false.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-11-02 12:46     ` Jesper Dangaard Brouer
@ 2020-11-02 16:23       ` John Fastabend
  0 siblings, 0 replies; 19+ messages in thread
From: John Fastabend @ 2020-11-02 16:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	brouer

Jesper Dangaard Brouer wrote:
> On Fri, 30 Oct 2020 13:36:05 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > Jesper Dangaard Brouer wrote:
> > > The use-case for dropping the MTU check when TC-BPF does redirect to
> > > ingress, is described by Eyal Birger in email[0]. The summary is the
> > > ability to increase packet size (e.g. with IPv6 headers for NAT64) and
> > > ingress redirect packet and let normal netstack fragment packet as needed.
> > > 
> > > [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
> > > 
> > > V4:
> > >  - Keep net_device "up" (IFF_UP) check.
> > >  - Adjustment to handle bpf_redirect_peer() helper
> > > 
> > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > > ---
> > >  include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
> > >  net/core/dev.c            |   19 ++-----------------
> > >  net/core/filter.c         |   14 +++++++++++---
> > >  3 files changed, 42 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 964b494b0e8d..bd02ddab8dfe 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -3891,11 +3891,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
> > >  bool is_skb_forwardable(const struct net_device *dev,
> > >  			const struct sk_buff *skb);
> > >  
> > > +static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
> > > +						 const struct sk_buff *skb,
> > > +						 const bool check_mtu)  
> > 
> > It looks like if check_mtu=false then this is just an interface up check.
> > Can we leave is_skb_forwardable logic alone and just change the spots where
> > this is called with false to something with a name that describes the check,
> > such as is_dev_up(dev). I think it will make this change smaller and the
> > code easier to read. Did I miss something?
> 
> People should realized that this is constructed such, the compiler will
> compile-time remove the actual argument (the const bool check_mtu).
> And this propagates also to ____dev_forward_skb() where the call places
> are also inlined.

The comment was about human reading the code not what gets generated
by the compiler.

> 
> Yes, this (check_mtu=false) is basically an interface up check, but the
> only place it is used directly is in the ndo_get_peer_dev() case, and
> reading the code I find it more readable that is says
> __is_skb_forwardable because this is used as part of a forwarding step,
> and is_dev_up() doesn't convey the intent in this use-case.

OK.

[...]

> 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index bd4a416bd9ad..71b78b8d443c 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
> >  
> >  static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
> >  {
> > -	return dev_forward_skb(dev, skb);
> > +	int ret = ____dev_forward_skb(dev, skb, false);
> > +
> > +	if (likely(!ret)) {
> > +		skb->protocol = eth_type_trans(skb, dev);
> > +		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > +		ret = netif_rx(skb);
> > +	}
> > +
> > +	return ret;
> >  }
> 
> I'm replicating two lines from dev_forward_skb(), but I couldn't find a
> way to avoid this, without causing larger code changes (and slower code).
> 

OK looks good to me then.

> 
> 
> > Other than style aspects it looks correct to me.
> > 
> > >  	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
> > > -	    unlikely(!is_skb_forwardable(dev, skb))) {
> > > +	    unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
> > >  		atomic_long_inc(&dev->rx_dropped);
> > >  		kfree_skb(skb);
> > >  		return NET_RX_DROP;
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 9499a414d67e..445ccf92c149 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -2188,28 +2188,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
> > >    
> > 
> 
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking
  2020-11-02 11:15     ` Jesper Dangaard Brouer
@ 2020-11-02 18:04       ` John Fastabend
  2020-11-02 20:10         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-11-02 18:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	brouer

Jesper Dangaard Brouer wrote:
> On Fri, 30 Oct 2020 13:23:43 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > Jesper Dangaard Brouer wrote:
> > > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
> > > 
> > > The API is designed to help the BPF-programmer, that want to do packet
> > > context size changes, which involves other helpers. These other helpers
> > > usually does a delta size adjustment. This helper also support a delta
> > > size (len_diff), which allow BPF-programmer to reuse arguments needed by
> > > these other helpers, and perform the MTU check prior to doing any actual
> > > size adjustment of the packet context.
> > > 
> > > It is on purpose, that we allow the len adjustment to become a negative
> > > result, that will pass the MTU check. This might seem weird, but it's not
> > > this helpers responsibility to "catch" wrong len_diff adjustments. Other
> > > helpers will take care of these checks, if BPF-programmer chooses to do
> > > actual size adjustment.
> > > 
> > > V4: Lot of changes
> > >  - ifindex 0 now use current netdev for MTU lookup
> > >  - rename helper from bpf_mtu_check to bpf_check_mtu
> > >  - fix bug for GSO pkt length (as skb->len is total len)
> > >  - remove __bpf_len_adj_positive, simply allow negative len adj
> > > 
> > > V3: Take L2/ETH_HLEN header size into account and document it.
> > > 
> > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > > ---  
> > 
> > Sorry for the late feedback here.
> > 
> > This seems like a lot of baked in functionality into the helper? Can you
> > say something about why the simpler and, at least to me, more intuitive
> > helper to simply return the ifindex mtu is not ideal?
> 
> I tried to explain this in the patch description.  This is for easier
> collaboration with other helpers, that also have the len_diff parameter.
> This API allow to check the MTU *prior* to doing the size adjustment.
> 
> Let me explain what is not in the patch desc:

OK extra details helps.

> 
> In the first patchset, I started with the simply implementation of
> returning the MTU.  Then I realized that this puts more work into the
> BPF program (thus increasing BPF code instructions).  As we in BPF-prog
> need to extract the packet length to compare against the returned MTU
> size. Looking at other programs that does the ctx/packet size adjust, we
> don't extract the packet length as ctx is about to change, and we don't
> need the MTU variable in the BPF prog (unless it fails).

On recent kernels instruction counts should not be a problem. So, looks
like the argument is to push the skb->len lookup from BPF program into
helper. I'm not sure it matters much if the insn is run in helper or
in the BPF program. I have a preference for the simpler "give me
the MTU and I'll figure out what to do with it". Real programs
will have to handle the failing case and will need to deal with MTU
anyways. We could do it as a BPF implemented helper in one of the
BPF headers so users could just call the BPF "helper" and not know parts of
it are implemented in BPF.

> 
> 
> > Rough pseudo code being,
> > 
> >  my_sender(struct __sk_buff *skb, int fwd_ifindex)
> >  {
> >    mtu = bpf_get_ifindex_mtu(fwd_ifindex, 0);
> >    if (skb->len + HDR_SIZE < mtu)
> >        return send_with_hdrs(skb);
> >    return -EMSGSIZE
> >  }
> > 
> > 
> > >  include/uapi/linux/bpf.h       |   70 +++++++++++++++++++++++
> > >  net/core/filter.c              |  120 ++++++++++++++++++++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h |   70 +++++++++++++++++++++++
> > >  3 files changed, 260 insertions(+)
> > >   
> > 
> > [...]
> > 
> > > + *              **BPF_MTU_CHK_RELAX**
> > > + *			This flag relax or increase the MTU with room for one
> > > + *			VLAN header (4 bytes). This relaxation is also used by
> > > + *			the kernels own forwarding MTU checks.  
> > 
> > I noted below as well, but not sure why this is needed. Seems if user
> > knows to add a flag because they want a vlan header we can just as
> > easily expect BPF program to do it. Alsoer it only works for VLAN headers
> > any other header data wont be accounted for so it seems only useful
> > in one specific case.
> 
> This was added because the kernels own forwarding have this relaxation
> build in.  Thus, I though that I should add flag to compatible with the
> kernels forwarding checks.

OK, I guess it doesn't hurt.

> 
> > > + *
> > > + *		**BPF_MTU_CHK_SEGS**
> > > + *			This flag will only works for *ctx* **struct sk_buff**.
> > > + *			If packet context contains extra packet segment buffers
> > > + *			(often knows as GSO skb), then MTU check is partly
> > > + *			skipped, because in transmit path it is possible for the
> > > + *			skb packet to get re-segmented (depending on net device
> > > + *			features).  This could still be a MTU violation, so this
> > > + *			flag enables performing MTU check against segments, with
> > > + *			a different violation return code to tell it apart.
> > > + *
> > > + *		The *mtu_result* pointer contains the MTU value of the net
> > > + *		device including the L2 header size (usually 14 bytes Ethernet
> > > + *		header). The net device configured MTU is the L3 size, but as
> > > + *		XDP and TX length operate at L2 this helper include L2 header
> > > + *		size in reported MTU.
> > > + *
> > > + *	Return
> > > + *		* 0 on success, and populate MTU value in *mtu_result* pointer.
> > > + *
> > > + *		* < 0 if any input argument is invalid (*mtu_result* not updated)
> > > + *
> > > + *		MTU violations return positive values, but also populate MTU
> > > + *		value in *mtu_result* pointer, as this can be needed for
> > > + *		implementing PMTU handing:
> > > + *
> > > + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > > + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> > > + *
> > >   */  
> > 
> > [...]
> > 
> > > +static int __bpf_lookup_mtu(struct net_device *dev_curr, u32 ifindex, u64 flags)
> > > +{
> > > +	struct net *netns = dev_net(dev_curr);
> > > +	struct net_device *dev;
> > > +	int mtu;
> > > +
> > > +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> > > +	if (ifindex == 0)
> > > +		dev = dev_curr;
> > > +	else
> > > +		dev = dev_get_by_index_rcu(netns, ifindex);
> > > +
> > > +	if (!dev)
> > > +		return -ENODEV;
> > > +
> > > +	/* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
> > > +	mtu = dev->mtu + dev->hard_header_len;  
> > 
> > READ_ONCE() on dev->mtu and hard_header_len as well? We don't have
> > any locks.
> 
> This is based on similar checks done in the same execution context,
> which don't have these READ_ONCE() macros.  I'm not introducing reading
> these, I'm simply moving when they are read.  If this is really needed,
> then I think we need separate fixes patches, for stable backporting.
> 
> While doing this work, I've realized that mtu + hard_header_len is
> located on two different cache-lines, which is unfortunate, but I will
> look at fixing this in followup patches.

Looks like a follow up then. But, would be best to add the READ_ONCE
here. The netdevice.h header has this comment,

	/* Note : dev->mtu is often read without holding a lock.
	 * Writers usually hold RTNL.
	 * It is recommended to use READ_ONCE() to annotate the reads,
	 * and to use WRITE_ONCE() to annotate the writes.
	 */

> 
> 
> > > +
> > > +	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> > > +	if (flags & BPF_MTU_CHK_RELAX)
> > > +		mtu += VLAN_HLEN;  
> > 
> > I'm trying to think about the use case where this might be used?
> > Compared to just adjusting MTU in BPF program side as needed for
> > packet encapsulation/headers/etc.
> 
> As I wrote above, this were added because the kernels own forwarding
> have this relaxation in it's checks (in is_skb_forwardable()).  I even
> tried to dig through the history, introduced in [1] and copy-pasted
> in[2].  And this seems to be a workaround, that have become standard,
> that still have practical implications.
> 
> My practical experiments showed, that e.g. ixgbe driver with MTU=1500
> (L3-size) will allow and fully send packets with 1504 (L3-size). But
> i40e will not, and drops the packet in hardware/firmware step.  So,
> what is the correct action, strict or relaxed?
> 
> My own conclusion is that we should inverse the flag.  Meaning to
> default add this VLAN_HLEN (4 bytes) relaxation, and have a flag to do
> more strict check,  e.g. BPF_MTU_CHK_STRICT. As for historical reasons
> we must act like kernels version of MTU check. Unless you object, I will
> do this in V6.

I'm fine with it either way as long as its documented in the helper
description so I have a chance of remembering this discussion in 6 months.
But, if you make it default won't this break for XDP cases? I assume the
XDP use case doesn't include the VLAN 4-bytes. Would you need to prevent
the flag from being used from XDP?

Thanks,
John

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking
  2020-11-02 18:04       ` John Fastabend
@ 2020-11-02 20:10         ` Jesper Dangaard Brouer
  2020-11-12 12:58           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-02 20:10 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	brouer

On Mon, 02 Nov 2020 10:04:44 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> > > > +
> > > > +	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> > > > +	if (flags & BPF_MTU_CHK_RELAX)
> > > > +		mtu += VLAN_HLEN;    
> > > 
> > > I'm trying to think about the use case where this might be used?
> > > Compared to just adjusting MTU in BPF program side as needed for
> > > packet encapsulation/headers/etc.  
> > 
> > As I wrote above, this were added because the kernels own forwarding
> > have this relaxation in it's checks (in is_skb_forwardable()).  I even
> > tried to dig through the history, introduced in [1] and copy-pasted
> > in[2].  And this seems to be a workaround, that have become standard,
> > that still have practical implications.
> > 
> > My practical experiments showed, that e.g. ixgbe driver with MTU=1500
> > (L3-size) will allow and fully send packets with 1504 (L3-size). But
> > i40e will not, and drops the packet in hardware/firmware step.  So,
> > what is the correct action, strict or relaxed?
> > 
> > My own conclusion is that we should inverse the flag.  Meaning to
> > default add this VLAN_HLEN (4 bytes) relaxation, and have a flag to do
> > more strict check,  e.g. BPF_MTU_CHK_STRICT. As for historical reasons
> > we must act like kernels version of MTU check. Unless you object, I will
> > do this in V6.  
> 
> I'm fine with it either way as long as its documented in the helper
> description so I have a chance of remembering this discussion in 6 months.
> But, if you make it default won't this break for XDP cases? I assume the
> XDP use case doesn't include the VLAN 4-bytes. Would you need to prevent
> the flag from being used from XDP?

XDP actually do include the VLAN_HLEN 4-bytes, see xdp_ok_fwd_dev(). I
was so certain that you John added this code, but looking through git
blame it pointed back to myself.  Going 5 levels git history deep and
3+ years, does seem like I move/reused some of Johns code containing
VLAN_HLEN in the MTU check, introduced for xdp-generic (6103aa96ec077)
which I acked.  Thus, I guess I cannot push this away and have to take
blame myself ;-)

I conclude that we default need to include this VLAN_HLEN, else the XDP
bpf_check_mtu could say deny, while it would have passed the check in
xdp_ok_fwd_dev().  As i40e will drop 1504 this at HW/FW level, I still
see a need for a BPF_MTU_CHK_STRICT flag for programs that want to
catch this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking
  2020-11-02 20:10         ` Jesper Dangaard Brouer
@ 2020-11-12 12:58           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-11-12 12:58 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, Jakub Kicinski, eyal.birger,
	brouer

On Mon, 2 Nov 2020 21:10:34 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon, 02 Nov 2020 10:04:44 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > > > > +
> > > > > +	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> > > > > +	if (flags & BPF_MTU_CHK_RELAX)
> > > > > +		mtu += VLAN_HLEN;      
> > > > 
> > > > I'm trying to think about the use case where this might be used?
> > > > Compared to just adjusting MTU in BPF program side as needed for
> > > > packet encapsulation/headers/etc.    
> > > 
> > > As I wrote above, this were added because the kernels own forwarding
> > > have this relaxation in it's checks (in is_skb_forwardable()).  I even
> > > tried to dig through the history, introduced in [1] and copy-pasted
> > > in[2].  And this seems to be a workaround, that have become standard,
> > > that still have practical implications.
> > > 
> > > My practical experiments showed, that e.g. ixgbe driver with MTU=1500
> > > (L3-size) will allow and fully send packets with 1504 (L3-size). But
> > > i40e will not, and drops the packet in hardware/firmware step.  So,
> > > what is the correct action, strict or relaxed?
> > > 
> > > My own conclusion is that we should inverse the flag.  Meaning to
> > > default add this VLAN_HLEN (4 bytes) relaxation, and have a flag to do
> > > more strict check,  e.g. BPF_MTU_CHK_STRICT. As for historical reasons
> > > we must act like kernels version of MTU check. Unless you object, I will
> > > do this in V6.    
> > 
> > I'm fine with it either way as long as its documented in the helper
> > description so I have a chance of remembering this discussion in 6 months.
> > But, if you make it default won't this break for XDP cases? I assume the
> > XDP use case doesn't include the VLAN 4-bytes. Would you need to prevent
> > the flag from being used from XDP?  
> 
> XDP actually do include the VLAN_HLEN 4-bytes, see xdp_ok_fwd_dev(). I
> was so certain that you John added this code, but looking through git
> blame it pointed back to myself.  Going 5 levels git history deep and
> 3+ years, does seem like I move/reused some of Johns code containing
> VLAN_HLEN in the MTU check, introduced for xdp-generic (6103aa96ec077)
> which I acked.  Thus, I guess I cannot push this away and have to take
> blame myself ;-)
> 
> I conclude that we default need to include this VLAN_HLEN, else the XDP
> bpf_check_mtu could say deny, while it would have passed the check in
> xdp_ok_fwd_dev().  As i40e will drop 1504 this at HW/FW level, I still
> see a need for a BPF_MTU_CHK_STRICT flag for programs that want to
> catch this.

Disagreeing with myself... I want to keep the BPF_MTU_CHK_RELAX, and
let MTU check use the actual MTU value (adjusted to L2 of-cause).

With the argument, that because some drivers with MTU 1500 will
actually drop frame with MTU 1504 bytes (+14 eth_hdr) frames, it is
wrong to "approve" this MTU size in the check.  A BPF program will know
it is playing with VLAN headers and can choose to violate the MTU check
with 4 bytes.  While BPF programs using other types of encap headers
will get confused that MTU check gives them 4 bytes more, which if used
will get dropped on a subset of drivers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-11-12 12:59 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-30 16:50 [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2020-10-30 16:50 ` [PATCH bpf-next V5 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2020-10-30 16:50 ` [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2020-10-30 19:40   ` John Fastabend
2020-11-02  9:28     ` Jesper Dangaard Brouer
2020-11-02 15:59       ` David Ahern
2020-11-02 16:18         ` John Fastabend
2020-10-31 15:52   ` David Ahern
2020-10-30 16:51 ` [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2020-10-30 20:23   ` John Fastabend
2020-11-02 11:15     ` Jesper Dangaard Brouer
2020-11-02 18:04       ` John Fastabend
2020-11-02 20:10         ` Jesper Dangaard Brouer
2020-11-12 12:58           ` Jesper Dangaard Brouer
2020-10-30 16:51 ` [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2020-10-30 20:36   ` John Fastabend
2020-11-02 12:46     ` Jesper Dangaard Brouer
2020-11-02 16:23       ` John Fastabend
2020-10-30 16:51 ` [PATCH bpf-next V5 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).