bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling
@ 2020-10-27 16:26 Jesper Dangaard Brouer
  2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:26 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.

---

Jesper Dangaard Brouer (5):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      bpf: make it possible to identify BPF redirected SKBs


 include/linux/netdevice.h      |   31 +++++++-
 include/uapi/linux/bpf.h       |   81 +++++++++++++++++++-
 net/core/dev.c                 |   21 +----
 net/core/filter.c              |  163 ++++++++++++++++++++++++++++++++++++----
 net/sched/Kconfig              |    1 
 tools/include/uapi/linux/bpf.h |   81 +++++++++++++++++++-
 6 files changed, 339 insertions(+), 39 deletions(-)

--


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
@ 2020-10-27 16:26 ` Jesper Dangaard Brouer
  2020-10-30 19:24   ` John Fastabend
  2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:26 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

Patch V4 keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests

V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..1ee97fdeea64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 	return 0;
 }
 
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
-	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
-			  SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 	   u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 {
 	u32 len_cur, len_diff_abs = abs(len_diff);
 	u32 len_min = bpf_skb_net_base_len(skb);
-	u32 len_max = __bpf_skb_max_len(skb);
+	u32 len_max = BPF_SKB_MAX_LEN;
 	__be16 proto = skb->protocol;
 	bool shrink = len_diff < 0;
 	u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 min_len = __bpf_skb_min_len(skb);
 	int ret;
 
@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = {
 static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
 					u64 flags)
 {
-	u32 max_len = __bpf_skb_max_len(skb);
+	u32 max_len = BPF_SKB_MAX_LEN;
 	u32 new_len = skb->len + head_room;
 	int ret;
 



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-27 16:26 ` Jesper Dangaard Brouer
  2020-10-27 17:15   ` David Ahern
  2020-10-28 12:49   ` Dan Carpenter
  2020-10-27 16:27 ` [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:26 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   11 +++++++++--
 net/core/filter.c              |   17 ++++++++++++-----
 tools/include/uapi/linux/bpf.h |   11 +++++++++--
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e6ceac3f7d62..03c042e3a34c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2219,6 +2219,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and result params->mtu contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4872,9 +4875,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* total length of packet from network hdr */
 
+		/* output: MTU value (if requested check_mtu) */
+		__u16	mtu;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */
diff --git a/net/core/filter.c b/net/core/filter.c
index 1ee97fdeea64..caa427edc563 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5265,12 +5265,13 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
 				  const struct neighbour *neigh,
-				  const struct net_device *dev)
+				  const struct net_device *dev, u32 mtu)
 {
 	memcpy(params->dmac, neigh->ha, ETH_ALEN);
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
+	params->mtu = mtu;
 
 	return 0;
 }
@@ -5354,8 +5355,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	nhc = res.nhc;
@@ -5389,7 +5392,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5481,8 +5484,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	if (res.nh->fib_nh_lws)
@@ -5502,7 +5507,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5571,6 +5576,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 		dev = dev_get_by_index_rcu(net, params->ifindex);
 		if (!is_skb_forwardable(dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
+
+		params->mtu = dev->mtu; /* union with tot_len */
 	}
 
 	return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e6ceac3f7d62..03c042e3a34c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2219,6 +2219,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and result params->mtu contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4872,9 +4875,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* total length of packet from network hdr */
 
+		/* output: MTU value (if requested check_mtu) */
+		__u16	mtu;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking
  2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
  2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
  2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-27 16:27 ` Jesper Dangaard Brouer
  2020-10-27 16:27 ` [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
  2020-10-27 16:27 ` [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
  4 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:27 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj

V3: Take L2/ETH_HLEN header size into account and document it.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h       |   70 +++++++++++++++++++++++
 net/core/filter.c              |  120 ++++++++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   70 +++++++++++++++++++++++
 3 files changed, 260 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 03c042e3a34c..c7ac1fab5e8b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3745,6 +3745,63 @@ union bpf_attr {
  * 	Return
  * 		The helper returns **TC_ACT_REDIRECT** on success or
  * 		**TC_ACT_SHOT** on error.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *              **BPF_MTU_CHK_RELAX**
+ *			This flag relax or increase the MTU with room for one
+ *			VLAN header (4 bytes). This relaxation is also used by
+ *			the kernels own forwarding MTU checks.
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is partly
+ *			skipped, because in transmit path it is possible for the
+ *			skb packet to get re-segmented (depending on net device
+ *			features).  This could still be a MTU violation, so this
+ *			flag enables performing MTU check against segments, with
+ *			a different violation return code to tell it apart.
+ *
+ *		The *mtu_result* pointer contains the MTU value of the net
+ *		device including the L2 header size (usually 14 bytes Ethernet
+ *		header). The net device configured MTU is the L3 size, but as
+ *		XDP and TX length operate at L2 this helper include L2 header
+ *		size in reported MTU.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_result* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3903,6 +3960,7 @@ union bpf_attr {
 	FN(bpf_per_cpu_ptr),            \
 	FN(bpf_this_cpu_ptr),		\
 	FN(redirect_peer),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4927,6 +4985,18 @@ struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_RELAX = (1U << 0),
+	BPF_MTU_CHK_SEGS  = (1U << 1),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
diff --git a/net/core/filter.c b/net/core/filter.c
index caa427edc563..d66a9cba8e14 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5593,6 +5593,122 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+static int __bpf_lookup_mtu(struct net_device *dev_curr, u32 ifindex, u64 flags)
+{
+	struct net *netns = dev_net(dev_curr);
+	struct net_device *dev;
+	int mtu;
+
+	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
+	if (ifindex == 0)
+		dev = dev_curr;
+	else
+		dev = dev_get_by_index_rcu(netns, ifindex);
+
+	if (!dev)
+		return -ENODEV;
+
+	/* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
+	mtu = dev->mtu + dev->hard_header_len;
+
+	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
+	if (flags & BPF_MTU_CHK_RELAX)
+		mtu += VLAN_HLEN;
+
+	return mtu;
+}
+
+BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+	struct net_device *dev = skb->dev;
+	int len = skb->len;
+	int mtu;
+
+	if (flags & ~(BPF_MTU_CHK_RELAX | BPF_MTU_CHK_SEGS))
+		return -EINVAL;
+
+	mtu = __bpf_lookup_mtu(dev, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu; /* errno */
+
+	len += len_diff; /* len_diff can be negative, minus result pass check */
+	if (len <= mtu) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+		goto out;
+	}
+	/* At this point, skb->len exceed MTU, but as it include length of all
+	 * segments, and SKB can get re-segmented in transmit path (see
+	 * validate_xmit_skb), we cannot reject MTU-check for GSO packets.
+	 */
+	if (skb_is_gso(skb)) {
+		ret = BPF_MTU_CHK_RET_SUCCESS;
+
+		/* SKB could get dropped later due to segs > MTU or lacking
+		 * features, thus allow BPF-prog to validate segs length here.
+		 */
+		if (flags & BPF_MTU_CHK_SEGS &&
+		    skb_gso_validate_network_len(skb, mtu)) {
+			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
+			goto out;
+		}
+	}
+out:
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;
+}
+
+BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	struct net_device *dev = xdp->rxq->dev;
+	int len = xdp->data_end - xdp->data;
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	int mtu;
+
+	/* XDP variant doesn't support multi-buffer segment check (yet) */
+	if (flags & ~BPF_MTU_CHK_RELAX)
+		return -EINVAL;
+
+	mtu = __bpf_lookup_mtu(dev, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu; /* errno */
+
+	len += len_diff; /* len_diff can be negative, minus result pass check */
+	if (len > mtu)
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_skb_check_mtu_proto = {
+	.func		= bpf_skb_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto bpf_xdp_check_mtu_proto = {
+	.func		= bpf_xdp_check_mtu,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
 {
@@ -7158,6 +7274,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_skb_check_mtu_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
 	case BPF_FUNC_sk_storage_get:
@@ -7227,6 +7345,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
+	case BPF_FUNC_check_mtu:
+		return &bpf_xdp_check_mtu_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 03c042e3a34c..c7ac1fab5e8b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3745,6 +3745,63 @@ union bpf_attr {
  * 	Return
  * 		The helper returns **TC_ACT_REDIRECT** on success or
  * 		**TC_ACT_SHOT** on error.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		Specifying *ifindex* zero means the MTU check is performed
+ *		against the current net device.  This is practical if this isn't
+ *		used prior to redirect.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *              **BPF_MTU_CHK_RELAX**
+ *			This flag relax or increase the MTU with room for one
+ *			VLAN header (4 bytes). This relaxation is also used by
+ *			the kernels own forwarding MTU checks.
+ *
+ *		**BPF_MTU_CHK_SEGS**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as GSO skb), then MTU check is partly
+ *			skipped, because in transmit path it is possible for the
+ *			skb packet to get re-segmented (depending on net device
+ *			features).  This could still be a MTU violation, so this
+ *			flag enables performing MTU check against segments, with
+ *			a different violation return code to tell it apart.
+ *
+ *		The *mtu_result* pointer contains the MTU value of the net
+ *		device including the L2 header size (usually 14 bytes Ethernet
+ *		header). The net device configured MTU is the L3 size, but as
+ *		XDP and TX length operate at L2 this helper include L2 header
+ *		size in reported MTU.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_result* pointer, as this can be needed for
+ *		implementing PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3903,6 +3960,7 @@ union bpf_attr {
 	FN(bpf_per_cpu_ptr),            \
 	FN(bpf_this_cpu_ptr),		\
 	FN(redirect_peer),		\
+	FN(check_mtu),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4927,6 +4985,18 @@ struct bpf_redir_neigh {
 	};
 };
 
+/* bpf_check_mtu flags*/
+enum  bpf_check_mtu_flags {
+	BPF_MTU_CHK_RELAX = (1U << 0),
+	BPF_MTU_CHK_SEGS  = (1U << 1),
+};
+
+enum bpf_check_mtu_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2020-10-27 16:27 ` [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2020-10-27 16:27 ` Jesper Dangaard Brouer
  2020-10-27 16:27 ` [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
  4 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:27 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |   31 +++++++++++++++++++++++++++++--
 net/core/dev.c            |   19 ++-----------------
 net/core/filter.c         |   14 +++++++++++---
 3 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 964b494b0e8d..bd02ddab8dfe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3891,11 +3891,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 bool is_skb_forwardable(const struct net_device *dev,
 			const struct sk_buff *skb);
 
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+						 const struct sk_buff *skb,
+						 const bool check_mtu)
+{
+	const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+	unsigned int len;
+
+	if (!(dev->flags & IFF_UP))
+		return false;
+
+	if (!check_mtu)
+		return true;
+
+	len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+	if (skb->len <= len)
+		return true;
+
+	/* if TSO is enabled, we don't care about the length as the packet
+	 * could be forwarded without being segmented before
+	 */
+	if (skb_is_gso(skb))
+		return true;
+
+	return false;
+}
+
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
-					       struct sk_buff *skb)
+					       struct sk_buff *skb,
+					       const bool check_mtu)
 {
 	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-	    unlikely(!is_skb_forwardable(dev, skb))) {
+	    unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index 9499a414d67e..445ccf92c149 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2188,28 +2188,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
 {
-	unsigned int len;
-
-	if (!(dev->flags & IFF_UP))
-		return false;
-
-	len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
-	if (skb->len <= len)
-		return true;
-
-	/* if TSO is enabled, we don't care about the length as the packet
-	 * could be forwarded without being segmented before
-	 */
-	if (skb_is_gso(skb))
-		return true;
-
-	return false;
+	return __is_skb_forwardable(dev, skb, true);
 }
 EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, true);
 
 	if (likely(!ret)) {
 		skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index d66a9cba8e14..6557c4e8858c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	return dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
+
+	if (likely(!ret)) {
+		skb->protocol = eth_type_trans(skb, dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		ret = netif_rx(skb);
+	}
+
+	return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
 				      struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
 
 	if (likely(!ret)) {
 		skb->dev = dev;
@@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb)
 			goto out_drop;
 		dev = ops->ndo_get_peer_dev(dev);
 		if (unlikely(!dev ||
-			     !is_skb_forwardable(dev, skb) ||
+			     !__is_skb_forwardable(dev, skb, false) ||
 			     net_eq(net, dev_net(dev))))
 			goto out_drop;
 		skb->dev = dev;



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs
  2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2020-10-27 16:27 ` [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2020-10-27 16:27 ` Jesper Dangaard Brouer
  4 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:27 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This change makes it possible to identify SKBs that have been redirected
by TC-BPF (cls_act). This is needed for a number of cases.

(1) For collaborating with driver ifb net_devices.
(2) For avoiding starting generic-XDP prog on TC ingress redirect.

It is most important to fix XDP case(2), because this can break userspace
when a driver gets support for native-XDP. Imagine userspace loads XDP
prog on eth0, which fallback to generic-XDP, and it process TC-redirected
packets. When kernel is updated with native-XDP support for eth0, then the
program no-longer see the TC-redirected packets. Therefore it is important
to keep the order intact; that XDP runs before TC-BPF.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/dev.c    |    2 ++
 net/sched/Kconfig |    1 +
 2 files changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 445ccf92c149..930c165a607e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		return NULL;
 	case TC_ACT_REDIRECT:
 		/* No need to push/pop skb's mac_header here on egress! */
+		skb_set_redirected(skb, false);
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
 		return NULL;
@@ -4959,6 +4960,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		 * redirecting to another netdev
 		 */
 		__skb_push(skb, skb->mac_len);
+		skb_set_redirected(skb, true);
 		if (skb_do_redirect(skb) == -EAGAIN) {
 			__skb_pull(skb, skb->mac_len);
 			*another = true;
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a3b37d88800e..a1bbaa8fd054 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -384,6 +384,7 @@ config NET_SCH_INGRESS
 	depends on NET_CLS_ACT
 	select NET_INGRESS
 	select NET_EGRESS
+	select NET_REDIRECT
 	help
 	  Say Y here if you want to use classifiers for incoming and/or outgoing
 	  packets. This qdisc doesn't do anything else besides running classifiers,



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-27 17:15   ` David Ahern
  2020-10-30 17:01     ` Jesper Dangaard Brouer
  2020-10-28 12:49   ` Dan Carpenter
  1 sibling, 1 reply; 11+ messages in thread
From: David Ahern @ 2020-10-27 17:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger

On 10/27/20 10:26 AM, Jesper Dangaard Brouer wrote:
> The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
> don't know the MTU value that caused this rejection.
> 
> If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> need to know this MTU value for the ICMP packet.
> 
> Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> value as output via a union with 'tot_len' as this is the value used for
> the MTU lookup.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h       |   11 +++++++++--
>  net/core/filter.c              |   17 ++++++++++++-----
>  tools/include/uapi/linux/bpf.h |   11 +++++++++--
>  3 files changed, 30 insertions(+), 9 deletions(-)
> 


Reviewed-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
  2020-10-27 17:15   ` David Ahern
@ 2020-10-28 12:49   ` Dan Carpenter
  2020-10-30 14:35     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 11+ messages in thread
From: Dan Carpenter @ 2020-10-28 12:49 UTC (permalink / raw)
  To: kbuild, Jesper Dangaard Brouer, bpf
  Cc: lkp, kbuild-all, Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek

[-- Attachment #1: Type: text/plain, Size: 10543 bytes --]

Hi Jesper,

url:    https://github.com/0day-ci/linux/commits/Jesper-Dangaard-Brouer/bpf-New-approach-for-BPF-MTU-handling/20201028-002919
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
config: i386-randconfig-m021-20201026 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

New smatch warnings:
net/core/filter.c:5395 bpf_ipv4_fib_lookup() error: uninitialized symbol 'mtu'.

vim +/mtu +5395 net/core/filter.c

87f5fc7e48dd317 David Ahern            2018-05-09  5281  static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
4f74fede40df8db David Ahern            2018-05-21  5282  			       u32 flags, bool check_mtu)
87f5fc7e48dd317 David Ahern            2018-05-09  5283  {
eba618abacade71 David Ahern            2019-04-02  5284  	struct fib_nh_common *nhc;
87f5fc7e48dd317 David Ahern            2018-05-09  5285  	struct in_device *in_dev;
87f5fc7e48dd317 David Ahern            2018-05-09  5286  	struct neighbour *neigh;
87f5fc7e48dd317 David Ahern            2018-05-09  5287  	struct net_device *dev;
87f5fc7e48dd317 David Ahern            2018-05-09  5288  	struct fib_result res;
87f5fc7e48dd317 David Ahern            2018-05-09  5289  	struct flowi4 fl4;
87f5fc7e48dd317 David Ahern            2018-05-09  5290  	int err;
4f74fede40df8db David Ahern            2018-05-21  5291  	u32 mtu;
                                                                ^^^^^^^^

87f5fc7e48dd317 David Ahern            2018-05-09  5292  
87f5fc7e48dd317 David Ahern            2018-05-09  5293  	dev = dev_get_by_index_rcu(net, params->ifindex);
87f5fc7e48dd317 David Ahern            2018-05-09  5294  	if (unlikely(!dev))
87f5fc7e48dd317 David Ahern            2018-05-09  5295  		return -ENODEV;
87f5fc7e48dd317 David Ahern            2018-05-09  5296  
87f5fc7e48dd317 David Ahern            2018-05-09  5297  	/* verify forwarding is enabled on this interface */
87f5fc7e48dd317 David Ahern            2018-05-09  5298  	in_dev = __in_dev_get_rcu(dev);
87f5fc7e48dd317 David Ahern            2018-05-09  5299  	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
4c79579b44b1876 David Ahern            2018-06-26  5300  		return BPF_FIB_LKUP_RET_FWD_DISABLED;
87f5fc7e48dd317 David Ahern            2018-05-09  5301  
87f5fc7e48dd317 David Ahern            2018-05-09  5302  	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
87f5fc7e48dd317 David Ahern            2018-05-09  5303  		fl4.flowi4_iif = 1;
87f5fc7e48dd317 David Ahern            2018-05-09  5304  		fl4.flowi4_oif = params->ifindex;
87f5fc7e48dd317 David Ahern            2018-05-09  5305  	} else {
87f5fc7e48dd317 David Ahern            2018-05-09  5306  		fl4.flowi4_iif = params->ifindex;
87f5fc7e48dd317 David Ahern            2018-05-09  5307  		fl4.flowi4_oif = 0;
87f5fc7e48dd317 David Ahern            2018-05-09  5308  	}
87f5fc7e48dd317 David Ahern            2018-05-09  5309  	fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
87f5fc7e48dd317 David Ahern            2018-05-09  5310  	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
87f5fc7e48dd317 David Ahern            2018-05-09  5311  	fl4.flowi4_flags = 0;
87f5fc7e48dd317 David Ahern            2018-05-09  5312  
87f5fc7e48dd317 David Ahern            2018-05-09  5313  	fl4.flowi4_proto = params->l4_protocol;
87f5fc7e48dd317 David Ahern            2018-05-09  5314  	fl4.daddr = params->ipv4_dst;
87f5fc7e48dd317 David Ahern            2018-05-09  5315  	fl4.saddr = params->ipv4_src;
87f5fc7e48dd317 David Ahern            2018-05-09  5316  	fl4.fl4_sport = params->sport;
87f5fc7e48dd317 David Ahern            2018-05-09  5317  	fl4.fl4_dport = params->dport;
1869e226a7b3ef7 David Ahern            2020-09-13  5318  	fl4.flowi4_multipath_hash = 0;
87f5fc7e48dd317 David Ahern            2018-05-09  5319  
87f5fc7e48dd317 David Ahern            2018-05-09  5320  	if (flags & BPF_FIB_LOOKUP_DIRECT) {
87f5fc7e48dd317 David Ahern            2018-05-09  5321  		u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
87f5fc7e48dd317 David Ahern            2018-05-09  5322  		struct fib_table *tb;
87f5fc7e48dd317 David Ahern            2018-05-09  5323  
87f5fc7e48dd317 David Ahern            2018-05-09  5324  		tb = fib_get_table(net, tbid);
87f5fc7e48dd317 David Ahern            2018-05-09  5325  		if (unlikely(!tb))
4c79579b44b1876 David Ahern            2018-06-26  5326  			return BPF_FIB_LKUP_RET_NOT_FWDED;
87f5fc7e48dd317 David Ahern            2018-05-09  5327  
87f5fc7e48dd317 David Ahern            2018-05-09  5328  		err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
87f5fc7e48dd317 David Ahern            2018-05-09  5329  	} else {
87f5fc7e48dd317 David Ahern            2018-05-09  5330  		fl4.flowi4_mark = 0;
87f5fc7e48dd317 David Ahern            2018-05-09  5331  		fl4.flowi4_secid = 0;
87f5fc7e48dd317 David Ahern            2018-05-09  5332  		fl4.flowi4_tun_key.tun_id = 0;
87f5fc7e48dd317 David Ahern            2018-05-09  5333  		fl4.flowi4_uid = sock_net_uid(net, NULL);
87f5fc7e48dd317 David Ahern            2018-05-09  5334  
87f5fc7e48dd317 David Ahern            2018-05-09  5335  		err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
87f5fc7e48dd317 David Ahern            2018-05-09  5336  	}
87f5fc7e48dd317 David Ahern            2018-05-09  5337  
4c79579b44b1876 David Ahern            2018-06-26  5338  	if (err) {
4c79579b44b1876 David Ahern            2018-06-26  5339  		/* map fib lookup errors to RTN_ type */
4c79579b44b1876 David Ahern            2018-06-26  5340  		if (err == -EINVAL)
4c79579b44b1876 David Ahern            2018-06-26  5341  			return BPF_FIB_LKUP_RET_BLACKHOLE;
4c79579b44b1876 David Ahern            2018-06-26  5342  		if (err == -EHOSTUNREACH)
4c79579b44b1876 David Ahern            2018-06-26  5343  			return BPF_FIB_LKUP_RET_UNREACHABLE;
4c79579b44b1876 David Ahern            2018-06-26  5344  		if (err == -EACCES)
4c79579b44b1876 David Ahern            2018-06-26  5345  			return BPF_FIB_LKUP_RET_PROHIBIT;
4c79579b44b1876 David Ahern            2018-06-26  5346  
4c79579b44b1876 David Ahern            2018-06-26  5347  		return BPF_FIB_LKUP_RET_NOT_FWDED;
4c79579b44b1876 David Ahern            2018-06-26  5348  	}
4c79579b44b1876 David Ahern            2018-06-26  5349  
4c79579b44b1876 David Ahern            2018-06-26  5350  	if (res.type != RTN_UNICAST)
4c79579b44b1876 David Ahern            2018-06-26  5351  		return BPF_FIB_LKUP_RET_NOT_FWDED;
87f5fc7e48dd317 David Ahern            2018-05-09  5352  
5481d73f81549e2 David Ahern            2019-06-03  5353  	if (fib_info_num_path(res.fi) > 1)
87f5fc7e48dd317 David Ahern            2018-05-09  5354  		fib_select_path(net, &res, &fl4, NULL);
87f5fc7e48dd317 David Ahern            2018-05-09  5355  
4f74fede40df8db David Ahern            2018-05-21  5356  	if (check_mtu) {
4f74fede40df8db David Ahern            2018-05-21  5357  		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27  5358  		if (params->tot_len > mtu) {
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27  5359  			params->mtu = mtu; /* union with tot_len */
4c79579b44b1876 David Ahern            2018-06-26  5360  			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
4f74fede40df8db David Ahern            2018-05-21  5361  		}
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27  5362  	}

"mtu" not initialized on else path.

4f74fede40df8db David Ahern            2018-05-21  5363  
eba618abacade71 David Ahern            2019-04-02  5364  	nhc = res.nhc;
87f5fc7e48dd317 David Ahern            2018-05-09  5365  
87f5fc7e48dd317 David Ahern            2018-05-09  5366  	/* do not handle lwt encaps right now */
eba618abacade71 David Ahern            2019-04-02  5367  	if (nhc->nhc_lwtstate)
4c79579b44b1876 David Ahern            2018-06-26  5368  		return BPF_FIB_LKUP_RET_UNSUPP_LWT;
87f5fc7e48dd317 David Ahern            2018-05-09  5369  
eba618abacade71 David Ahern            2019-04-02  5370  	dev = nhc->nhc_dev;
87f5fc7e48dd317 David Ahern            2018-05-09  5371  
87f5fc7e48dd317 David Ahern            2018-05-09  5372  	params->rt_metric = res.fi->fib_priority;
d1c362e1dd68a42 Toke Høiland-Jørgensen 2020-10-09  5373  	params->ifindex = dev->ifindex;
87f5fc7e48dd317 David Ahern            2018-05-09  5374  
87f5fc7e48dd317 David Ahern            2018-05-09  5375  	/* xdp and cls_bpf programs are run in RCU-bh so
87f5fc7e48dd317 David Ahern            2018-05-09  5376  	 * rcu_read_lock_bh is not needed here
87f5fc7e48dd317 David Ahern            2018-05-09  5377  	 */
6f5f68d05ec0f64 David Ahern            2019-04-05  5378  	if (likely(nhc->nhc_gw_family != AF_INET6)) {
6f5f68d05ec0f64 David Ahern            2019-04-05  5379  		if (nhc->nhc_gw_family)
6f5f68d05ec0f64 David Ahern            2019-04-05  5380  			params->ipv4_dst = nhc->nhc_gw.ipv4;
6f5f68d05ec0f64 David Ahern            2019-04-05  5381  
6f5f68d05ec0f64 David Ahern            2019-04-05  5382  		neigh = __ipv4_neigh_lookup_noref(dev,
6f5f68d05ec0f64 David Ahern            2019-04-05  5383  						 (__force u32)params->ipv4_dst);
6f5f68d05ec0f64 David Ahern            2019-04-05  5384  	} else {
6f5f68d05ec0f64 David Ahern            2019-04-05  5385  		struct in6_addr *dst = (struct in6_addr *)params->ipv6_dst;
6f5f68d05ec0f64 David Ahern            2019-04-05  5386  
6f5f68d05ec0f64 David Ahern            2019-04-05  5387  		params->family = AF_INET6;
6f5f68d05ec0f64 David Ahern            2019-04-05  5388  		*dst = nhc->nhc_gw.ipv6;
6f5f68d05ec0f64 David Ahern            2019-04-05  5389  		neigh = __ipv6_neigh_lookup_noref_stub(dev, dst);
6f5f68d05ec0f64 David Ahern            2019-04-05  5390  	}
6f5f68d05ec0f64 David Ahern            2019-04-05  5391  
4c79579b44b1876 David Ahern            2018-06-26  5392  	if (!neigh)
4c79579b44b1876 David Ahern            2018-06-26  5393  		return BPF_FIB_LKUP_RET_NO_NEIGH;
87f5fc7e48dd317 David Ahern            2018-05-09  5394  
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27 @5395  	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
                                                                                                                  ^^^
Uninitialized variable warning.

87f5fc7e48dd317 David Ahern            2018-05-09  5396  }

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39749 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-28 12:49   ` Dan Carpenter
@ 2020-10-30 14:35     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 14:35 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: kbuild, bpf, lkp, kbuild-all, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	brouer

On Wed, 28 Oct 2020 15:49:42 +0300
Dan Carpenter <dan.carpenter@oracle.com> wrote:

> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> New smatch warnings:
> net/core/filter.c:5395 bpf_ipv4_fib_lookup() error: uninitialized symbol 'mtu'.

I will fix and send V5.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-27 17:15   ` David Ahern
@ 2020-10-30 17:01     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 17:01 UTC (permalink / raw)
  To: David Ahern
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, brouer

On Tue, 27 Oct 2020 11:15:31 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 10/27/20 10:26 AM, Jesper Dangaard Brouer wrote:
> > The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> > can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
> > don't know the MTU value that caused this rejection.
> > 
> > If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> > need to know this MTU value for the ICMP packet.
> > 
> > Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> > value as output via a union with 'tot_len' as this is the value used for
> > the MTU lookup.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/uapi/linux/bpf.h       |   11 +++++++++--
> >  net/core/filter.c              |   17 ++++++++++++-----
> >  tools/include/uapi/linux/bpf.h |   11 +++++++++--
> >  3 files changed, 30 insertions(+), 9 deletions(-)
> 
> Reviewed-by: David Ahern <dsahern@kernel.org>

Thanks a lot for the review.  I didn't propagate-it-over in V5 of this
patch, as I changed the name of the output member from mtu to
mtu_result in V5.  Please review V5 and give your review consent.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-30 19:24   ` John Fastabend
  0 siblings, 0 replies; 11+ messages in thread
From: John Fastabend @ 2020-10-30 19:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Jesper Dangaard Brouer wrote:
> Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
> __bpf_skb_max_len() as the max-length. This function limit size against
> the current net_device MTU (skb->dev->mtu).
> 
> When a BPF-prog grow the packet size, then it should not be limited to the
> MTU. The MTU is a transmit limitation, and software receiving this packet
> should be allowed to increase the size. Further more, current MTU check in
> __bpf_skb_max_len uses the MTU from ingress/current net_device, which in
> case of redirects uses the wrong net_device.
> 
> Patch V4 keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
> is elsewhere in the system. Jesper's testing[1] showed it was not possible
> to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
> factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
> SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
> in-effect due to this being called from softirq context see code
> __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
> that frames above 16KiB can cause NICs to reset (but not crash). Keep this
> sanity limit at this level as memory layer can differ based on kernel
> config.
> 
> [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests
> 
> V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-10-30 19:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2020-10-30 19:24   ` John Fastabend
2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2020-10-27 17:15   ` David Ahern
2020-10-30 17:01     ` Jesper Dangaard Brouer
2020-10-28 12:49   ` Dan Carpenter
2020-10-30 14:35     ` Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).