All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement
@ 2020-10-07 16:22 Jesper Dangaard Brouer
  2020-10-07 16:22 ` [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-07 16:22 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code. If BPF-prog doesn't comply with the MTU this is enforced on the
kernel side.

Realizing MTU should only apply to transmitted packets, the MTU
enforcement is now done after the TC egress hook. This gives TC-BPF
programs most flexibility and allows to shrink packet size again in egress
hook prior to transmit.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: New BPF-helper design

---

Jesper Dangaard Brouer (6):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: make it possible to identify BPF redirected SKBs
      bpf: Add MTU check for TC-BPF packets after egress hook
      bpf: drop MTU check when doing TC-BPF redirect to ingress


 include/linux/netdevice.h |    5 +-
 include/uapi/linux/bpf.h  |   68 ++++++++++++++++++++-
 net/core/dev.c            |   28 ++++++++
 net/core/filter.c         |  149 ++++++++++++++++++++++++++++++++++++++++++---
 net/sched/Kconfig         |    1 
 5 files changed, 235 insertions(+), 16 deletions(-)

--


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-07 16:22 [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement Jesper Dangaard Brouer
@ 2020-10-07 16:22 ` Jesper Dangaard Brouer
  2020-10-07 21:26   ` Daniel Borkmann
  2020-10-07 23:46   ` Maciej Żenczykowski
  2020-10-07 16:22 ` [PATCH bpf-next V2 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-07 16:22 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

Keep a sanity max limit of IP_MAX_MTU which is 64KiB.

In later patches we will enforce the MTU limitation when transmitting
packets.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
(imported from commit 37f8552786cf46588af52b77829b730dd14524d3)
---
 net/core/filter.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 05df73780dd3..fed239e77bdc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3476,8 +3476,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 
 static u32 __bpf_skb_max_len(const struct sk_buff *skb)
 {
-	return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
-			  SKB_MAX_ALLOC;
+	return IP_MAX_MTU;
 }
 
 BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V2 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up
  2020-10-07 16:22 [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement Jesper Dangaard Brouer
  2020-10-07 16:22 ` [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-07 16:22 ` Jesper Dangaard Brouer
  2020-10-07 16:22 ` [PATCH bpf-next V2 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-07 16:22 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED.  The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h |   11 +++++++++--
 net/core/filter.c        |   17 ++++++++++++-----
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c446394135be..50ce65e37b16 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2216,6 +2216,9 @@ union bpf_attr {
  *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
  *		  packet is not forwarded or needs assist from full stack
  *
+ *		If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ *		was exceeded and result params->mtu contains the MTU.
+ *
  * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
  *		Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4844,9 +4847,13 @@ struct bpf_fib_lookup {
 	__be16	sport;
 	__be16	dport;
 
-	/* total length of packet from network header - used for MTU check */
-	__u16	tot_len;
+	union {	/* used for MTU check */
+		/* input to lookup */
+		__u16	tot_len; /* total length of packet from network hdr */
 
+		/* output: MTU value (if requested check_mtu) */
+		__u16	mtu;
+	};
 	/* input: L3 device index for lookup
 	 * output: device index from FIB lookup
 	 */
diff --git a/net/core/filter.c b/net/core/filter.c
index fed239e77bdc..d84723f347c0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5185,13 +5185,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
 static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
 				  const struct neighbour *neigh,
-				  const struct net_device *dev)
+				  const struct net_device *dev, u32 mtu)
 {
 	memcpy(params->dmac, neigh->ha, ETH_ALEN);
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
 	params->ifindex = dev->ifindex;
+	params->mtu = mtu;
 
 	return 0;
 }
@@ -5275,8 +5276,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	nhc = res.nhc;
@@ -5309,7 +5312,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5401,8 +5404,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	if (check_mtu) {
 		mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
-		if (params->tot_len > mtu)
+		if (params->tot_len > mtu) {
+			params->mtu = mtu; /* union with tot_len */
 			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+		}
 	}
 
 	if (res.nh->fib_nh_lws)
@@ -5421,7 +5426,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (!neigh)
 		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return bpf_fib_set_fwd_params(params, neigh, dev);
+	return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
 }
 #endif
 
@@ -5490,6 +5495,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 		dev = dev_get_by_index_rcu(net, params->ifindex);
 		if (!is_skb_forwardable(dev, skb))
 			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
+
+		params->mtu = dev->mtu; /* union with tot_len */
 	}
 
 	return rc;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V2 3/6] bpf: add BPF-helper for MTU checking
  2020-10-07 16:22 [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement Jesper Dangaard Brouer
  2020-10-07 16:22 ` [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
  2020-10-07 16:22 ` [PATCH bpf-next V2 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-07 16:22 ` Jesper Dangaard Brouer
  2020-10-07 16:22 ` [PATCH bpf-next V2 4/6] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-07 16:22 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This BPF-helper bpf_mtu_check() works for both XDP and TC-BPF programs.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h |   57 ++++++++++++++++++++++
 net/core/filter.c        |  117 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 174 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 50ce65e37b16..64cdad06135e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3718,6 +3718,50 @@ union bpf_attr {
  *		never return NULL.
  *	Return
  *		A pointer pointing to the kernel percpu variable on this cpu.
+ *
+ * int bpf_mtu_check(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ *	Description
+ *		Check ctx packet size against MTU of net device (based on
+ *		*ifindex*).  This helper will likely be used in combination with
+ *		helpers that adjust/change the packet size.  The argument
+ *		*len_diff* can be used for querying with a planned size
+ *		change. This allows to check MTU prior to changing packet ctx.
+ *
+ *		The Linux kernel route table can configure MTUs on a more
+ *		specific per route level, which is not provided by this helper.
+ *		For route level MTU checks use the **bpf_fib_lookup**\ ()
+ *		helper.
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** for tc cls_act programs.
+ *
+ *		The *flags* argument can be a combination of one or more of the
+ *		following values:
+ *
+ *              **BPF_MTU_CHK_RELAX**
+ *			This flag relax or increase the MTU with room for one
+ *			VLAN header (4 bytes) and take into account net device
+ *			hard_header_len.  This relaxation is also used by the
+ *			kernels own forwarding MTU checks.
+ *
+ *		**BPF_MTU_CHK_GSO**
+ *			This flag will only works for *ctx* **struct sk_buff**.
+ *			If packet context contains extra packet segment buffers
+ *			(often knows as frags), then those are also checked
+ *			against the MTU size.
+ *
+ *	Return
+ *		* 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ *		* < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ *		MTU violations return positive values, but also populate MTU
+ *		value in *mtu_result* pointer, as this can be needed for
+ *		implemeting PMTU handing:
+ *
+ *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ *		* **BPF_MTU_CHK_RET_GSO_TOOBIG**
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3875,6 +3919,7 @@ union bpf_attr {
 	FN(redirect_neigh),		\
 	FN(bpf_per_cpu_ptr),            \
 	FN(bpf_this_cpu_ptr),		\
+	FN(mtu_check),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4889,6 +4934,18 @@ struct bpf_fib_lookup {
 	__u8	dmac[6];     /* ETH_ALEN */
 };
 
+/* bpf_mtu_check flags*/
+enum  bpf_mtu_check_flags {
+	BPF_MTU_CHK_RELAX = (1U << 0),
+	BPF_MTU_CHK_GSO   = (1U << 1),
+};
+
+enum bpf_mtu_check_ret {
+	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
+	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+	BPF_MTU_CHK_RET_GSO_TOOBIG,   /* GSO re-segmentation needed to fwd */
+};
+
 enum bpf_task_fd_type {
 	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
 	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
diff --git a/net/core/filter.c b/net/core/filter.c
index d84723f347c0..54b779e34f83 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5512,6 +5512,119 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+static int bpf_mtu_lookup(struct net *netns, u32 ifindex, u64 flags)
+{
+	struct net_device *dev;
+	int mtu;
+
+	dev = dev_get_by_index_rcu(netns, ifindex);
+	if (!dev)
+		return -ENODEV;
+
+	mtu = dev->mtu;
+
+	/*  Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
+	if (flags & BPF_MTU_CHK_RELAX)
+		mtu += dev->hard_header_len + VLAN_HLEN;
+
+	return mtu;
+}
+
+static unsigned int __bpf_len_adjust_positive(unsigned int len, int len_diff)
+{
+	int len_new = len + len_diff; /* notice len_diff can be negative */
+
+	if (len_new > 0)
+		return len_new;
+
+	return 0;
+}
+
+BPF_CALL_5(bpf_skb_mtu_check, struct sk_buff *, skb,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	struct net *netns = dev_net(skb->dev);
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	unsigned int len = skb->len;
+	int mtu;
+
+	if (flags & ~(BPF_MTU_CHK_RELAX|BPF_MTU_CHK_GSO))
+		return -EINVAL;
+
+	mtu = bpf_mtu_lookup(netns, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu;  /* errno */
+
+	len = __bpf_len_adjust_positive(len, len_diff);
+	if (len > mtu) {
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+		goto out;
+	}
+
+	if (flags & BPF_MTU_CHK_GSO &&
+	    skb_is_gso(skb) &&
+	    skb_gso_validate_network_len(skb, mtu)) {
+		ret = BPF_MTU_CHK_RET_GSO_TOOBIG;
+		goto out;
+	}
+
+out:
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;
+}
+
+BPF_CALL_5(bpf_xdp_mtu_check, struct xdp_buff *, xdp,
+	   u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+	unsigned int len = xdp->data_end - xdp->data;
+	struct net_device *dev = xdp->rxq->dev;
+	struct net *netns = dev_net(dev);
+	int ret = BPF_MTU_CHK_RET_SUCCESS;
+	int mtu;
+
+	/* XDP variant doesn't support multi-buffer segment check (yet) */
+	if (flags & ~BPF_MTU_CHK_RELAX)
+		return -EINVAL;
+
+	mtu = bpf_mtu_lookup(netns, ifindex, flags);
+	if (unlikely(mtu < 0))
+		return mtu; /* errno */
+
+	len = __bpf_len_adjust_positive(len, len_diff);
+	if (len > mtu) {
+		ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+		goto out;
+	}
+out:
+	if (mtu_result)
+		*mtu_result = mtu;
+
+	return ret;}
+
+static const struct bpf_func_proto bpf_skb_mtu_check_proto = {
+	.func		= bpf_skb_mtu_check,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto bpf_xdp_mtu_check_proto = {
+	.func		= bpf_xdp_mtu_check,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_PTR_TO_MEM,
+	.arg4_type      = ARG_ANYTHING,
+	.arg5_type      = ARG_ANYTHING,
+};
+
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
 {
@@ -7075,6 +7188,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_mtu_check:
+		return &bpf_skb_mtu_check_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
 	case BPF_FUNC_sk_storage_get:
@@ -7144,6 +7259,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
+	case BPF_FUNC_mtu_check:
+		return &bpf_xdp_mtu_check_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V2 4/6] bpf: make it possible to identify BPF redirected SKBs
  2020-10-07 16:22 [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2020-10-07 16:22 ` [PATCH bpf-next V2 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2020-10-07 16:22 ` Jesper Dangaard Brouer
  2020-10-07 16:23 ` [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook Jesper Dangaard Brouer
  2020-10-07 16:23 ` [PATCH bpf-next V2 6/6] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
  5 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-07 16:22 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

This change makes it possible to identify SKBs that have been redirected
by TC-BPF (cls_act). This is needed for a number of cases.

(1) For collaborating with driver ifb net_devices.
(2) For avoiding starting generic-XDP prog on TC ingress redirect.
(3) Next MTU check patches need ability to identify redirected SKBs.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/dev.c    |    2 ++
 net/sched/Kconfig |    1 +
 2 files changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 9d55bf5d1a65..b433098896b2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3885,6 +3885,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 		return NULL;
 	case TC_ACT_REDIRECT:
 		/* No need to push/pop skb's mac_header here on egress! */
+		skb_set_redirected(skb, false);
 		skb_do_redirect(skb);
 		*ret = NET_XMIT_SUCCESS;
 		return NULL;
@@ -4974,6 +4975,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		 * redirecting to another netdev
 		 */
 		__skb_push(skb, skb->mac_len);
+		skb_set_redirected(skb, true);
 		skb_do_redirect(skb);
 		return NULL;
 	case TC_ACT_CONSUMED:
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a3b37d88800e..a1bbaa8fd054 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -384,6 +384,7 @@ config NET_SCH_INGRESS
 	depends on NET_CLS_ACT
 	select NET_INGRESS
 	select NET_EGRESS
+	select NET_REDIRECT
 	help
 	  Say Y here if you want to use classifiers for incoming and/or outgoing
 	  packets. This qdisc doesn't do anything else besides running classifiers,



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-07 16:22 [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2020-10-07 16:22 ` [PATCH bpf-next V2 4/6] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
@ 2020-10-07 16:23 ` Jesper Dangaard Brouer
  2020-10-07 21:37   ` Daniel Borkmann
  2020-10-07 16:23 ` [PATCH bpf-next V2 6/6] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
  5 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-07 16:23 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The MTU should only apply to transmitted packets.

When TC-ingress redirect packet to egress on another netdev, then the
normal netstack MTU checks are skipped (and driver level will not catch
any MTU violation, checked ixgbe).

This patch choose not to add MTU check in the egress code path of
skb_do_redirect() prior to calling dev_queue_xmit(), because it is still
possible to run another BPF egress program that will shrink/consume
headers, which will make packet comply with netdev MTU. This use-case
might already be in production use (if ingress MTU is larger than egress).

Instead do the MTU check after sch_handle_egress() step, for the cases
that require this.

The cases need a bit explaining. Ingress to egress redirected packets
could be detected via skb->tc_at_ingress bit, but it is not reliable,
because sch_handle_egress() could steal the packet and redirect this
(again) to another egress netdev, which will then have the
skb->tc_at_ingress cleared. There is also the case of TC-egress prog
increase packet size and then redirect it egress. Thus, it is more
reliable to do the MTU check for any redirected packet (both ingress and
egress), which is available via skb_is_redirected() in earlier patch.
Also handle case where egress BPF-prog increased size.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

Troubleshooting: MTU violations are recorded in TX dropped counter, and
kprobe on dev_queue_xmit() have retval -EMSGSIZE.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/dev.c |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b433098896b2..19406013f93e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
 	case TC_ACT_OK:
 	case TC_ACT_RECLASSIFY:
+		*ret = NET_XMIT_SUCCESS;
 		skb->tc_index = TC_H_MIN(cl_res.classid);
 		break;
 	case TC_ACT_SHOT:
@@ -4064,9 +4065,12 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq;
+#ifdef CONFIG_NET_CLS_ACT
+	bool mtu_check = false;
+#endif
+	bool again = false;
 	struct Qdisc *q;
 	int rc = -ENOMEM;
-	bool again = false;
 
 	skb_reset_mac_header(skb);
 
@@ -4082,14 +4086,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	qdisc_pkt_len_init(skb);
 #ifdef CONFIG_NET_CLS_ACT
+	mtu_check = skb_is_redirected(skb);
 	skb->tc_at_ingress = 0;
 # ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
+		unsigned int len_orig = skb->len;
+
 		skb = sch_handle_egress(skb, &rc, dev);
 		if (!skb)
 			goto out;
+		/* BPF-prog ran and could have changed packet size beyond MTU */
+		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
+			mtu_check = true;
 	}
 # endif
+	/* MTU-check only happens on "last" net_device in a redirect sequence
+	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
+	 * either ingress or egress to another device).
+	 */
+	if (mtu_check && !is_skb_forwardable(dev, skb)) {
+		rc = -EMSGSIZE;
+		goto drop;
+	}
 #endif
 	/* If device/qdisc don't need skb->dst, release it right now while
 	 * its hot in this cpu cache.
@@ -4157,7 +4175,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	rc = -ENETDOWN;
 	rcu_read_unlock_bh();
-
+#ifdef CONFIG_NET_CLS_ACT
+drop:
+#endif
 	atomic_long_inc(&dev->tx_dropped);
 	kfree_skb_list(skb);
 	return rc;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next V2 6/6] bpf: drop MTU check when doing TC-BPF redirect to ingress
  2020-10-07 16:22 [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2020-10-07 16:23 ` [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook Jesper Dangaard Brouer
@ 2020-10-07 16:23 ` Jesper Dangaard Brouer
  5 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-07 16:23 UTC (permalink / raw)
  To: bpf
  Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
	John Fastabend, Jakub Kicinski, eyal.birger

The use-case for dropping the MTU check when TC-BPF ingress redirecting a
packet, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |    5 +++--
 net/core/dev.c            |    2 +-
 net/core/filter.c         |   12 ++++++++++--
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 28cfa53daf72..58fb7b4869ba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3866,10 +3866,11 @@ bool is_skb_forwardable(const struct net_device *dev,
 			const struct sk_buff *skb);
 
 static __always_inline int ____dev_forward_skb(struct net_device *dev,
-					       struct sk_buff *skb)
+					       struct sk_buff *skb,
+					       const bool mtu_check)
 {
 	if (skb_orphan_frags(skb, GFP_ATOMIC) ||
-	    unlikely(!is_skb_forwardable(dev, skb))) {
+	    (mtu_check && unlikely(!is_skb_forwardable(dev, skb)))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index 19406013f93e..bae95ae9aa96 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2209,7 +2209,7 @@ EXPORT_SYMBOL_GPL(is_skb_forwardable);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, true);
 
 	if (likely(!ret)) {
 		skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index 54b779e34f83..5516d4efe225 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
 
 static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
 {
-	return dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
+
+	if (likely(!ret)) {
+		skb->protocol = eth_type_trans(skb, dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		ret = netif_rx(skb);
+	}
+
+	return ret;
 }
 
 static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
 				      struct sk_buff *skb)
 {
-	int ret = ____dev_forward_skb(dev, skb);
+	int ret = ____dev_forward_skb(dev, skb, false);
 
 	if (likely(!ret)) {
 		skb->dev = dev;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-07 16:22 ` [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-07 21:26   ` Daniel Borkmann
  2020-10-07 23:46   ` Maciej Żenczykowski
  1 sibling, 0 replies; 19+ messages in thread
From: Daniel Borkmann @ 2020-10-07 21:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger

On 10/7/20 6:22 PM, Jesper Dangaard Brouer wrote:
[...]
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> (imported from commit 37f8552786cf46588af52b77829b730dd14524d3)

slipped in?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-07 16:23 ` [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook Jesper Dangaard Brouer
@ 2020-10-07 21:37   ` Daniel Borkmann
  2020-10-07 22:36     ` John Fastabend
  2020-10-08  8:30     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Borkmann @ 2020-10-07 21:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger

On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
[...]
>   net/core/dev.c |   24 ++++++++++++++++++++++--
>   1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b433098896b2..19406013f93e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>   	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
>   	case TC_ACT_OK:
>   	case TC_ACT_RECLASSIFY:
> +		*ret = NET_XMIT_SUCCESS;
>   		skb->tc_index = TC_H_MIN(cl_res.classid);
>   		break;
>   	case TC_ACT_SHOT:
> @@ -4064,9 +4065,12 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   {
>   	struct net_device *dev = skb->dev;
>   	struct netdev_queue *txq;
> +#ifdef CONFIG_NET_CLS_ACT
> +	bool mtu_check = false;
> +#endif
> +	bool again = false;
>   	struct Qdisc *q;
>   	int rc = -ENOMEM;
> -	bool again = false;
>   
>   	skb_reset_mac_header(skb);
>   
> @@ -4082,14 +4086,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   
>   	qdisc_pkt_len_init(skb);
>   #ifdef CONFIG_NET_CLS_ACT
> +	mtu_check = skb_is_redirected(skb);
>   	skb->tc_at_ingress = 0;
>   # ifdef CONFIG_NET_EGRESS
>   	if (static_branch_unlikely(&egress_needed_key)) {
> +		unsigned int len_orig = skb->len;
> +
>   		skb = sch_handle_egress(skb, &rc, dev);
>   		if (!skb)
>   			goto out;
> +		/* BPF-prog ran and could have changed packet size beyond MTU */
> +		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
> +			mtu_check = true;
>   	}
>   # endif
> +	/* MTU-check only happens on "last" net_device in a redirect sequence
> +	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
> +	 * either ingress or egress to another device).
> +	 */

Hmm, quite some overhead in fast path. Also, won't this be checked multiple times
on stacked devices? :( Moreover, this missed the fact that 'real' qdiscs can have
filters attached too, this would come after this check. Can't this instead be in
driver layer for those that really need it? I would probably only drop the check
as done in 1/6 and allow the BPF prog to do the validation if needed.

> +	if (mtu_check && !is_skb_forwardable(dev, skb)) {
> +		rc = -EMSGSIZE;
> +		goto drop;
> +	}
>   #endif
>   	/* If device/qdisc don't need skb->dst, release it right now while
>   	 * its hot in this cpu cache.
> @@ -4157,7 +4175,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   
>   	rc = -ENETDOWN;
>   	rcu_read_unlock_bh();
> -
> +#ifdef CONFIG_NET_CLS_ACT
> +drop:
> +#endif
>   	atomic_long_inc(&dev->tx_dropped);
>   	kfree_skb_list(skb);
>   	return rc;
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-07 21:37   ` Daniel Borkmann
@ 2020-10-07 22:36     ` John Fastabend
  2020-10-07 23:52       ` Maciej Żenczykowski
  2020-10-08  8:30     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-10-07 22:36 UTC (permalink / raw)
  To: Daniel Borkmann, Jesper Dangaard Brouer, bpf
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
	Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger

Daniel Borkmann wrote:
> On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> [...]
> >   net/core/dev.c |   24 ++++++++++++++++++++++--
> >   1 file changed, 22 insertions(+), 2 deletions(-)

Couple high-level comments. Whats the problem with just letting the driver
consume the packet? I would chalk it up to a buggy BPF program that is
sending these packets. The drivers really shouldn't panic or do anything
horrible under this case because even today I don't think we can be
100% certain MTU on skb matches set MTU. Imagine the case where I change
the MTU from 9kB->1500B there will be some skbs in-flight with the larger
length and some with the shorter. If the drivers panic/fault or otherwise
does something else horrible thats not going to be friendly in general case
regardless of what BPF does. And seeing this type of config is all done
async its tricky (not practical) to flush any skbs in-flight.

I've spent many hours debugging these types of feature flag, mtu
change bugs on the driver side I'm not sure it can be resolved by
the stack easily. Better to just build drivers that can handle it IMO.

Do we know if sending >MTU size skbs to drivers causes problems in real
cases? I haven't tried on the NICs I have here, but I expect they should
be fine. Fine here being system keeps running as expected. Dropping the
skb either on TX or RX side is expected. Even with this change though
its possible for the skb to slip through if I configure MTU on a live
system.

> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index b433098896b2..19406013f93e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> >   	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
> >   	case TC_ACT_OK:
> >   	case TC_ACT_RECLASSIFY:
> > +		*ret = NET_XMIT_SUCCESS;
> >   		skb->tc_index = TC_H_MIN(cl_res.classid);
> >   		break;
> >   	case TC_ACT_SHOT:
> > @@ -4064,9 +4065,12 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   {
> >   	struct net_device *dev = skb->dev;
> >   	struct netdev_queue *txq;
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	bool mtu_check = false;
> > +#endif
> > +	bool again = false;
> >   	struct Qdisc *q;
> >   	int rc = -ENOMEM;
> > -	bool again = false;
> >   
> >   	skb_reset_mac_header(skb);
> >   
> > @@ -4082,14 +4086,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	qdisc_pkt_len_init(skb);
> >   #ifdef CONFIG_NET_CLS_ACT
> > +	mtu_check = skb_is_redirected(skb);
> >   	skb->tc_at_ingress = 0;
> >   # ifdef CONFIG_NET_EGRESS
> >   	if (static_branch_unlikely(&egress_needed_key)) {
> > +		unsigned int len_orig = skb->len;
> > +
> >   		skb = sch_handle_egress(skb, &rc, dev);
> >   		if (!skb)
> >   			goto out;
> > +		/* BPF-prog ran and could have changed packet size beyond MTU */
> > +		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
> > +			mtu_check = true;
> >   	}
> >   # endif
> > +	/* MTU-check only happens on "last" net_device in a redirect sequence
> > +	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
> > +	 * either ingress or egress to another device).
> > +	 */
> 
> Hmm, quite some overhead in fast path. Also, won't this be checked multiple times
> on stacked devices? :( Moreover, this missed the fact that 'real' qdiscs can have
> filters attached too, this would come after this check. Can't this instead be in
> driver layer for those that really need it? I would probably only drop the check
> as done in 1/6 and allow the BPF prog to do the validation if needed.

Any checks like this should probably go in validate_xmit_skb_list() this is
where we check other things GSO, checksum, etc. that depend on drivers. Worse
case we could add a netif_needs_mtu() case in validate_xmit_skb if drivers
really can't handle >MTU size.

> 
> > +	if (mtu_check && !is_skb_forwardable(dev, skb)) {
> > +		rc = -EMSGSIZE;
> > +		goto drop;
> > +	}
> >   #endif
> >   	/* If device/qdisc don't need skb->dst, release it right now while
> >   	 * its hot in this cpu cache.
> > @@ -4157,7 +4175,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	rc = -ENETDOWN;
> >   	rcu_read_unlock_bh();
> > -
> > +#ifdef CONFIG_NET_CLS_ACT
> > +drop:
> > +#endif
> >   	atomic_long_inc(&dev->tx_dropped);
> >   	kfree_skb_list(skb);
> >   	return rc;
> > 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-07 16:22 ` [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
  2020-10-07 21:26   ` Daniel Borkmann
@ 2020-10-07 23:46   ` Maciej Żenczykowski
  2020-10-08 11:06     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 19+ messages in thread
From: Maciej Żenczykowski @ 2020-10-07 23:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, Linux NetDev, Daniel Borkmann, Alexei Starovoitov,
	Lorenz Bauer, Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	John Fastabend, Jakub Kicinski, Eyal Birger

>  static u32 __bpf_skb_max_len(const struct sk_buff *skb)
>  {
> -       return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> -                         SKB_MAX_ALLOC;
> +       return IP_MAX_MTU;
>  }

Shouldn't we just delete this helper instead and replace call sites?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-07 22:36     ` John Fastabend
@ 2020-10-07 23:52       ` Maciej Żenczykowski
  2020-10-08  3:19         ` John Fastabend
  0 siblings, 1 reply; 19+ messages in thread
From: Maciej Żenczykowski @ 2020-10-07 23:52 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Jesper Dangaard Brouer, bpf, Linux NetDev,
	Daniel Borkmann, Alexei Starovoitov, Lorenz Bauer,
	Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	Jakub Kicinski, Eyal Birger

On Wed, Oct 7, 2020 at 3:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Daniel Borkmann wrote:
> > On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> > [...]
> > >   net/core/dev.c |   24 ++++++++++++++++++++++--
> > >   1 file changed, 22 insertions(+), 2 deletions(-)
>
> Couple high-level comments. Whats the problem with just letting the driver
> consume the packet? I would chalk it up to a buggy BPF program that is
> sending these packets. The drivers really shouldn't panic or do anything
> horrible under this case because even today I don't think we can be
> 100% certain MTU on skb matches set MTU. Imagine the case where I change
> the MTU from 9kB->1500B there will be some skbs in-flight with the larger
> length and some with the shorter. If the drivers panic/fault or otherwise
> does something else horrible thats not going to be friendly in general case
> regardless of what BPF does. And seeing this type of config is all done
> async its tricky (not practical) to flush any skbs in-flight.
>
> I've spent many hours debugging these types of feature flag, mtu
> change bugs on the driver side I'm not sure it can be resolved by
> the stack easily. Better to just build drivers that can handle it IMO.
>
> Do we know if sending >MTU size skbs to drivers causes problems in real
> cases? I haven't tried on the NICs I have here, but I expect they should
> be fine. Fine here being system keeps running as expected. Dropping the
> skb either on TX or RX side is expected. Even with this change though
> its possible for the skb to slip through if I configure MTU on a live
> system.

I wholeheartedly agree with the above.

Ideally the only >mtu check should happen at driver admittance.
But again ideally it should happen in some core stack location not in
the driver itself.
However, due to both gso and vlan offload, even this is not trivial to do...
The mtu is L3, but drivers/hardware/the wire usually care about L2...
(because ultimately that's what gets received and must fit in receive buffers)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-07 23:52       ` Maciej Żenczykowski
@ 2020-10-08  3:19         ` John Fastabend
  2020-10-08  8:07           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: John Fastabend @ 2020-10-08  3:19 UTC (permalink / raw)
  To: Maciej Żenczykowski, John Fastabend
  Cc: Daniel Borkmann, Jesper Dangaard Brouer, bpf, Linux NetDev,
	Daniel Borkmann, Alexei Starovoitov, Lorenz Bauer,
	Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	Jakub Kicinski, Eyal Birger

Maciej Żenczykowski wrote:
> On Wed, Oct 7, 2020 at 3:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Daniel Borkmann wrote:
> > > On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> > > [...]
> > > >   net/core/dev.c |   24 ++++++++++++++++++++++--
> > > >   1 file changed, 22 insertions(+), 2 deletions(-)
> >
> > Couple high-level comments. Whats the problem with just letting the driver
> > consume the packet? I would chalk it up to a buggy BPF program that is
> > sending these packets. The drivers really shouldn't panic or do anything
> > horrible under this case because even today I don't think we can be
> > 100% certain MTU on skb matches set MTU. Imagine the case where I change
> > the MTU from 9kB->1500B there will be some skbs in-flight with the larger
> > length and some with the shorter. If the drivers panic/fault or otherwise
> > does something else horrible thats not going to be friendly in general case
> > regardless of what BPF does. And seeing this type of config is all done
> > async its tricky (not practical) to flush any skbs in-flight.
> >
> > I've spent many hours debugging these types of feature flag, mtu
> > change bugs on the driver side I'm not sure it can be resolved by
> > the stack easily. Better to just build drivers that can handle it IMO.
> >
> > Do we know if sending >MTU size skbs to drivers causes problems in real
> > cases? I haven't tried on the NICs I have here, but I expect they should
> > be fine. Fine here being system keeps running as expected. Dropping the
> > skb either on TX or RX side is expected. Even with this change though
> > its possible for the skb to slip through if I configure MTU on a live
> > system.
> 
> I wholeheartedly agree with the above.
> 
> Ideally the only >mtu check should happen at driver admittance.
> But again ideally it should happen in some core stack location not in
> the driver itself.

Ideally maybe, but IMO we should just let the skb go to the driver
and let the driver sort it out. Even if this means pushing the packet
onto the wire then the switch will drop it or the receiver, etc. A
BPF program can do lots of horrible things that should never be
on the wire otherwise. MTU is just one of them, but sending corrupted
payloads, adding bogus headers, checksums etc. so I don't think we can
reasonable protect against all of them.

Of course if the driver is going to hang/panic then something needs
to be done. Perhaps a needs_mtu_check feature flag, although thats
not so nice either so perhaps drivers just need to handle it themselves.
Also even today the case could happen without BPF as best I can tell
so the drivers should be prepared for it.

> However, due to both gso and vlan offload, even this is not trivial to do...
> The mtu is L3, but drivers/hardware/the wire usually care about L2...
> (because ultimately that's what gets received and must fit in receive buffers)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-08  3:19         ` John Fastabend
@ 2020-10-08  8:07           ` Jesper Dangaard Brouer
  2020-10-08  8:25             ` Daniel Borkmann
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08  8:07 UTC (permalink / raw)
  To: John Fastabend
  Cc: Maciej Żenczykowski, Daniel Borkmann, bpf, Linux NetDev,
	Daniel Borkmann, Alexei Starovoitov, Lorenz Bauer,
	Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	Jakub Kicinski, Eyal Birger, brouer

On Wed, 07 Oct 2020 20:19:39 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Maciej Żenczykowski wrote:
> > On Wed, Oct 7, 2020 at 3:37 PM John Fastabend <john.fastabend@gmail.com> wrote:  
> > >
> > > Daniel Borkmann wrote:  
> > > > On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> > > > [...]  
> > > > >   net/core/dev.c |   24 ++++++++++++++++++++++--
> > > > >   1 file changed, 22 insertions(+), 2 deletions(-)  
> > >
> > > Couple high-level comments. Whats the problem with just letting the driver
> > > consume the packet? I would chalk it up to a buggy BPF program that is
> > > sending these packets. The drivers really shouldn't panic or do anything
> > > horrible under this case because even today I don't think we can be
> > > 100% certain MTU on skb matches set MTU. Imagine the case where I change
> > > the MTU from 9kB->1500B there will be some skbs in-flight with the larger
> > > length and some with the shorter. If the drivers panic/fault or otherwise
> > > does something else horrible thats not going to be friendly in general case
> > > regardless of what BPF does. And seeing this type of config is all done
> > > async its tricky (not practical) to flush any skbs in-flight.
> > >
> > > I've spent many hours debugging these types of feature flag, mtu
> > > change bugs on the driver side I'm not sure it can be resolved by
> > > the stack easily. Better to just build drivers that can handle it IMO.
> > >
> > > Do we know if sending >MTU size skbs to drivers causes problems in real
> > > cases? I haven't tried on the NICs I have here, but I expect they should
> > > be fine. Fine here being system keeps running as expected. Dropping the
> > > skb either on TX or RX side is expected. Even with this change though
> > > its possible for the skb to slip through if I configure MTU on a live
> > > system.  
> > 
> > I wholeheartedly agree with the above.
> > 
> > Ideally the only >mtu check should happen at driver admittance.
> > But again ideally it should happen in some core stack location not in
> > the driver itself.  
> 
> Ideally maybe, but IMO we should just let the skb go to the driver
> and let the driver sort it out. Even if this means pushing the packet
> onto the wire then the switch will drop it or the receiver, etc. A
> BPF program can do lots of horrible things that should never be
> on the wire otherwise. MTU is just one of them, but sending corrupted
> payloads, adding bogus headers, checksums etc. so I don't think we can
> reasonable protect against all of them.

That is a good point.

> Of course if the driver is going to hang/panic then something needs
> to be done. Perhaps a needs_mtu_check feature flag, although thats
> not so nice either so perhaps drivers just need to handle it themselves.
> Also even today the case could happen without BPF as best I can tell
> so the drivers should be prepared for it.

Yes, borderline cases do exist already today (like your reconf with
inflight packets), so I guess drivers should at-least not hang/panic
and be robust enough that we can drop this check.

I think you have convinced me that we can drop this check.  My only
concern is how people can troubleshoot this, but its not different from
current state.


> > However, due to both gso and vlan offload, even this is not trivial to do...
> > The mtu is L3, but drivers/hardware/the wire usually care about L2...

If net_device->mtu is L3 (1500) and XDP (and TC, right?) operate at L2,
that likely means that the "strict" bpf_mtu_check (in my BPF-helper) is
wrong, as XDP (and TC) length at this point include the 14 bytes
Ethernet header.  I will check and fix.

Is this accounted for via net_device->hard_header_len ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-08  8:07           ` Jesper Dangaard Brouer
@ 2020-10-08  8:25             ` Daniel Borkmann
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Borkmann @ 2020-10-08  8:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: Maciej Żenczykowski, bpf, Linux NetDev, Daniel Borkmann,
	Alexei Starovoitov, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, Jakub Kicinski, Eyal Birger

On 10/8/20 10:07 AM, Jesper Dangaard Brouer wrote:
[...]
>>> However, due to both gso and vlan offload, even this is not trivial to do...
>>> The mtu is L3, but drivers/hardware/the wire usually care about L2...
> 
> If net_device->mtu is L3 (1500) and XDP (and TC, right?) operate at L2,
> that likely means that the "strict" bpf_mtu_check (in my BPF-helper) is
> wrong, as XDP (and TC) length at this point include the 14 bytes
> Ethernet header.  I will check and fix.

Yes, both at L2 layer.

> Is this accounted for via net_device->hard_header_len ?

It is, see also ether_setup().

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook
  2020-10-07 21:37   ` Daniel Borkmann
  2020-10-07 22:36     ` John Fastabend
@ 2020-10-08  8:30     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08  8:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
	shaun, Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
	eyal.birger, brouer

On Wed, 7 Oct 2020 23:37:00 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/7/20 6:23 PM, Jesper Dangaard Brouer wrote:
> [...]
> >   net/core/dev.c |   24 ++++++++++++++++++++++--
> >   1 file changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index b433098896b2..19406013f93e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> >   	switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
> >   	case TC_ACT_OK:
> >   	case TC_ACT_RECLASSIFY:
> > +		*ret = NET_XMIT_SUCCESS;
> >   		skb->tc_index = TC_H_MIN(cl_res.classid);
> >   		break;
> >   	case TC_ACT_SHOT:
> > @@ -4064,9 +4065,12 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   {
> >   	struct net_device *dev = skb->dev;
> >   	struct netdev_queue *txq;
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	bool mtu_check = false;
> > +#endif
> > +	bool again = false;
> >   	struct Qdisc *q;
> >   	int rc = -ENOMEM;
> > -	bool again = false;
> >   
> >   	skb_reset_mac_header(skb);
> >   
> > @@ -4082,14 +4086,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	qdisc_pkt_len_init(skb);
> >   #ifdef CONFIG_NET_CLS_ACT
> > +	mtu_check = skb_is_redirected(skb);
> >   	skb->tc_at_ingress = 0;
> >   # ifdef CONFIG_NET_EGRESS
> >   	if (static_branch_unlikely(&egress_needed_key)) {
> > +		unsigned int len_orig = skb->len;
> > +
> >   		skb = sch_handle_egress(skb, &rc, dev);
> >   		if (!skb)
> >   			goto out;
> > +		/* BPF-prog ran and could have changed packet size beyond MTU */
> > +		if (rc == NET_XMIT_SUCCESS && skb->len > len_orig)
> > +			mtu_check = true;
> >   	}
> >   # endif
> > +	/* MTU-check only happens on "last" net_device in a redirect sequence
> > +	 * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it
> > +	 * either ingress or egress to another device).
> > +	 */  
> 
> Hmm, quite some overhead in fast path. 

Not really, normal fast-path already call is_skb_forwardable(). And it
already happens in existing code, ingress redirect code, which I remove
calling in patch 6.

(I have considered inlining is_skb_forwardable as a optimization for
general netstack dev_forward_skb)

> Also, won't this be checked multiple times on stacked devices? :(

I don't think it will be checked multiple times, because we have a
skb_reset_redirect() in ingress path (just after sch_handle_ingress()).

> Moreover, this missed the fact that 'real' qdiscs can have
> filters attached too, this would come after this check. Can't this instead be in
> driver layer for those that really need it? I would probably only drop the check
> as done in 1/6 and allow the BPF prog to do the validation if needed.

See other reply, this is likely what we will end-up with.

> > +	if (mtu_check && !is_skb_forwardable(dev, skb)) {
> > +		rc = -EMSGSIZE;
> > +		goto drop;
> > +	}
> >   #endif
> >   	/* If device/qdisc don't need skb->dst, release it right now while
> >   	 * its hot in this cpu cache.
> > @@ -4157,7 +4175,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   
> >   	rc = -ENETDOWN;
> >   	rcu_read_unlock_bh();
> > -
> > +#ifdef CONFIG_NET_CLS_ACT
> > +drop:
> > +#endif
> >   	atomic_long_inc(&dev->tx_dropped);
> >   	kfree_skb_list(skb);
> >   	return rc;
> >   
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-07 23:46   ` Maciej Żenczykowski
@ 2020-10-08 11:06     ` Jesper Dangaard Brouer
  2020-10-08 12:33       ` Willem de Bruijn
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 11:06 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: bpf, Linux NetDev, Daniel Borkmann, Alexei Starovoitov,
	Lorenz Bauer, Shaun Crampton, Lorenzo Bianconi, Marek Majkowski,
	John Fastabend, Jakub Kicinski, Eyal Birger, brouer

On Wed, 7 Oct 2020 16:46:10 -0700
Maciej Żenczykowski <maze@google.com> wrote:

> >  static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> >  {
> > -       return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> > -                         SKB_MAX_ALLOC;
> > +       return IP_MAX_MTU;
> >  }  
> 
> Shouldn't we just delete this helper instead and replace call sites?

It does seem wrong to pass argument skb into this function, as it is
no-longer used...

Guess I can simply replace __bpf_skb_max_len with IP_MAX_MTU.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-08 11:06     ` Jesper Dangaard Brouer
@ 2020-10-08 12:33       ` Willem de Bruijn
  2020-10-08 14:07         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Willem de Bruijn @ 2020-10-08 12:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Maciej Żenczykowski, bpf, Linux NetDev, Daniel Borkmann,
	Alexei Starovoitov, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, John Fastabend,
	Jakub Kicinski, Eyal Birger

On Thu, Oct 8, 2020 at 7:06 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> On Wed, 7 Oct 2020 16:46:10 -0700
> Maciej Żenczykowski <maze@google.com> wrote:
>
> > >  static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> > >  {
> > > -       return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> > > -                         SKB_MAX_ALLOC;
> > > +       return IP_MAX_MTU;
> > >  }
> >
> > Shouldn't we just delete this helper instead and replace call sites?
>
> It does seem wrong to pass argument skb into this function, as it is
> no-longer used...
>
> Guess I can simply replace __bpf_skb_max_len with IP_MAX_MTU.

Should that be IP6_MAX_MTU, which is larger than IP_MAX_MTU?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len
  2020-10-08 12:33       ` Willem de Bruijn
@ 2020-10-08 14:07         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-08 14:07 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Maciej Żenczykowski, bpf, Linux NetDev, Daniel Borkmann,
	Alexei Starovoitov, Lorenz Bauer, Shaun Crampton,
	Lorenzo Bianconi, Marek Majkowski, John Fastabend,
	Jakub Kicinski, Eyal Birger, brouer

On Thu, 8 Oct 2020 08:33:04 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> On Thu, Oct 8, 2020 at 7:06 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >
> > On Wed, 7 Oct 2020 16:46:10 -0700
> > Maciej Żenczykowski <maze@google.com> wrote:
> >  
> > > >  static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> > > >  {
> > > > -       return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> > > > -                         SKB_MAX_ALLOC;
> > > > +       return IP_MAX_MTU;
> > > >  }  
> > >
> > > Shouldn't we just delete this helper instead and replace call sites?  
> >
> > It does seem wrong to pass argument skb into this function, as it is
> > no-longer used...
> >
> > Guess I can simply replace __bpf_skb_max_len with IP_MAX_MTU.  
> 
> Should that be IP6_MAX_MTU, which is larger than IP_MAX_MTU?

Sure I'll do that, and handle that is hides under CONFIG_IPV6.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-10-08 14:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-07 16:22 [PATCH bpf-next V2 0/6] bpf: New approach for BPF MTU handling and enforcement Jesper Dangaard Brouer
2020-10-07 16:22 ` [PATCH bpf-next V2 1/6] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2020-10-07 21:26   ` Daniel Borkmann
2020-10-07 23:46   ` Maciej Żenczykowski
2020-10-08 11:06     ` Jesper Dangaard Brouer
2020-10-08 12:33       ` Willem de Bruijn
2020-10-08 14:07         ` Jesper Dangaard Brouer
2020-10-07 16:22 ` [PATCH bpf-next V2 2/6] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2020-10-07 16:22 ` [PATCH bpf-next V2 3/6] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2020-10-07 16:22 ` [PATCH bpf-next V2 4/6] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
2020-10-07 16:23 ` [PATCH bpf-next V2 5/6] bpf: Add MTU check for TC-BPF packets after egress hook Jesper Dangaard Brouer
2020-10-07 21:37   ` Daniel Borkmann
2020-10-07 22:36     ` John Fastabend
2020-10-07 23:52       ` Maciej Żenczykowski
2020-10-08  3:19         ` John Fastabend
2020-10-08  8:07           ` Jesper Dangaard Brouer
2020-10-08  8:25             ` Daniel Borkmann
2020-10-08  8:30     ` Jesper Dangaard Brouer
2020-10-07 16:23 ` [PATCH bpf-next V2 6/6] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.