* [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling
@ 2020-10-27 16:26 Jesper Dangaard Brouer
2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
` (4 more replies)
0 siblings, 5 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:26 UTC (permalink / raw)
To: bpf
Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
John Fastabend, Jakub Kicinski, eyal.birger
This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.
The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.
If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.
One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.
This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.
V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
---
Jesper Dangaard Brouer (5):
bpf: Remove MTU check in __bpf_skb_max_len
bpf: bpf_fib_lookup return MTU value as output when looked up
bpf: add BPF-helper for MTU checking
bpf: drop MTU check when doing TC-BPF redirect to ingress
bpf: make it possible to identify BPF redirected SKBs
include/linux/netdevice.h | 31 +++++++-
include/uapi/linux/bpf.h | 81 +++++++++++++++++++-
net/core/dev.c | 21 +----
net/core/filter.c | 163 ++++++++++++++++++++++++++++++++++++----
net/sched/Kconfig | 1
tools/include/uapi/linux/bpf.h | 81 +++++++++++++++++++-
6 files changed, 339 insertions(+), 39 deletions(-)
--
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len
2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
@ 2020-10-27 16:26 ` Jesper Dangaard Brouer
2020-10-30 19:24 ` John Fastabend
2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
` (3 subsequent siblings)
4 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:26 UTC (permalink / raw)
To: bpf
Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
John Fastabend, Jakub Kicinski, eyal.birger
Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).
When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.
Patch V4 keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.
[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests
V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
net/core/filter.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 2ca5eecebacf..1ee97fdeea64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
return 0;
}
-static u32 __bpf_skb_max_len(const struct sk_buff *skb)
-{
- return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
-}
+#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC
BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
u32, mode, u64, flags)
@@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
{
u32 len_cur, len_diff_abs = abs(len_diff);
u32 len_min = bpf_skb_net_base_len(skb);
- u32 len_max = __bpf_skb_max_len(skb);
+ u32 len_max = BPF_SKB_MAX_LEN;
__be16 proto = skb->protocol;
bool shrink = len_diff < 0;
u32 off;
@@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len,
u64 flags)
{
- u32 max_len = __bpf_skb_max_len(skb);
+ u32 max_len = BPF_SKB_MAX_LEN;
u32 min_len = __bpf_skb_min_len(skb);
int ret;
@@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = {
static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room,
u64 flags)
{
- u32 max_len = __bpf_skb_max_len(skb);
+ u32 max_len = BPF_SKB_MAX_LEN;
u32 new_len = skb->len + head_room;
int ret;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-27 16:26 ` Jesper Dangaard Brouer
2020-10-27 17:15 ` David Ahern
2020-10-28 12:49 ` Dan Carpenter
2020-10-27 16:27 ` [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
` (2 subsequent siblings)
4 siblings, 2 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:26 UTC (permalink / raw)
To: bpf
Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
John Fastabend, Jakub Kicinski, eyal.birger
The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
don't know the MTU value that caused this rejection.
If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.
Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/uapi/linux/bpf.h | 11 +++++++++--
net/core/filter.c | 17 ++++++++++++-----
tools/include/uapi/linux/bpf.h | 11 +++++++++--
3 files changed, 30 insertions(+), 9 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e6ceac3f7d62..03c042e3a34c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2219,6 +2219,9 @@ union bpf_attr {
* * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
* packet is not forwarded or needs assist from full stack
*
+ * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ * was exceeded and result params->mtu contains the MTU.
+ *
* long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
* Description
* Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4872,9 +4875,13 @@ struct bpf_fib_lookup {
__be16 sport;
__be16 dport;
- /* total length of packet from network header - used for MTU check */
- __u16 tot_len;
+ union { /* used for MTU check */
+ /* input to lookup */
+ __u16 tot_len; /* total length of packet from network hdr */
+ /* output: MTU value (if requested check_mtu) */
+ __u16 mtu;
+ };
/* input: L3 device index for lookup
* output: device index from FIB lookup
*/
diff --git a/net/core/filter.c b/net/core/filter.c
index 1ee97fdeea64..caa427edc563 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5265,12 +5265,13 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
const struct neighbour *neigh,
- const struct net_device *dev)
+ const struct net_device *dev, u32 mtu)
{
memcpy(params->dmac, neigh->ha, ETH_ALEN);
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+ params->mtu = mtu;
return 0;
}
@@ -5354,8 +5355,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (check_mtu) {
mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
- if (params->tot_len > mtu)
+ if (params->tot_len > mtu) {
+ params->mtu = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+ }
}
nhc = res.nhc;
@@ -5389,7 +5392,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
- return bpf_fib_set_fwd_params(params, neigh, dev);
+ return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
}
#endif
@@ -5481,8 +5484,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (check_mtu) {
mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src);
- if (params->tot_len > mtu)
+ if (params->tot_len > mtu) {
+ params->mtu = mtu; /* union with tot_len */
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
+ }
}
if (res.nh->fib_nh_lws)
@@ -5502,7 +5507,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
- return bpf_fib_set_fwd_params(params, neigh, dev);
+ return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
}
#endif
@@ -5571,6 +5576,8 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
dev = dev_get_by_index_rcu(net, params->ifindex);
if (!is_skb_forwardable(dev, skb))
rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
+
+ params->mtu = dev->mtu; /* union with tot_len */
}
return rc;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e6ceac3f7d62..03c042e3a34c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2219,6 +2219,9 @@ union bpf_attr {
* * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
* packet is not forwarded or needs assist from full stack
*
+ * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU
+ * was exceeded and result params->mtu contains the MTU.
+ *
* long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
* Description
* Add an entry to, or update a sockhash *map* referencing sockets.
@@ -4872,9 +4875,13 @@ struct bpf_fib_lookup {
__be16 sport;
__be16 dport;
- /* total length of packet from network header - used for MTU check */
- __u16 tot_len;
+ union { /* used for MTU check */
+ /* input to lookup */
+ __u16 tot_len; /* total length of packet from network hdr */
+ /* output: MTU value (if requested check_mtu) */
+ __u16 mtu;
+ };
/* input: L3 device index for lookup
* output: device index from FIB lookup
*/
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking
2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-27 16:27 ` Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
4 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:27 UTC (permalink / raw)
To: bpf
Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
John Fastabend, Jakub Kicinski, eyal.birger
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.
It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.
V4: Lot of changes
- ifindex 0 now use current netdev for MTU lookup
- rename helper from bpf_mtu_check to bpf_check_mtu
- fix bug for GSO pkt length (as skb->len is total len)
- remove __bpf_len_adj_positive, simply allow negative len adj
V3: Take L2/ETH_HLEN header size into account and document it.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/uapi/linux/bpf.h | 70 +++++++++++++++++++++++
net/core/filter.c | 120 ++++++++++++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 70 +++++++++++++++++++++++
3 files changed, 260 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 03c042e3a34c..c7ac1fab5e8b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3745,6 +3745,63 @@ union bpf_attr {
* Return
* The helper returns **TC_ACT_REDIRECT** on success or
* **TC_ACT_SHOT** on error.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ * Description
+ * Check ctx packet size against MTU of net device (based on
+ * *ifindex*). This helper will likely be used in combination with
+ * helpers that adjust/change the packet size. The argument
+ * *len_diff* can be used for querying with a planned size
+ * change. This allows to check MTU prior to changing packet ctx.
+ *
+ * Specifying *ifindex* zero means the MTU check is performed
+ * against the current net device. This is practical if this isn't
+ * used prior to redirect.
+ *
+ * The Linux kernel route table can configure MTUs on a more
+ * specific per route level, which is not provided by this helper.
+ * For route level MTU checks use the **bpf_fib_lookup**\ ()
+ * helper.
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** for tc cls_act programs.
+ *
+ * The *flags* argument can be a combination of one or more of the
+ * following values:
+ *
+ * **BPF_MTU_CHK_RELAX**
+ * This flag relax or increase the MTU with room for one
+ * VLAN header (4 bytes). This relaxation is also used by
+ * the kernels own forwarding MTU checks.
+ *
+ * **BPF_MTU_CHK_SEGS**
+ * This flag will only works for *ctx* **struct sk_buff**.
+ * If packet context contains extra packet segment buffers
+ * (often knows as GSO skb), then MTU check is partly
+ * skipped, because in transmit path it is possible for the
+ * skb packet to get re-segmented (depending on net device
+ * features). This could still be a MTU violation, so this
+ * flag enables performing MTU check against segments, with
+ * a different violation return code to tell it apart.
+ *
+ * The *mtu_result* pointer contains the MTU value of the net
+ * device including the L2 header size (usually 14 bytes Ethernet
+ * header). The net device configured MTU is the L3 size, but as
+ * XDP and TX length operate at L2 this helper include L2 header
+ * size in reported MTU.
+ *
+ * Return
+ * * 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ * * < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ * MTU violations return positive values, but also populate MTU
+ * value in *mtu_result* pointer, as this can be needed for
+ * implementing PMTU handing:
+ *
+ * * **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ * * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -3903,6 +3960,7 @@ union bpf_attr {
FN(bpf_per_cpu_ptr), \
FN(bpf_this_cpu_ptr), \
FN(redirect_peer), \
+ FN(check_mtu), \
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4927,6 +4985,18 @@ struct bpf_redir_neigh {
};
};
+/* bpf_check_mtu flags*/
+enum bpf_check_mtu_flags {
+ BPF_MTU_CHK_RELAX = (1U << 0),
+ BPF_MTU_CHK_SEGS = (1U << 1),
+};
+
+enum bpf_check_mtu_ret {
+ BPF_MTU_CHK_RET_SUCCESS, /* check and lookup successful */
+ BPF_MTU_CHK_RET_FRAG_NEEDED, /* fragmentation required to fwd */
+ BPF_MTU_CHK_RET_SEGS_TOOBIG, /* GSO re-segmentation needed to fwd */
+};
+
enum bpf_task_fd_type {
BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */
BPF_FD_TYPE_TRACEPOINT, /* tp name */
diff --git a/net/core/filter.c b/net/core/filter.c
index caa427edc563..d66a9cba8e14 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5593,6 +5593,122 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
.arg4_type = ARG_ANYTHING,
};
+static int __bpf_lookup_mtu(struct net_device *dev_curr, u32 ifindex, u64 flags)
+{
+ struct net *netns = dev_net(dev_curr);
+ struct net_device *dev;
+ int mtu;
+
+ /* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
+ if (ifindex == 0)
+ dev = dev_curr;
+ else
+ dev = dev_get_by_index_rcu(netns, ifindex);
+
+ if (!dev)
+ return -ENODEV;
+
+ /* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
+ mtu = dev->mtu + dev->hard_header_len;
+
+ /* Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
+ if (flags & BPF_MTU_CHK_RELAX)
+ mtu += VLAN_HLEN;
+
+ return mtu;
+}
+
+BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
+ u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+ int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+ struct net_device *dev = skb->dev;
+ int len = skb->len;
+ int mtu;
+
+ if (flags & ~(BPF_MTU_CHK_RELAX | BPF_MTU_CHK_SEGS))
+ return -EINVAL;
+
+ mtu = __bpf_lookup_mtu(dev, ifindex, flags);
+ if (unlikely(mtu < 0))
+ return mtu; /* errno */
+
+ len += len_diff; /* len_diff can be negative, minus result pass check */
+ if (len <= mtu) {
+ ret = BPF_MTU_CHK_RET_SUCCESS;
+ goto out;
+ }
+ /* At this point, skb->len exceed MTU, but as it include length of all
+ * segments, and SKB can get re-segmented in transmit path (see
+ * validate_xmit_skb), we cannot reject MTU-check for GSO packets.
+ */
+ if (skb_is_gso(skb)) {
+ ret = BPF_MTU_CHK_RET_SUCCESS;
+
+ /* SKB could get dropped later due to segs > MTU or lacking
+ * features, thus allow BPF-prog to validate segs length here.
+ */
+ if (flags & BPF_MTU_CHK_SEGS &&
+ skb_gso_validate_network_len(skb, mtu)) {
+ ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
+ goto out;
+ }
+ }
+out:
+ if (mtu_result)
+ *mtu_result = mtu;
+
+ return ret;
+}
+
+BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp,
+ u32, ifindex, u32 *, mtu_result, s32, len_diff, u64, flags)
+{
+ struct net_device *dev = xdp->rxq->dev;
+ int len = xdp->data_end - xdp->data;
+ int ret = BPF_MTU_CHK_RET_SUCCESS;
+ int mtu;
+
+ /* XDP variant doesn't support multi-buffer segment check (yet) */
+ if (flags & ~BPF_MTU_CHK_RELAX)
+ return -EINVAL;
+
+ mtu = __bpf_lookup_mtu(dev, ifindex, flags);
+ if (unlikely(mtu < 0))
+ return mtu; /* errno */
+
+ len += len_diff; /* len_diff can be negative, minus result pass check */
+ if (len > mtu)
+ ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
+
+ if (mtu_result)
+ *mtu_result = mtu;
+
+ return ret;
+}
+
+static const struct bpf_func_proto bpf_skb_check_mtu_proto = {
+ .func = bpf_skb_check_mtu,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_MEM,
+ .arg4_type = ARG_ANYTHING,
+ .arg5_type = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto bpf_xdp_check_mtu_proto = {
+ .func = bpf_xdp_check_mtu,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_MEM,
+ .arg4_type = ARG_ANYTHING,
+ .arg5_type = ARG_ANYTHING,
+};
+
#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
{
@@ -7158,6 +7274,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_get_socket_uid_proto;
case BPF_FUNC_fib_lookup:
return &bpf_skb_fib_lookup_proto;
+ case BPF_FUNC_check_mtu:
+ return &bpf_skb_check_mtu_proto;
case BPF_FUNC_sk_fullsock:
return &bpf_sk_fullsock_proto;
case BPF_FUNC_sk_storage_get:
@@ -7227,6 +7345,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_xdp_adjust_tail_proto;
case BPF_FUNC_fib_lookup:
return &bpf_xdp_fib_lookup_proto;
+ case BPF_FUNC_check_mtu:
+ return &bpf_xdp_check_mtu_proto;
#ifdef CONFIG_INET
case BPF_FUNC_sk_lookup_udp:
return &bpf_xdp_sk_lookup_udp_proto;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 03c042e3a34c..c7ac1fab5e8b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3745,6 +3745,63 @@ union bpf_attr {
* Return
* The helper returns **TC_ACT_REDIRECT** on success or
* **TC_ACT_SHOT** on error.
+ *
+ * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_result, s32 len_diff, u64 flags)
+ * Description
+ * Check ctx packet size against MTU of net device (based on
+ * *ifindex*). This helper will likely be used in combination with
+ * helpers that adjust/change the packet size. The argument
+ * *len_diff* can be used for querying with a planned size
+ * change. This allows to check MTU prior to changing packet ctx.
+ *
+ * Specifying *ifindex* zero means the MTU check is performed
+ * against the current net device. This is practical if this isn't
+ * used prior to redirect.
+ *
+ * The Linux kernel route table can configure MTUs on a more
+ * specific per route level, which is not provided by this helper.
+ * For route level MTU checks use the **bpf_fib_lookup**\ ()
+ * helper.
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** for tc cls_act programs.
+ *
+ * The *flags* argument can be a combination of one or more of the
+ * following values:
+ *
+ * **BPF_MTU_CHK_RELAX**
+ * This flag relax or increase the MTU with room for one
+ * VLAN header (4 bytes). This relaxation is also used by
+ * the kernels own forwarding MTU checks.
+ *
+ * **BPF_MTU_CHK_SEGS**
+ * This flag will only works for *ctx* **struct sk_buff**.
+ * If packet context contains extra packet segment buffers
+ * (often knows as GSO skb), then MTU check is partly
+ * skipped, because in transmit path it is possible for the
+ * skb packet to get re-segmented (depending on net device
+ * features). This could still be a MTU violation, so this
+ * flag enables performing MTU check against segments, with
+ * a different violation return code to tell it apart.
+ *
+ * The *mtu_result* pointer contains the MTU value of the net
+ * device including the L2 header size (usually 14 bytes Ethernet
+ * header). The net device configured MTU is the L3 size, but as
+ * XDP and TX length operate at L2 this helper include L2 header
+ * size in reported MTU.
+ *
+ * Return
+ * * 0 on success, and populate MTU value in *mtu_result* pointer.
+ *
+ * * < 0 if any input argument is invalid (*mtu_result* not updated)
+ *
+ * MTU violations return positive values, but also populate MTU
+ * value in *mtu_result* pointer, as this can be needed for
+ * implementing PMTU handing:
+ *
+ * * **BPF_MTU_CHK_RET_FRAG_NEEDED**
+ * * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
+ *
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -3903,6 +3960,7 @@ union bpf_attr {
FN(bpf_per_cpu_ptr), \
FN(bpf_this_cpu_ptr), \
FN(redirect_peer), \
+ FN(check_mtu), \
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4927,6 +4985,18 @@ struct bpf_redir_neigh {
};
};
+/* bpf_check_mtu flags*/
+enum bpf_check_mtu_flags {
+ BPF_MTU_CHK_RELAX = (1U << 0),
+ BPF_MTU_CHK_SEGS = (1U << 1),
+};
+
+enum bpf_check_mtu_ret {
+ BPF_MTU_CHK_RET_SUCCESS, /* check and lookup successful */
+ BPF_MTU_CHK_RET_FRAG_NEEDED, /* fragmentation required to fwd */
+ BPF_MTU_CHK_RET_SEGS_TOOBIG, /* GSO re-segmentation needed to fwd */
+};
+
enum bpf_task_fd_type {
BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */
BPF_FD_TYPE_TRACEPOINT, /* tp name */
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress
2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
` (2 preceding siblings ...)
2020-10-27 16:27 ` [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
@ 2020-10-27 16:27 ` Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
4 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:27 UTC (permalink / raw)
To: bpf
Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
John Fastabend, Jakub Kicinski, eyal.birger
The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.
[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
V4:
- Keep net_device "up" (IFF_UP) check.
- Adjustment to handle bpf_redirect_peer() helper
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/linux/netdevice.h | 31 +++++++++++++++++++++++++++++--
net/core/dev.c | 19 ++-----------------
net/core/filter.c | 14 +++++++++++---
3 files changed, 42 insertions(+), 22 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 964b494b0e8d..bd02ddab8dfe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3891,11 +3891,38 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
bool is_skb_forwardable(const struct net_device *dev,
const struct sk_buff *skb);
+static __always_inline bool __is_skb_forwardable(const struct net_device *dev,
+ const struct sk_buff *skb,
+ const bool check_mtu)
+{
+ const u32 vlan_hdr_len = 4; /* VLAN_HLEN */
+ unsigned int len;
+
+ if (!(dev->flags & IFF_UP))
+ return false;
+
+ if (!check_mtu)
+ return true;
+
+ len = dev->mtu + dev->hard_header_len + vlan_hdr_len;
+ if (skb->len <= len)
+ return true;
+
+ /* if TSO is enabled, we don't care about the length as the packet
+ * could be forwarded without being segmented before
+ */
+ if (skb_is_gso(skb))
+ return true;
+
+ return false;
+}
+
static __always_inline int ____dev_forward_skb(struct net_device *dev,
- struct sk_buff *skb)
+ struct sk_buff *skb,
+ const bool check_mtu)
{
if (skb_orphan_frags(skb, GFP_ATOMIC) ||
- unlikely(!is_skb_forwardable(dev, skb))) {
+ unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) {
atomic_long_inc(&dev->rx_dropped);
kfree_skb(skb);
return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index 9499a414d67e..445ccf92c149 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2188,28 +2188,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
{
- unsigned int len;
-
- if (!(dev->flags & IFF_UP))
- return false;
-
- len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
- if (skb->len <= len)
- return true;
-
- /* if TSO is enabled, we don't care about the length as the packet
- * could be forwarded without being segmented before
- */
- if (skb_is_gso(skb))
- return true;
-
- return false;
+ return __is_skb_forwardable(dev, skb, true);
}
EXPORT_SYMBOL_GPL(is_skb_forwardable);
int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
{
- int ret = ____dev_forward_skb(dev, skb);
+ int ret = ____dev_forward_skb(dev, skb, true);
if (likely(!ret)) {
skb->protocol = eth_type_trans(skb, dev);
diff --git a/net/core/filter.c b/net/core/filter.c
index d66a9cba8e14..6557c4e8858c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2083,13 +2083,21 @@ static const struct bpf_func_proto bpf_csum_level_proto = {
static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb)
{
- return dev_forward_skb(dev, skb);
+ int ret = ____dev_forward_skb(dev, skb, false);
+
+ if (likely(!ret)) {
+ skb->protocol = eth_type_trans(skb, dev);
+ skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+ ret = netif_rx(skb);
+ }
+
+ return ret;
}
static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
struct sk_buff *skb)
{
- int ret = ____dev_forward_skb(dev, skb);
+ int ret = ____dev_forward_skb(dev, skb, false);
if (likely(!ret)) {
skb->dev = dev;
@@ -2480,7 +2488,7 @@ int skb_do_redirect(struct sk_buff *skb)
goto out_drop;
dev = ops->ndo_get_peer_dev(dev);
if (unlikely(!dev ||
- !is_skb_forwardable(dev, skb) ||
+ !__is_skb_forwardable(dev, skb, false) ||
net_eq(net, dev_net(dev))))
goto out_drop;
skb->dev = dev;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs
2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
` (3 preceding siblings ...)
2020-10-27 16:27 ` [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
@ 2020-10-27 16:27 ` Jesper Dangaard Brouer
4 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-27 16:27 UTC (permalink / raw)
To: bpf
Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
John Fastabend, Jakub Kicinski, eyal.birger
This change makes it possible to identify SKBs that have been redirected
by TC-BPF (cls_act). This is needed for a number of cases.
(1) For collaborating with driver ifb net_devices.
(2) For avoiding starting generic-XDP prog on TC ingress redirect.
It is most important to fix XDP case(2), because this can break userspace
when a driver gets support for native-XDP. Imagine userspace loads XDP
prog on eth0, which fallback to generic-XDP, and it process TC-redirected
packets. When kernel is updated with native-XDP support for eth0, then the
program no-longer see the TC-redirected packets. Therefore it is important
to keep the order intact; that XDP runs before TC-BPF.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
net/core/dev.c | 2 ++
net/sched/Kconfig | 1 +
2 files changed, 3 insertions(+)
diff --git a/net/core/dev.c b/net/core/dev.c
index 445ccf92c149..930c165a607e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
return NULL;
case TC_ACT_REDIRECT:
/* No need to push/pop skb's mac_header here on egress! */
+ skb_set_redirected(skb, false);
skb_do_redirect(skb);
*ret = NET_XMIT_SUCCESS;
return NULL;
@@ -4959,6 +4960,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
* redirecting to another netdev
*/
__skb_push(skb, skb->mac_len);
+ skb_set_redirected(skb, true);
if (skb_do_redirect(skb) == -EAGAIN) {
__skb_pull(skb, skb->mac_len);
*another = true;
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a3b37d88800e..a1bbaa8fd054 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -384,6 +384,7 @@ config NET_SCH_INGRESS
depends on NET_CLS_ACT
select NET_INGRESS
select NET_EGRESS
+ select NET_REDIRECT
help
Say Y here if you want to use classifiers for incoming and/or outgoing
packets. This qdisc doesn't do anything else besides running classifiers,
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
@ 2020-10-27 17:15 ` David Ahern
2020-10-30 17:01 ` Jesper Dangaard Brouer
2020-10-28 12:49 ` Dan Carpenter
1 sibling, 1 reply; 11+ messages in thread
From: David Ahern @ 2020-10-27 17:15 UTC (permalink / raw)
To: Jesper Dangaard Brouer, bpf
Cc: netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb, shaun,
Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
eyal.birger
On 10/27/20 10:26 AM, Jesper Dangaard Brouer wrote:
> The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
> don't know the MTU value that caused this rejection.
>
> If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> need to know this MTU value for the ICMP packet.
>
> Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> value as output via a union with 'tot_len' as this is the value used for
> the MTU lookup.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> include/uapi/linux/bpf.h | 11 +++++++++--
> net/core/filter.c | 17 ++++++++++++-----
> tools/include/uapi/linux/bpf.h | 11 +++++++++--
> 3 files changed, 30 insertions(+), 9 deletions(-)
>
Reviewed-by: David Ahern <dsahern@kernel.org>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2020-10-27 17:15 ` David Ahern
@ 2020-10-28 12:49 ` Dan Carpenter
2020-10-30 14:35 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 11+ messages in thread
From: Dan Carpenter @ 2020-10-28 12:49 UTC (permalink / raw)
To: kbuild, Jesper Dangaard Brouer, bpf
Cc: lkp, kbuild-all, Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek
[-- Attachment #1: Type: text/plain, Size: 10543 bytes --]
Hi Jesper,
url: https://github.com/0day-ci/linux/commits/Jesper-Dangaard-Brouer/bpf-New-approach-for-BPF-MTU-handling/20201028-002919
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
config: i386-randconfig-m021-20201026 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
New smatch warnings:
net/core/filter.c:5395 bpf_ipv4_fib_lookup() error: uninitialized symbol 'mtu'.
vim +/mtu +5395 net/core/filter.c
87f5fc7e48dd317 David Ahern 2018-05-09 5281 static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
4f74fede40df8db David Ahern 2018-05-21 5282 u32 flags, bool check_mtu)
87f5fc7e48dd317 David Ahern 2018-05-09 5283 {
eba618abacade71 David Ahern 2019-04-02 5284 struct fib_nh_common *nhc;
87f5fc7e48dd317 David Ahern 2018-05-09 5285 struct in_device *in_dev;
87f5fc7e48dd317 David Ahern 2018-05-09 5286 struct neighbour *neigh;
87f5fc7e48dd317 David Ahern 2018-05-09 5287 struct net_device *dev;
87f5fc7e48dd317 David Ahern 2018-05-09 5288 struct fib_result res;
87f5fc7e48dd317 David Ahern 2018-05-09 5289 struct flowi4 fl4;
87f5fc7e48dd317 David Ahern 2018-05-09 5290 int err;
4f74fede40df8db David Ahern 2018-05-21 5291 u32 mtu;
^^^^^^^^
87f5fc7e48dd317 David Ahern 2018-05-09 5292
87f5fc7e48dd317 David Ahern 2018-05-09 5293 dev = dev_get_by_index_rcu(net, params->ifindex);
87f5fc7e48dd317 David Ahern 2018-05-09 5294 if (unlikely(!dev))
87f5fc7e48dd317 David Ahern 2018-05-09 5295 return -ENODEV;
87f5fc7e48dd317 David Ahern 2018-05-09 5296
87f5fc7e48dd317 David Ahern 2018-05-09 5297 /* verify forwarding is enabled on this interface */
87f5fc7e48dd317 David Ahern 2018-05-09 5298 in_dev = __in_dev_get_rcu(dev);
87f5fc7e48dd317 David Ahern 2018-05-09 5299 if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
4c79579b44b1876 David Ahern 2018-06-26 5300 return BPF_FIB_LKUP_RET_FWD_DISABLED;
87f5fc7e48dd317 David Ahern 2018-05-09 5301
87f5fc7e48dd317 David Ahern 2018-05-09 5302 if (flags & BPF_FIB_LOOKUP_OUTPUT) {
87f5fc7e48dd317 David Ahern 2018-05-09 5303 fl4.flowi4_iif = 1;
87f5fc7e48dd317 David Ahern 2018-05-09 5304 fl4.flowi4_oif = params->ifindex;
87f5fc7e48dd317 David Ahern 2018-05-09 5305 } else {
87f5fc7e48dd317 David Ahern 2018-05-09 5306 fl4.flowi4_iif = params->ifindex;
87f5fc7e48dd317 David Ahern 2018-05-09 5307 fl4.flowi4_oif = 0;
87f5fc7e48dd317 David Ahern 2018-05-09 5308 }
87f5fc7e48dd317 David Ahern 2018-05-09 5309 fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
87f5fc7e48dd317 David Ahern 2018-05-09 5310 fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
87f5fc7e48dd317 David Ahern 2018-05-09 5311 fl4.flowi4_flags = 0;
87f5fc7e48dd317 David Ahern 2018-05-09 5312
87f5fc7e48dd317 David Ahern 2018-05-09 5313 fl4.flowi4_proto = params->l4_protocol;
87f5fc7e48dd317 David Ahern 2018-05-09 5314 fl4.daddr = params->ipv4_dst;
87f5fc7e48dd317 David Ahern 2018-05-09 5315 fl4.saddr = params->ipv4_src;
87f5fc7e48dd317 David Ahern 2018-05-09 5316 fl4.fl4_sport = params->sport;
87f5fc7e48dd317 David Ahern 2018-05-09 5317 fl4.fl4_dport = params->dport;
1869e226a7b3ef7 David Ahern 2020-09-13 5318 fl4.flowi4_multipath_hash = 0;
87f5fc7e48dd317 David Ahern 2018-05-09 5319
87f5fc7e48dd317 David Ahern 2018-05-09 5320 if (flags & BPF_FIB_LOOKUP_DIRECT) {
87f5fc7e48dd317 David Ahern 2018-05-09 5321 u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
87f5fc7e48dd317 David Ahern 2018-05-09 5322 struct fib_table *tb;
87f5fc7e48dd317 David Ahern 2018-05-09 5323
87f5fc7e48dd317 David Ahern 2018-05-09 5324 tb = fib_get_table(net, tbid);
87f5fc7e48dd317 David Ahern 2018-05-09 5325 if (unlikely(!tb))
4c79579b44b1876 David Ahern 2018-06-26 5326 return BPF_FIB_LKUP_RET_NOT_FWDED;
87f5fc7e48dd317 David Ahern 2018-05-09 5327
87f5fc7e48dd317 David Ahern 2018-05-09 5328 err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
87f5fc7e48dd317 David Ahern 2018-05-09 5329 } else {
87f5fc7e48dd317 David Ahern 2018-05-09 5330 fl4.flowi4_mark = 0;
87f5fc7e48dd317 David Ahern 2018-05-09 5331 fl4.flowi4_secid = 0;
87f5fc7e48dd317 David Ahern 2018-05-09 5332 fl4.flowi4_tun_key.tun_id = 0;
87f5fc7e48dd317 David Ahern 2018-05-09 5333 fl4.flowi4_uid = sock_net_uid(net, NULL);
87f5fc7e48dd317 David Ahern 2018-05-09 5334
87f5fc7e48dd317 David Ahern 2018-05-09 5335 err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
87f5fc7e48dd317 David Ahern 2018-05-09 5336 }
87f5fc7e48dd317 David Ahern 2018-05-09 5337
4c79579b44b1876 David Ahern 2018-06-26 5338 if (err) {
4c79579b44b1876 David Ahern 2018-06-26 5339 /* map fib lookup errors to RTN_ type */
4c79579b44b1876 David Ahern 2018-06-26 5340 if (err == -EINVAL)
4c79579b44b1876 David Ahern 2018-06-26 5341 return BPF_FIB_LKUP_RET_BLACKHOLE;
4c79579b44b1876 David Ahern 2018-06-26 5342 if (err == -EHOSTUNREACH)
4c79579b44b1876 David Ahern 2018-06-26 5343 return BPF_FIB_LKUP_RET_UNREACHABLE;
4c79579b44b1876 David Ahern 2018-06-26 5344 if (err == -EACCES)
4c79579b44b1876 David Ahern 2018-06-26 5345 return BPF_FIB_LKUP_RET_PROHIBIT;
4c79579b44b1876 David Ahern 2018-06-26 5346
4c79579b44b1876 David Ahern 2018-06-26 5347 return BPF_FIB_LKUP_RET_NOT_FWDED;
4c79579b44b1876 David Ahern 2018-06-26 5348 }
4c79579b44b1876 David Ahern 2018-06-26 5349
4c79579b44b1876 David Ahern 2018-06-26 5350 if (res.type != RTN_UNICAST)
4c79579b44b1876 David Ahern 2018-06-26 5351 return BPF_FIB_LKUP_RET_NOT_FWDED;
87f5fc7e48dd317 David Ahern 2018-05-09 5352
5481d73f81549e2 David Ahern 2019-06-03 5353 if (fib_info_num_path(res.fi) > 1)
87f5fc7e48dd317 David Ahern 2018-05-09 5354 fib_select_path(net, &res, &fl4, NULL);
87f5fc7e48dd317 David Ahern 2018-05-09 5355
4f74fede40df8db David Ahern 2018-05-21 5356 if (check_mtu) {
4f74fede40df8db David Ahern 2018-05-21 5357 mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27 5358 if (params->tot_len > mtu) {
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27 5359 params->mtu = mtu; /* union with tot_len */
4c79579b44b1876 David Ahern 2018-06-26 5360 return BPF_FIB_LKUP_RET_FRAG_NEEDED;
4f74fede40df8db David Ahern 2018-05-21 5361 }
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27 5362 }
"mtu" not initialized on else path.
4f74fede40df8db David Ahern 2018-05-21 5363
eba618abacade71 David Ahern 2019-04-02 5364 nhc = res.nhc;
87f5fc7e48dd317 David Ahern 2018-05-09 5365
87f5fc7e48dd317 David Ahern 2018-05-09 5366 /* do not handle lwt encaps right now */
eba618abacade71 David Ahern 2019-04-02 5367 if (nhc->nhc_lwtstate)
4c79579b44b1876 David Ahern 2018-06-26 5368 return BPF_FIB_LKUP_RET_UNSUPP_LWT;
87f5fc7e48dd317 David Ahern 2018-05-09 5369
eba618abacade71 David Ahern 2019-04-02 5370 dev = nhc->nhc_dev;
87f5fc7e48dd317 David Ahern 2018-05-09 5371
87f5fc7e48dd317 David Ahern 2018-05-09 5372 params->rt_metric = res.fi->fib_priority;
d1c362e1dd68a42 Toke Høiland-Jørgensen 2020-10-09 5373 params->ifindex = dev->ifindex;
87f5fc7e48dd317 David Ahern 2018-05-09 5374
87f5fc7e48dd317 David Ahern 2018-05-09 5375 /* xdp and cls_bpf programs are run in RCU-bh so
87f5fc7e48dd317 David Ahern 2018-05-09 5376 * rcu_read_lock_bh is not needed here
87f5fc7e48dd317 David Ahern 2018-05-09 5377 */
6f5f68d05ec0f64 David Ahern 2019-04-05 5378 if (likely(nhc->nhc_gw_family != AF_INET6)) {
6f5f68d05ec0f64 David Ahern 2019-04-05 5379 if (nhc->nhc_gw_family)
6f5f68d05ec0f64 David Ahern 2019-04-05 5380 params->ipv4_dst = nhc->nhc_gw.ipv4;
6f5f68d05ec0f64 David Ahern 2019-04-05 5381
6f5f68d05ec0f64 David Ahern 2019-04-05 5382 neigh = __ipv4_neigh_lookup_noref(dev,
6f5f68d05ec0f64 David Ahern 2019-04-05 5383 (__force u32)params->ipv4_dst);
6f5f68d05ec0f64 David Ahern 2019-04-05 5384 } else {
6f5f68d05ec0f64 David Ahern 2019-04-05 5385 struct in6_addr *dst = (struct in6_addr *)params->ipv6_dst;
6f5f68d05ec0f64 David Ahern 2019-04-05 5386
6f5f68d05ec0f64 David Ahern 2019-04-05 5387 params->family = AF_INET6;
6f5f68d05ec0f64 David Ahern 2019-04-05 5388 *dst = nhc->nhc_gw.ipv6;
6f5f68d05ec0f64 David Ahern 2019-04-05 5389 neigh = __ipv6_neigh_lookup_noref_stub(dev, dst);
6f5f68d05ec0f64 David Ahern 2019-04-05 5390 }
6f5f68d05ec0f64 David Ahern 2019-04-05 5391
4c79579b44b1876 David Ahern 2018-06-26 5392 if (!neigh)
4c79579b44b1876 David Ahern 2018-06-26 5393 return BPF_FIB_LKUP_RET_NO_NEIGH;
87f5fc7e48dd317 David Ahern 2018-05-09 5394
88ffc2c2e37ebb3 Jesper Dangaard Brouer 2020-10-27 @5395 return bpf_fib_set_fwd_params(params, neigh, dev, mtu);
^^^
Uninitialized variable warning.
87f5fc7e48dd317 David Ahern 2018-05-09 5396 }
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39749 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
2020-10-28 12:49 ` Dan Carpenter
@ 2020-10-30 14:35 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 14:35 UTC (permalink / raw)
To: Dan Carpenter
Cc: kbuild, bpf, lkp, kbuild-all, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
brouer
On Wed, 28 Oct 2020 15:49:42 +0300
Dan Carpenter <dan.carpenter@oracle.com> wrote:
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
>
> New smatch warnings:
> net/core/filter.c:5395 bpf_ipv4_fib_lookup() error: uninitialized symbol 'mtu'.
I will fix and send V5.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up
2020-10-27 17:15 ` David Ahern
@ 2020-10-30 17:01 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2020-10-30 17:01 UTC (permalink / raw)
To: David Ahern
Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov, maze, lmb,
shaun, Lorenzo Bianconi, marek, John Fastabend, Jakub Kicinski,
eyal.birger, brouer
On Tue, 27 Oct 2020 11:15:31 -0600
David Ahern <dsahern@gmail.com> wrote:
> On 10/27/20 10:26 AM, Jesper Dangaard Brouer wrote:
> > The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
> > can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
> > don't know the MTU value that caused this rejection.
> >
> > If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
> > need to know this MTU value for the ICMP packet.
> >
> > Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
> > value as output via a union with 'tot_len' as this is the value used for
> > the MTU lookup.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> > include/uapi/linux/bpf.h | 11 +++++++++--
> > net/core/filter.c | 17 ++++++++++++-----
> > tools/include/uapi/linux/bpf.h | 11 +++++++++--
> > 3 files changed, 30 insertions(+), 9 deletions(-)
>
> Reviewed-by: David Ahern <dsahern@kernel.org>
Thanks a lot for the review. I didn't propagate-it-over in V5 of this
patch, as I changed the name of the output member from mtu to
mtu_result in V5. Please review V5 and give your review consent.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len
2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
@ 2020-10-30 19:24 ` John Fastabend
0 siblings, 0 replies; 11+ messages in thread
From: John Fastabend @ 2020-10-30 19:24 UTC (permalink / raw)
To: Jesper Dangaard Brouer, bpf
Cc: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
Alexei Starovoitov, maze, lmb, shaun, Lorenzo Bianconi, marek,
John Fastabend, Jakub Kicinski, eyal.birger
Jesper Dangaard Brouer wrote:
> Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
> __bpf_skb_max_len() as the max-length. This function limit size against
> the current net_device MTU (skb->dev->mtu).
>
> When a BPF-prog grow the packet size, then it should not be limited to the
> MTU. The MTU is a transmit limitation, and software receiving this packet
> should be allowed to increase the size. Further more, current MTU check in
> __bpf_skb_max_len uses the MTU from ingress/current net_device, which in
> case of redirects uses the wrong net_device.
>
> Patch V4 keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
> is elsewhere in the system. Jesper's testing[1] showed it was not possible
> to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
> factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
> SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
> in-effect due to this being called from softirq context see code
> __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
> that frames above 16KiB can cause NICs to reset (but not crash). Keep this
> sanity limit at this level as memory layer can differ based on kernel
> config.
>
> [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests
>
> V3: replace __bpf_skb_max_len() with define and use IPv6 max MTU size.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
Acked-by: John Fastabend <john.fastabend@gmail.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-10-30 19:24 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-27 16:26 [PATCH bpf-next V4 0/5] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2020-10-27 16:26 ` [PATCH bpf-next V4 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2020-10-30 19:24 ` John Fastabend
2020-10-27 16:26 ` [PATCH bpf-next V4 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2020-10-27 17:15 ` David Ahern
2020-10-30 17:01 ` Jesper Dangaard Brouer
2020-10-28 12:49 ` Dan Carpenter
2020-10-30 14:35 ` Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2020-10-27 16:27 ` [PATCH bpf-next V4 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).