From: sdf@google.com
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org, razor@blackwall.org, ast@kernel.org,
andrii@kernel.org, martin.lau@linux.dev,
john.fastabend@gmail.com, joannelkoong@gmail.com,
memxor@gmail.com, toke@redhat.com, joe@cilium.io,
netdev@vger.kernel.org
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
Date: Tue, 4 Oct 2022 17:55:42 -0700 [thread overview]
Message-ID: <YzzWDqAmN5DRTupQ@google.com> (raw)
In-Reply-To: <20221004231143.19190-2-daniel@iogearbox.net>
On 10/05, Daniel Borkmann wrote:
> This work refactors and adds a lightweight extension to the tc BPF ingress
> and egress data path side for allowing BPF programs via an fd-based
> attach /
> detach API. The main goal behind this work which we also presented at LPC
> [0]
> this year is to eventually add support for BPF links for tc BPF programs
> in
> a second step, thus this prep work is required for the latter which allows
> for a model of safe ownership and program detachment. Given the vast rise
> in tc BPF users in cloud native / Kubernetes environments, this becomes
> necessary to avoid hard to debug incidents either through stale leftover
> programs or 3rd party applications stepping on each others toes. Further
> details for BPF link rationale in next patch.
> For the current tc framework, there is no change in behavior with this
> change
> and neither does this change touch on tc core kernel APIs. The gist of
> this
> patch is that the ingress and egress hook gets a lightweight, qdisc-less
> extension for BPF to attach its tc BPF programs, in other words, a minimal
> tc-layer entry point for BPF. As part of the feedback from LPC, there was
> a suggestion to provide a name for this infrastructure to more easily
> differ
> between the classic cls_bpf attachment and the fd-based API. As for most,
> the XDP vs tc layer is already the default mental model for the pkt
> processing
> pipeline. We refactored this with an xtc internal prefix aka 'express
> traffic
> control' in order to avoid to deviate too far (and 'express' given its
> more
> lightweight/faster entry point).
> For the ingress and egress xtc points, the device holds a cache-friendly
> array
> with programs. Same as with classic tc, programs are attached with a prio
> that
> can be specified or auto-allocated through an idr, and the program return
> code
> determines whether to continue in the pipeline or to terminate processing.
> With TC_ACT_UNSPEC code, the processing continues (as the case today).
> The goal
> was to have maximum compatibility to existing tc BPF programs, so they
> don't
> need to be adapted. Compatibility to call into classic tcf_classify() is
> also
> provided in order to allow successive migration or both to cleanly
> co-exist
> where needed given its one logical layer. The fd-based API is behind a
> static
> key, so that when unused the code is also not entered. The struct
> xtc_entry's
> program array is currently static, but could be made dynamic if necessary
> at
> a point in future. Desire has also been expressed for future work to adapt
> similar framework for XDP to allow multi-attach from in-kernel side, too.
> Tested with tc-testing selftest suite which all passes, as well as the tc
> BPF
> tests from the BPF CI.
> [0] https://lpc.events/event/16/contributions/1353/
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
> MAINTAINERS | 4 +-
> include/linux/bpf.h | 1 +
> include/linux/netdevice.h | 14 +-
> include/linux/skbuff.h | 4 +-
> include/net/sch_generic.h | 2 +-
> include/net/xtc.h | 181 ++++++++++++++++++++++
> include/uapi/linux/bpf.h | 35 ++++-
> kernel/bpf/Kconfig | 1 +
> kernel/bpf/Makefile | 1 +
> kernel/bpf/net.c | 274 +++++++++++++++++++++++++++++++++
> kernel/bpf/syscall.c | 24 ++-
> net/Kconfig | 5 +
> net/core/dev.c | 262 +++++++++++++++++++------------
> net/core/filter.c | 4 +-
> net/sched/Kconfig | 4 +-
> net/sched/sch_ingress.c | 48 +++++-
> tools/include/uapi/linux/bpf.h | 35 ++++-
> 17 files changed, 769 insertions(+), 130 deletions(-)
> create mode 100644 include/net/xtc.h
> create mode 100644 kernel/bpf/net.c
> diff --git a/MAINTAINERS b/MAINTAINERS
> index e55a4d47324c..bb63d8d000ea 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3850,13 +3850,15 @@ S: Maintained
> F: kernel/trace/bpf_trace.c
> F: kernel/bpf/stackmap.c
> -BPF [NETWORKING] (tc BPF, sock_addr)
> +BPF [NETWORKING] (xtc & tc BPF, sock_addr)
> M: Martin KaFai Lau <martin.lau@linux.dev>
> M: Daniel Borkmann <daniel@iogearbox.net>
> R: John Fastabend <john.fastabend@gmail.com>
> L: bpf@vger.kernel.org
> L: netdev@vger.kernel.org
> S: Maintained
> +F: include/net/xtc.h
> +F: kernel/bpf/net.c
> F: net/core/filter.c
> F: net/sched/act_bpf.c
> F: net/sched/cls_bpf.c
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..71e5f43db378 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1473,6 +1473,7 @@ struct bpf_prog_array_item {
> union {
> struct bpf_cgroup_storage *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
> u64 bpf_cookie;
> + u32 bpf_priority;
> };
> };
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eddf8ee270e7..43bbb2303e57 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1880,8 +1880,7 @@ enum netdev_ml_priv_type {
> *
> * @rx_handler: handler for received packets
> * @rx_handler_data: XXX: need comments on this one
> - * @miniq_ingress: ingress/clsact qdisc specific data for
> - * ingress processing
> + * @xtc_ingress: BPF/clsact qdisc specific data for ingress processing
> * @ingress_queue: XXX: need comments on this one
> * @nf_hooks_ingress: netfilter hooks executed for ingress packets
> * @broadcast: hw bcast address
> @@ -1902,8 +1901,7 @@ enum netdev_ml_priv_type {
> * @xps_maps: all CPUs/RXQs maps for XPS device
> *
> * @xps_maps: XXX: need comments on this one
> - * @miniq_egress: clsact qdisc specific data for
> - * egress processing
> + * @xtc_egress: BPF/clsact qdisc specific data for egress processing
> * @nf_hooks_egress: netfilter hooks executed for egress packets
> * @qdisc_hash: qdisc hash table
> * @watchdog_timeo: Represents the timeout that is used by
> @@ -2191,8 +2189,8 @@ struct net_device {
> rx_handler_func_t __rcu *rx_handler;
> void __rcu *rx_handler_data;
> -#ifdef CONFIG_NET_CLS_ACT
> - struct mini_Qdisc __rcu *miniq_ingress;
> +#ifdef CONFIG_NET_XGRESS
> + struct xtc_entry __rcu *xtc_ingress;
> #endif
> struct netdev_queue __rcu *ingress_queue;
> #ifdef CONFIG_NETFILTER_INGRESS
> @@ -2220,8 +2218,8 @@ struct net_device {
> #ifdef CONFIG_XPS
> struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
> #endif
> -#ifdef CONFIG_NET_CLS_ACT
> - struct mini_Qdisc __rcu *miniq_egress;
> +#ifdef CONFIG_NET_XGRESS
> + struct xtc_entry __rcu *xtc_egress;
> #endif
> #ifdef CONFIG_NETFILTER_EGRESS
> struct nf_hook_entries __rcu *nf_hooks_egress;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 9fcf534f2d92..a9ff7a1996e9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -955,7 +955,7 @@ struct sk_buff {
> __u8 csum_level:2;
> __u8 dst_pending_confirm:1;
> __u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
> __u8 tc_skip_classify:1;
> __u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */
> #endif
> @@ -983,7 +983,7 @@ struct sk_buff {
> __u8 slow_gro:1;
> __u8 csum_not_inet:1;
> -#ifdef CONFIG_NET_SCHED
> +#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> __u16 tc_index; /* traffic control index */
> #endif
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index d5517719af4e..bc5c1da2d30f 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -693,7 +693,7 @@ int skb_do_redirect(struct sk_buff *);
> static inline bool skb_at_tc_ingress(const struct sk_buff *skb)
> {
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
> return skb->tc_at_ingress;
> #else
> return false;
> diff --git a/include/net/xtc.h b/include/net/xtc.h
> new file mode 100644
> index 000000000000..627dc18aa433
> --- /dev/null
> +++ b/include/net/xtc.h
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2022 Isovalent */
> +#ifndef __NET_XTC_H
> +#define __NET_XTC_H
> +
> +#include <linux/idr.h>
> +#include <linux/bpf.h>
> +
> +#include <net/sch_generic.h>
> +
> +#define XTC_MAX_ENTRIES 30
> +/* Adds 1 NULL entry. */
> +#define XTC_MAX (XTC_MAX_ENTRIES + 1)
> +
> +struct xtc_entry {
> + struct bpf_prog_array_item items[XTC_MAX] ____cacheline_aligned;
> + struct xtc_entry_pair *parent;
> +};
> +
> +struct mini_Qdisc;
> +
> +struct xtc_entry_pair {
> + struct rcu_head rcu;
> + struct idr idr;
> + struct mini_Qdisc *miniq;
> + struct xtc_entry a;
> + struct xtc_entry b;
> +};
> +
> +static inline void xtc_set_ingress(struct sk_buff *skb, bool ingress)
> +{
> +#ifdef CONFIG_NET_XGRESS
> + skb->tc_at_ingress = ingress;
> +#endif
> +}
> +
> +#ifdef CONFIG_NET_XGRESS
> +void xtc_inc(void);
> +void xtc_dec(void);
> +
> +static inline void
> +dev_xtc_entry_update(struct net_device *dev, struct xtc_entry *entry,
> + bool ingress)
> +{
> + ASSERT_RTNL();
> + if (ingress)
> + rcu_assign_pointer(dev->xtc_ingress, entry);
> + else
> + rcu_assign_pointer(dev->xtc_egress, entry);
> + synchronize_rcu();
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_peer(const struct
> xtc_entry *entry)
> +{
> + if (entry == &entry->parent->a)
> + return &entry->parent->b;
> + else
> + return &entry->parent->a;
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_create(void)
> +{
> + struct xtc_entry_pair *pair = kzalloc(sizeof(*pair), GFP_KERNEL);
> +
> + if (pair) {
> + pair->a.parent = pair;
> + pair->b.parent = pair;
> + idr_init(&pair->idr);
> + return &pair->a;
> + }
> + return NULL;
> +}
> +
> +static inline struct xtc_entry *dev_xtc_entry_fetch(struct net_device
> *dev,
> + bool ingress, bool *created)
> +{
> + struct xtc_entry *entry = ingress ?
> + rcu_dereference_rtnl(dev->xtc_ingress) :
> + rcu_dereference_rtnl(dev->xtc_egress);
> +
> + *created = false;
> + if (!entry) {
> + entry = dev_xtc_entry_create();
> + if (!entry)
> + return NULL;
> + *created = true;
> + }
> + return entry;
> +}
> +
> +static inline void dev_xtc_entry_clear(struct xtc_entry *entry)
> +{
> + memset(entry->items, 0, sizeof(entry->items));
> +}
> +
> +static inline int dev_xtc_entry_prio_new(struct xtc_entry *entry, u32
> prio,
> + struct bpf_prog *prog)
> +{
> + int ret;
> +
> + if (prio == 0)
> + prio = 1;
> + ret = idr_alloc_u32(&entry->parent->idr, prog, &prio, U32_MAX,
> + GFP_KERNEL);
> + return ret < 0 ? ret : prio;
> +}
> +
> +static inline void dev_xtc_entry_prio_set(struct xtc_entry *entry, u32
> prio,
> + struct bpf_prog *prog)
> +{
> + idr_replace(&entry->parent->idr, prog, prio);
> +}
> +
> +static inline void dev_xtc_entry_prio_del(struct xtc_entry *entry, u32
> prio)
> +{
> + idr_remove(&entry->parent->idr, prio);
> +}
> +
> +static inline void dev_xtc_entry_free(struct xtc_entry *entry)
> +{
> + idr_destroy(&entry->parent->idr);
> + kfree_rcu(entry->parent, rcu);
> +}
> +
> +static inline u32 dev_xtc_entry_total(struct xtc_entry *entry)
> +{
> + const struct bpf_prog_array_item *item;
> + const struct bpf_prog *prog;
> + u32 num = 0;
> +
> + item = &entry->items[0];
> + while ((prog = READ_ONCE(item->prog))) {
> + num++;
> + item++;
> + }
> + return num;
> +}
> +
> +static inline enum tc_action_base xtc_action_code(struct sk_buff *skb,
> int code)
> +{
> + switch (code) {
> + case TC_PASS:
> + skb->tc_index = qdisc_skb_cb(skb)->tc_classid;
> + fallthrough;
> + case TC_DROP:
> + case TC_REDIRECT:
> + return code;
> + case TC_NEXT:
> + default:
> + return TC_NEXT;
> + }
> +}
> +
> +int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int xtc_prog_detach(const union bpf_attr *attr);
> +int xtc_prog_query(const union bpf_attr *attr,
> + union bpf_attr __user *uattr);
> +void dev_xtc_uninstall(struct net_device *dev);
> +#else
> +static inline int xtc_prog_attach(const union bpf_attr *attr,
> + struct bpf_prog *prog)
> +{
> + return -EINVAL;
> +}
> +
> +static inline int xtc_prog_detach(const union bpf_attr *attr)
> +{
> + return -EINVAL;
> +}
> +
> +static inline int xtc_prog_query(const union bpf_attr *attr,
> + union bpf_attr __user *uattr)
> +{
> + return -EINVAL;
> +}
> +
> +static inline void dev_xtc_uninstall(struct net_device *dev)
> +{
> +}
> +#endif /* CONFIG_NET_XGRESS */
> +#endif /* __NET_XTC_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
> BPF_PERF_EVENT,
> BPF_TRACE_KPROBE_MULTI,
> BPF_LSM_CGROUP,
> + BPF_NET_INGRESS,
> + BPF_NET_EGRESS,
> __MAX_BPF_ATTACH_TYPE
> };
> @@ -1399,14 +1401,20 @@ union bpf_attr {
> };
> struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> - __u32 target_fd; /* container object to attach to */
> + union {
> + __u32 target_fd; /* container object to attach to */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_bpf_fd; /* eBPF program to attach */
> __u32 attach_type;
> __u32 attach_flags;
> - __u32 replace_bpf_fd; /* previously attached eBPF
[..]
> + union {
> + __u32 attach_priority;
> + __u32 replace_bpf_fd; /* previously attached eBPF
> * program to replace if
> * BPF_F_REPLACE is used
> */
> + };
The series looks exciting, haven't had a chance to look deeply, will try
to find some time this week.
We've chatted briefly about priority during the talk, let's maybe discuss
it here more?
I, as a user, still really have no clue about what priority to use.
We have this problem at tc, and we'll seemingly have the same problem
here? I guess it's even more relevant in k8s because internally at G we
can control the users.
Is it worth at least trying to provide some default bands / guidance?
For example, having SEC('tc/ingress') receive attach_priority=124 by
default? Maybe we can even have something like 'tc/ingress_first' get
attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
(the names are arbitrary, we can do something better)
ingress_first/ingress_last can be used by some monitoring jobs. The rest
can use default 124. If somebody really needs a custom priority, then they
can manually use something around 124/2 if they need to trigger before the
'default' priority or 124+124/2 if they want to trigger after?
Thoughts? Is it worth it? Do we care?
> };
> struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
> } info;
> struct { /* anonymous struct used by BPF_PROG_QUERY command */
> - __u32 target_fd; /* container object to query */
> + union {
> + __u32 target_fd; /* container object to query */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_type;
> __u32 query_flags;
> __u32 attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
> };
> };
> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> + TC_NEXT = -1,
> + TC_PASS = 0,
> + TC_DROP = 2,
> + TC_REDIRECT = 7,
> +};
> +
> struct bpf_xdp_sock {
> __u32 queue_id;
> };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
> __be32 flow_label;
> };
> +struct bpf_query_info {
> + __u32 prog_id;
> + __u32 prio;
> +};
> +
> struct bpf_func_info {
> __u32 insn_off;
> __u32 type_id;
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..6a906ff93006 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -31,6 +31,7 @@ config BPF_SYSCALL
> select TASKS_TRACE_RCU
> select BINARY_PRINTF
> select NET_SOCK_MSG if NET
> + select NET_XGRESS if NET
> select PAGE_POOL if NET
> default n
> help
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..76c3f9d4e2f3 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
> obj-$(CONFIG_BPF_SYSCALL) += offload.o
> obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o
> +obj-$(CONFIG_BPF_SYSCALL) += net.o
> endif
> ifeq ($(CONFIG_PERF_EVENTS),y)
> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> diff --git a/kernel/bpf/net.c b/kernel/bpf/net.c
> new file mode 100644
> index 000000000000..ab9a9dee615b
> --- /dev/null
> +++ b/kernel/bpf/net.c
> @@ -0,0 +1,274 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Isovalent */
> +
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/xtc.h>
> +
> +static int __xtc_prog_attach(struct net_device *dev, bool ingress, u32
> limit,
> + struct bpf_prog *nprog, u32 prio, u32 flags)
> +{
> + struct bpf_prog_array_item *item, *tmp;
> + struct xtc_entry *entry, *peer;
> + struct bpf_prog *oprog;
> + bool created;
> + int i, j;
> +
> + ASSERT_RTNL();
> +
> + entry = dev_xtc_entry_fetch(dev, ingress, &created);
> + if (!entry)
> + return -ENOMEM;
> + for (i = 0; i < limit; i++) {
> + item = &entry->items[i];
> + oprog = item->prog;
> + if (!oprog)
> + break;
> + if (item->bpf_priority == prio) {
> + if (flags & BPF_F_REPLACE) {
> + /* Pairs with READ_ONCE() in xtc_run_progs(). */
> + WRITE_ONCE(item->prog, nprog);
> + bpf_prog_put(oprog);
> + dev_xtc_entry_prio_set(entry, prio, nprog);
> + return prio;
> + }
> + return -EBUSY;
> + }
> + }
> + if (dev_xtc_entry_total(entry) >= limit)
> + return -ENOSPC;
> + prio = dev_xtc_entry_prio_new(entry, prio, nprog);
> + if (prio < 0) {
> + if (created)
> + dev_xtc_entry_free(entry);
> + return -ENOMEM;
> + }
> + peer = dev_xtc_entry_peer(entry);
> + dev_xtc_entry_clear(peer);
> + for (i = 0, j = 0; i < limit; i++, j++) {
> + item = &entry->items[i];
> + tmp = &peer->items[j];
> + oprog = item->prog;
> + if (!oprog) {
> + if (i == j) {
> + tmp->prog = nprog;
> + tmp->bpf_priority = prio;
> + }
> + break;
> + } else if (item->bpf_priority < prio) {
> + tmp->prog = oprog;
> + tmp->bpf_priority = item->bpf_priority;
> + } else if (item->bpf_priority > prio) {
> + if (i == j) {
> + tmp->prog = nprog;
> + tmp->bpf_priority = prio;
> + tmp = &peer->items[++j];
> + }
> + tmp->prog = oprog;
> + tmp->bpf_priority = item->bpf_priority;
> + }
> + }
> + dev_xtc_entry_update(dev, peer, ingress);
> + if (ingress)
> + net_inc_ingress_queue();
> + else
> + net_inc_egress_queue();
> + xtc_inc();
> + return prio;
> +}
> +
> +int xtc_prog_attach(const union bpf_attr *attr, struct bpf_prog *nprog)
> +{
> + struct net *net = current->nsproxy->net_ns;
> + bool ingress = attr->attach_type == BPF_NET_INGRESS;
> + struct net_device *dev;
> + int ret;
> +
> + if (attr->attach_flags & ~BPF_F_REPLACE)
> + return -EINVAL;
> + rtnl_lock();
> + dev = __dev_get_by_index(net, attr->target_ifindex);
> + if (!dev) {
> + rtnl_unlock();
> + return -EINVAL;
> + }
> + ret = __xtc_prog_attach(dev, ingress, XTC_MAX_ENTRIES, nprog,
> + attr->attach_priority, attr->attach_flags);
> + rtnl_unlock();
> + return ret;
> +}
> +
> +static int __xtc_prog_detach(struct net_device *dev, bool ingress, u32
> limit,
> + u32 prio)
> +{
> + struct bpf_prog_array_item *item, *tmp;
> + struct bpf_prog *oprog, *fprog = NULL;
> + struct xtc_entry *entry, *peer;
> + int i, j;
> +
> + ASSERT_RTNL();
> +
> + entry = ingress ?
> + rcu_dereference_rtnl(dev->xtc_ingress) :
> + rcu_dereference_rtnl(dev->xtc_egress);
> + if (!entry)
> + return -ENOENT;
> + peer = dev_xtc_entry_peer(entry);
> + dev_xtc_entry_clear(peer);
> + for (i = 0, j = 0; i < limit; i++) {
> + item = &entry->items[i];
> + tmp = &peer->items[j];
> + oprog = item->prog;
> + if (!oprog)
> + break;
> + if (item->bpf_priority != prio) {
> + tmp->prog = oprog;
> + tmp->bpf_priority = item->bpf_priority;
> + j++;
> + } else {
> + fprog = oprog;
> + }
> + }
> + if (fprog) {
> + dev_xtc_entry_prio_del(peer, prio);
> + if (dev_xtc_entry_total(peer) == 0 && !entry->parent->miniq)
> + peer = NULL;
> + dev_xtc_entry_update(dev, peer, ingress);
> + bpf_prog_put(fprog);
> + if (!peer)
> + dev_xtc_entry_free(entry);
> + if (ingress)
> + net_dec_ingress_queue();
> + else
> + net_dec_egress_queue();
> + xtc_dec();
> + return 0;
> + }
> + return -ENOENT;
> +}
> +
> +int xtc_prog_detach(const union bpf_attr *attr)
> +{
> + struct net *net = current->nsproxy->net_ns;
> + bool ingress = attr->attach_type == BPF_NET_INGRESS;
> + struct net_device *dev;
> + int ret;
> +
> + if (attr->attach_flags || !attr->attach_priority)
> + return -EINVAL;
> + rtnl_lock();
> + dev = __dev_get_by_index(net, attr->target_ifindex);
> + if (!dev) {
> + rtnl_unlock();
> + return -EINVAL;
> + }
> + ret = __xtc_prog_detach(dev, ingress, XTC_MAX_ENTRIES,
> + attr->attach_priority);
> + rtnl_unlock();
> + return ret;
> +}
> +
> +static void __xtc_prog_detach_all(struct net_device *dev, bool ingress,
> u32 limit)
> +{
> + struct bpf_prog_array_item *item;
> + struct xtc_entry *entry;
> + struct bpf_prog *prog;
> + int i;
> +
> + ASSERT_RTNL();
> +
> + entry = ingress ?
> + rcu_dereference_rtnl(dev->xtc_ingress) :
> + rcu_dereference_rtnl(dev->xtc_egress);
> + if (!entry)
> + return;
> + dev_xtc_entry_update(dev, NULL, ingress);
> + for (i = 0; i < limit; i++) {
> + item = &entry->items[i];
> + prog = item->prog;
> + if (!prog)
> + break;
> + dev_xtc_entry_prio_del(entry, item->bpf_priority);
> + bpf_prog_put(prog);
> + if (ingress)
> + net_dec_ingress_queue();
> + else
> + net_dec_egress_queue();
> + xtc_dec();
> + }
> + dev_xtc_entry_free(entry);
> +}
> +
> +void dev_xtc_uninstall(struct net_device *dev)
> +{
> + __xtc_prog_detach_all(dev, true, XTC_MAX_ENTRIES + 1);
> + __xtc_prog_detach_all(dev, false, XTC_MAX_ENTRIES + 1);
> +}
> +
> +static int
> +__xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user
> *uattr,
> + struct net_device *dev, bool ingress, u32 limit)
> +{
> + struct bpf_query_info info, __user *uinfo;
> + struct bpf_prog_array_item *item;
> + struct xtc_entry *entry;
> + struct bpf_prog *prog;
> + u32 i, flags = 0, cnt;
> + int ret = 0;
> +
> + ASSERT_RTNL();
> +
> + entry = ingress ?
> + rcu_dereference_rtnl(dev->xtc_ingress) :
> + rcu_dereference_rtnl(dev->xtc_egress);
> + if (!entry)
> + return -ENOENT;
> + cnt = dev_xtc_entry_total(entry);
> + if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
> + return -EFAULT;
> + if (copy_to_user(&uattr->query.prog_cnt, &cnt, sizeof(cnt)))
> + return -EFAULT;
> + uinfo = u64_to_user_ptr(attr->query.prog_ids);
> + if (attr->query.prog_cnt == 0 || !uinfo || !cnt)
> + /* return early if user requested only program count + flags */
> + return 0;
> + if (attr->query.prog_cnt < cnt) {
> + cnt = attr->query.prog_cnt;
> + ret = -ENOSPC;
> + }
> + for (i = 0; i < limit; i++) {
> + item = &entry->items[i];
> + prog = item->prog;
> + if (!prog)
> + break;
> + info.prog_id = prog->aux->id;
> + info.prio = item->bpf_priority;
> + if (copy_to_user(uinfo + i, &info, sizeof(info)))
> + return -EFAULT;
> + if (i + 1 == cnt)
> + break;
> + }
> + return ret;
> +}
> +
> +int xtc_prog_query(const union bpf_attr *attr, union bpf_attr __user
> *uattr)
> +{
> + struct net *net = current->nsproxy->net_ns;
> + bool ingress = attr->query.attach_type == BPF_NET_INGRESS;
> + struct net_device *dev;
> + int ret;
> +
> + if (attr->query.query_flags || attr->query.attach_flags)
> + return -EINVAL;
> + rtnl_lock();
> + dev = __dev_get_by_index(net, attr->query.target_ifindex);
> + if (!dev) {
> + rtnl_unlock();
> + return -EINVAL;
> + }
> + ret = __xtc_prog_query(attr, uattr, dev, ingress, XTC_MAX_ENTRIES);
> + rtnl_unlock();
> + return ret;
> +}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 7b373a5e861f..a0a670b964bb 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -36,6 +36,8 @@
> #include <linux/memcontrol.h>
> #include <linux/trace_events.h>
> +#include <net/xtc.h>
> +
> #define IS_FD_ARRAY(map) ((map)->map_type ==
> BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
> (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
> (map)->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> @@ -3448,6 +3450,9 @@ attach_type_to_prog_type(enum bpf_attach_type
> attach_type)
> return BPF_PROG_TYPE_XDP;
> case BPF_LSM_CGROUP:
> return BPF_PROG_TYPE_LSM;
> + case BPF_NET_INGRESS:
> + case BPF_NET_EGRESS:
> + return BPF_PROG_TYPE_SCHED_CLS;
> default:
> return BPF_PROG_TYPE_UNSPEC;
> }
[..]
> @@ -3466,18 +3471,15 @@ static int bpf_prog_attach(const union bpf_attr
> *attr)
> if (CHECK_ATTR(BPF_PROG_ATTACH))
> return -EINVAL;
> -
> if (attr->attach_flags & ~BPF_F_ATTACH_MASK)
> return -EINVAL;
> ptype = attach_type_to_prog_type(attr->attach_type);
> if (ptype == BPF_PROG_TYPE_UNSPEC)
> return -EINVAL;
> -
> prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
> if (IS_ERR(prog))
> return PTR_ERR(prog);
> -
> if (bpf_prog_attach_check_attach_type(prog, attr->attach_type)) {
> bpf_prog_put(prog);
> return -EINVAL;
This whole chunk can probably be dropped?
> @@ -3508,16 +3510,18 @@ static int bpf_prog_attach(const union bpf_attr
> *attr)
> ret = cgroup_bpf_prog_attach(attr, ptype, prog);
> break;
> + case BPF_PROG_TYPE_SCHED_CLS:
> + ret = xtc_prog_attach(attr, prog);
> + break;
> default:
> ret = -EINVAL;
> }
> -
> - if (ret)
> + if (ret < 0)
> bpf_prog_put(prog);
> return ret;
> }
> -#define BPF_PROG_DETACH_LAST_FIELD attach_type
> +#define BPF_PROG_DETACH_LAST_FIELD replace_bpf_fd
> static int bpf_prog_detach(const union bpf_attr *attr)
> {
> @@ -3527,6 +3531,9 @@ static int bpf_prog_detach(const union bpf_attr
> *attr)
> return -EINVAL;
> ptype = attach_type_to_prog_type(attr->attach_type);
> + if (ptype != BPF_PROG_TYPE_SCHED_CLS &&
> + (attr->attach_flags || attr->replace_bpf_fd))
> + return -EINVAL;
> switch (ptype) {
> case BPF_PROG_TYPE_SK_MSG:
> @@ -3545,6 +3552,8 @@ static int bpf_prog_detach(const union bpf_attr
> *attr)
> case BPF_PROG_TYPE_SOCK_OPS:
> case BPF_PROG_TYPE_LSM:
> return cgroup_bpf_prog_detach(attr, ptype);
> + case BPF_PROG_TYPE_SCHED_CLS:
> + return xtc_prog_detach(attr);
> default:
> return -EINVAL;
> }
> @@ -3598,6 +3607,9 @@ static int bpf_prog_query(const union bpf_attr
> *attr,
> case BPF_SK_MSG_VERDICT:
> case BPF_SK_SKB_VERDICT:
> return sock_map_bpf_prog_query(attr, uattr);
> + case BPF_NET_INGRESS:
> + case BPF_NET_EGRESS:
> + return xtc_prog_query(attr, uattr);
> default:
> return -EINVAL;
> }
> diff --git a/net/Kconfig b/net/Kconfig
> index 48c33c222199..b7a9cd174464 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -52,6 +52,11 @@ config NET_INGRESS
> config NET_EGRESS
> bool
> +config NET_XGRESS
> + select NET_INGRESS
> + select NET_EGRESS
> + bool
> +
> config NET_REDIRECT
> bool
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa53830d0683..552b805c27dd 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -107,6 +107,7 @@
> #include <net/pkt_cls.h>
> #include <net/checksum.h>
> #include <net/xfrm.h>
> +#include <net/xtc.h>
> #include <linux/highmem.h>
> #include <linux/init.h>
> #include <linux/module.h>
> @@ -154,7 +155,6 @@
> #include "dev.h"
> #include "net-sysfs.h"
> -
> static DEFINE_SPINLOCK(ptype_lock);
> struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
> struct list_head ptype_all __read_mostly; /* Taps */
> @@ -3935,69 +3935,199 @@ int dev_loopback_xmit(struct net *net, struct
> sock *sk, struct sk_buff *skb)
> EXPORT_SYMBOL(dev_loopback_xmit);
> #ifdef CONFIG_NET_EGRESS
> -static struct sk_buff *
> -sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +static struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> + int qm = skb_get_queue_mapping(skb);
> +
> + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
> +static bool netdev_xmit_txqueue_skipped(void)
> +{
> + return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +void netdev_xmit_skip_txqueue(bool skip)
> +{
> + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> +#endif /* CONFIG_NET_EGRESS */
> +
> +#ifdef CONFIG_NET_XGRESS
> +static int tc_run(struct xtc_entry *entry, struct sk_buff *skb)
> {
> + int ret = TC_ACT_UNSPEC;
> #ifdef CONFIG_NET_CLS_ACT
> - struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
> - struct tcf_result cl_res;
> + struct mini_Qdisc *miniq = rcu_dereference_bh(entry->parent->miniq);
> + struct tcf_result res;
> if (!miniq)
> - return skb;
> + return ret;
> - /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
> tc_skb_cb(skb)->mru = 0;
> tc_skb_cb(skb)->post_ct = false;
> - mini_qdisc_bstats_cpu_update(miniq, skb);
> - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res,
> false)) {
> + mini_qdisc_bstats_cpu_update(miniq, skb);
> + ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
> + /* Only tcf related quirks below. */
> + switch (ret) {
> + case TC_ACT_SHOT:
> + mini_qdisc_qstats_cpu_drop(miniq);
> + break;
> case TC_ACT_OK:
> case TC_ACT_RECLASSIFY:
> - skb->tc_index = TC_H_MIN(cl_res.classid);
> + skb->tc_index = TC_H_MIN(res.classid);
> break;
> + }
> +#endif /* CONFIG_NET_CLS_ACT */
> + return ret;
> +}
> +
> +static DEFINE_STATIC_KEY_FALSE(xtc_needed_key);
> +
> +void xtc_inc(void)
> +{
> + static_branch_inc(&xtc_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(xtc_inc);
> +
> +void xtc_dec(void)
> +{
> + static_branch_dec(&xtc_needed_key);
> +}
> +EXPORT_SYMBOL_GPL(xtc_dec);
> +
> +static __always_inline enum tc_action_base
> +xtc_run(const struct xtc_entry *entry, struct sk_buff *skb,
> + const bool needs_mac)
> +{
> + const struct bpf_prog_array_item *item;
> + const struct bpf_prog *prog;
> + int ret = TC_NEXT;
> +
> + if (needs_mac)
> + __skb_push(skb, skb->mac_len);
> + item = &entry->items[0];
> + while ((prog = READ_ONCE(item->prog))) {
> + bpf_compute_data_pointers(skb);
> + ret = bpf_prog_run(prog, skb);
> + if (ret != TC_NEXT)
> + break;
> + item++;
> + }
> + if (needs_mac)
> + __skb_pull(skb, skb->mac_len);
> + return xtc_action_code(skb, ret);
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
> int *ret,
> + struct net_device *orig_dev, bool *another)
> +{
> + struct xtc_entry *entry = rcu_dereference_bh(skb->dev->xtc_ingress);
> + int sch_ret;
> +
> + if (!entry)
> + return skb;
> + if (*pt_prev) {
> + *ret = deliver_skb(skb, *pt_prev, orig_dev);
> + *pt_prev = NULL;
> + }
> +
> + qdisc_skb_cb(skb)->pkt_len = skb->len;
> + xtc_set_ingress(skb, true);
> +
> + if (static_branch_unlikely(&xtc_needed_key)) {
> + sch_ret = xtc_run(entry, skb, true);
> + if (sch_ret != TC_ACT_UNSPEC)
> + goto ingress_verdict;
> + }
> + sch_ret = tc_run(entry, skb);
> +ingress_verdict:
> + switch (sch_ret) {
> + case TC_ACT_REDIRECT:
> + /* skb_mac_header check was done by BPF, so we can safely
> + * push the L2 header back before redirecting to another
> + * netdev.
> + */
> + __skb_push(skb, skb->mac_len);
> + if (skb_do_redirect(skb) == -EAGAIN) {
> + __skb_pull(skb, skb->mac_len);
> + *another = true;
> + break;
> + }
> + return NULL;
> case TC_ACT_SHOT:
> - mini_qdisc_qstats_cpu_drop(miniq);
> - *ret = NET_XMIT_DROP;
> - kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> + kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> return NULL;
> + /* used by tc_run */
> case TC_ACT_STOLEN:
> case TC_ACT_QUEUED:
> case TC_ACT_TRAP:
> - *ret = NET_XMIT_SUCCESS;
> consume_skb(skb);
> + fallthrough;
> + case TC_ACT_CONSUMED:
> return NULL;
> + }
> +
> + return skb;
> +}
> +
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> +{
> + struct xtc_entry *entry = rcu_dereference_bh(dev->xtc_egress);
> + int sch_ret;
> +
> + if (!entry)
> + return skb;
> +
> + /* qdisc_skb_cb(skb)->pkt_len & xtc_set_ingress() was
> + * already set by the caller.
> + */
> + if (static_branch_unlikely(&xtc_needed_key)) {
> + sch_ret = xtc_run(entry, skb, false);
> + if (sch_ret != TC_ACT_UNSPEC)
> + goto egress_verdict;
> + }
> + sch_ret = tc_run(entry, skb);
> +egress_verdict:
> + switch (sch_ret) {
> case TC_ACT_REDIRECT:
> + *ret = NET_XMIT_SUCCESS;
> /* No need to push/pop skb's mac_header here on egress! */
> skb_do_redirect(skb);
> + return NULL;
> + case TC_ACT_SHOT:
> + *ret = NET_XMIT_DROP;
> + kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> + return NULL;
> + /* used by tc_run */
> + case TC_ACT_STOLEN:
> + case TC_ACT_QUEUED:
> + case TC_ACT_TRAP:
> *ret = NET_XMIT_SUCCESS;
> return NULL;
> - default:
> - break;
> }
> -#endif /* CONFIG_NET_CLS_ACT */
> return skb;
> }
> -
> -static struct netdev_queue *
> -netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> -{
> - int qm = skb_get_queue_mapping(skb);
> -
> - return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> -}
> -
> -static bool netdev_xmit_txqueue_skipped(void)
> +#else
> +static __always_inline struct sk_buff *
> +sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
> int *ret,
> + struct net_device *orig_dev, bool *another)
> {
> - return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> + return skb;
> }
> -void netdev_xmit_skip_txqueue(bool skip)
> +static __always_inline struct sk_buff *
> +sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> {
> - __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> + return skb;
> }
> -EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
> -#endif /* CONFIG_NET_EGRESS */
> +#endif /* CONFIG_NET_XGRESS */
> #ifdef CONFIG_XPS
> static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff
> *skb,
> @@ -4181,9 +4311,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
> skb_update_prio(skb);
> qdisc_pkt_len_init(skb);
> -#ifdef CONFIG_NET_CLS_ACT
> - skb->tc_at_ingress = 0;
> -#endif
> + xtc_set_ingress(skb, false);
> #ifdef CONFIG_NET_EGRESS
> if (static_branch_unlikely(&egress_needed_key)) {
> if (nf_hook_egress_active()) {
> @@ -5101,68 +5229,6 @@ int (*br_fdb_test_addr_hook)(struct net_device
> *dev,
> EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
> #endif
> -static inline struct sk_buff *
> -sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
> int *ret,
> - struct net_device *orig_dev, bool *another)
> -{
> -#ifdef CONFIG_NET_CLS_ACT
> - struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
> - struct tcf_result cl_res;
> -
> - /* If there's at least one ingress present somewhere (so
> - * we get here via enabled static key), remaining devices
> - * that are not configured with an ingress qdisc will bail
> - * out here.
> - */
> - if (!miniq)
> - return skb;
> -
> - if (*pt_prev) {
> - *ret = deliver_skb(skb, *pt_prev, orig_dev);
> - *pt_prev = NULL;
> - }
> -
> - qdisc_skb_cb(skb)->pkt_len = skb->len;
> - tc_skb_cb(skb)->mru = 0;
> - tc_skb_cb(skb)->post_ct = false;
> - skb->tc_at_ingress = 1;
> - mini_qdisc_bstats_cpu_update(miniq, skb);
> -
> - switch (tcf_classify(skb, miniq->block, miniq->filter_list, &cl_res,
> false)) {
> - case TC_ACT_OK:
> - case TC_ACT_RECLASSIFY:
> - skb->tc_index = TC_H_MIN(cl_res.classid);
> - break;
> - case TC_ACT_SHOT:
> - mini_qdisc_qstats_cpu_drop(miniq);
> - kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> - return NULL;
> - case TC_ACT_STOLEN:
> - case TC_ACT_QUEUED:
> - case TC_ACT_TRAP:
> - consume_skb(skb);
> - return NULL;
> - case TC_ACT_REDIRECT:
> - /* skb_mac_header check was done by cls/act_bpf, so
> - * we can safely push the L2 header back before
> - * redirecting to another netdev
> - */
> - __skb_push(skb, skb->mac_len);
> - if (skb_do_redirect(skb) == -EAGAIN) {
> - __skb_pull(skb, skb->mac_len);
> - *another = true;
> - break;
> - }
> - return NULL;
> - case TC_ACT_CONSUMED:
> - return NULL;
> - default:
> - break;
> - }
> -#endif /* CONFIG_NET_CLS_ACT */
> - return skb;
> -}
> -
> /**
> * netdev_is_rx_handler_busy - check if receive handler is registered
> * @dev: device to check
> @@ -10832,7 +10898,7 @@ void unregister_netdevice_many(struct list_head
> *head)
> /* Shutdown queueing discipline. */
> dev_shutdown(dev);
> -
> + dev_xtc_uninstall(dev);
> dev_xdp_uninstall(dev);
> netdev_offload_xstats_disable_all(dev);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bb0136e7a8e4..ac4bb016c5ee 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9132,7 +9132,7 @@ static struct bpf_insn
> *bpf_convert_tstamp_read(const struct bpf_prog *prog,
> __u8 value_reg = si->dst_reg;
> __u8 skb_reg = si->src_reg;
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
> /* If the tstamp_type is read,
> * the bpf prog is aware the tstamp could have delivery time.
> * Thus, read skb->tstamp as is if tstamp_type_access is true.
> @@ -9166,7 +9166,7 @@ static struct bpf_insn
> *bpf_convert_tstamp_write(const struct bpf_prog *prog,
> __u8 value_reg = si->src_reg;
> __u8 skb_reg = si->dst_reg;
> -#ifdef CONFIG_NET_CLS_ACT
> +#ifdef CONFIG_NET_XGRESS
> /* If the tstamp_type is read,
> * the bpf prog is aware the tstamp could have delivery time.
> * Thus, write skb->tstamp as is if tstamp_type_access is true.
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 1e8ab4749c6c..c1b8f2e7d966 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -382,8 +382,7 @@ config NET_SCH_FQ_PIE
> config NET_SCH_INGRESS
> tristate "Ingress/classifier-action Qdisc"
> depends on NET_CLS_ACT
> - select NET_INGRESS
> - select NET_EGRESS
> + select NET_XGRESS
> help
> Say Y here if you want to use classifiers for incoming and/or outgoing
> packets. This qdisc doesn't do anything else besides running
> classifiers,
> @@ -753,6 +752,7 @@ config NET_EMATCH_IPT
> config NET_CLS_ACT
> bool "Actions"
> select NET_CLS
> + select NET_XGRESS
> help
> Say Y here if you want to use traffic control actions. Actions
> get attached to classifiers and are invoked after a successful
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..3bd37ee898ce 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -13,6 +13,7 @@
> #include <net/netlink.h>
> #include <net/pkt_sched.h>
> #include <net/pkt_cls.h>
> +#include <net/xtc.h>
> struct ingress_sched_data {
> struct tcf_block *block;
> @@ -78,11 +79,19 @@ static int ingress_init(struct Qdisc *sch, struct
> nlattr *opt,
> {
> struct ingress_sched_data *q = qdisc_priv(sch);
> struct net_device *dev = qdisc_dev(sch);
> + struct xtc_entry *entry;
> + bool created;
> int err;
> net_inc_ingress_queue();
> - mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> + entry = dev_xtc_entry_fetch(dev, true, &created);
> + if (!entry)
> + return -ENOMEM;
> +
> + mini_qdisc_pair_init(&q->miniqp, sch, &entry->parent->miniq);
> + if (created)
> + dev_xtc_entry_update(dev, entry, true);
> q->block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
> q->block_info.chain_head_change = clsact_chain_head_change;
> @@ -93,15 +102,20 @@ static int ingress_init(struct Qdisc *sch, struct
> nlattr *opt,
> return err;
> mini_qdisc_pair_block_init(&q->miniqp, q->block);
> -
> return 0;
> }
> static void ingress_destroy(struct Qdisc *sch)
> {
> struct ingress_sched_data *q = qdisc_priv(sch);
> + struct net_device *dev = qdisc_dev(sch);
> + struct xtc_entry *entry = rtnl_dereference(dev->xtc_ingress);
> tcf_block_put_ext(q->block, sch, &q->block_info);
> + if (entry && dev_xtc_entry_total(entry) == 0) {
> + dev_xtc_entry_update(dev, NULL, true);
> + dev_xtc_entry_free(entry);
> + }
> net_dec_ingress_queue();
> }
> @@ -217,12 +231,20 @@ static int clsact_init(struct Qdisc *sch, struct
> nlattr *opt,
> {
> struct clsact_sched_data *q = qdisc_priv(sch);
> struct net_device *dev = qdisc_dev(sch);
> + struct xtc_entry *entry;
> + bool created;
> int err;
> net_inc_ingress_queue();
> net_inc_egress_queue();
> - mini_qdisc_pair_init(&q->miniqp_ingress, sch, &dev->miniq_ingress);
> + entry = dev_xtc_entry_fetch(dev, true, &created);
> + if (!entry)
> + return -ENOMEM;
> +
> + mini_qdisc_pair_init(&q->miniqp_ingress, sch, &entry->parent->miniq);
> + if (created)
> + dev_xtc_entry_update(dev, entry, true);
> q->ingress_block_info.binder_type =
> FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
> q->ingress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -235,7 +257,13 @@ static int clsact_init(struct Qdisc *sch, struct
> nlattr *opt,
> mini_qdisc_pair_block_init(&q->miniqp_ingress, q->ingress_block);
> - mini_qdisc_pair_init(&q->miniqp_egress, sch, &dev->miniq_egress);
> + entry = dev_xtc_entry_fetch(dev, false, &created);
> + if (!entry)
> + return -ENOMEM;
> +
> + mini_qdisc_pair_init(&q->miniqp_egress, sch, &entry->parent->miniq);
> + if (created)
> + dev_xtc_entry_update(dev, entry, false);
> q->egress_block_info.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS;
> q->egress_block_info.chain_head_change = clsact_chain_head_change;
> @@ -247,9 +275,21 @@ static int clsact_init(struct Qdisc *sch, struct
> nlattr *opt,
> static void clsact_destroy(struct Qdisc *sch)
> {
> struct clsact_sched_data *q = qdisc_priv(sch);
> + struct net_device *dev = qdisc_dev(sch);
> + struct xtc_entry *ingress_entry = rtnl_dereference(dev->xtc_ingress);
> + struct xtc_entry *egress_entry = rtnl_dereference(dev->xtc_egress);
> tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
> + if (egress_entry && dev_xtc_entry_total(egress_entry) == 0) {
> + dev_xtc_entry_update(dev, NULL, false);
> + dev_xtc_entry_free(egress_entry);
> + }
> +
> tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
> + if (ingress_entry && dev_xtc_entry_total(ingress_entry) == 0) {
> + dev_xtc_entry_update(dev, NULL, true);
> + dev_xtc_entry_free(ingress_entry);
> + }
> net_dec_ingress_queue();
> net_dec_egress_queue();
> diff --git a/tools/include/uapi/linux/bpf.h
> b/tools/include/uapi/linux/bpf.h
> index 51b9aa640ad2..de1f5546bcfe 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1025,6 +1025,8 @@ enum bpf_attach_type {
> BPF_PERF_EVENT,
> BPF_TRACE_KPROBE_MULTI,
> BPF_LSM_CGROUP,
> + BPF_NET_INGRESS,
> + BPF_NET_EGRESS,
> __MAX_BPF_ATTACH_TYPE
> };
> @@ -1399,14 +1401,20 @@ union bpf_attr {
> };
> struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
> - __u32 target_fd; /* container object to attach to */
> + union {
> + __u32 target_fd; /* container object to attach to */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_bpf_fd; /* eBPF program to attach */
> __u32 attach_type;
> __u32 attach_flags;
> - __u32 replace_bpf_fd; /* previously attached eBPF
> + union {
> + __u32 attach_priority;
> + __u32 replace_bpf_fd; /* previously attached eBPF
> * program to replace if
> * BPF_F_REPLACE is used
> */
> + };
> };
> struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
> @@ -1452,7 +1460,10 @@ union bpf_attr {
> } info;
> struct { /* anonymous struct used by BPF_PROG_QUERY command */
> - __u32 target_fd; /* container object to query */
> + union {
> + __u32 target_fd; /* container object to query */
> + __u32 target_ifindex; /* target ifindex */
> + };
> __u32 attach_type;
> __u32 query_flags;
> __u32 attach_flags;
> @@ -6038,6 +6049,19 @@ struct bpf_sock_tuple {
> };
> };
> +/* (Simplified) user return codes for tc prog type.
> + * A valid tc program must return one of these defined values. All other
> + * return codes are reserved for future use. Must remain compatible with
> + * their TC_ACT_* counter-parts. For compatibility in behavior, unknown
> + * return codes are mapped to TC_NEXT.
> + */
> +enum tc_action_base {
> + TC_NEXT = -1,
> + TC_PASS = 0,
> + TC_DROP = 2,
> + TC_REDIRECT = 7,
> +};
> +
> struct bpf_xdp_sock {
> __u32 queue_id;
> };
> @@ -6804,6 +6828,11 @@ struct bpf_flow_keys {
> __be32 flow_label;
> };
> +struct bpf_query_info {
> + __u32 prog_id;
> + __u32 prio;
> +};
> +
> struct bpf_func_info {
> __u32 insn_off;
> __u32 type_id;
> --
> 2.34.1
next prev parent reply other threads:[~2022-10-05 0:56 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
2022-10-05 0:55 ` sdf [this message]
2022-10-05 10:50 ` Toke Høiland-Jørgensen
2022-10-05 14:48 ` Daniel Borkmann
2022-10-05 12:35 ` Daniel Borkmann
2022-10-05 17:56 ` sdf
2022-10-05 18:21 ` Daniel Borkmann
2022-10-05 10:33 ` Toke Høiland-Jørgensen
2022-10-05 12:47 ` Daniel Borkmann
2022-10-05 14:32 ` Toke Høiland-Jørgensen
2022-10-05 14:53 ` Daniel Borkmann
2022-10-05 19:04 ` Jamal Hadi Salim
2022-10-06 20:49 ` Daniel Borkmann
2022-10-07 15:36 ` Jamal Hadi Salim
2022-10-06 0:22 ` Andrii Nakryiko
2022-10-06 5:00 ` Alexei Starovoitov
2022-10-06 14:40 ` Jamal Hadi Salim
2022-10-06 23:29 ` Alexei Starovoitov
2022-10-07 15:43 ` Jamal Hadi Salim
2022-10-06 21:29 ` Daniel Borkmann
2022-10-06 23:28 ` Alexei Starovoitov
2022-10-07 13:26 ` Daniel Borkmann
2022-10-07 14:32 ` Toke Høiland-Jørgensen
2022-10-07 16:55 ` sdf
2022-10-07 17:20 ` Toke Høiland-Jørgensen
2022-10-07 18:11 ` sdf
2022-10-07 19:06 ` Daniel Borkmann
2022-10-07 18:59 ` Alexei Starovoitov
2022-10-07 19:37 ` Daniel Borkmann
2022-10-07 22:45 ` sdf
2022-10-07 23:41 ` Alexei Starovoitov
2022-10-07 23:34 ` Alexei Starovoitov
2022-10-08 11:38 ` Toke Høiland-Jørgensen
2022-10-08 20:38 ` Alexei Starovoitov
2022-10-13 18:30 ` Andrii Nakryiko
2022-10-14 15:38 ` Alexei Starovoitov
2022-10-27 9:01 ` Daniel Xu
2022-10-06 20:15 ` Martin KaFai Lau
2022-10-06 20:54 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-06 20:54 ` Daniel Borkmann
2022-10-06 17:56 ` Martin KaFai Lau
2022-10-06 20:10 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-06 23:14 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-06 23:24 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 09/10] bpftool: Add support for tc fd-based attach types Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YzzWDqAmN5DRTupQ@google.com \
--to=sdf@google.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=joannelkoong@gmail.com \
--cc=joe@cilium.io \
--cc=john.fastabend@gmail.com \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=razor@blackwall.org \
--cc=toke@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).