This series adds support for executing multiple XDP programs on a single interface in sequence, through the use of chain calls, as discussed at the Linux Plumbers Conference last month: https://linuxplumbersconf.org/event/4/contributions/460/ # HIGH-LEVEL IDEA The basic idea is to express the chain call sequence through a special map type, which contains a mapping from a (program, return code) tuple to another program to run in next in the sequence. Userspace can populate this map to express arbitrary call sequences, and update the sequence by updating or replacing the map. The actual execution of the program sequence is done in bpf_prog_run_xdp(), which will lookup the chain sequence map, and if found, will loop through calls to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the previous program ID and return code. An XDP chain call map can be installed on an interface by means of a new netlink attribute containing an fd pointing to a chain call map. This can be supplied along with the XDP prog fd, so that a chain map is always installed together with an XDP program. # PERFORMANCE I performed a simple performance test to get an initial feel for the overhead of the chain call mechanism. This test consists of running only two programs in sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then measure the drop PPS performance and compare it to a baseline of just a single program that only returns XDP_DROP. For comparison, a test case that uses regular eBPF tail calls to sequence two programs together is also included. Finally, because 'perf' showed that the hashmap lookup was the largest single source of overhead, I also added a test case where I removed the jhash() call from the hashmap code, and just use the u32 key directly as an index into the hash bucket structure. The performance for these different cases is as follows (with retpolines disabled): | Test case | Perf | Add. overhead | Total overhead | |---------------------------------+-----------+---------------+----------------| | Before patch (XDP DROP program) | 31.0 Mpps | | | | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | | XDP tail call | 26.6 Mpps | 3.0 ns | 5.3 ns | | XDP chain call (no jhash) | 19.6 Mpps | 13.4 ns | 18.7 ns | | XDP chain call (this series) | 17.0 Mpps | 7.9 ns | 26.6 ns | From this it is clear that while there is some overhead from this mechanism; but the jhash removal example indicates that it is probably possible to optimise the code to the point where the overhead becomes low enough that it is acceptable. # PATCH SET STRUCTURE This series is structured as follows: - Patch 1: Prerequisite - Patch 2: New map type - Patch 3: Netlink hooks to install the chain call map - Patch 4: Core chain call logic - Patch 5-7: Bookkeeping updates to tools - Patch 8: Libbpf support for installing chain call maps - Patch 9: Selftests with example user space code The whole series is also available in my git repo on kernel.org: https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-multiprog-01 --- Toke Høiland-Jørgensen (9): hashtab: Add new bpf_map_fd_put_value op xdp: Add new xdp_chain_map type for specifying XDP call sequences xdp: Support setting and getting device chain map xdp: Implement chain call logic to support multiple programs on one interface tools/include/uapi: Add XDP chain map definitions tools/libbpf_probes: Add support for xdp_chain map type bpftool: Add definitions for xdp_chain map type libbpf: Add support for setting and getting XDP chain maps selftests: Add tests for XDP chain calls include/linux/bpf.h | 10 + include/linux/bpf_types.h | 1 include/linux/filter.h | 26 ++ include/linux/netdevice.h | 3 include/uapi/linux/bpf.h | 12 + include/uapi/linux/if_link.h | 2 kernel/bpf/hashtab.c | 169 +++++++++++++- kernel/bpf/map_in_map.c | 7 + kernel/bpf/map_in_map.h | 1 kernel/bpf/syscall.c | 11 + net/core/dev.c | 42 +++- net/core/rtnetlink.c | 23 ++ tools/bpf/bpftool/Documentation/bpftool-map.rst | 4 tools/bpf/bpftool/bash-completion/bpftool | 2 tools/bpf/bpftool/map.c | 3 tools/include/uapi/linux/bpf.h | 12 + tools/include/uapi/linux/if_link.h | 2 tools/lib/bpf/libbpf.h | 4 tools/lib/bpf/libbpf.map | 2 tools/lib/bpf/libbpf_probes.c | 4 tools/lib/bpf/netlink.c | 49 ++++ tools/testing/selftests/bpf/.gitignore | 1 tools/testing/selftests/bpf/Makefile | 3 tools/testing/selftests/bpf/progs/xdp_dummy.c | 6 + tools/testing/selftests/bpf/test_maps.c | 45 ++++ tools/testing/selftests/bpf/test_xdp_chain.sh | 77 +++++++ tools/testing/selftests/bpf/xdp_chain.c | 271 +++++++++++++++++++++++ 27 files changed, 765 insertions(+), 27 deletions(-) create mode 100755 tools/testing/selftests/bpf/test_xdp_chain.sh create mode 100644 tools/testing/selftests/bpf/xdp_chain.c
From: Toke Høiland-Jørgensen <toke@redhat.com> The fd type maps all resolve the fd into a pointer to the underlying object when it is inserted into the map, and stores that pointer as the real array value. The htab code assumes that the map value is this single pointer, and dereferences it before passing it to the map fd_put_ptr() op. For xdp chain maps we want to be able to store multiple pointers, so we need to get the pointer to the map value store, not the dereferenced pointer to the actual object. So add a new more general bpf_map_fd_put_value() op that takes the map value instead of the dereferenced pointer, and use this on map element free. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- include/linux/bpf.h | 1 + kernel/bpf/hashtab.c | 16 ++++++---------- kernel/bpf/map_in_map.c | 7 +++++++ kernel/bpf/map_in_map.h | 1 + 4 files changed, 15 insertions(+), 10 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5b9d22338606..be3e9e9109c7 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -50,6 +50,7 @@ struct bpf_map_ops { void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file, int fd); void (*map_fd_put_ptr)(void *ptr); + void (*map_fd_put_value)(void *value); u32 (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf); u32 (*map_fd_sys_lookup_elem)(void *ptr); void (*map_seq_show_elem)(struct bpf_map *map, void *key, diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 22066a62c8c9..113e1286e184 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -87,11 +87,6 @@ static inline void __percpu *htab_elem_get_ptr(struct htab_elem *l, u32 key_size return *(void __percpu **)(l->key + key_size); } -static void *fd_htab_map_get_ptr(const struct bpf_map *map, struct htab_elem *l) -{ - return *(void **)(l->key + roundup(map->key_size, 8)); -} - static struct htab_elem *get_htab_elem(struct bpf_htab *htab, int i) { return (struct htab_elem *) (htab->elems + i * htab->elem_size); @@ -679,10 +674,10 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) { struct bpf_map *map = &htab->map; - if (map->ops->map_fd_put_ptr) { - void *ptr = fd_htab_map_get_ptr(map, l); + if (map->ops->map_fd_put_value) { + void *value = l->key + round_up(map->key_size, 8); - map->ops->map_fd_put_ptr(ptr); + map->ops->map_fd_put_value(value); } if (htab_is_prealloc(htab)) { @@ -1400,9 +1395,9 @@ static void fd_htab_map_free(struct bpf_map *map) head = select_bucket(htab, i); hlist_nulls_for_each_entry_safe(l, n, head, hash_node) { - void *ptr = fd_htab_map_get_ptr(map, l); + void *value = l->key + round_up(map->key_size, 8); - map->ops->map_fd_put_ptr(ptr); + map->ops->map_fd_put_value(value); } } @@ -1510,6 +1505,7 @@ const struct bpf_map_ops htab_of_maps_map_ops = { .map_delete_elem = htab_map_delete_elem, .map_fd_get_ptr = bpf_map_fd_get_ptr, .map_fd_put_ptr = bpf_map_fd_put_ptr, + .map_fd_put_value = bpf_map_fd_put_value, .map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem, .map_gen_lookup = htab_of_map_gen_lookup, .map_check_btf = map_check_no_btf, diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index fab4fb134547..1b4e8b6da777 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -114,6 +114,13 @@ void bpf_map_fd_put_ptr(void *ptr) bpf_map_put(ptr); } +void bpf_map_fd_put_value(void *value) +{ + void **ptr = value; + + bpf_map_fd_put_ptr(*ptr); +} + u32 bpf_map_fd_sys_lookup_elem(void *ptr) { return ((struct bpf_map *)ptr)->id; diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h index a507bf6ef8b9..68d1a52e1757 100644 --- a/kernel/bpf/map_in_map.h +++ b/kernel/bpf/map_in_map.h @@ -16,6 +16,7 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0, void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd); void bpf_map_fd_put_ptr(void *ptr); +void bpf_map_fd_put_value(void *value); u32 bpf_map_fd_sys_lookup_elem(void *ptr); #endif
From: Toke Høiland-Jørgensen <toke@redhat.com> To specify call sequences for XDP programs, we need to do a lookup after an XDP program runs to get the next program to run. Introduce a new map type for this use which allows specifying one action per return code of the original program. The new map type is derived from the hashmap type, and uses the program ID as the lookup key. The value is a struct of program pointers specifying the next program in the call sequence for each return code of the previous program. A special wildcard pointer can also be specified which will match on all program return codes if no more specific entry is included in the structure (i.e., LPM-style matching). Userspace fills in a struct of program fds and passes that as the value, and the map converts the fds to pointers before inserting the structure into the map. Using the map type directly from an eBPF program is disallowed; instead, a subsequent commit will add lookup code to the XDP dispatch function, and the ability for userspace to set the current chain call map when attaching an XDP program to an interface. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- include/linux/bpf.h | 9 +++ include/linux/bpf_types.h | 1 include/uapi/linux/bpf.h | 12 ++++ kernel/bpf/hashtab.c | 153 +++++++++++++++++++++++++++++++++++++++++++++ kernel/bpf/syscall.c | 11 +++ 5 files changed, 186 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index be3e9e9109c7..e72702c4cb12 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -689,6 +689,15 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags); int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); +int bpf_xdp_chain_map_update_elem(struct bpf_map *map, void *key, void *value, + u64 map_flags); +int bpf_xdp_chain_map_lookup_elem(struct bpf_map *map, void *key, void *value); +struct bpf_prog *bpf_xdp_chain_map_get_prog(struct bpf_map *map, + u32 prev_id, + enum xdp_action action); + + + int bpf_get_file_flag(int flags); int bpf_check_uarg_tail_zero(void __user *uaddr, size_t expected_size, size_t actual_size); diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 36a9c2325176..95650ed86a12 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -63,6 +63,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops) #ifdef CONFIG_NET BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, dev_map_hash_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_XDP_CHAIN, xdp_chain_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops) #if defined(CONFIG_BPF_STREAM_PARSER) BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 77c6be96d676..8b336fb68880 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -136,6 +136,7 @@ enum bpf_map_type { BPF_MAP_TYPE_STACK, BPF_MAP_TYPE_SK_STORAGE, BPF_MAP_TYPE_DEVMAP_HASH, + BPF_MAP_TYPE_XDP_CHAIN, }; /* Note that tracing related programs such as @@ -3153,6 +3154,17 @@ enum xdp_action { XDP_PASS, XDP_TX, XDP_REDIRECT, + + __XDP_ACT_MAX /* leave at end */ +}; +#define XDP_ACT_MAX (__XDP_ACT_MAX - 1) + +struct xdp_chain_acts { + __u32 wildcard_act; + __u32 drop_act; + __u32 pass_act; + __u32 tx_act; + __u32 redirect_act; }; /* user accessible metadata for XDP packet hook diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 113e1286e184..ab855095c830 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1510,3 +1510,156 @@ const struct bpf_map_ops htab_of_maps_map_ops = { .map_gen_lookup = htab_of_map_gen_lookup, .map_check_btf = map_check_no_btf, }; + +struct xdp_chain_table { + struct bpf_prog *wildcard_act; + struct bpf_prog *act[XDP_ACT_MAX]; +}; + +static int xdp_chain_map_alloc_check(union bpf_attr *attr) +{ + BUILD_BUG_ON(sizeof(struct xdp_chain_table) / sizeof(void *) != + sizeof(struct xdp_chain_acts) / sizeof(u32)); + + if (attr->key_size != sizeof(u32) || + attr->value_size != sizeof(struct xdp_chain_acts)) + return -EINVAL; + + attr->value_size = sizeof(struct xdp_chain_table); + return htab_map_alloc_check(attr); +} + +struct bpf_prog *bpf_xdp_chain_map_get_prog(struct bpf_map *map, + u32 prev_id, + enum xdp_action action) +{ + struct xdp_chain_table *tab; + void *ptr; + + ptr = htab_map_lookup_elem(map, &prev_id); + + if (!ptr) + return NULL; + + tab = READ_ONCE(ptr); + return tab->act[action - 1] ?: tab->wildcard_act; +} +EXPORT_SYMBOL_GPL(bpf_xdp_chain_map_get_prog); + +/* only called from syscall */ +int bpf_xdp_chain_map_lookup_elem(struct bpf_map *map, void *key, void *value) +{ + struct xdp_chain_acts *act = value; + struct xdp_chain_table *tab; + void *ptr; + u32 *cur; + int i; + + ptr = htab_map_lookup_elem(map, key); + if (!ptr) + return -ENOENT; + + tab = READ_ONCE(ptr); + + if (tab->wildcard_act) + act->wildcard_act = tab->wildcard_act->aux->id; + + cur = &act->drop_act; + for (i = 0; i < XDP_ACT_MAX; i++, cur++) + if(tab->act[i]) + *cur = tab->act[i]->aux->id; + + return 0; +} + +static void *xdp_chain_map_get_ptr(int fd) +{ + struct bpf_prog *prog = bpf_prog_get(fd); + + if (IS_ERR(prog)) + return prog; + + if (prog->type != BPF_PROG_TYPE_XDP || + bpf_prog_is_dev_bound(prog->aux)) { + bpf_prog_put(prog); + return ERR_PTR(-EINVAL); + } + + return prog; +} + +static void xdp_chain_map_put_ptrs(void *value) +{ + struct xdp_chain_table *tab = value; + int i; + + for (i = 0; i < XDP_ACT_MAX; i++) + if (tab->act[i]) + bpf_prog_put(tab->act[i]); + + if (tab->wildcard_act) + bpf_prog_put(tab->wildcard_act); +} + +/* only called from syscall */ +int bpf_xdp_chain_map_update_elem(struct bpf_map *map, void *key, void *value, + u64 map_flags) +{ + struct xdp_chain_acts *act = value; + struct xdp_chain_table tab = {}; + u32 lookup_key = *((u32*)key); + u32 *cur = &act->drop_act; + bool found_val = false; + int ret, i; + void *ptr; + + if (!lookup_key) + return -EINVAL; + + if (act->wildcard_act) { + ptr = xdp_chain_map_get_ptr(act->wildcard_act); + if (IS_ERR(ptr)) + return PTR_ERR(ptr); + tab.wildcard_act = ptr; + found_val = true; + } + + for (i = 0; i < XDP_ACT_MAX; i++, cur++) { + if (*cur) { + ptr = xdp_chain_map_get_ptr(*cur); + if (IS_ERR(ptr)) { + ret = PTR_ERR(ptr); + goto out_err; + } + tab.act[i] = ptr; + found_val = true; + } + } + + if (!found_val) { + ret = -EINVAL; + goto out_err; + } + + ret = htab_map_update_elem(map, key, &tab, map_flags); + if (ret) + goto out_err; + + return ret; + +out_err: + xdp_chain_map_put_ptrs(&tab); + + return ret; +} + + +const struct bpf_map_ops xdp_chain_map_ops = { + .map_alloc_check = xdp_chain_map_alloc_check, + .map_alloc = htab_map_alloc, + .map_free = fd_htab_map_free, + .map_get_next_key = htab_map_get_next_key, + .map_delete_elem = htab_map_delete_elem, + .map_fd_put_value = xdp_chain_map_put_ptrs, + .map_check_btf = map_check_no_btf, +}; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 82eabd4e38ad..c9afaace048d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -778,6 +778,8 @@ static int map_lookup_elem(union bpf_attr *attr) value_size = round_up(map->value_size, 8) * num_possible_cpus(); else if (IS_FD_MAP(map)) value_size = sizeof(u32); + else if (map->map_type == BPF_MAP_TYPE_XDP_CHAIN) + value_size = sizeof(struct xdp_chain_acts); else value_size = map->value_size; @@ -806,6 +808,8 @@ static int map_lookup_elem(union bpf_attr *attr) err = bpf_fd_array_map_lookup_elem(map, key, value); } else if (IS_FD_HASH(map)) { err = bpf_fd_htab_map_lookup_elem(map, key, value); + } else if (map->map_type == BPF_MAP_TYPE_XDP_CHAIN) { + err = bpf_xdp_chain_map_lookup_elem(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) { err = bpf_fd_reuseport_array_lookup_elem(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_QUEUE || @@ -908,6 +912,8 @@ static int map_update_elem(union bpf_attr *attr) map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY || map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) value_size = round_up(map->value_size, 8) * num_possible_cpus(); + else if (map->map_type == BPF_MAP_TYPE_XDP_CHAIN) + value_size = sizeof(struct xdp_chain_acts); else value_size = map->value_size; @@ -954,6 +960,11 @@ static int map_update_elem(union bpf_attr *attr) err = bpf_fd_htab_map_update_elem(map, f.file, key, value, attr->flags); rcu_read_unlock(); + } else if (map->map_type == BPF_MAP_TYPE_XDP_CHAIN) { + rcu_read_lock(); + err = bpf_xdp_chain_map_update_elem(map, key, value, + attr->flags); + rcu_read_unlock(); } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) { /* rcu_read_lock() is not needed */ err = bpf_fd_reuseport_array_update_elem(map, key, value,
From: Toke Høiland-Jørgensen <toke@redhat.com> This adds support to rtnetlink for setting and getting the per-device XDP chain map. The map is set by means of a new netlink attribute that contains a pointer to a BPF map of the XDP chain type. If such an attribute is included, it will be inserted into the struct net_device so that the XDP chain call code will pick it up on program execution. To prevent old userspace programs that do not understand the chain map attribute from messing up the chain call order, a netlink message with no chain map attribute set will be rejected if a chain map has already been installed. When installing a new chain call map, an XDP program fd must also be provided, otherwise the operation will be rejected. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- include/linux/netdevice.h | 3 ++- include/uapi/linux/if_link.h | 2 ++ net/core/dev.c | 42 ++++++++++++++++++++++++++++++++++++------ net/core/rtnetlink.c | 23 +++++++++++++++++++++++ 4 files changed, 63 insertions(+), 7 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 48cc71aae466..60f3b510b175 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1941,6 +1941,7 @@ struct net_device { unsigned int real_num_rx_queues; struct bpf_prog __rcu *xdp_prog; + struct bpf_map __rcu *xdp_chain_map; unsigned long gro_flush_timeout; rx_handler_func_t __rcu *rx_handler; void __rcu *rx_handler_data; @@ -3702,7 +3703,7 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, - int fd, u32 flags); + int prog_fd, int chain_map_fd, u32 flags); u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op, enum bpf_netdev_command cmd); int xdp_umem_query(struct net_device *dev, u16 queue_id); diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 8aec8769d944..e7a704387608 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -976,6 +976,8 @@ enum { IFLA_XDP_DRV_PROG_ID, IFLA_XDP_SKB_PROG_ID, IFLA_XDP_HW_PROG_ID, + IFLA_XDP_CHAIN_MAP_FD, + IFLA_XDP_CHAIN_MAP_ID, __IFLA_XDP_MAX, }; diff --git a/net/core/dev.c b/net/core/dev.c index 7a456c6a7ad8..0aa5106339e7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8177,9 +8177,15 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op, static void dev_xdp_uninstall(struct net_device *dev) { + struct bpf_map *chain_map = NULL; struct netdev_bpf xdp; bpf_op_t ndo_bpf; + /* Remove chain map */ + rcu_swap_protected(dev->xdp_chain_map, chain_map, 1); + if(chain_map) + bpf_map_put_with_uref(chain_map); + /* Remove generic XDP */ WARN_ON(dev_xdp_install(dev, generic_xdp_install, NULL, 0, NULL)); @@ -8207,15 +8213,17 @@ static void dev_xdp_uninstall(struct net_device *dev) * dev_change_xdp_fd - set or clear a bpf program for a device rx path * @dev: device * @extack: netlink extended ack - * @fd: new program fd or negative value to clear + * @prog_fd: new program fd or negative value to clear + * @chain_map_fd: new chain map fd or negative value to clear * @flags: xdp-related flags * * Set or clear a bpf program for a device */ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, - int fd, u32 flags) + int prog_fd, int chain_map_fd, u32 flags) { const struct net_device_ops *ops = dev->netdev_ops; + struct bpf_map *chain_map = NULL; enum bpf_netdev_command query; struct bpf_prog *prog = NULL; bpf_op_t bpf_op, bpf_chk; @@ -8237,7 +8245,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, if (bpf_op == bpf_chk) bpf_chk = generic_xdp_install; - if (fd >= 0) { + if (prog_fd >= 0) { u32 prog_id; if (!offload && __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG)) { @@ -8251,7 +8259,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, return -EBUSY; } - prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP, + prog = bpf_prog_get_type_dev(prog_fd, BPF_PROG_TYPE_XDP, bpf_op == ops->ndo_bpf); if (IS_ERR(prog)) return PTR_ERR(prog); @@ -8267,13 +8275,35 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, return 0; } } else { + if (chain_map_fd >= 0) + return -EINVAL; + if (!__dev_xdp_query(dev, bpf_op, query)) return 0; } + if (chain_map_fd >= 0) { + chain_map = bpf_map_get_with_uref(chain_map_fd); + if (IS_ERR(chain_map)) + return PTR_ERR(chain_map); + + if (chain_map->map_type != BPF_MAP_TYPE_XDP_CHAIN) { + NL_SET_ERR_MSG(extack, "invalid chain map type"); + bpf_map_put_with_uref(chain_map); + return -EINVAL; + } + } + err = dev_xdp_install(dev, bpf_op, extack, flags, prog); - if (err < 0 && prog) - bpf_prog_put(prog); + if (err < 0) { + if (prog) + bpf_prog_put(prog); + } else { + rcu_swap_protected(dev->xdp_chain_map, chain_map, 1); + } + + if(chain_map) + bpf_map_put_with_uref(chain_map); return err; } diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 49fa910b58af..d6123efdff80 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1427,6 +1427,7 @@ static int rtnl_xdp_report_one(struct sk_buff *skb, struct net_device *dev, static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) { + struct bpf_map *chain_map; struct nlattr *xdp; u32 prog_id; int err; @@ -1461,6 +1462,13 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) goto err_cancel; } + chain_map = rcu_dereference(dev->xdp_chain_map); + if (chain_map) { + err = nla_put_u32(skb, IFLA_XDP_CHAIN_MAP_ID, chain_map->id); + if (err) + goto err_cancel; + } + nla_nest_end(skb, xdp); return 0; @@ -2756,12 +2764,27 @@ static int do_setlink(const struct sk_buff *skb, } if (xdp[IFLA_XDP_FD]) { + int chain_map_fd = -1; + + if (xdp[IFLA_XDP_CHAIN_MAP_FD]) { + chain_map_fd = nla_get_s32(xdp[IFLA_XDP_CHAIN_MAP_FD]); + } else if (dev->xdp_chain_map) { + NL_SET_ERR_MSG(extack, "no chain map attribute, but chain map loaded"); + err = -EINVAL; + goto errout; + } + err = dev_change_xdp_fd(dev, extack, nla_get_s32(xdp[IFLA_XDP_FD]), + chain_map_fd, xdp_flags); if (err) goto errout; status |= DO_SETLINK_NOTIFY; + } else if (xdp[IFLA_XDP_CHAIN_MAP_FD]) { + err = -EINVAL; + NL_SET_ERR_MSG(extack, "chain map attribute invalid without prog fd"); + goto errout; } }
From: Toke Høiland-Jørgensen <toke@redhat.com> This adds for executing multiple XDP programs on a single interface using the chain call map type introduced in the previous commits. The logic is added as an extension of bpf_prog_run_xdp() which will loop through the call sequence specified by the chain call map installed on the current interface. The overhead when no chain map is installed is only a single pointer dereference. The total call sequence length is limited to 32 programs, and the call sequence will be aborted and XDP_ABORTED returned if it is exceeded. Likewise, if a program in the sequence returns XDP_ABORTED, the whole sequence will be aborted immediately, on the assumption that this is a fault somewhere in the system. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- include/linux/filter.h | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 2ce57645f3cd..8a79ddd4f7b5 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -693,6 +693,7 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, return res; } +#define BPF_XDP_MAX_CHAIN_CALLS 32 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, struct xdp_buff *xdp) { @@ -702,7 +703,30 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, * already takes rcu_read_lock() when fetching the program, so * it's not necessary here anymore. */ - return BPF_PROG_RUN(prog, xdp); + + int i = BPF_XDP_MAX_CHAIN_CALLS; + struct bpf_map *chain_map; + u32 ret; + + chain_map = rcu_dereference(xdp->rxq->dev->xdp_chain_map); + if (!chain_map) + return BPF_PROG_RUN(prog, xdp); + + do { + if (!--i) { + ret = XDP_ABORTED; + goto out; + } + + ret = BPF_PROG_RUN(prog, xdp); + if (ret == XDP_ABORTED) + goto out; + + prog = bpf_xdp_chain_map_get_prog(chain_map, prog->aux->id, ret); + } while(prog); + +out: + return ret; } static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
From: Toke Høiland-Jørgensen <toke@redhat.com> This syncs the XDP chain-map related UAPI definitions into tools/include/uapi. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- tools/include/uapi/linux/bpf.h | 12 ++++++++++++ tools/include/uapi/linux/if_link.h | 2 ++ 2 files changed, 14 insertions(+) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 77c6be96d676..8b336fb68880 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -136,6 +136,7 @@ enum bpf_map_type { BPF_MAP_TYPE_STACK, BPF_MAP_TYPE_SK_STORAGE, BPF_MAP_TYPE_DEVMAP_HASH, + BPF_MAP_TYPE_XDP_CHAIN, }; /* Note that tracing related programs such as @@ -3153,6 +3154,17 @@ enum xdp_action { XDP_PASS, XDP_TX, XDP_REDIRECT, + + __XDP_ACT_MAX /* leave at end */ +}; +#define XDP_ACT_MAX (__XDP_ACT_MAX - 1) + +struct xdp_chain_acts { + __u32 wildcard_act; + __u32 drop_act; + __u32 pass_act; + __u32 tx_act; + __u32 redirect_act; }; /* user accessible metadata for XDP packet hook diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h index 4a8c02cafa9a..7387d2371489 100644 --- a/tools/include/uapi/linux/if_link.h +++ b/tools/include/uapi/linux/if_link.h @@ -974,6 +974,8 @@ enum { IFLA_XDP_DRV_PROG_ID, IFLA_XDP_SKB_PROG_ID, IFLA_XDP_HW_PROG_ID, + IFLA_XDP_CHAIN_MAP_FD, + IFLA_XDP_CHAIN_MAP_ID, __IFLA_XDP_MAX, };
From: Toke Høiland-Jørgensen <toke@redhat.com> This adds support for the BPF_MAP_TYPE_XDP_CHAIN map type to libbpf_probes.c. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- tools/lib/bpf/libbpf_probes.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index 4b0b0364f5fc..266ae6e78f88 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -231,6 +231,10 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex) if (btf_fd < 0) return false; break; + case BPF_MAP_TYPE_XDP_CHAIN: + key_size = sizeof(__u32); + value_size = sizeof(struct xdp_chain_acts); + break; case BPF_MAP_TYPE_UNSPEC: case BPF_MAP_TYPE_HASH: case BPF_MAP_TYPE_ARRAY:
From: Toke Høiland-Jørgensen <toke@redhat.com> Adds bash completion and definition types for the xdp_chain map type to bpftool. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- tools/bpf/bpftool/Documentation/bpftool-map.rst | 4 ++-- tools/bpf/bpftool/bash-completion/bpftool | 2 +- tools/bpf/bpftool/map.c | 3 ++- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst index 1c0f7146aab0..9aefcef50b53 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst @@ -47,8 +47,8 @@ MAP COMMANDS | *TYPE* := { **hash** | **array** | **prog_array** | **perf_event_array** | **percpu_hash** | | **percpu_array** | **stack_trace** | **cgroup_array** | **lru_hash** | | **lru_percpu_hash** | **lpm_trie** | **array_of_maps** | **hash_of_maps** -| | **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash** -| | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** +| | **devmap** | **devmap_hash** | **xdp_chain** | **sockmap** | **cpumap** | **xskmap** +| | **sockhash** | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** | | **queue** | **stack** } DESCRIPTION diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index 70493a6da206..95f19191b8d1 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -489,7 +489,7 @@ _bpftool() perf_event_array percpu_hash percpu_array \ stack_trace cgroup_array lru_hash \ lru_percpu_hash lpm_trie array_of_maps \ - hash_of_maps devmap devmap_hash sockmap cpumap \ + hash_of_maps devmap devmap_hash xdp_chain sockmap cpumap \ xskmap sockhash cgroup_storage reuseport_sockarray \ percpu_cgroup_storage queue stack' -- \ "$cur" ) ) diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index de61d73b9030..97b5b42df79c 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -38,6 +38,7 @@ const char * const map_type_name[] = { [BPF_MAP_TYPE_HASH_OF_MAPS] = "hash_of_maps", [BPF_MAP_TYPE_DEVMAP] = "devmap", [BPF_MAP_TYPE_DEVMAP_HASH] = "devmap_hash", + [BPF_MAP_TYPE_XDP_CHAIN] = "xdp_chain", [BPF_MAP_TYPE_SOCKMAP] = "sockmap", [BPF_MAP_TYPE_CPUMAP] = "cpumap", [BPF_MAP_TYPE_XSKMAP] = "xskmap", @@ -1326,7 +1327,7 @@ static int do_help(int argc, char **argv) " TYPE := { hash | array | prog_array | perf_event_array | percpu_hash |\n" " percpu_array | stack_trace | cgroup_array | lru_hash |\n" " lru_percpu_hash | lpm_trie | array_of_maps | hash_of_maps |\n" - " devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n" + " devmap | devmap_hash | xdp_chain | sockmap | cpumap | xskmap | sockhash |\n" " cgroup_storage | reuseport_sockarray | percpu_cgroup_storage }\n" " " HELP_SPEC_OPTIONS "\n" "",
From: Toke Høiland-Jørgensen <toke@redhat.com> This adds two new API functions for setting and getting XDP programs with associated chain map IDs. Programs are expected to pair them, so that if a program uses the chain map-aware setter, it should also use the associated getter. Programs using the old non-chain aware variants of the functions will not set the XDP chain map attribute on the netlink message, resulting in the kernel rejecting the command if a chain map has already been loaded. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- tools/lib/bpf/libbpf.h | 4 ++++ tools/lib/bpf/libbpf.map | 2 ++ tools/lib/bpf/netlink.c | 49 ++++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 51 insertions(+), 4 deletions(-) diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index e8f70977d137..0a459840e32c 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -357,7 +357,11 @@ LIBBPF_API int bpf_prog_load(const char *file, enum bpf_prog_type type, struct bpf_object **pobj, int *prog_fd); LIBBPF_API int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags); +LIBBPF_API int bpf_set_link_xdp_chain(int ifindex, int prog_fd, int chain_map_fd, + __u32 flags); LIBBPF_API int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags); +LIBBPF_API int bpf_get_link_xdp_chain(int ifindex, __u32 *prog_id, __u32 *chain_map_id, + __u32 flags); struct perf_buffer; diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 8d10ca03d78d..59f412680292 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -192,4 +192,6 @@ LIBBPF_0.0.5 { } LIBBPF_0.0.4; LIBBPF_0.0.6 { + bpf_set_link_xdp_chain; + bpf_get_link_xdp_chain; } LIBBPF_0.0.5; diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c index ce3ec81b71c0..c6f63bdab2e6 100644 --- a/tools/lib/bpf/netlink.c +++ b/tools/lib/bpf/netlink.c @@ -25,6 +25,7 @@ struct xdp_id_md { int ifindex; __u32 flags; __u32 id; + __u32 chain_map_id; }; int libbpf_netlink_open(__u32 *nl_pid) @@ -128,7 +129,8 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq, return ret; } -int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) +static int __bpf_set_link_xdp_fd(int ifindex, int *prog_fd, int *chain_map_fd, + __u32 flags) { int sock, seq = 0, ret; struct nlattr *nla, *nla_xdp; @@ -162,9 +164,19 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); nla_xdp->nla_type = IFLA_XDP_FD; nla_xdp->nla_len = NLA_HDRLEN + sizeof(int); - memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd)); + memcpy((char *)nla_xdp + NLA_HDRLEN, prog_fd, sizeof(*prog_fd)); nla->nla_len += nla_xdp->nla_len; + if (chain_map_fd) { + /* add XDP chain map */ + nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); + nla_xdp->nla_type = IFLA_XDP_CHAIN_MAP_FD; + nla_xdp->nla_len = NLA_HDRLEN + sizeof(int); + memcpy((char *)nla_xdp + NLA_HDRLEN, chain_map_fd, + sizeof(*chain_map_fd)); + nla->nla_len += nla_xdp->nla_len; + } + /* if user passed in any flags, add those too */ if (flags) { nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); @@ -187,6 +199,17 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) return ret; } +int bpf_set_link_xdp_chain(int ifindex, int prog_fd, int chain_map_fd, + __u32 flags) +{ + return __bpf_set_link_xdp_fd(ifindex, &prog_fd, &chain_map_fd, flags); +} + +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) +{ + return __bpf_set_link_xdp_fd(ifindex, &fd, NULL, flags); +} + static int __dump_link_nlmsg(struct nlmsghdr *nlh, libbpf_dump_nlmsg_t dump_link_nlmsg, void *cookie) { @@ -247,10 +270,13 @@ static int get_xdp_id(void *cookie, void *msg, struct nlattr **tb) xdp_id->id = libbpf_nla_getattr_u32(xdp_tb[xdp_attr]); + if (xdp_tb[IFLA_XDP_CHAIN_MAP_ID]) + xdp_id->chain_map_id = libbpf_nla_getattr_u32(xdp_tb[IFLA_XDP_CHAIN_MAP_ID]); + return 0; } -int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags) +static int __bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 *chain_map_id, __u32 flags) { struct xdp_id_md xdp_id = {}; int sock, ret; @@ -274,13 +300,28 @@ int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags) xdp_id.flags = flags; ret = libbpf_nl_get_link(sock, nl_pid, get_xdp_id, &xdp_id); - if (!ret) + if (!ret) { *prog_id = xdp_id.id; + if (chain_map_id) + *chain_map_id = xdp_id.chain_map_id; + } + close(sock); return ret; } +int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags) +{ + return __bpf_get_link_xdp_id(ifindex, prog_id, NULL, flags); +} + +int bpf_get_link_xdp_chain(int ifindex, __u32 *prog_id, __u32 *chain_map_id, + __u32 flags) +{ + return __bpf_get_link_xdp_id(ifindex, prog_id, chain_map_id, flags); +} + int libbpf_nl_get_link(int sock, unsigned int nl_pid, libbpf_dump_nlmsg_t dump_link_nlmsg, void *cookie) {
From: Toke Høiland-Jørgensen <toke@redhat.com> This adds new self tests for the XDP chain call functionality. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> --- tools/testing/selftests/bpf/.gitignore | 1 tools/testing/selftests/bpf/Makefile | 3 tools/testing/selftests/bpf/progs/xdp_dummy.c | 6 + tools/testing/selftests/bpf/test_maps.c | 45 ++++ tools/testing/selftests/bpf/test_xdp_chain.sh | 77 +++++++ tools/testing/selftests/bpf/xdp_chain.c | 271 +++++++++++++++++++++++++ 6 files changed, 402 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/bpf/test_xdp_chain.sh create mode 100644 tools/testing/selftests/bpf/xdp_chain.c diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index 7470327edcfe..e9d2d765cc8f 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -39,3 +39,4 @@ libbpf.so.* test_hashmap test_btf_dump xdping +xdp_chain diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 6889c19a628c..97e8f6ae4a15 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test test_sock test_btf test_sockmap get_cgroup_id_user test_socket_cookie \ test_cgroup_storage test_select_reuseport test_section_names \ test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \ - test_btf_dump test_cgroup_attach xdping + test_btf_dump test_cgroup_attach xdping xdp_chain BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c))) TEST_GEN_FILES = $(BPF_OBJ_FILES) @@ -71,6 +71,7 @@ TEST_PROGS := test_kmod.sh \ test_tc_tunnel.sh \ test_tc_edt.sh \ test_xdping.sh \ + test_xdp_chain.sh \ test_bpftool_build.sh TEST_PROGS_EXTENDED := with_addr.sh \ diff --git a/tools/testing/selftests/bpf/progs/xdp_dummy.c b/tools/testing/selftests/bpf/progs/xdp_dummy.c index 43b0ef1001ed..454a1f0763a1 100644 --- a/tools/testing/selftests/bpf/progs/xdp_dummy.c +++ b/tools/testing/selftests/bpf/progs/xdp_dummy.c @@ -10,4 +10,10 @@ int xdp_dummy_prog(struct xdp_md *ctx) return XDP_PASS; } +SEC("xdp_drop") +int xdp_drop_prog(struct xdp_md *ctx) +{ + return XDP_DROP; +} + char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c index e1f1becda529..44f2f8548a24 100644 --- a/tools/testing/selftests/bpf/test_maps.c +++ b/tools/testing/selftests/bpf/test_maps.c @@ -523,6 +523,50 @@ static void test_devmap_hash(unsigned int task, void *data) close(fd); } +static void test_xdp_chain_map(unsigned int task, void *data) +{ + const char *file = "./xdp_dummy.o"; + struct xdp_chain_acts acts = {}; + struct bpf_object *obj; + int map_fd, prog_fd; + __u32 key = 0; + int err; + + map_fd = bpf_create_map(BPF_MAP_TYPE_XDP_CHAIN, sizeof(key), sizeof(acts), + 2, 0); + if (map_fd < 0) { + printf("Failed to create xdp_chain map '%s'!\n", strerror(errno)); + exit(1); + } + + err = bpf_prog_load(file, BPF_PROG_TYPE_XDP, &obj, &prog_fd); + if (err < 0) { + printf("Failed to load dummy prog: '%s'!\n", strerror(errno)); + exit(1); + } + + /* Try inserting NULL key/value - should fail */ + assert(bpf_map_update_elem(map_fd, &key, &acts, 0) == -1 && errno == EINVAL); + + /* Try inserting NULL value - should fail */ + key = 1; + assert(bpf_map_update_elem(map_fd, &key, &acts, 0) == -1 && errno == EINVAL); + + /* Try a real insert - should succeed */ + acts.wildcard_act = prog_fd; + assert(bpf_map_update_elem(map_fd, &key, &acts, 0) == 0); + + /* Replace with full table */ + acts.drop_act = acts.pass_act = acts.tx_act = acts.redirect_act = prog_fd; + assert(bpf_map_update_elem(map_fd, &key, &acts, 0) == 0); + + /* Try deleting element */ + assert(bpf_map_delete_elem(map_fd, &key) == 0); + + bpf_object__close(obj); + close(map_fd); +} + static void test_queuemap(unsigned int task, void *data) { const int MAP_SIZE = 32; @@ -1700,6 +1744,7 @@ static void run_all_tests(void) test_devmap(0, NULL); test_devmap_hash(0, NULL); + test_xdp_chain_map(0, NULL); test_sockmap(0, NULL); test_map_large(); diff --git a/tools/testing/selftests/bpf/test_xdp_chain.sh b/tools/testing/selftests/bpf/test_xdp_chain.sh new file mode 100755 index 000000000000..3997655d4e45 --- /dev/null +++ b/tools/testing/selftests/bpf/test_xdp_chain.sh @@ -0,0 +1,77 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# xdp_chain tests +# Here we setup and teardown configuration required to run +# xdp_chain, exercising its options. +# +# Setup is similar to xdping tests. +# +# Topology: +# --------- +# root namespace | tc_ns0 namespace +# | +# ---------- | ---------- +# | veth1 | --------- | veth0 | +# ---------- peer ---------- +# +# Device Configuration +# -------------------- +# Root namespace with BPF +# Device names and addresses: +# veth1 IP: 10.1.1.200 +# +# Namespace tc_ns0 with BPF +# Device names and addresses: +# veth0 IPv4: 10.1.1.100 +# xdp_chain binary run inside this +# + +readonly TARGET_IP="10.1.1.100" +readonly TARGET_NS="xdp_ns0" + +readonly LOCAL_IP="10.1.1.200" + +setup() +{ + ip netns add $TARGET_NS + ip link add veth0 type veth peer name veth1 + ip link set veth0 netns $TARGET_NS + ip netns exec $TARGET_NS ip addr add ${TARGET_IP}/24 dev veth0 + ip addr add ${LOCAL_IP}/24 dev veth1 + ip netns exec $TARGET_NS ip link set veth0 up + ip link set veth1 up +} + +cleanup() +{ + set +e + ip netns delete $TARGET_NS 2>/dev/null + ip link del veth1 2>/dev/null +} + +die() +{ + echo "$@" >&2 + exit 1 +} + +test() +{ + args="$1" + + ip netns exec $TARGET_NS ./xdp_chain $args || die "XDP chain test error" +} + +set -e + +server_pid=0 + +trap cleanup EXIT + +setup + +test "-I veth0 -S $LOCAL_IP" + +echo "OK. All tests passed" +exit 0 diff --git a/tools/testing/selftests/bpf/xdp_chain.c b/tools/testing/selftests/bpf/xdp_chain.c new file mode 100644 index 000000000000..11f78bb1c2e7 --- /dev/null +++ b/tools/testing/selftests/bpf/xdp_chain.c @@ -0,0 +1,271 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved. */ + +#include <linux/bpf.h> +#include <linux/if_link.h> +#include <arpa/inet.h> +#include <assert.h> +#include <errno.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <libgen.h> +#include <sys/resource.h> +#include <net/if.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <netdb.h> + +#include "bpf/bpf.h" +#include "bpf/libbpf.h" + +static int ifindex; +static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST; +static char *dest = NULL, *ifname = NULL; + +static void cleanup(int sig) +{ + int ret; + + fprintf(stderr, " Cleaning up\n"); + if ((ret = bpf_set_link_xdp_chain(ifindex, -1, -1, xdp_flags))) + fprintf(stderr, "Warning: Unable to clear XDP prog: %s\n", + strerror(-ret)); + if (sig) + exit(1); +} + +static void show_usage(const char *prog) +{ + fprintf(stderr, + "usage: %s [OPTS] -I interface destination\n\n" + "OPTS:\n" + " -I interface interface name\n" + " -N Run in driver mode\n" + " -S Run in skb mode\n" + " -p pin_path path to pin chain call map\n" + " -x Exit after setup\n" + " -c Cleanup and exit\n", + prog); +} + +static int run_ping(bool should_fail, const char *msg) +{ + char cmd[256]; + bool success; + int ret; + + snprintf(cmd, sizeof(cmd), "ping -c 1 -W 1 -I %s %s >/dev/null", ifname, dest); + + printf(" %s: ", msg); + + ret = system(cmd); + + success = (!!ret == should_fail); + printf(success ? "PASS\n" : "FAIL\n"); + + return !success; +} + +int main(int argc, char **argv) +{ + __u32 mode_flags = XDP_FLAGS_DRV_MODE | XDP_FLAGS_SKB_MODE; + int pass_prog_fd = -1, drop_prog_fd = -1, map_fd = -1; + const char *filename = "xdp_dummy.o", *pin_path = NULL; + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + struct bpf_program *pass_prog, *drop_prog; + __u32 map_key, prog_id, chain_map_id; + struct xdp_chain_acts acts = {}; + struct bpf_prog_info info = {}; + __u32 info_len = sizeof(info); + const char *optstr = "I:NSp:xc"; + bool setup_only = false, cleanup_only = false; + struct bpf_object *obj; + int opt, ret = 1; + + while ((opt = getopt(argc, argv, optstr)) != -1) { + switch (opt) { + case 'I': + ifname = optarg; + ifindex = if_nametoindex(ifname); + if (!ifindex) { + fprintf(stderr, "Could not get interface %s\n", + ifname); + return 1; + } + break; + case 'N': + xdp_flags |= XDP_FLAGS_DRV_MODE; + break; + case 'S': + xdp_flags |= XDP_FLAGS_SKB_MODE; + break; + case 'x': + setup_only = true; + break; + case 'c': + cleanup_only = true; + break; + case 'p': + pin_path = optarg; + break; + default: + show_usage(basename(argv[0])); + return 1; + } + } + + if (!ifname) { + show_usage(basename(argv[0])); + return 1; + } + + if (cleanup_only) { + if (pin_path) + unlink(pin_path); + cleanup(0); + return 0; + } + + if (!setup_only && optind == argc) { + show_usage(basename(argv[0])); + return 1; + } + dest = argv[optind]; + + if ((xdp_flags & mode_flags) == mode_flags) { + fprintf(stderr, "-N or -S can be specified, not both.\n"); + show_usage(basename(argv[0])); + return 1; + } + + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + perror("setrlimit(RLIMIT_MEMLOCK)"); + return 1; + } + + if (bpf_prog_load(filename, BPF_PROG_TYPE_XDP, &obj, &pass_prog_fd)) { + fprintf(stderr, "load of %s failed\n", filename); + return 1; + } + + pass_prog = bpf_object__find_program_by_title(obj, "xdp_dummy"); + drop_prog = bpf_object__find_program_by_title(obj, "xdp_drop"); + + if (!pass_prog || !drop_prog) { + fprintf(stderr, "could not find xdp programs\n"); + return 1; + } + pass_prog_fd = bpf_program__fd(pass_prog); + drop_prog_fd = bpf_program__fd(drop_prog); + if (pass_prog_fd < 0 || drop_prog_fd < 0) { + fprintf(stderr, "could not find xdp programs\n"); + goto done; + } + + ret = bpf_obj_get_info_by_fd(pass_prog_fd, &info, &info_len); + if (ret) { + fprintf(stderr, "unable to get program ID from kernel\n"); + goto done; + } + map_key = info.id; + map_fd = bpf_create_map(BPF_MAP_TYPE_XDP_CHAIN, + sizeof(map_key), sizeof(acts), + 2, 0); + + if (map_fd < 0) { + fprintf(stderr, "unable to create chain call map: %s\n", strerror(errno)); + goto done; + } + + if (pin_path && (ret = bpf_obj_pin(map_fd, pin_path))) { + fprintf(stderr, "unable to pin map at %s: %s\n", pin_path, + strerror(errno)); + goto done; + } + + +#define RUN_PING(should_fail, err) if ((ret = run_ping(should_fail, err))) goto done; + + if (!setup_only) { + RUN_PING(false, "Pre-setup ping test"); + + signal(SIGINT, cleanup); + signal(SIGTERM, cleanup); + } + + if ((ret = bpf_set_link_xdp_chain(ifindex, pass_prog_fd, map_fd, xdp_flags)) < 0) { + fprintf(stderr, "Link set xdp fd failed for %s: %s\n", ifname, + strerror(-ret)); + goto done; + } + + if ((ret = bpf_get_link_xdp_chain(ifindex, &prog_id, &chain_map_id, 0)) < 0) { + fprintf(stderr, "Unable to get xdp IDs for %s: '%s'\n", ifname, strerror(-ret)); + goto done; + } + printf(" XDP prog ID: %u Chain map ID: %u\n", prog_id, chain_map_id); + + if (!setup_only) { + sleep(1); + RUN_PING(false, "Empty map test"); + } + + acts.wildcard_act = drop_prog_fd; + if (bpf_map_update_elem(map_fd, &map_key, &acts, 0)) { + fprintf(stderr, "unable to insert into map: %s\n", strerror(errno)); + goto done; + } + + if (setup_only) { + printf("Setup done; exiting.\n"); + ret = 0; + goto done; + } + + sleep(1); + + RUN_PING(true, "Wildcard act test"); + + if (bpf_map_delete_elem(map_fd, &map_key)) { + fprintf(stderr, "unable to delete from map: %s\n", strerror(errno)); + goto done; + } + sleep(1); + + RUN_PING(false, "Post-delete map test"); + + acts.wildcard_act = 0; + acts.pass_act = drop_prog_fd; + if (bpf_map_update_elem(map_fd, &map_key, &acts, 0)) { + fprintf(stderr, "unable to insert into map: %s\n", strerror(errno)); + goto done; + } + sleep(1); + + RUN_PING(true, "Pass act test"); + + + if ((ret = bpf_set_link_xdp_chain(ifindex, -1, -1, xdp_flags)) < 0) { + fprintf(stderr, "Link clear xdp fd failed for %s: '%s'\n", ifname, strerror(-ret)); + goto done; + } + sleep(1); + + RUN_PING(false, "Post-delete prog test"); + + +done: + cleanup(ret); + + if (pass_prog_fd > 0) + close(pass_prog_fd); + if (drop_prog_fd > 0) + close(drop_prog_fd); + if (map_fd > 0) + close(map_fd); + + return ret; +}
[-- Attachment #1: Type: text/plain, Size: 2193 bytes --] On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote: > This series adds support for executing multiple XDP programs on a single > interface in sequence, through the use of chain calls, as discussed at the Linux > Plumbers Conference last month: > > https://linuxplumbersconf.org/event/4/contributions/460/ > > # HIGH-LEVEL IDEA > > The basic idea is to express the chain call sequence through a special map type, > which contains a mapping from a (program, return code) tuple to another program > to run in next in the sequence. Userspace can populate this map to express > arbitrary call sequences, and update the sequence by updating or replacing the > map. > > The actual execution of the program sequence is done in bpf_prog_run_xdp(), > which will lookup the chain sequence map, and if found, will loop through calls > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > previous program ID and return code. > > An XDP chain call map can be installed on an interface by means of a new netlink > attribute containing an fd pointing to a chain call map. This can be supplied > along with the XDP prog fd, so that a chain map is always installed together > with an XDP program. > This is great stuff Toke! One thing that wasn't immediately clear to me - and this may be just me - is the relationship between program behaviour for the XDP_DROP case and chain call execution. My initial thought was that a program in the chain XDP_DROP'ping the packet would terminate the call chain, but on looking at patch #4 it seems that the only way the call chain execution is terminated is if - XDP_ABORTED is returned from a program in the call chain; or - the map entry for the next program (determined by the return value of the current program) is empty; or - we run out of entries in the map The return value of the last-executed program in the chain seems to be what determines packet processing behaviour after executing the chain (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and XDP_TX a packet from the same chain, right? Just want to make sure I've got the semantics correct. Thanks! Alan
Alan Maguire <alan.maguire@oracle.com> writes: > On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote: > >> This series adds support for executing multiple XDP programs on a single >> interface in sequence, through the use of chain calls, as discussed at the Linux >> Plumbers Conference last month: >> >> https://linuxplumbersconf.org/event/4/contributions/460/ >> >> # HIGH-LEVEL IDEA >> >> The basic idea is to express the chain call sequence through a special map type, >> which contains a mapping from a (program, return code) tuple to another program >> to run in next in the sequence. Userspace can populate this map to express >> arbitrary call sequences, and update the sequence by updating or replacing the >> map. >> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(), >> which will lookup the chain sequence map, and if found, will loop through calls >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the >> previous program ID and return code. >> >> An XDP chain call map can be installed on an interface by means of a new netlink >> attribute containing an fd pointing to a chain call map. This can be supplied >> along with the XDP prog fd, so that a chain map is always installed together >> with an XDP program. >> > > This is great stuff Toke! Thanks! :) > One thing that wasn't immediately clear to me - and this may be just > me - is the relationship between program behaviour for the XDP_DROP > case and chain call execution. My initial thought was that a program > in the chain XDP_DROP'ping the packet would terminate the call chain, > but on looking at patch #4 it seems that the only way the call chain > execution is terminated is if > > - XDP_ABORTED is returned from a program in the call chain; or Yes. Not actually sure about this one... > - the map entry for the next program (determined by the return value > of the current program) is empty; or This will be the common exit condition, I expect > - we run out of entries in the map You mean if we run the iteration counter to zero, right? > The return value of the last-executed program in the chain seems to be > what determines packet processing behaviour after executing the chain > (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and > XDP_TX a packet from the same chain, right? Just want to make sure > I've got the semantics correct. Thanks! Yeah, you've got all this right. The chain call mechanism itself doesn't change any of the underlying fundamentals of XDP. I.e., each packet gets exactly one verdict. For chaining actual XDP programs that do different things to the packet, I expect that the most common use case will be to only run the next program if the previous one returns XDP_PASS. That will make the most semantic sense I think. But there are also use cases where one would want to match on the other return codes; such as packet capture, for instance, where one might install a capture program that would carry forward the previous return code, but do something to the packet (throw it out to userspace) first. For the latter use case, the question is if we need to expose the previous return code to the program when it runs. You can do things without it (by just using a different program per return code), but it may simplify things if we just expose the return code. However, since this will also change the semantics for running programs, I decided to leave that off for now. -Toke
On Wed, 2 Oct 2019 at 14:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> From: Toke Høiland-Jørgensen <toke@redhat.com>
>
> This adds support to rtnetlink for setting and getting the per-device XDP
> chain map. The map is set by means of a new netlink attribute that contains
> a pointer to a BPF map of the XDP chain type. If such an attribute is
> included, it will be inserted into the struct net_device so that the XDP
> chain call code will pick it up on program execution.
>
> To prevent old userspace programs that do not understand the chain map
> attribute from messing up the chain call order, a netlink message with no
> chain map attribute set will be rejected if a chain map has already been
> installed.
>
> When installing a new chain call map, an XDP program fd must also be
> provided, otherwise the operation will be rejected.
Why is the program required? I kind of expected the chain call map
to override any program.
--
Lorenz Bauer | Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK
www.cloudflare.com
On Wed, 2 Oct 2019 at 14:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 113e1286e184..ab855095c830 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -1510,3 +1510,156 @@ const struct bpf_map_ops htab_of_maps_map_ops = { > .map_gen_lookup = htab_of_map_gen_lookup, > .map_check_btf = map_check_no_btf, > }; > + > +struct xdp_chain_table { > + struct bpf_prog *wildcard_act; > + struct bpf_prog *act[XDP_ACT_MAX]; > +}; > + > +static int xdp_chain_map_alloc_check(union bpf_attr *attr) > +{ > + BUILD_BUG_ON(sizeof(struct xdp_chain_table) / sizeof(void *) != > + sizeof(struct xdp_chain_acts) / sizeof(u32)); > + > + if (attr->key_size != sizeof(u32) || > + attr->value_size != sizeof(struct xdp_chain_acts)) > + return -EINVAL; How are we going to extend xdp_chain_acts if a new XDP action is introduced? > + > + attr->value_size = sizeof(struct xdp_chain_table); > + return htab_map_alloc_check(attr); > +} > + > +struct bpf_prog *bpf_xdp_chain_map_get_prog(struct bpf_map *map, > + u32 prev_id, > + enum xdp_action action) > +{ > + struct xdp_chain_table *tab; > + void *ptr; > + > + ptr = htab_map_lookup_elem(map, &prev_id); > + > + if (!ptr) > + return NULL; > + > + tab = READ_ONCE(ptr); > + return tab->act[action - 1] ?: tab->wildcard_act; > +} > +EXPORT_SYMBOL_GPL(bpf_xdp_chain_map_get_prog); > + > +/* only called from syscall */ > +int bpf_xdp_chain_map_lookup_elem(struct bpf_map *map, void *key, void *value) > +{ > + struct xdp_chain_acts *act = value; > + struct xdp_chain_table *tab; > + void *ptr; > + u32 *cur; > + int i; > + > + ptr = htab_map_lookup_elem(map, key); > + if (!ptr) > + return -ENOENT; > + > + tab = READ_ONCE(ptr); > + > + if (tab->wildcard_act) > + act->wildcard_act = tab->wildcard_act->aux->id; > + > + cur = &act->drop_act; > + for (i = 0; i < XDP_ACT_MAX; i++, cur++) > + if(tab->act[i]) > + *cur = tab->act[i]->aux->id; For completeness, zero out *cur in the else case? > + > + return 0; > +} > + > +static void *xdp_chain_map_get_ptr(int fd) > +{ > + struct bpf_prog *prog = bpf_prog_get(fd); > + > + if (IS_ERR(prog)) > + return prog; > + > + if (prog->type != BPF_PROG_TYPE_XDP || > + bpf_prog_is_dev_bound(prog->aux)) { > + bpf_prog_put(prog); > + return ERR_PTR(-EINVAL); > + } > + > + return prog; > +} > + > +static void xdp_chain_map_put_ptrs(void *value) > +{ > + struct xdp_chain_table *tab = value; > + int i; > + > + for (i = 0; i < XDP_ACT_MAX; i++) > + if (tab->act[i]) > + bpf_prog_put(tab->act[i]); > + > + if (tab->wildcard_act) > + bpf_prog_put(tab->wildcard_act); > +} > + > +/* only called from syscall */ > +int bpf_xdp_chain_map_update_elem(struct bpf_map *map, void *key, void *value, > + u64 map_flags) Nit: check that map_flags == 0 > +{ > + struct xdp_chain_acts *act = value; > + struct xdp_chain_table tab = {}; > + u32 lookup_key = *((u32*)key); > + u32 *cur = &act->drop_act; > + bool found_val = false; > + int ret, i; > + void *ptr; > + > + if (!lookup_key) > + return -EINVAL; Is it possible to check that this is a valid prog id / fd or whatever it is? > + > + if (act->wildcard_act) { If this is an fd, 0 is a valid value no? > + ptr = xdp_chain_map_get_ptr(act->wildcard_act); > + if (IS_ERR(ptr)) > + return PTR_ERR(ptr); > + tab.wildcard_act = ptr; > + found_val = true; > + } > + > + for (i = 0; i < XDP_ACT_MAX; i++, cur++) { > + if (*cur) { > + ptr = xdp_chain_map_get_ptr(*cur); > + if (IS_ERR(ptr)) { > + ret = PTR_ERR(ptr); > + goto out_err; > + } > + tab.act[i] = ptr; > + found_val = true; > + } > + } > + > + if (!found_val) { > + ret = -EINVAL; > + goto out_err; > + } > + > + ret = htab_map_update_elem(map, key, &tab, map_flags); > + if (ret) > + goto out_err; > + > + return ret; > + > +out_err: > + xdp_chain_map_put_ptrs(&tab); > + > + return ret; > +} > + > + > +const struct bpf_map_ops xdp_chain_map_ops = { > + .map_alloc_check = xdp_chain_map_alloc_check, > + .map_alloc = htab_map_alloc, > + .map_free = fd_htab_map_free, > + .map_get_next_key = htab_map_get_next_key, > + .map_delete_elem = htab_map_delete_elem, > + .map_fd_put_value = xdp_chain_map_put_ptrs, > + .map_check_btf = map_check_no_btf, > +}; -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com
Toke Høiland-Jørgensen wrote:
> Alan Maguire <alan.maguire@oracle.com> writes:
>
> > On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote:
> >
> >> This series adds support for executing multiple XDP programs on a single
> >> interface in sequence, through the use of chain calls, as discussed at the Linux
> >> Plumbers Conference last month:
> >>
> >> https://linuxplumbersconf.org/event/4/contributions/460/
> >>
> >> # HIGH-LEVEL IDEA
> >>
> >> The basic idea is to express the chain call sequence through a special map type,
> >> which contains a mapping from a (program, return code) tuple to another program
> >> to run in next in the sequence. Userspace can populate this map to express
> >> arbitrary call sequences, and update the sequence by updating or replacing the
> >> map.
> >>
> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(),
> >> which will lookup the chain sequence map, and if found, will loop through calls
> >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
> >> previous program ID and return code.
> >>
> >> An XDP chain call map can be installed on an interface by means of a new netlink
> >> attribute containing an fd pointing to a chain call map. This can be supplied
> >> along with the XDP prog fd, so that a chain map is always installed together
> >> with an XDP program.
> >>
> >
> > This is great stuff Toke!
>
> Thanks! :)
>
> > One thing that wasn't immediately clear to me - and this may be just
> > me - is the relationship between program behaviour for the XDP_DROP
> > case and chain call execution. My initial thought was that a program
> > in the chain XDP_DROP'ping the packet would terminate the call chain,
> > but on looking at patch #4 it seems that the only way the call chain
> > execution is terminated is if
> >
> > - XDP_ABORTED is returned from a program in the call chain; or
>
> Yes. Not actually sure about this one...
>
> > - the map entry for the next program (determined by the return value
> > of the current program) is empty; or
>
> This will be the common exit condition, I expect
>
> > - we run out of entries in the map
>
> You mean if we run the iteration counter to zero, right?
>
> > The return value of the last-executed program in the chain seems to be
> > what determines packet processing behaviour after executing the chain
> > (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and
> > XDP_TX a packet from the same chain, right? Just want to make sure
> > I've got the semantics correct. Thanks!
>
> Yeah, you've got all this right. The chain call mechanism itself doesn't
> change any of the underlying fundamentals of XDP. I.e., each packet gets
> exactly one verdict.
>
> For chaining actual XDP programs that do different things to the packet,
> I expect that the most common use case will be to only run the next
> program if the previous one returns XDP_PASS. That will make the most
> semantic sense I think.
>
> But there are also use cases where one would want to match on the other
> return codes; such as packet capture, for instance, where one might
> install a capture program that would carry forward the previous return
> code, but do something to the packet (throw it out to userspace) first.
>
> For the latter use case, the question is if we need to expose the
> previous return code to the program when it runs. You can do things
> without it (by just using a different program per return code), but it
> may simplify things if we just expose the return code. However, since
> this will also change the semantics for running programs, I decided to
> leave that off for now.
>
> -Toke
In other cases where programs (e.g. cgroups) are run in an array the
return codes are 'AND'ed together so that we get
result1 & result2 & ... & resultN
The result is if any program returns a drop then the packet should
be dropped, but all programs at least "see" the packet. There was
a lot of debate over this semantic and I think in hind sight it
actually works pretty well so any chaining in XDP should also keep
those semantics. For things like redirect we can already, even in
the same program, do multiple calls to the redirect helper and it
will overwrite the last call so those semantics can stay the same.
.John
On Wed, 2 Oct 2019 at 14:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > This series adds support for executing multiple XDP programs on a single > interface in sequence, through the use of chain calls, as discussed at the Linux > Plumbers Conference last month: Hi Toke, Thanks for posting the patch set! As I mentioned, this is a big pain point for us as well. Right now, all of our different XDP components are managed by a single daemon called (drumroll) xdpd. We'd like to separate this out into individual bits and pieces that operate separately from each other. I've looked at the kernel side of your patch set, here are my thoughts: > # HIGH-LEVEL IDEA > > The basic idea is to express the chain call sequence through a special map type, > which contains a mapping from a (program, return code) tuple to another program > to run in next in the sequence. Userspace can populate this map to express > arbitrary call sequences, and update the sequence by updating or replacing the > map. How do you imagine this would work in practice? From what I can tell, the map is keyed by program id, which makes it difficult to reliably construct the control flow that I want. As an example, I'd like to split the daemon into two parts, A and B, which I want to be completely independent. So: - A runs before B, if both are active - If A is not active, B is called first - If B is not active, only A is called Both A and B may at any point in time replace their current XDP program with a new one. This means that there is no stable program ID I can use. Another problem are deletions: if I delete A (because that component is shut down) B will never be called, since the program ID that linked B into the control flow is gone. This means that B needs to know about A and vice versa. > The actual execution of the program sequence is done in bpf_prog_run_xdp(), > which will lookup the chain sequence map, and if found, will loop through calls > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > previous program ID and return code. I think that the tail call chain is useful for other eBPF programs as well. How hard is it to turn the logic in bpf_prog_run_xdp into a helper instead? Lorenz -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com
Toke Høiland-Jørgensen wrote: > This series adds support for executing multiple XDP programs on a single > interface in sequence, through the use of chain calls, as discussed at the Linux > Plumbers Conference last month: > > https://linuxplumbersconf.org/event/4/contributions/460/ > > # HIGH-LEVEL IDEA > > The basic idea is to express the chain call sequence through a special map type, > which contains a mapping from a (program, return code) tuple to another program > to run in next in the sequence. Userspace can populate this map to express > arbitrary call sequences, and update the sequence by updating or replacing the > map. > > The actual execution of the program sequence is done in bpf_prog_run_xdp(), > which will lookup the chain sequence map, and if found, will loop through calls > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > previous program ID and return code. > > An XDP chain call map can be installed on an interface by means of a new netlink > attribute containing an fd pointing to a chain call map. This can be supplied > along with the XDP prog fd, so that a chain map is always installed together > with an XDP program. > > # PERFORMANCE > > I performed a simple performance test to get an initial feel for the overhead of > the chain call mechanism. This test consists of running only two programs in > sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then > measure the drop PPS performance and compare it to a baseline of just a single > program that only returns XDP_DROP. > > For comparison, a test case that uses regular eBPF tail calls to sequence two > programs together is also included. Finally, because 'perf' showed that the > hashmap lookup was the largest single source of overhead, I also added a test > case where I removed the jhash() call from the hashmap code, and just use the > u32 key directly as an index into the hash bucket structure. > > The performance for these different cases is as follows (with retpolines disabled): retpolines enabled would also be interesting. > > | Test case | Perf | Add. overhead | Total overhead | > |---------------------------------+-----------+---------------+----------------| > | Before patch (XDP DROP program) | 31.0 Mpps | | | > | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | IMO even 1 Mpps overhead is too much for a feature that is primarily about ease of use. Sacrificing performance to make userland a bit easier is hard to justify for me when XDP _is_ singularly about performance. Also that is nearly 10% overhead which is fairly large. So I think going forward the performance gab needs to be removed. > | XDP tail call | 26.6 Mpps | 3.0 ns | 5.3 ns | > | XDP chain call (no jhash) | 19.6 Mpps | 13.4 ns | 18.7 ns | > | XDP chain call (this series) | 17.0 Mpps | 7.9 ns | 26.6 ns | > > From this it is clear that while there is some overhead from this mechanism; but > the jhash removal example indicates that it is probably possible to optimise the > code to the point where the overhead becomes low enough that it is acceptable. I'm missing why 'in theory' at least this can't be made as-fast as tail calls? Again I can't see why someone would lose 30% of their performance when a userland program could populate a tail call map for the same effect. Sure userland would also have to enforce some program standards/conventions but it could be done and at 30% overhead that pain is probably worth it IMO. My thinking though is if we are a bit clever chaining and tail calls could be performance-wise equivalent? I'll go read the patches now ;) .John > > # PATCH SET STRUCTURE > This series is structured as follows: > > - Patch 1: Prerequisite > - Patch 2: New map type > - Patch 3: Netlink hooks to install the chain call map > - Patch 4: Core chain call logic > - Patch 5-7: Bookkeeping updates to tools > - Patch 8: Libbpf support for installing chain call maps > - Patch 9: Selftests with example user space code > > The whole series is also available in my git repo on kernel.org: > https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-multiprog-01 >
[-- Attachment #1: Type: text/plain, Size: 2924 bytes --] Hi "Toke, I love your patch! Yet something to improve: [auto build test ERROR on bpf-next/master] url: https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/xdp-Support-multiple-programs-on-a-single-interface-through-chain-calls/20191003-005238 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master config: i386-tinyconfig (attached as .config) compiler: gcc-7 (Debian 7.4.0-13) 7.4.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All error/warnings (new ones prefixed by >>): In file included from include/net/sock.h:59:0, from include/linux/tcp.h:19, from include/linux/ipv6.h:87, from include/net/ipv6.h:12, from include/linux/sunrpc/clnt.h:28, from include/linux/nfs_fs.h:32, from init/do_mounts.c:23: include/linux/filter.h: In function 'bpf_prog_run_xdp': >> include/linux/filter.h:725:10: error: implicit declaration of function 'bpf_xdp_chain_map_get_prog' [-Werror=implicit-function-declaration] prog = bpf_xdp_chain_map_get_prog(chain_map, prog->aux->id, ret); ^~~~~~~~~~~~~~~~~~~~~~~~~~ >> include/linux/filter.h:725:8: warning: assignment makes pointer from integer without a cast [-Wint-conversion] prog = bpf_xdp_chain_map_get_prog(chain_map, prog->aux->id, ret); ^ cc1: some warnings being treated as errors vim +/bpf_xdp_chain_map_get_prog +725 include/linux/filter.h 695 696 #define BPF_XDP_MAX_CHAIN_CALLS 32 697 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, 698 struct xdp_buff *xdp) 699 { 700 /* Caller needs to hold rcu_read_lock() (!), otherwise program 701 * can be released while still running, or map elements could be 702 * freed early while still having concurrent users. XDP fastpath 703 * already takes rcu_read_lock() when fetching the program, so 704 * it's not necessary here anymore. 705 */ 706 707 int i = BPF_XDP_MAX_CHAIN_CALLS; 708 struct bpf_map *chain_map; 709 u32 ret; 710 711 chain_map = rcu_dereference(xdp->rxq->dev->xdp_chain_map); 712 if (!chain_map) 713 return BPF_PROG_RUN(prog, xdp); 714 715 do { 716 if (!--i) { 717 ret = XDP_ABORTED; 718 goto out; 719 } 720 721 ret = BPF_PROG_RUN(prog, xdp); 722 if (ret == XDP_ABORTED) 723 goto out; 724 > 725 prog = bpf_xdp_chain_map_get_prog(chain_map, prog->aux->id, ret); 726 } while(prog); 727 728 out: 729 return ret; 730 } 731 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 7205 bytes --]
[-- Attachment #1: Type: text/plain, Size: 3076 bytes --] Hi "Toke, I love your patch! Yet something to improve: [auto build test ERROR on bpf-next/master] url: https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/xdp-Support-multiple-programs-on-a-single-interface-through-chain-calls/20191003-005238 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master config: sparc64-allnoconfig (attached as .config) compiler: sparc64-linux-gcc (GCC) 7.4.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.4.0 make.cross ARCH=sparc64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): In file included from include/net/sock.h:59:0, from include/linux/tcp.h:19, from include/linux/ipv6.h:87, from include/net/ipv6.h:12, from include/linux/sunrpc/clnt.h:28, from include/linux/nfs_fs.h:32, from arch/sparc/kernel/sys_sparc32.c:25: include/linux/filter.h: In function 'bpf_prog_run_xdp': include/linux/filter.h:725:10: error: implicit declaration of function 'bpf_xdp_chain_map_get_prog' [-Werror=implicit-function-declaration] prog = bpf_xdp_chain_map_get_prog(chain_map, prog->aux->id, ret); ^~~~~~~~~~~~~~~~~~~~~~~~~~ >> include/linux/filter.h:725:8: error: assignment makes pointer from integer without a cast [-Werror=int-conversion] prog = bpf_xdp_chain_map_get_prog(chain_map, prog->aux->id, ret); ^ cc1: all warnings being treated as errors vim +725 include/linux/filter.h 695 696 #define BPF_XDP_MAX_CHAIN_CALLS 32 697 static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, 698 struct xdp_buff *xdp) 699 { 700 /* Caller needs to hold rcu_read_lock() (!), otherwise program 701 * can be released while still running, or map elements could be 702 * freed early while still having concurrent users. XDP fastpath 703 * already takes rcu_read_lock() when fetching the program, so 704 * it's not necessary here anymore. 705 */ 706 707 int i = BPF_XDP_MAX_CHAIN_CALLS; 708 struct bpf_map *chain_map; 709 u32 ret; 710 711 chain_map = rcu_dereference(xdp->rxq->dev->xdp_chain_map); 712 if (!chain_map) 713 return BPF_PROG_RUN(prog, xdp); 714 715 do { 716 if (!--i) { 717 ret = XDP_ABORTED; 718 goto out; 719 } 720 721 ret = BPF_PROG_RUN(prog, xdp); 722 if (ret == XDP_ABORTED) 723 goto out; 724 > 725 prog = bpf_xdp_chain_map_get_prog(chain_map, prog->aux->id, ret); 726 } while(prog); 727 728 out: 729 return ret; 730 } 731 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 5713 bytes --]
[-- Attachment #1: Type: text/plain, Size: 6239 bytes --] Hi "Toke, I love your patch! Yet something to improve: [auto build test ERROR on bpf-next/master] url: https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/xdp-Support-multiple-programs-on-a-single-interface-through-chain-calls/20191003-005238 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master config: ia64-defconfig (attached as .config) compiler: ia64-linux-gcc (GCC) 7.4.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.4.0 make.cross ARCH=ia64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All error/warnings (new ones prefixed by >>): net/core/dev.c: In function 'dev_xdp_uninstall': >> net/core/dev.c:8187:3: error: implicit declaration of function 'bpf_map_put_with_uref' [-Werror=implicit-function-declaration] bpf_map_put_with_uref(chain_map); ^~~~~~~~~~~~~~~~~~~~~ net/core/dev.c: In function 'dev_change_xdp_fd': >> net/core/dev.c:8286:15: error: implicit declaration of function 'bpf_map_get_with_uref'; did you mean 'bpf_obj_get_user'? [-Werror=implicit-function-declaration] chain_map = bpf_map_get_with_uref(chain_map_fd); ^~~~~~~~~~~~~~~~~~~~~ bpf_obj_get_user >> net/core/dev.c:8286:13: warning: assignment makes pointer from integer without a cast [-Wint-conversion] chain_map = bpf_map_get_with_uref(chain_map_fd); ^ cc1: some warnings being treated as errors vim +/bpf_map_put_with_uref +8187 net/core/dev.c 8177 8178 static void dev_xdp_uninstall(struct net_device *dev) 8179 { 8180 struct bpf_map *chain_map = NULL; 8181 struct netdev_bpf xdp; 8182 bpf_op_t ndo_bpf; 8183 8184 /* Remove chain map */ 8185 rcu_swap_protected(dev->xdp_chain_map, chain_map, 1); 8186 if(chain_map) > 8187 bpf_map_put_with_uref(chain_map); 8188 8189 /* Remove generic XDP */ 8190 WARN_ON(dev_xdp_install(dev, generic_xdp_install, NULL, 0, NULL)); 8191 8192 /* Remove from the driver */ 8193 ndo_bpf = dev->netdev_ops->ndo_bpf; 8194 if (!ndo_bpf) 8195 return; 8196 8197 memset(&xdp, 0, sizeof(xdp)); 8198 xdp.command = XDP_QUERY_PROG; 8199 WARN_ON(ndo_bpf(dev, &xdp)); 8200 if (xdp.prog_id) 8201 WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, 8202 NULL)); 8203 8204 /* Remove HW offload */ 8205 memset(&xdp, 0, sizeof(xdp)); 8206 xdp.command = XDP_QUERY_PROG_HW; 8207 if (!ndo_bpf(dev, &xdp) && xdp.prog_id) 8208 WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, 8209 NULL)); 8210 } 8211 8212 /** 8213 * dev_change_xdp_fd - set or clear a bpf program for a device rx path 8214 * @dev: device 8215 * @extack: netlink extended ack 8216 * @prog_fd: new program fd or negative value to clear 8217 * @chain_map_fd: new chain map fd or negative value to clear 8218 * @flags: xdp-related flags 8219 * 8220 * Set or clear a bpf program for a device 8221 */ 8222 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, 8223 int prog_fd, int chain_map_fd, u32 flags) 8224 { 8225 const struct net_device_ops *ops = dev->netdev_ops; 8226 struct bpf_map *chain_map = NULL; 8227 enum bpf_netdev_command query; 8228 struct bpf_prog *prog = NULL; 8229 bpf_op_t bpf_op, bpf_chk; 8230 bool offload; 8231 int err; 8232 8233 ASSERT_RTNL(); 8234 8235 offload = flags & XDP_FLAGS_HW_MODE; 8236 query = offload ? XDP_QUERY_PROG_HW : XDP_QUERY_PROG; 8237 8238 bpf_op = bpf_chk = ops->ndo_bpf; 8239 if (!bpf_op && (flags & (XDP_FLAGS_DRV_MODE | XDP_FLAGS_HW_MODE))) { 8240 NL_SET_ERR_MSG(extack, "underlying driver does not support XDP in native mode"); 8241 return -EOPNOTSUPP; 8242 } 8243 if (!bpf_op || (flags & XDP_FLAGS_SKB_MODE)) 8244 bpf_op = generic_xdp_install; 8245 if (bpf_op == bpf_chk) 8246 bpf_chk = generic_xdp_install; 8247 8248 if (prog_fd >= 0) { 8249 u32 prog_id; 8250 8251 if (!offload && __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG)) { 8252 NL_SET_ERR_MSG(extack, "native and generic XDP can't be active at the same time"); 8253 return -EEXIST; 8254 } 8255 8256 prog_id = __dev_xdp_query(dev, bpf_op, query); 8257 if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && prog_id) { 8258 NL_SET_ERR_MSG(extack, "XDP program already attached"); 8259 return -EBUSY; 8260 } 8261 8262 prog = bpf_prog_get_type_dev(prog_fd, BPF_PROG_TYPE_XDP, 8263 bpf_op == ops->ndo_bpf); 8264 if (IS_ERR(prog)) 8265 return PTR_ERR(prog); 8266 8267 if (!offload && bpf_prog_is_dev_bound(prog->aux)) { 8268 NL_SET_ERR_MSG(extack, "using device-bound program without HW_MODE flag is not supported"); 8269 bpf_prog_put(prog); 8270 return -EINVAL; 8271 } 8272 8273 if (prog->aux->id == prog_id) { 8274 bpf_prog_put(prog); 8275 return 0; 8276 } 8277 } else { 8278 if (chain_map_fd >= 0) 8279 return -EINVAL; 8280 8281 if (!__dev_xdp_query(dev, bpf_op, query)) 8282 return 0; 8283 } 8284 8285 if (chain_map_fd >= 0) { > 8286 chain_map = bpf_map_get_with_uref(chain_map_fd); 8287 if (IS_ERR(chain_map)) 8288 return PTR_ERR(chain_map); 8289 8290 if (chain_map->map_type != BPF_MAP_TYPE_XDP_CHAIN) { 8291 NL_SET_ERR_MSG(extack, "invalid chain map type"); 8292 bpf_map_put_with_uref(chain_map); 8293 return -EINVAL; 8294 } 8295 } 8296 8297 err = dev_xdp_install(dev, bpf_op, extack, flags, prog); 8298 if (err < 0) { 8299 if (prog) 8300 bpf_prog_put(prog); 8301 } else { 8302 rcu_swap_protected(dev->xdp_chain_map, chain_map, 1); 8303 } 8304 8305 if(chain_map) 8306 bpf_map_put_with_uref(chain_map); 8307 8308 return err; 8309 } 8310 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 19101 bytes --]
Lorenz Bauer <lmb@cloudflare.com> writes: > On Wed, 2 Oct 2019 at 14:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 113e1286e184..ab855095c830 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -1510,3 +1510,156 @@ const struct bpf_map_ops htab_of_maps_map_ops = { >> .map_gen_lookup = htab_of_map_gen_lookup, >> .map_check_btf = map_check_no_btf, >> }; >> + >> +struct xdp_chain_table { >> + struct bpf_prog *wildcard_act; >> + struct bpf_prog *act[XDP_ACT_MAX]; >> +}; >> + >> +static int xdp_chain_map_alloc_check(union bpf_attr *attr) >> +{ >> + BUILD_BUG_ON(sizeof(struct xdp_chain_table) / sizeof(void *) != >> + sizeof(struct xdp_chain_acts) / sizeof(u32)); >> + >> + if (attr->key_size != sizeof(u32) || >> + attr->value_size != sizeof(struct xdp_chain_acts)) >> + return -EINVAL; > > How are we going to extend xdp_chain_acts if a new XDP action is > introduced? By just checking the size and reacting appropriately? Don't think that is problematic, just takes a few if statements here? >> + >> + attr->value_size = sizeof(struct xdp_chain_table); >> + return htab_map_alloc_check(attr); >> +} >> + >> +struct bpf_prog *bpf_xdp_chain_map_get_prog(struct bpf_map *map, >> + u32 prev_id, >> + enum xdp_action action) >> +{ >> + struct xdp_chain_table *tab; >> + void *ptr; >> + >> + ptr = htab_map_lookup_elem(map, &prev_id); >> + >> + if (!ptr) >> + return NULL; >> + >> + tab = READ_ONCE(ptr); >> + return tab->act[action - 1] ?: tab->wildcard_act; >> +} >> +EXPORT_SYMBOL_GPL(bpf_xdp_chain_map_get_prog); >> + >> +/* only called from syscall */ >> +int bpf_xdp_chain_map_lookup_elem(struct bpf_map *map, void *key, void *value) >> +{ >> + struct xdp_chain_acts *act = value; >> + struct xdp_chain_table *tab; >> + void *ptr; >> + u32 *cur; >> + int i; >> + >> + ptr = htab_map_lookup_elem(map, key); >> + if (!ptr) >> + return -ENOENT; >> + >> + tab = READ_ONCE(ptr); >> + >> + if (tab->wildcard_act) >> + act->wildcard_act = tab->wildcard_act->aux->id; >> + >> + cur = &act->drop_act; >> + for (i = 0; i < XDP_ACT_MAX; i++, cur++) >> + if(tab->act[i]) >> + *cur = tab->act[i]->aux->id; > > For completeness, zero out *cur in the else case? Ah, good point. I assumed the caller was kzalloc'ing; seems that's not the case; will fix. >> + >> + return 0; >> +} >> + >> +static void *xdp_chain_map_get_ptr(int fd) >> +{ >> + struct bpf_prog *prog = bpf_prog_get(fd); >> + >> + if (IS_ERR(prog)) >> + return prog; >> + >> + if (prog->type != BPF_PROG_TYPE_XDP || >> + bpf_prog_is_dev_bound(prog->aux)) { >> + bpf_prog_put(prog); >> + return ERR_PTR(-EINVAL); >> + } >> + >> + return prog; >> +} >> + >> +static void xdp_chain_map_put_ptrs(void *value) >> +{ >> + struct xdp_chain_table *tab = value; >> + int i; >> + >> + for (i = 0; i < XDP_ACT_MAX; i++) >> + if (tab->act[i]) >> + bpf_prog_put(tab->act[i]); >> + >> + if (tab->wildcard_act) >> + bpf_prog_put(tab->wildcard_act); >> +} >> + >> +/* only called from syscall */ >> +int bpf_xdp_chain_map_update_elem(struct bpf_map *map, void *key, void *value, >> + u64 map_flags) > > Nit: check that map_flags == 0 Yup, will fix. >> +{ >> + struct xdp_chain_acts *act = value; >> + struct xdp_chain_table tab = {}; >> + u32 lookup_key = *((u32*)key); >> + u32 *cur = &act->drop_act; >> + bool found_val = false; >> + int ret, i; >> + void *ptr; >> + >> + if (!lookup_key) >> + return -EINVAL; > > Is it possible to check that this is a valid prog id / fd or whatever > it is? I suppose we could. The reason I didn't was that I figured it would be valid to insert IDs into the map that don't exist (yet, or any longer). Theoretically, if you can predict the program ID, you could insert it into the map before it's allocated. There's no clearing of maps when program IDs are freed either. >> + >> + if (act->wildcard_act) { > > If this is an fd, 0 is a valid value no? Yes, technically. But if we actually allow fd 0, that means the userspace caller can't just pass a 0-initialised struct, but will have to loop through and explicitly set everything to -1. Whereas if we just disallow fd 0, that is no longer a problem (and normally you have to jump through some hoops to end up with a bpf program pointer as standard input, no?). It's not ideal, but I figured this was the least ugly way to do it. If you have a better idea I'm all ears? :) >> + ptr = xdp_chain_map_get_ptr(act->wildcard_act); >> + if (IS_ERR(ptr)) >> + return PTR_ERR(ptr); >> + tab.wildcard_act = ptr; >> + found_val = true; >> + } >> + >> + for (i = 0; i < XDP_ACT_MAX; i++, cur++) { >> + if (*cur) { >> + ptr = xdp_chain_map_get_ptr(*cur); >> + if (IS_ERR(ptr)) { >> + ret = PTR_ERR(ptr); >> + goto out_err; >> + } >> + tab.act[i] = ptr; >> + found_val = true; >> + } >> + } >> + >> + if (!found_val) { >> + ret = -EINVAL; >> + goto out_err; >> + } >> + >> + ret = htab_map_update_elem(map, key, &tab, map_flags); >> + if (ret) >> + goto out_err; >> + >> + return ret; >> + >> +out_err: >> + xdp_chain_map_put_ptrs(&tab); >> + >> + return ret; >> +} >> + >> + >> +const struct bpf_map_ops xdp_chain_map_ops = { >> + .map_alloc_check = xdp_chain_map_alloc_check, >> + .map_alloc = htab_map_alloc, >> + .map_free = fd_htab_map_free, >> + .map_get_next_key = htab_map_get_next_key, >> + .map_delete_elem = htab_map_delete_elem, >> + .map_fd_put_value = xdp_chain_map_put_ptrs, >> + .map_check_btf = map_check_no_btf, >> +}; > > -- > Lorenz Bauer | Systems Engineer > 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK > > www.cloudflare.com
[-- Attachment #1: Type: text/plain, Size: 5992 bytes --] Hi "Toke, I love your patch! Yet something to improve: [auto build test ERROR on bpf-next/master] url: https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/xdp-Support-multiple-programs-on-a-single-interface-through-chain-calls/20191003-005238 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master config: x86_64-randconfig-s2-201939 (attached as .config) compiler: gcc-5 (Ubuntu 5.5.0-12ubuntu1) 5.5.0 20171010 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): net/core/dev.c: In function 'dev_xdp_uninstall': net/core/dev.c:8187:3: error: implicit declaration of function 'bpf_map_put_with_uref' [-Werror=implicit-function-declaration] bpf_map_put_with_uref(chain_map); ^ net/core/dev.c: In function 'dev_change_xdp_fd': >> net/core/dev.c:8286:15: error: implicit declaration of function 'bpf_map_get_with_uref' [-Werror=implicit-function-declaration] chain_map = bpf_map_get_with_uref(chain_map_fd); ^ net/core/dev.c:8286:13: warning: assignment makes pointer from integer without a cast [-Wint-conversion] chain_map = bpf_map_get_with_uref(chain_map_fd); ^ cc1: some warnings being treated as errors vim +/bpf_map_get_with_uref +8286 net/core/dev.c 8177 8178 static void dev_xdp_uninstall(struct net_device *dev) 8179 { 8180 struct bpf_map *chain_map = NULL; 8181 struct netdev_bpf xdp; 8182 bpf_op_t ndo_bpf; 8183 8184 /* Remove chain map */ 8185 rcu_swap_protected(dev->xdp_chain_map, chain_map, 1); 8186 if(chain_map) > 8187 bpf_map_put_with_uref(chain_map); 8188 8189 /* Remove generic XDP */ 8190 WARN_ON(dev_xdp_install(dev, generic_xdp_install, NULL, 0, NULL)); 8191 8192 /* Remove from the driver */ 8193 ndo_bpf = dev->netdev_ops->ndo_bpf; 8194 if (!ndo_bpf) 8195 return; 8196 8197 memset(&xdp, 0, sizeof(xdp)); 8198 xdp.command = XDP_QUERY_PROG; 8199 WARN_ON(ndo_bpf(dev, &xdp)); 8200 if (xdp.prog_id) 8201 WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, 8202 NULL)); 8203 8204 /* Remove HW offload */ 8205 memset(&xdp, 0, sizeof(xdp)); 8206 xdp.command = XDP_QUERY_PROG_HW; 8207 if (!ndo_bpf(dev, &xdp) && xdp.prog_id) 8208 WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, 8209 NULL)); 8210 } 8211 8212 /** 8213 * dev_change_xdp_fd - set or clear a bpf program for a device rx path 8214 * @dev: device 8215 * @extack: netlink extended ack 8216 * @prog_fd: new program fd or negative value to clear 8217 * @chain_map_fd: new chain map fd or negative value to clear 8218 * @flags: xdp-related flags 8219 * 8220 * Set or clear a bpf program for a device 8221 */ 8222 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, 8223 int prog_fd, int chain_map_fd, u32 flags) 8224 { 8225 const struct net_device_ops *ops = dev->netdev_ops; 8226 struct bpf_map *chain_map = NULL; 8227 enum bpf_netdev_command query; 8228 struct bpf_prog *prog = NULL; 8229 bpf_op_t bpf_op, bpf_chk; 8230 bool offload; 8231 int err; 8232 8233 ASSERT_RTNL(); 8234 8235 offload = flags & XDP_FLAGS_HW_MODE; 8236 query = offload ? XDP_QUERY_PROG_HW : XDP_QUERY_PROG; 8237 8238 bpf_op = bpf_chk = ops->ndo_bpf; 8239 if (!bpf_op && (flags & (XDP_FLAGS_DRV_MODE | XDP_FLAGS_HW_MODE))) { 8240 NL_SET_ERR_MSG(extack, "underlying driver does not support XDP in native mode"); 8241 return -EOPNOTSUPP; 8242 } 8243 if (!bpf_op || (flags & XDP_FLAGS_SKB_MODE)) 8244 bpf_op = generic_xdp_install; 8245 if (bpf_op == bpf_chk) 8246 bpf_chk = generic_xdp_install; 8247 8248 if (prog_fd >= 0) { 8249 u32 prog_id; 8250 8251 if (!offload && __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG)) { 8252 NL_SET_ERR_MSG(extack, "native and generic XDP can't be active at the same time"); 8253 return -EEXIST; 8254 } 8255 8256 prog_id = __dev_xdp_query(dev, bpf_op, query); 8257 if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && prog_id) { 8258 NL_SET_ERR_MSG(extack, "XDP program already attached"); 8259 return -EBUSY; 8260 } 8261 8262 prog = bpf_prog_get_type_dev(prog_fd, BPF_PROG_TYPE_XDP, 8263 bpf_op == ops->ndo_bpf); 8264 if (IS_ERR(prog)) 8265 return PTR_ERR(prog); 8266 8267 if (!offload && bpf_prog_is_dev_bound(prog->aux)) { 8268 NL_SET_ERR_MSG(extack, "using device-bound program without HW_MODE flag is not supported"); 8269 bpf_prog_put(prog); 8270 return -EINVAL; 8271 } 8272 8273 if (prog->aux->id == prog_id) { 8274 bpf_prog_put(prog); 8275 return 0; 8276 } 8277 } else { 8278 if (chain_map_fd >= 0) 8279 return -EINVAL; 8280 8281 if (!__dev_xdp_query(dev, bpf_op, query)) 8282 return 0; 8283 } 8284 8285 if (chain_map_fd >= 0) { > 8286 chain_map = bpf_map_get_with_uref(chain_map_fd); 8287 if (IS_ERR(chain_map)) 8288 return PTR_ERR(chain_map); 8289 8290 if (chain_map->map_type != BPF_MAP_TYPE_XDP_CHAIN) { 8291 NL_SET_ERR_MSG(extack, "invalid chain map type"); 8292 bpf_map_put_with_uref(chain_map); 8293 return -EINVAL; 8294 } 8295 } 8296 8297 err = dev_xdp_install(dev, bpf_op, extack, flags, prog); 8298 if (err < 0) { 8299 if (prog) 8300 bpf_prog_put(prog); 8301 } else { 8302 rcu_swap_protected(dev->xdp_chain_map, chain_map, 1); 8303 } 8304 8305 if(chain_map) 8306 bpf_map_put_with_uref(chain_map); 8307 8308 return err; 8309 } 8310 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 37305 bytes --]
Lorenz Bauer <lmb@cloudflare.com> writes:
> On Wed, 2 Oct 2019 at 14:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>
>> This adds support to rtnetlink for setting and getting the per-device XDP
>> chain map. The map is set by means of a new netlink attribute that contains
>> a pointer to a BPF map of the XDP chain type. If such an attribute is
>> included, it will be inserted into the struct net_device so that the XDP
>> chain call code will pick it up on program execution.
>>
>> To prevent old userspace programs that do not understand the chain map
>> attribute from messing up the chain call order, a netlink message with no
>> chain map attribute set will be rejected if a chain map has already been
>> installed.
>>
>> When installing a new chain call map, an XDP program fd must also be
>> provided, otherwise the operation will be rejected.
>
> Why is the program required? I kind of expected the chain call map to
> override any program.
Mainly because drivers expect to attach an XDP program, not a map. We
could conceivably emulate this by pulling out the first program from the
map on attach (after we define semantics for this; just lookup ID 0?)
and pass that down to the driver. But then we'd either need to make that
first program immutable in the map, or we'd need to propagate map
updates all the way down to the driver. Or alternatively, we'd need to
teach all the drivers how to attach a map instead of a program. Lots of
cans of worms there I'd rather not open... :)
A userspace application or library could emulate all of this, though, by
just accepting "attach" of a map, and turning that into the right
netlink message. It would have the same problem re: map updates, but if
it controls those that should be doable...
-Toke
John Fastabend <john.fastabend@gmail.com> writes:
> Toke Høiland-Jørgensen wrote:
>> Alan Maguire <alan.maguire@oracle.com> writes:
>>
>> > On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote:
>> >
>> >> This series adds support for executing multiple XDP programs on a single
>> >> interface in sequence, through the use of chain calls, as discussed at the Linux
>> >> Plumbers Conference last month:
>> >>
>> >> https://linuxplumbersconf.org/event/4/contributions/460/
>> >>
>> >> # HIGH-LEVEL IDEA
>> >>
>> >> The basic idea is to express the chain call sequence through a special map type,
>> >> which contains a mapping from a (program, return code) tuple to another program
>> >> to run in next in the sequence. Userspace can populate this map to express
>> >> arbitrary call sequences, and update the sequence by updating or replacing the
>> >> map.
>> >>
>> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(),
>> >> which will lookup the chain sequence map, and if found, will loop through calls
>> >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
>> >> previous program ID and return code.
>> >>
>> >> An XDP chain call map can be installed on an interface by means of a new netlink
>> >> attribute containing an fd pointing to a chain call map. This can be supplied
>> >> along with the XDP prog fd, so that a chain map is always installed together
>> >> with an XDP program.
>> >>
>> >
>> > This is great stuff Toke!
>>
>> Thanks! :)
>>
>> > One thing that wasn't immediately clear to me - and this may be just
>> > me - is the relationship between program behaviour for the XDP_DROP
>> > case and chain call execution. My initial thought was that a program
>> > in the chain XDP_DROP'ping the packet would terminate the call chain,
>> > but on looking at patch #4 it seems that the only way the call chain
>> > execution is terminated is if
>> >
>> > - XDP_ABORTED is returned from a program in the call chain; or
>>
>> Yes. Not actually sure about this one...
>>
>> > - the map entry for the next program (determined by the return value
>> > of the current program) is empty; or
>>
>> This will be the common exit condition, I expect
>>
>> > - we run out of entries in the map
>>
>> You mean if we run the iteration counter to zero, right?
>>
>> > The return value of the last-executed program in the chain seems to be
>> > what determines packet processing behaviour after executing the chain
>> > (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and
>> > XDP_TX a packet from the same chain, right? Just want to make sure
>> > I've got the semantics correct. Thanks!
>>
>> Yeah, you've got all this right. The chain call mechanism itself doesn't
>> change any of the underlying fundamentals of XDP. I.e., each packet gets
>> exactly one verdict.
>>
>> For chaining actual XDP programs that do different things to the packet,
>> I expect that the most common use case will be to only run the next
>> program if the previous one returns XDP_PASS. That will make the most
>> semantic sense I think.
>>
>> But there are also use cases where one would want to match on the other
>> return codes; such as packet capture, for instance, where one might
>> install a capture program that would carry forward the previous return
>> code, but do something to the packet (throw it out to userspace) first.
>>
>> For the latter use case, the question is if we need to expose the
>> previous return code to the program when it runs. You can do things
>> without it (by just using a different program per return code), but it
>> may simplify things if we just expose the return code. However, since
>> this will also change the semantics for running programs, I decided to
>> leave that off for now.
>>
>> -Toke
>
> In other cases where programs (e.g. cgroups) are run in an array the
> return codes are 'AND'ed together so that we get
>
> result1 & result2 & ... & resultN
How would that work with multiple programs, though? PASS -> DROP seems
obvious, but what if the first program returns TX? Also, programs may
want to be able to actually override return codes (e.g., say you want to
turn DROPs into REDIRECTs, to get all your dropped packets mirrored to
your IDS or something).
-Toke
> On Oct 2, 2019, at 6:30 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > This series adds support for executing multiple XDP programs on a single > interface in sequence, through the use of chain calls, as discussed at the Linux > Plumbers Conference last month: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__linuxplumbersconf.org_event_4_contributions_460_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=dR8692q0_uaizy0jkrBJQM5k2hfm4CiFxYT8KaysFrg&m=YXqqHTC51zXBviPBEk55y-fQjFQwcXWFlH0IoOqm2KU&s=NF4w3eSPmNhSpJr1-0FLqqlqfgEV8gsCQb9YqWQ9p-k&e= > > # HIGH-LEVEL IDEA > > The basic idea is to express the chain call sequence through a special map type, > which contains a mapping from a (program, return code) tuple to another program > to run in next in the sequence. Userspace can populate this map to express > arbitrary call sequences, and update the sequence by updating or replacing the > map. > > The actual execution of the program sequence is done in bpf_prog_run_xdp(), > which will lookup the chain sequence map, and if found, will loop through calls > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > previous program ID and return code. > > An XDP chain call map can be installed on an interface by means of a new netlink > attribute containing an fd pointing to a chain call map. This can be supplied > along with the XDP prog fd, so that a chain map is always installed together > with an XDP program. Interesting work! Quick question: can we achieve the same by adding a "retval to call_tail_next" map to each program? I think one issue is how to avoid loop like A->B->C->A, but this should be solvable? > > # PERFORMANCE > > I performed a simple performance test to get an initial feel for the overhead of > the chain call mechanism. This test consists of running only two programs in > sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then > measure the drop PPS performance and compare it to a baseline of just a single > program that only returns XDP_DROP. > > For comparison, a test case that uses regular eBPF tail calls to sequence two > programs together is also included. Finally, because 'perf' showed that the > hashmap lookup was the largest single source of overhead, I also added a test > case where I removed the jhash() call from the hashmap code, and just use the > u32 key directly as an index into the hash bucket structure. > > The performance for these different cases is as follows (with retpolines disabled): > > | Test case | Perf | Add. overhead | Total overhead | > |---------------------------------+-----------+---------------+----------------| > | Before patch (XDP DROP program) | 31.0 Mpps | | | > | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | > | XDP tail call | 26.6 Mpps | 3.0 ns | 5.3 ns | > | XDP chain call (no jhash) | 19.6 Mpps | 13.4 ns | 18.7 ns | > | XDP chain call (this series) | 17.0 Mpps | 7.9 ns | 26.6 ns | > > From this it is clear that while there is some overhead from this mechanism; but > the jhash removal example indicates that it is probably possible to optimise the > code to the point where the overhead becomes low enough that it is acceptable. I think we can probably re-jit multiple programs into one based on the mapping, which should give the best performance. Thanks, Song
Lorenz Bauer <lmb@cloudflare.com> writes: > On Wed, 2 Oct 2019 at 14:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> This series adds support for executing multiple XDP programs on a single >> interface in sequence, through the use of chain calls, as discussed at the Linux >> Plumbers Conference last month: > > Hi Toke, > > Thanks for posting the patch set! As I mentioned, this is a big pain > point for us as well. Right now, all of our different XDP components > are managed by a single daemon called (drumroll) xdpd. We'd like to > separate this out into individual bits and pieces that operate > separately from each other. Yes, I'm counting on you guys to be one of the potential users and help me iron out the API and semantics! ;) > I've looked at the kernel side of your patch set, here are my thoughts: > >> # HIGH-LEVEL IDEA >> >> The basic idea is to express the chain call sequence through a special map type, >> which contains a mapping from a (program, return code) tuple to another program >> to run in next in the sequence. Userspace can populate this map to express >> arbitrary call sequences, and update the sequence by updating or replacing the >> map. > > How do you imagine this would work in practice? From what I can tell, > the map is keyed by program id, which makes it difficult to reliably > construct the control flow that I want. The way I've been picturing this would work, is that programs would cooperatively manage this by updating the data structures to insert and remove themselves as needed. I haven't actually fleshed out this in code yet (that's next on my agenda), but I imagine it would be something like this: > As an example, I'd like to split the daemon into two parts, A and B, > which I want to be completely independent. So: > > - A runs before B, if both are active > - If A is not active, B is called first > - If B is not active, only A is called > > Both A and B may at any point in time replace their current XDP > program with a new one. This means that there is no stable program ID > I can use. They would have to cooperate. Either by a central management daemon, or by using the chain call map as a data structure to coordinate. If A switches out its program, it would need to first load the new version, then update the map to replace the old ID with the new one. > Another problem are deletions: if I delete A (because that component > is shut down) B will never be called, since the program ID that linked > B into the control flow is gone. This means that B needs to know about > A and vice versa. Say both programs are running. If B shuts down, it just removes itself from the map. If A shuts down, it also needs to remove itself from the map, and move up all its "descendants" in the call sequence. So in this case, A would look up itself in the chain call map, find the next program in the sequence, and install that as the new program on the interface. The operations are kinda like linked list manipulations. I had some examples in my slides at LPC: - Simple updates: *linked-list like* operations (map stays the same) # Insert after id 3 --> id = load(prog.o); --> map_update(map, {3, PASS}, id) # atomic update # Insert before id 2 --> id = load(prog.o); --> map_update(map, {id, PASS}, 2); # no effect on chain sequence --> map_update(map, {1, PASS}, id); # atomic update - More complex operations: /*replace the whole thing*/ # Replace ID 3 with new program --> id = load(prog.o); map = new_map(); --> map_update(map, {1, PASS}, 2); --> map_update(map, {1, TX}, id); --> map_update(map, {2, PASS}, id); --> xdp_attach(eth0, 1, map, FORCE); # atomic replace The tricky part is avoiding read-modify-update races. I was imagining that we could provide "cmpxchg-like" semantics by adding an "expected old value" to the netlink load and/or to map updates. The kernel could then ensure that the update fails if the actual value being replaced has changed since userspace read it, in which case userspace could retry the whole operation. >> The actual execution of the program sequence is done in >> bpf_prog_run_xdp(), which will lookup the chain sequence map, and if >> found, will loop through calls to BPF_PROG_RUN, looking up the next >> XDP program in the sequence based on the previous program ID and >> return code. > > I think that the tail call chain is useful for other eBPF programs as > well. How hard is it to turn the logic in bpf_prog_run_xdp into a > helper instead? Heh. In principle, I guess we could move the logic *into* BPF_PROG_RUN instead of wrapping around it. The tricky part would be figuring out where to stash the map itself. One way to do this, which your other question about why both prog fd and map fd was needed made me think about, is to make programs and sequence maps interchangeable *everywhere*. I.e., every time we attach an eBPF program anywhere, we pass an fd to that program. So from a UAPI PoV, we could just extend that so all attach points would accept *either* a program fd *or* a chain call map fd, and then do the appropriate thing. This is quite intrusive, though, so let's leave that aside from now and maybe come back to it in the future if this turns out to be a huge success :) -Toke
> On Oct 2, 2019, at 11:38 AM, Song Liu <songliubraving@fb.com> wrote:
>
>
>
>> On Oct 2, 2019, at 6:30 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> This series adds support for executing multiple XDP programs on a single
>> interface in sequence, through the use of chain calls, as discussed at the Linux
>> Plumbers Conference last month:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__linuxplumbersconf.org_event_4_contributions_460_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=dR8692q0_uaizy0jkrBJQM5k2hfm4CiFxYT8KaysFrg&m=YXqqHTC51zXBviPBEk55y-fQjFQwcXWFlH0IoOqm2KU&s=NF4w3eSPmNhSpJr1-0FLqqlqfgEV8gsCQb9YqWQ9p-k&e=
>>
>> # HIGH-LEVEL IDEA
>>
>> The basic idea is to express the chain call sequence through a special map type,
>> which contains a mapping from a (program, return code) tuple to another program
>> to run in next in the sequence. Userspace can populate this map to express
>> arbitrary call sequences, and update the sequence by updating or replacing the
>> map.
>>
>> The actual execution of the program sequence is done in bpf_prog_run_xdp(),
>> which will lookup the chain sequence map, and if found, will loop through calls
>> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
>> previous program ID and return code.
>>
>> An XDP chain call map can be installed on an interface by means of a new netlink
>> attribute containing an fd pointing to a chain call map. This can be supplied
>> along with the XDP prog fd, so that a chain map is always installed together
>> with an XDP program.
>
> Interesting work!
>
> Quick question: can we achieve the same by adding a "retval to call_tail_next"
> map to each program? I think one issue is how to avoid loop like A->B->C->A,
> but this should be solvable?
Also, could you please share a real word example? I saw the example from LPC
slides, but I am more curious about what does each program do in real use cases.
Thanks,
Song
John Fastabend <john.fastabend@gmail.com> writes: > Toke Høiland-Jørgensen wrote: >> This series adds support for executing multiple XDP programs on a single >> interface in sequence, through the use of chain calls, as discussed at the Linux >> Plumbers Conference last month: >> >> https://linuxplumbersconf.org/event/4/contributions/460/ >> >> # HIGH-LEVEL IDEA >> >> The basic idea is to express the chain call sequence through a special map type, >> which contains a mapping from a (program, return code) tuple to another program >> to run in next in the sequence. Userspace can populate this map to express >> arbitrary call sequences, and update the sequence by updating or replacing the >> map. >> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(), >> which will lookup the chain sequence map, and if found, will loop through calls >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the >> previous program ID and return code. >> >> An XDP chain call map can be installed on an interface by means of a new netlink >> attribute containing an fd pointing to a chain call map. This can be supplied >> along with the XDP prog fd, so that a chain map is always installed together >> with an XDP program. >> >> # PERFORMANCE >> >> I performed a simple performance test to get an initial feel for the overhead of >> the chain call mechanism. This test consists of running only two programs in >> sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then >> measure the drop PPS performance and compare it to a baseline of just a single >> program that only returns XDP_DROP. >> >> For comparison, a test case that uses regular eBPF tail calls to sequence two >> programs together is also included. Finally, because 'perf' showed that the >> hashmap lookup was the largest single source of overhead, I also added a test >> case where I removed the jhash() call from the hashmap code, and just use the >> u32 key directly as an index into the hash bucket structure. >> >> The performance for these different cases is as follows (with retpolines disabled): > > retpolines enabled would also be interesting. Yes, I'll try that as well. >> >> | Test case | Perf | Add. overhead | Total overhead | >> |---------------------------------+-----------+---------------+----------------| >> | Before patch (XDP DROP program) | 31.0 Mpps | | | >> | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | > > IMO even 1 Mpps overhead is too much for a feature that is primarily > about ease of use. Sacrificing performance to make userland a bit > easier is hard to justify for me when XDP _is_ singularly about > performance. Also that is nearly 10% overhead which is fairly large. > So I think going forward the performance gab needs to be removed. It's not just about ease of use. If we really want XDP to catch on and be integrated into third-party applications, we *need* to solve the multi-program problem. Otherwise users are going to have to choose between running their DDOS protection application and their IDS; which means no one is going to be able to rely on XDP support, and so no one is going to bother to implement it. Or at least, I'm afraid it'll end up that way. That being said, I agree that the performance impact should be kept at an absolute minimum. In terms of code it already is: it's a single if statement to check if a chain map is loaded. I haven't looked into why that takes 2.3 ns to do yet. I suspect it may just be because we're taking a cache miss: the chain map pointer is not stored anywhere near the rest of the XDP data structures... >> | XDP tail call | 26.6 Mpps | 3.0 ns | 5.3 ns | >> | XDP chain call (no jhash) | 19.6 Mpps | 13.4 ns | 18.7 ns | >> | XDP chain call (this series) | 17.0 Mpps | 7.9 ns | 26.6 ns | >> >> From this it is clear that while there is some overhead from this mechanism; but >> the jhash removal example indicates that it is probably possible to optimise the >> code to the point where the overhead becomes low enough that it is acceptable. > > I'm missing why 'in theory' at least this can't be made as-fast as > tail calls? Again I can't see why someone would lose 30% of their > performance when a userland program could populate a tail call map for > the same effect. Sure userland would also have to enforce some program > standards/conventions but it could be done and at 30% overhead that > pain is probably worth it IMO. > > My thinking though is if we are a bit clever chaining and tail calls > could be performance-wise equivalent? Oh, I totally agree that 30% overhead is way too much. I just prioritised getting this to work and send it out for comments to get the conversation going around the API before I spend too much time optimising the performance bits. I think it should be possible to get it pretty close to pure tail calls. A tail call compiles down to a direct jump instruction without leaving the BPF execution environment, though, so maybe not *exactly* as fast. But I'd settle for an overhead of, say, a handful of nanoseconds compared to a pure tail call. Certainly not the tens of nanoseconds I get right now. Also, bear in mind that an overhead of say 4-6 nanoseconds only translates into such a large percentage-wise overhead in this test because the XDP programs being run don't actually do anything - not even touch the packet data. As the programs themselves get larger, the percentage-wise overhead becomes smaller as well. So a test like this shows the worst-case performance. -Toke
On Wed, Oct 02, 2019 at 09:43:49AM -0700, John Fastabend wrote:
> Toke Høiland-Jørgensen wrote:
> > This series adds support for executing multiple XDP programs on a single
> > interface in sequence, through the use of chain calls, as discussed at the Linux
> > Plumbers Conference last month:
> >
> > https://linuxplumbersconf.org/event/4/contributions/460/
> >
> > # HIGH-LEVEL IDEA
> >
> > The basic idea is to express the chain call sequence through a special map type,
> > which contains a mapping from a (program, return code) tuple to another program
> > to run in next in the sequence. Userspace can populate this map to express
> > arbitrary call sequences, and update the sequence by updating or replacing the
> > map.
> >
> > The actual execution of the program sequence is done in bpf_prog_run_xdp(),
> > which will lookup the chain sequence map, and if found, will loop through calls
> > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
> > previous program ID and return code.
> >
> > An XDP chain call map can be installed on an interface by means of a new netlink
> > attribute containing an fd pointing to a chain call map. This can be supplied
> > along with the XDP prog fd, so that a chain map is always installed together
> > with an XDP program.
> >
> > # PERFORMANCE
> >
> > I performed a simple performance test to get an initial feel for the overhead of
> > the chain call mechanism. This test consists of running only two programs in
> > sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then
> > measure the drop PPS performance and compare it to a baseline of just a single
> > program that only returns XDP_DROP.
> >
> > For comparison, a test case that uses regular eBPF tail calls to sequence two
> > programs together is also included. Finally, because 'perf' showed that the
> > hashmap lookup was the largest single source of overhead, I also added a test
> > case where I removed the jhash() call from the hashmap code, and just use the
> > u32 key directly as an index into the hash bucket structure.
> >
> > The performance for these different cases is as follows (with retpolines disabled):
>
> retpolines enabled would also be interesting.
>
> >
> > | Test case | Perf | Add. overhead | Total overhead |
> > |---------------------------------+-----------+---------------+----------------|
> > | Before patch (XDP DROP program) | 31.0 Mpps | | |
> > | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns |
>
> IMO even 1 Mpps overhead is too much for a feature that is primarily about
> ease of use. Sacrificing performance to make userland a bit easier is hard
> to justify for me when XDP _is_ singularly about performance. Also that is
> nearly 10% overhead which is fairly large. So I think going forward the
> performance gab needs to be removed.
Fully agree, for the case where this facility is not used, it must have
*zero* overhead. This is /one/ map flavor, in future there will be other
facilities with different use-cases, but we cannot place them all into
the critical fast-path. Given this is BPF, we have the flexibility that
this can be hidden behind the scenes by rewriting and therefore only add
overhead when used.
What I also see as a red flag with this proposal is the fact that it's
tied to XDP only because you need to go and hack bpf_prog_run_xdp() all
the way to fetch xdp->rxq->dev->xdp_chain_map even though the map/concept
itself is rather generic and could be used in various other program types
as well. I'm very sure that once there, people would request it. Therefore,
better to explore a way where this has no changes to BPF_PROG_RUN() similar
to the original tail call work.
Thanks,
Daniel
Song Liu <songliubraving@fb.com> writes: >> On Oct 2, 2019, at 6:30 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> This series adds support for executing multiple XDP programs on a single >> interface in sequence, through the use of chain calls, as discussed at the Linux >> Plumbers Conference last month: >> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__linuxplumbersconf.org_event_4_contributions_460_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=dR8692q0_uaizy0jkrBJQM5k2hfm4CiFxYT8KaysFrg&m=YXqqHTC51zXBviPBEk55y-fQjFQwcXWFlH0IoOqm2KU&s=NF4w3eSPmNhSpJr1-0FLqqlqfgEV8gsCQb9YqWQ9p-k&e= >> >> # HIGH-LEVEL IDEA >> >> The basic idea is to express the chain call sequence through a special map type, >> which contains a mapping from a (program, return code) tuple to another program >> to run in next in the sequence. Userspace can populate this map to express >> arbitrary call sequences, and update the sequence by updating or replacing the >> map. >> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(), >> which will lookup the chain sequence map, and if found, will loop through calls >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the >> previous program ID and return code. >> >> An XDP chain call map can be installed on an interface by means of a new netlink >> attribute containing an fd pointing to a chain call map. This can be supplied >> along with the XDP prog fd, so that a chain map is always installed together >> with an XDP program. > > Interesting work! > > Quick question: can we achieve the same by adding a "retval to > call_tail_next" map to each program? Hmm, that's an interesting idea; I hadn't thought of that. As long as that map can be manipulated outside of the program itself, it may work. I wonder how complex it gets to modify the call sequence, though; say you want to change A->B->C to A->C->B - how do you do that without interrupting the sequence while you're modifying things? Or is it OK if that is not possible? > I think one issue is how to avoid loop like A->B->C->A, but this > should be solvable? Well, for tail calls there's already a counter that breaks the sequence after a certain number of calls. We could do the same here. >> # PERFORMANCE >> >> I performed a simple performance test to get an initial feel for the overhead of >> the chain call mechanism. This test consists of running only two programs in >> sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then >> measure the drop PPS performance and compare it to a baseline of just a single >> program that only returns XDP_DROP. >> >> For comparison, a test case that uses regular eBPF tail calls to sequence two >> programs together is also included. Finally, because 'perf' showed that the >> hashmap lookup was the largest single source of overhead, I also added a test >> case where I removed the jhash() call from the hashmap code, and just use the >> u32 key directly as an index into the hash bucket structure. >> >> The performance for these different cases is as follows (with retpolines disabled): >> >> | Test case | Perf | Add. overhead | Total overhead | >> |---------------------------------+-----------+---------------+----------------| >> | Before patch (XDP DROP program) | 31.0 Mpps | | | >> | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | >> | XDP tail call | 26.6 Mpps | 3.0 ns | 5.3 ns | >> | XDP chain call (no jhash) | 19.6 Mpps | 13.4 ns | 18.7 ns | >> | XDP chain call (this series) | 17.0 Mpps | 7.9 ns | 26.6 ns | >> >> From this it is clear that while there is some overhead from this mechanism; but >> the jhash removal example indicates that it is probably possible to optimise the >> code to the point where the overhead becomes low enough that it is acceptable. > > I think we can probably re-jit multiple programs into one based on the > mapping, which should give the best performance. Yeah, integrating this into the jit+verifier would obviously give the best performance. But I wanted to avoid that because I viewed this as an XDP-specific feature, and I didn't want to add more complexity to the already somewhat complex verifier. However, if there's really interest in having this be a general feature outside of XDP, I guess I can look at that again. -Toke
Song Liu <songliubraving@fb.com> writes: >> On Oct 2, 2019, at 11:38 AM, Song Liu <songliubraving@fb.com> wrote: >> >> >> >>> On Oct 2, 2019, at 6:30 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote: >>> >>> This series adds support for executing multiple XDP programs on a single >>> interface in sequence, through the use of chain calls, as discussed at the Linux >>> Plumbers Conference last month: >>> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__linuxplumbersconf.org_event_4_contributions_460_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=dR8692q0_uaizy0jkrBJQM5k2hfm4CiFxYT8KaysFrg&m=YXqqHTC51zXBviPBEk55y-fQjFQwcXWFlH0IoOqm2KU&s=NF4w3eSPmNhSpJr1-0FLqqlqfgEV8gsCQb9YqWQ9p-k&e= >>> >>> # HIGH-LEVEL IDEA >>> >>> The basic idea is to express the chain call sequence through a special map type, >>> which contains a mapping from a (program, return code) tuple to another program >>> to run in next in the sequence. Userspace can populate this map to express >>> arbitrary call sequences, and update the sequence by updating or replacing the >>> map. >>> >>> The actual execution of the program sequence is done in bpf_prog_run_xdp(), >>> which will lookup the chain sequence map, and if found, will loop through calls >>> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the >>> previous program ID and return code. >>> >>> An XDP chain call map can be installed on an interface by means of a new netlink >>> attribute containing an fd pointing to a chain call map. This can be supplied >>> along with the XDP prog fd, so that a chain map is always installed together >>> with an XDP program. >> >> Interesting work! >> >> Quick question: can we achieve the same by adding a "retval to call_tail_next" >> map to each program? I think one issue is how to avoid loop like A->B->C->A, >> but this should be solvable? > > Also, could you please share a real word example? I saw the example > from LPC slides, but I am more curious about what does each program do > in real use cases. The only concrete program that I have that needs this is xdpcap: https://github.com/cloudflare/xdpcap Right now that needs to be integrated into the calling program to work; I want to write a tool like it, but that can insert itself before or after arbitrary XDP programs. Lorenz, can you say more about your use case? :) -Toke
Daniel Borkmann <daniel@iogearbox.net> writes:
> On Wed, Oct 02, 2019 at 09:43:49AM -0700, John Fastabend wrote:
>> Toke Høiland-Jørgensen wrote:
>> > This series adds support for executing multiple XDP programs on a single
>> > interface in sequence, through the use of chain calls, as discussed at the Linux
>> > Plumbers Conference last month:
>> >
>> > https://linuxplumbersconf.org/event/4/contributions/460/
>> >
>> > # HIGH-LEVEL IDEA
>> >
>> > The basic idea is to express the chain call sequence through a special map type,
>> > which contains a mapping from a (program, return code) tuple to another program
>> > to run in next in the sequence. Userspace can populate this map to express
>> > arbitrary call sequences, and update the sequence by updating or replacing the
>> > map.
>> >
>> > The actual execution of the program sequence is done in bpf_prog_run_xdp(),
>> > which will lookup the chain sequence map, and if found, will loop through calls
>> > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
>> > previous program ID and return code.
>> >
>> > An XDP chain call map can be installed on an interface by means of a new netlink
>> > attribute containing an fd pointing to a chain call map. This can be supplied
>> > along with the XDP prog fd, so that a chain map is always installed together
>> > with an XDP program.
>> >
>> > # PERFORMANCE
>> >
>> > I performed a simple performance test to get an initial feel for the overhead of
>> > the chain call mechanism. This test consists of running only two programs in
>> > sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then
>> > measure the drop PPS performance and compare it to a baseline of just a single
>> > program that only returns XDP_DROP.
>> >
>> > For comparison, a test case that uses regular eBPF tail calls to sequence two
>> > programs together is also included. Finally, because 'perf' showed that the
>> > hashmap lookup was the largest single source of overhead, I also added a test
>> > case where I removed the jhash() call from the hashmap code, and just use the
>> > u32 key directly as an index into the hash bucket structure.
>> >
>> > The performance for these different cases is as follows (with retpolines disabled):
>>
>> retpolines enabled would also be interesting.
>>
>> >
>> > | Test case | Perf | Add. overhead | Total overhead |
>> > |---------------------------------+-----------+---------------+----------------|
>> > | Before patch (XDP DROP program) | 31.0 Mpps | | |
>> > | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns |
>>
>> IMO even 1 Mpps overhead is too much for a feature that is primarily about
>> ease of use. Sacrificing performance to make userland a bit easier is hard
>> to justify for me when XDP _is_ singularly about performance. Also that is
>> nearly 10% overhead which is fairly large. So I think going forward the
>> performance gab needs to be removed.
>
> Fully agree, for the case where this facility is not used, it must
> have *zero* overhead. This is /one/ map flavor, in future there will
> be other facilities with different use-cases, but we cannot place them
> all into the critical fast-path. Given this is BPF, we have the
> flexibility that this can be hidden behind the scenes by rewriting and
> therefore only add overhead when used.
>
> What I also see as a red flag with this proposal is the fact that it's
> tied to XDP only because you need to go and hack bpf_prog_run_xdp()
> all the way to fetch xdp->rxq->dev->xdp_chain_map even though the
> map/concept itself is rather generic and could be used in various
> other program types as well. I'm very sure that once there, people
> would request it. Therefore, better to explore a way where this has no
> changes to BPF_PROG_RUN() similar to the original tail call work.
As I said in the other reply, I actually went out of my way to make this
XDP only. But since you're now the third person requesting it not be, I
guess I'll take the hint and look at a more general way to hook this in :)
-Toke
On Wed, Oct 02, 2019 at 09:15:22PM +0200, Daniel Borkmann wrote: > On Wed, Oct 02, 2019 at 09:43:49AM -0700, John Fastabend wrote: > > Toke Høiland-Jørgensen wrote: > > > This series adds support for executing multiple XDP programs on a single > > > interface in sequence, through the use of chain calls, as discussed at the Linux > > > Plumbers Conference last month: > > > > > > https://linuxplumbersconf.org/event/4/contributions/460/ > > > > > > # HIGH-LEVEL IDEA > > > > > > The basic idea is to express the chain call sequence through a special map type, > > > which contains a mapping from a (program, return code) tuple to another program > > > to run in next in the sequence. Userspace can populate this map to express > > > arbitrary call sequences, and update the sequence by updating or replacing the > > > map. > > > > > > The actual execution of the program sequence is done in bpf_prog_run_xdp(), > > > which will lookup the chain sequence map, and if found, will loop through calls > > > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > > > previous program ID and return code. > > > > > > An XDP chain call map can be installed on an interface by means of a new netlink > > > attribute containing an fd pointing to a chain call map. This can be supplied > > > along with the XDP prog fd, so that a chain map is always installed together > > > with an XDP program. > > > > > > # PERFORMANCE > > > > > > I performed a simple performance test to get an initial feel for the overhead of > > > the chain call mechanism. This test consists of running only two programs in > > > sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then > > > measure the drop PPS performance and compare it to a baseline of just a single > > > program that only returns XDP_DROP. > > > > > > For comparison, a test case that uses regular eBPF tail calls to sequence two > > > programs together is also included. Finally, because 'perf' showed that the > > > hashmap lookup was the largest single source of overhead, I also added a test > > > case where I removed the jhash() call from the hashmap code, and just use the > > > u32 key directly as an index into the hash bucket structure. > > > > > > The performance for these different cases is as follows (with retpolines disabled): > > > > retpolines enabled would also be interesting. > > > > > > > > | Test case | Perf | Add. overhead | Total overhead | > > > |---------------------------------+-----------+---------------+----------------| > > > | Before patch (XDP DROP program) | 31.0 Mpps | | | > > > | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | > > > > IMO even 1 Mpps overhead is too much for a feature that is primarily about > > ease of use. Sacrificing performance to make userland a bit easier is hard > > to justify for me when XDP _is_ singularly about performance. Also that is > > nearly 10% overhead which is fairly large. So I think going forward the > > performance gab needs to be removed. > > Fully agree, for the case where this facility is not used, it must have > *zero* overhead. This is /one/ map flavor, in future there will be other > facilities with different use-cases, but we cannot place them all into > the critical fast-path. Given this is BPF, we have the flexibility that > this can be hidden behind the scenes by rewriting and therefore only add > overhead when used. > > What I also see as a red flag with this proposal is the fact that it's > tied to XDP only because you need to go and hack bpf_prog_run_xdp() all > the way to fetch xdp->rxq->dev->xdp_chain_map even though the map/concept > itself is rather generic and could be used in various other program types > as well. I'm very sure that once there, people would request it. Therefore, > better to explore a way where this has no changes to BPF_PROG_RUN() similar > to the original tail call work. two +1s. 1. new features have to have zero overhead when not used. this is not negotiable. 2. prog chaining is not xdp specific. two years ago I was thinking about extending tail_call mechanism like this: https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=prog_chain&id=f54f45d00f91e083f6aec2abe35b6f0be52ae85b and the program would call the new helper 'bpf_tail_call_next()' to jump into the next program. Sample code is here: https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=prog_chain&id=933a93208f1bd60a9707dc3f51a9aa457c86be87 In my benchmarks it was faster than existing bpf_tail_call via prog_array. (And fit the rule of zero overhead when not used). While coding it I figured that we can do proper indirect calls instead, which would be even cleaner solution. It would support arbitrary program chaining and calling. The verifier back then didn't have enough infra to support indirect calls. I suggest to look into implementing indirect calls instead of hacking custom prog chaining logic via maps.
> On Oct 2, 2019, at 12:23 PM, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Song Liu <songliubraving@fb.com> writes:
>
>>> On Oct 2, 2019, at 6:30 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> This series adds support for executing multiple XDP programs on a single
>>> interface in sequence, through the use of chain calls, as discussed at the Linux
>>> Plumbers Conference last month:
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__linuxplumbersconf.org_event_4_contributions_460_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=dR8692q0_uaizy0jkrBJQM5k2hfm4CiFxYT8KaysFrg&m=YXqqHTC51zXBviPBEk55y-fQjFQwcXWFlH0IoOqm2KU&s=NF4w3eSPmNhSpJr1-0FLqqlqfgEV8gsCQb9YqWQ9p-k&e=
>>>
>>> # HIGH-LEVEL IDEA
>>>
>>> The basic idea is to express the chain call sequence through a special map type,
>>> which contains a mapping from a (program, return code) tuple to another program
>>> to run in next in the sequence. Userspace can populate this map to express
>>> arbitrary call sequences, and update the sequence by updating or replacing the
>>> map.
>>>
>>> The actual execution of the program sequence is done in bpf_prog_run_xdp(),
>>> which will lookup the chain sequence map, and if found, will loop through calls
>>> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
>>> previous program ID and return code.
>>>
>>> An XDP chain call map can be installed on an interface by means of a new netlink
>>> attribute containing an fd pointing to a chain call map. This can be supplied
>>> along with the XDP prog fd, so that a chain map is always installed together
>>> with an XDP program.
>>
>> Interesting work!
>>
>> Quick question: can we achieve the same by adding a "retval to
>> call_tail_next" map to each program?
>
> Hmm, that's an interesting idea; I hadn't thought of that. As long as
> that map can be manipulated outside of the program itself, it may work.
> I wonder how complex it gets to modify the call sequence, though; say
> you want to change A->B->C to A->C->B - how do you do that without
> interrupting the sequence while you're modifying things? Or is it OK if
> that is not possible?
We can always load another copy of B and C, say D == B, and E == C. And
make it A->E->D.
Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
>
> > Toke Høiland-Jørgensen wrote:
> >> Alan Maguire <alan.maguire@oracle.com> writes:
> >>
> >> > On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote:
> >> >
> >> >> This series adds support for executing multiple XDP programs on a single
> >> >> interface in sequence, through the use of chain calls, as discussed at the Linux
> >> >> Plumbers Conference last month:
> >> >>
> >> >> https://linuxplumbersconf.org/event/4/contributions/460/
> >> >>
> >> >> # HIGH-LEVEL IDEA
> >> >>
> >> >> The basic idea is to express the chain call sequence through a special map type,
> >> >> which contains a mapping from a (program, return code) tuple to another program
> >> >> to run in next in the sequence. Userspace can populate this map to express
> >> >> arbitrary call sequences, and update the sequence by updating or replacing the
> >> >> map.
> >> >>
> >> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(),
> >> >> which will lookup the chain sequence map, and if found, will loop through calls
> >> >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
> >> >> previous program ID and return code.
> >> >>
> >> >> An XDP chain call map can be installed on an interface by means of a new netlink
> >> >> attribute containing an fd pointing to a chain call map. This can be supplied
> >> >> along with the XDP prog fd, so that a chain map is always installed together
> >> >> with an XDP program.
> >> >>
> >> >
> >> > This is great stuff Toke!
> >>
> >> Thanks! :)
> >>
> >> > One thing that wasn't immediately clear to me - and this may be just
> >> > me - is the relationship between program behaviour for the XDP_DROP
> >> > case and chain call execution. My initial thought was that a program
> >> > in the chain XDP_DROP'ping the packet would terminate the call chain,
> >> > but on looking at patch #4 it seems that the only way the call chain
> >> > execution is terminated is if
> >> >
> >> > - XDP_ABORTED is returned from a program in the call chain; or
> >>
> >> Yes. Not actually sure about this one...
> >>
> >> > - the map entry for the next program (determined by the return value
> >> > of the current program) is empty; or
> >>
> >> This will be the common exit condition, I expect
> >>
> >> > - we run out of entries in the map
> >>
> >> You mean if we run the iteration counter to zero, right?
> >>
> >> > The return value of the last-executed program in the chain seems to be
> >> > what determines packet processing behaviour after executing the chain
> >> > (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and
> >> > XDP_TX a packet from the same chain, right? Just want to make sure
> >> > I've got the semantics correct. Thanks!
> >>
> >> Yeah, you've got all this right. The chain call mechanism itself doesn't
> >> change any of the underlying fundamentals of XDP. I.e., each packet gets
> >> exactly one verdict.
> >>
> >> For chaining actual XDP programs that do different things to the packet,
> >> I expect that the most common use case will be to only run the next
> >> program if the previous one returns XDP_PASS. That will make the most
> >> semantic sense I think.
> >>
> >> But there are also use cases where one would want to match on the other
> >> return codes; such as packet capture, for instance, where one might
> >> install a capture program that would carry forward the previous return
> >> code, but do something to the packet (throw it out to userspace) first.
> >>
> >> For the latter use case, the question is if we need to expose the
> >> previous return code to the program when it runs. You can do things
> >> without it (by just using a different program per return code), but it
> >> may simplify things if we just expose the return code. However, since
> >> this will also change the semantics for running programs, I decided to
> >> leave that off for now.
> >>
> >> -Toke
> >
> > In other cases where programs (e.g. cgroups) are run in an array the
> > return codes are 'AND'ed together so that we get
> >
> > result1 & result2 & ... & resultN
>
> How would that work with multiple programs, though? PASS -> DROP seems
> obvious, but what if the first program returns TX? Also, programs may
> want to be able to actually override return codes (e.g., say you want to
> turn DROPs into REDIRECTs, to get all your dropped packets mirrored to
> your IDS or something).
In general I think either you hard code a precedence that will have to
be overly conservative because if one program (your firewall) tells XDP
to drop the packet and some other program redirects it, passes, etc. that
seems incorrect to me. Or you get creative with the precedence rules and
they become complex and difficult to manage, where a drop will drop a packet
unless a previous/preceding program redirects it, etc. I think any hard
coded precedence you come up with will make some one happy and some other
user annoyed. Defeating the programability of BPF.
Better if its programmable. I would prefer to pass the context into the
next program then programs can build their own semantics. Then leave
the & of return codes so any program can if needed really drop a packet.
The context could be pushed into a shared memory region and then it
doesn't even need to be part of the program signature.
.John
John Fastabend <john.fastabend@gmail.com> writes: > Toke Høiland-Jørgensen wrote: >> John Fastabend <john.fastabend@gmail.com> writes: >> >> > Toke Høiland-Jørgensen wrote: >> >> Alan Maguire <alan.maguire@oracle.com> writes: >> >> >> >> > On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote: >> >> > >> >> >> This series adds support for executing multiple XDP programs on a single >> >> >> interface in sequence, through the use of chain calls, as discussed at the Linux >> >> >> Plumbers Conference last month: >> >> >> >> >> >> https://linuxplumbersconf.org/event/4/contributions/460/ >> >> >> >> >> >> # HIGH-LEVEL IDEA >> >> >> >> >> >> The basic idea is to express the chain call sequence through a special map type, >> >> >> which contains a mapping from a (program, return code) tuple to another program >> >> >> to run in next in the sequence. Userspace can populate this map to express >> >> >> arbitrary call sequences, and update the sequence by updating or replacing the >> >> >> map. >> >> >> >> >> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(), >> >> >> which will lookup the chain sequence map, and if found, will loop through calls >> >> >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the >> >> >> previous program ID and return code. >> >> >> >> >> >> An XDP chain call map can be installed on an interface by means of a new netlink >> >> >> attribute containing an fd pointing to a chain call map. This can be supplied >> >> >> along with the XDP prog fd, so that a chain map is always installed together >> >> >> with an XDP program. >> >> >> >> >> > >> >> > This is great stuff Toke! >> >> >> >> Thanks! :) >> >> >> >> > One thing that wasn't immediately clear to me - and this may be just >> >> > me - is the relationship between program behaviour for the XDP_DROP >> >> > case and chain call execution. My initial thought was that a program >> >> > in the chain XDP_DROP'ping the packet would terminate the call chain, >> >> > but on looking at patch #4 it seems that the only way the call chain >> >> > execution is terminated is if >> >> > >> >> > - XDP_ABORTED is returned from a program in the call chain; or >> >> >> >> Yes. Not actually sure about this one... >> >> >> >> > - the map entry for the next program (determined by the return value >> >> > of the current program) is empty; or >> >> >> >> This will be the common exit condition, I expect >> >> >> >> > - we run out of entries in the map >> >> >> >> You mean if we run the iteration counter to zero, right? >> >> >> >> > The return value of the last-executed program in the chain seems to be >> >> > what determines packet processing behaviour after executing the chain >> >> > (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and >> >> > XDP_TX a packet from the same chain, right? Just want to make sure >> >> > I've got the semantics correct. Thanks! >> >> >> >> Yeah, you've got all this right. The chain call mechanism itself doesn't >> >> change any of the underlying fundamentals of XDP. I.e., each packet gets >> >> exactly one verdict. >> >> >> >> For chaining actual XDP programs that do different things to the packet, >> >> I expect that the most common use case will be to only run the next >> >> program if the previous one returns XDP_PASS. That will make the most >> >> semantic sense I think. >> >> >> >> But there are also use cases where one would want to match on the other >> >> return codes; such as packet capture, for instance, where one might >> >> install a capture program that would carry forward the previous return >> >> code, but do something to the packet (throw it out to userspace) first. >> >> >> >> For the latter use case, the question is if we need to expose the >> >> previous return code to the program when it runs. You can do things >> >> without it (by just using a different program per return code), but it >> >> may simplify things if we just expose the return code. However, since >> >> this will also change the semantics for running programs, I decided to >> >> leave that off for now. >> >> >> >> -Toke >> > >> > In other cases where programs (e.g. cgroups) are run in an array the >> > return codes are 'AND'ed together so that we get >> > >> > result1 & result2 & ... & resultN >> >> How would that work with multiple programs, though? PASS -> DROP seems >> obvious, but what if the first program returns TX? Also, programs may >> want to be able to actually override return codes (e.g., say you want to >> turn DROPs into REDIRECTs, to get all your dropped packets mirrored to >> your IDS or something). > > In general I think either you hard code a precedence that will have to > be overly conservative because if one program (your firewall) tells > XDP to drop the packet and some other program redirects it, passes, > etc. that seems incorrect to me. Or you get creative with the > precedence rules and they become complex and difficult to manage, > where a drop will drop a packet unless a previous/preceding program > redirects it, etc. I think any hard coded precedence you come up with > will make some one happy and some other user annoyed. Defeating the > programability of BPF. Yeah, exactly. That's basically why I punted on that completely. Besides, technically you can get this by just installing different programs in each slot if you really need it. > Better if its programmable. I would prefer to pass the context into > the next program then programs can build their own semantics. Then > leave the & of return codes so any program can if needed really drop a > packet. The context could be pushed into a shared memory region and > then it doesn't even need to be part of the program signature. Since it seems I'll be going down the rabbit hole of baking this into the BPF execution environment itself, I guess I'll keep this in mind as well. Either by stuffing the previous program return code into the context object(s), or by adding a new helper to retrieve it. -Toke
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Wed, Oct 02, 2019 at 09:15:22PM +0200, Daniel Borkmann wrote: >> On Wed, Oct 02, 2019 at 09:43:49AM -0700, John Fastabend wrote: >> > Toke Høiland-Jørgensen wrote: >> > > This series adds support for executing multiple XDP programs on a single >> > > interface in sequence, through the use of chain calls, as discussed at the Linux >> > > Plumbers Conference last month: >> > > >> > > https://linuxplumbersconf.org/event/4/contributions/460/ >> > > >> > > # HIGH-LEVEL IDEA >> > > >> > > The basic idea is to express the chain call sequence through a special map type, >> > > which contains a mapping from a (program, return code) tuple to another program >> > > to run in next in the sequence. Userspace can populate this map to express >> > > arbitrary call sequences, and update the sequence by updating or replacing the >> > > map. >> > > >> > > The actual execution of the program sequence is done in bpf_prog_run_xdp(), >> > > which will lookup the chain sequence map, and if found, will loop through calls >> > > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the >> > > previous program ID and return code. >> > > >> > > An XDP chain call map can be installed on an interface by means of a new netlink >> > > attribute containing an fd pointing to a chain call map. This can be supplied >> > > along with the XDP prog fd, so that a chain map is always installed together >> > > with an XDP program. >> > > >> > > # PERFORMANCE >> > > >> > > I performed a simple performance test to get an initial feel for the overhead of >> > > the chain call mechanism. This test consists of running only two programs in >> > > sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then >> > > measure the drop PPS performance and compare it to a baseline of just a single >> > > program that only returns XDP_DROP. >> > > >> > > For comparison, a test case that uses regular eBPF tail calls to sequence two >> > > programs together is also included. Finally, because 'perf' showed that the >> > > hashmap lookup was the largest single source of overhead, I also added a test >> > > case where I removed the jhash() call from the hashmap code, and just use the >> > > u32 key directly as an index into the hash bucket structure. >> > > >> > > The performance for these different cases is as follows (with retpolines disabled): >> > >> > retpolines enabled would also be interesting. >> > >> > > >> > > | Test case | Perf | Add. overhead | Total overhead | >> > > |---------------------------------+-----------+---------------+----------------| >> > > | Before patch (XDP DROP program) | 31.0 Mpps | | | >> > > | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | >> > >> > IMO even 1 Mpps overhead is too much for a feature that is primarily about >> > ease of use. Sacrificing performance to make userland a bit easier is hard >> > to justify for me when XDP _is_ singularly about performance. Also that is >> > nearly 10% overhead which is fairly large. So I think going forward the >> > performance gab needs to be removed. >> >> Fully agree, for the case where this facility is not used, it must have >> *zero* overhead. This is /one/ map flavor, in future there will be other >> facilities with different use-cases, but we cannot place them all into >> the critical fast-path. Given this is BPF, we have the flexibility that >> this can be hidden behind the scenes by rewriting and therefore only add >> overhead when used. >> >> What I also see as a red flag with this proposal is the fact that it's >> tied to XDP only because you need to go and hack bpf_prog_run_xdp() all >> the way to fetch xdp->rxq->dev->xdp_chain_map even though the map/concept >> itself is rather generic and could be used in various other program types >> as well. I'm very sure that once there, people would request it. Therefore, >> better to explore a way where this has no changes to BPF_PROG_RUN() similar >> to the original tail call work. > > two +1s. > > 1. new features have to have zero overhead when not used. this is not > negotiable. > 2. prog chaining is not xdp specific. > > two years ago I was thinking about extending tail_call mechanism like this: > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=prog_chain&id=f54f45d00f91e083f6aec2abe35b6f0be52ae85b > > and the program would call the new helper 'bpf_tail_call_next()' to jump > into the next program. > Sample code is here: > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=prog_chain&id=933a93208f1bd60a9707dc3f51a9aa457c86be87 Ah, that's a useful starting point, thanks! > In my benchmarks it was faster than existing bpf_tail_call via prog_array. > (And fit the rule of zero overhead when not used). > > While coding it I figured that we can do proper indirect calls instead, > which would be even cleaner solution. > It would support arbitrary program chaining and calling. > > The verifier back then didn't have enough infra to support indirect calls. > I suggest to look into implementing indirect calls instead of hacking > custom prog chaining logic via maps. One constraint I have is that the admin needs to be able to define the program execution chain without any cooperation from the program authors. So it can't rely on programs calling a helper. But I guess we can achieve that by injecting code into the program when the feature is turned on... I'll go jump down the BPF execution environment rabbit whole and see what comes back up! :) -Toke
Song Liu <songliubraving@fb.com> writes:
>> On Oct 2, 2019, at 12:23 PM, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Song Liu <songliubraving@fb.com> writes:
>>
>>>> On Oct 2, 2019, at 6:30 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> This series adds support for executing multiple XDP programs on a single
>>>> interface in sequence, through the use of chain calls, as discussed at the Linux
>>>> Plumbers Conference last month:
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__linuxplumbersconf.org_event_4_contributions_460_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=dR8692q0_uaizy0jkrBJQM5k2hfm4CiFxYT8KaysFrg&m=YXqqHTC51zXBviPBEk55y-fQjFQwcXWFlH0IoOqm2KU&s=NF4w3eSPmNhSpJr1-0FLqqlqfgEV8gsCQb9YqWQ9p-k&e=
>>>>
>>>> # HIGH-LEVEL IDEA
>>>>
>>>> The basic idea is to express the chain call sequence through a special map type,
>>>> which contains a mapping from a (program, return code) tuple to another program
>>>> to run in next in the sequence. Userspace can populate this map to express
>>>> arbitrary call sequences, and update the sequence by updating or replacing the
>>>> map.
>>>>
>>>> The actual execution of the program sequence is done in bpf_prog_run_xdp(),
>>>> which will lookup the chain sequence map, and if found, will loop through calls
>>>> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the
>>>> previous program ID and return code.
>>>>
>>>> An XDP chain call map can be installed on an interface by means of a new netlink
>>>> attribute containing an fd pointing to a chain call map. This can be supplied
>>>> along with the XDP prog fd, so that a chain map is always installed together
>>>> with an XDP program.
>>>
>>> Interesting work!
>>>
>>> Quick question: can we achieve the same by adding a "retval to
>>> call_tail_next" map to each program?
>>
>> Hmm, that's an interesting idea; I hadn't thought of that. As long as
>> that map can be manipulated outside of the program itself, it may work.
>> I wonder how complex it gets to modify the call sequence, though; say
>> you want to change A->B->C to A->C->B - how do you do that without
>> interrupting the sequence while you're modifying things? Or is it OK if
>> that is not possible?
>
> We can always load another copy of B and C, say D == B, and E == C. And
> make it A->E->D.
Yes, thinking some more about this I don't think it's actually a
problem. I'll go prototype something...
On Wed, 02 Oct 2019 21:25:28 +0200 Toke Høiland-Jørgensen <toke@redhat.com> wrote: > Song Liu <songliubraving@fb.com> writes: > > >> On Oct 2, 2019, at 11:38 AM, Song Liu <songliubraving@fb.com> wrote: > >> > >>> On Oct 2, 2019, at 6:30 AM, Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >>> > >>> This series adds support for executing multiple XDP programs on a single > >>> interface in sequence, through the use of chain calls, as discussed at the Linux > >>> Plumbers Conference last month: > >>> [1] https://linuxplumbersconf.org/event/4/contributions/460/ - [2] Slides: http://people.netfilter.org/hawk/presentations/LinuxPlumbers2019/xdp-distro-view.pdf - [3] Source: https://github.com/xdp-project/xdp-project/tree/master/conference/LinuxPlumbers2019 [...] > > > > Also, could you please share a real word example? I saw the example > > from LPC slides, but I am more curious about what does each program do > > in real use cases. > > The only concrete program that I have that needs this is xdpcap: > https://github.com/cloudflare/xdpcap > > Right now that needs to be integrated into the calling program to work; > I want to write a tool like it, but that can insert itself before or > after arbitrary XDP programs. The other real world use-case it Facebooks katran, you should be aware: https://github.com/facebookincubator/katran It might be important to understand that the patchset/intent is a hybrid that satisfy both xdpcap ([2] slide-26) and katran ([2] slide-27), see later slides how this is done. Notice there a requirement is that users don't (need to) modify the BPF ELF file, to make it cooperate with this system. The katran use-case is to chain several eBPF programs. The xdpcap use-case is to trap any XDP return action code (and tcpdump via perf event ring_buffer). For system administrators the xdpcap use-case is something we hear about all the time, so one of the missing features for XDP. As Toke also wrote, we want to extend this to ALSO be-able to see/dump the packet BEFORE a given XDP program. > Lorenz, can you say more about your use case? :) AFAIK Cloudflare also have a chaining eBPF program use-case for XDP. I could not find the blog post. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Thu, 03 Oct 2019 09:48:22 +0200 Toke Høiland-Jørgensen <toke@redhat.com> wrote: > John Fastabend <john.fastabend@gmail.com> writes: > > > Toke Høiland-Jørgensen wrote: > >> John Fastabend <john.fastabend@gmail.com> writes: > >> > >> > Toke Høiland-Jørgensen wrote: > >> >> Alan Maguire <alan.maguire@oracle.com> writes: > >> >> > >> >> > On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote: > >> >> > > >> >> >> This series adds support for executing multiple XDP programs on a single > >> >> >> interface in sequence, through the use of chain calls, as discussed at the Linux > >> >> >> Plumbers Conference last month: > >> >> >> > >> >> >> https://linuxplumbersconf.org/event/4/contributions/460/ > >> >> >> > >> >> >> # HIGH-LEVEL IDEA > >> >> >> > >> >> >> The basic idea is to express the chain call sequence through a special map type, > >> >> >> which contains a mapping from a (program, return code) tuple to another program > >> >> >> to run in next in the sequence. Userspace can populate this map to express > >> >> >> arbitrary call sequences, and update the sequence by updating or replacing the > >> >> >> map. > >> >> >> > >> >> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(), > >> >> >> which will lookup the chain sequence map, and if found, will loop through calls > >> >> >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > >> >> >> previous program ID and return code. > >> >> >> > >> >> >> An XDP chain call map can be installed on an interface by means of a new netlink > >> >> >> attribute containing an fd pointing to a chain call map. This can be supplied > >> >> >> along with the XDP prog fd, so that a chain map is always installed together > >> >> >> with an XDP program. > >> >> >> > >> >> > > >> >> > This is great stuff Toke! > >> >> > >> >> Thanks! :) > >> >> > >> >> > One thing that wasn't immediately clear to me - and this may be just > >> >> > me - is the relationship between program behaviour for the XDP_DROP > >> >> > case and chain call execution. My initial thought was that a program > >> >> > in the chain XDP_DROP'ping the packet would terminate the call chain, > >> >> > but on looking at patch #4 it seems that the only way the call chain > >> >> > execution is terminated is if > >> >> > > >> >> > - XDP_ABORTED is returned from a program in the call chain; or > >> >> > >> >> Yes. Not actually sure about this one... > >> >> > >> >> > - the map entry for the next program (determined by the return value > >> >> > of the current program) is empty; or > >> >> > >> >> This will be the common exit condition, I expect > >> >> > >> >> > - we run out of entries in the map > >> >> > >> >> You mean if we run the iteration counter to zero, right? > >> >> > >> >> > The return value of the last-executed program in the chain seems to be > >> >> > what determines packet processing behaviour after executing the chain > >> >> > (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and > >> >> > XDP_TX a packet from the same chain, right? Just want to make sure > >> >> > I've got the semantics correct. Thanks! > >> >> > >> >> Yeah, you've got all this right. The chain call mechanism itself doesn't > >> >> change any of the underlying fundamentals of XDP. I.e., each packet gets > >> >> exactly one verdict. > >> >> > >> >> For chaining actual XDP programs that do different things to the packet, > >> >> I expect that the most common use case will be to only run the next > >> >> program if the previous one returns XDP_PASS. That will make the most > >> >> semantic sense I think. > >> >> > >> >> But there are also use cases where one would want to match on the other > >> >> return codes; such as packet capture, for instance, where one might > >> >> install a capture program that would carry forward the previous return > >> >> code, but do something to the packet (throw it out to userspace) first. > >> >> > >> >> For the latter use case, the question is if we need to expose the > >> >> previous return code to the program when it runs. You can do things > >> >> without it (by just using a different program per return code), but it > >> >> may simplify things if we just expose the return code. However, since > >> >> this will also change the semantics for running programs, I decided to > >> >> leave that off for now. > >> >> > >> >> -Toke > >> > > >> > In other cases where programs (e.g. cgroups) are run in an array the > >> > return codes are 'AND'ed together so that we get > >> > > >> > result1 & result2 & ... & resultN But the XDP return codes are not bit values, so AND operation doesn't make sense to me. > >> > >> How would that work with multiple programs, though? PASS -> DROP seems > >> obvious, but what if the first program returns TX? Also, programs may > >> want to be able to actually override return codes (e.g., say you want to > >> turn DROPs into REDIRECTs, to get all your dropped packets mirrored to > >> your IDS or something). > > > > In general I think either you hard code a precedence that will have to > > be overly conservative because if one program (your firewall) tells > > XDP to drop the packet and some other program redirects it, passes, > > etc. that seems incorrect to me. Or you get creative with the > > precedence rules and they become complex and difficult to manage, > > where a drop will drop a packet unless a previous/preceding program > > redirects it, etc. I think any hard coded precedence you come up with > > will make some one happy and some other user annoyed. Defeating the > > programability of BPF. > > Yeah, exactly. That's basically why I punted on that completely. > Besides, technically you can get this by just installing different > programs in each slot if you really need it. I would really like to avoid hard coding precedence. I know it is "challenging" that we want to allow overruling any XDP return code, but I think it makes sense and it is the most flexible solution. > > Better if its programmable. I would prefer to pass the context into > > the next program then programs can build their own semantics. Then > > leave the & of return codes so any program can if needed really drop a > > packet. The context could be pushed into a shared memory region and > > then it doesn't even need to be part of the program signature. > > Since it seems I'll be going down the rabbit hole of baking this into > the BPF execution environment itself, I guess I'll keep this in mind as > well. Either by stuffing the previous program return code into the > context object(s), or by adding a new helper to retrieve it. I would like to see the ability to retrieve previous program return code, and a new helper would be the simplest approach. As this could potentially simplify and compact the data-structure. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Thu, Oct 3, 2019 at 1:53 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>> The xdpcap use-case is to trap any XDP return action code (and tcpdump
> via perf event ring_buffer). For system administrators the xdpcap
> use-case is something we hear about all the time, so one of the missing
> features for XDP. As Toke also wrote, we want to extend this to ALSO
> be-able to see/dump the packet BEFORE a given XDP program.
It sounds to me that 'xdpdump/xdpcap' (tcpdump equivalent) is
the only use case both you and Toke are advocating for.
I think such case we can do already without new kernel code:
- retrieve prog_id of the program attached to given xdp ifindex
- convert to fd
- create prog_array of one element and store that prog_fd
- create xdpump bpf prog that prints to ring buffer
and tail_calls into that prog_array
- replace xdp prog on that ifindex
Now it see all the traffic first and existing xdp progs keep working.
What am I missing?
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> On Thu, Oct 3, 2019 at 1:53 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>> The xdpcap use-case is to trap any XDP return action code (and tcpdump
>> via perf event ring_buffer). For system administrators the xdpcap
>> use-case is something we hear about all the time, so one of the missing
>> features for XDP. As Toke also wrote, we want to extend this to ALSO
>> be-able to see/dump the packet BEFORE a given XDP program.
>
> It sounds to me that 'xdpdump/xdpcap' (tcpdump equivalent) is
> the only use case both you and Toke are advocating for.
> I think such case we can do already without new kernel code:
> - retrieve prog_id of the program attached to given xdp ifindex
> - convert to fd
> - create prog_array of one element and store that prog_fd
> - create xdpump bpf prog that prints to ring buffer
> and tail_calls into that prog_array
> - replace xdp prog on that ifindex
>
> Now it see all the traffic first and existing xdp progs keep working.
> What am I missing?
Yeah, that takes care of the "run xdpdump as the first thing" use case.
But we also want to be able to run it *after* another program, *without*
modifying that program to add a tail call.
More generally, we want to be able to chain XDP programs from multiple
sources in arbitrary ways. Let me try to provide a more fleshed-out
usage example:
Say a distro ships MyFirewall and MyIDS, two different upstream
projects, both of which support XDP acceleration. MyFirewall has
specified in its documentation that its XDP program will return XDP_PASS
for anything that it has determined should not be dropped. So the
sysadmin decides he wants to enable both, and of course he wants both to
be XDP-accelerated.
This particular sysadmin doesn't want to expend IDS resources on traffic
that the firewall has already decided to drop, so he'll just install the
firewall first, and then run the IDS on any traffic that gets XDP_PASS.
So he installs IDS as a chain-call XDP program on the XDP_PASS action
after the firewall.
Another sysadmin might be more paranoid (or have more CPU resources
available), and so he wants to run the IDS first, and the firewall
afterwards. So he installs the two XDP programs in the reverse order, by
chaining the firewall to the IDS' XDP_PASS action.
At the same time, the sysadmin wants to inspect what the firewall is
actually dropping, so he fires up xdpdump and tells it to show him
everything dropped by the firewall. The xdpdump tool does this by
attaching itself as a chain call program to the XDP_DROP action of the
firewall program.
In all cases, the sysadmin can't (or doesn't want to) modify any of the
XDP programs. In fact, they may just be installed as pre-compiled .so
BPF files on his system. So he needs to be able to configure the call
chain of different programs without modifying the eBPF program source
code.
This is basically what we're trying to support with XDP chain calls
(which I guess is now turning into more general eBPF chain calls). I
think it is doable with something based on the BPF_PROG_CHAIN_* series
you posted a link to earlier; but instead of having an explicit
tail_call_next() helper, I'll just make the verifier insert the chain
calls before each BPF_EXIT instruction when this feature is turned on.
Do you see any reason why this wouldn't work?
-Toke
On 03/10/2019 15:33, Toke Høiland-Jørgensen wrote:
> In all cases, the sysadmin can't (or doesn't want to) modify any of the
> XDP programs. In fact, they may just be installed as pre-compiled .so
> BPF files on his system. So he needs to be able to configure the call
> chain of different programs without modifying the eBPF program source
> code.
Perhaps I'm being dumb, but can't we solve this if we make linking work?
I.e. myIDS.so has ids_main() function, myFirewall.so has firewall()
function, and sysadmin writes a little XDP prog to call these:
int main(struct xdp_md *ctx)
{
int rc = firewall(ctx), rc2;
switch(rc) {
case XDP_DROP:
case XDP_ABORTED:
default:
return rc;
case XDP_PASS:
return ids_main(ctx);
case XDP_TX:
case XDP_REDIRECT:
rc2 = ids_main(ctx);
if (rc2 == XDP_PASS)
return rc;
return rc2;
}
}
Now he compiles this and links it against those .so files, giving him
a new object file which he can then install.
(One problem which does spring to mind is that the .so files may very
inconsiderately both name their entry points main(), which makes
linking against both of them rather challenging. But I think that
can be worked around with a sufficiently clever linker).
-Ed
On Thu, 3 Oct 2019 15:53:50 +0100 Edward Cree <ecree@solarflare.com> wrote: > On 03/10/2019 15:33, Toke Høiland-Jørgensen wrote: > > In all cases, the sysadmin can't (or doesn't want to) modify any of the > > XDP programs. In fact, they may just be installed as pre-compiled .so > > BPF files on his system. So he needs to be able to configure the call > > chain of different programs without modifying the eBPF program source > > code. > > Perhaps I'm being dumb, but can't we solve this if we make linking work? > I.e. myIDS.so has ids_main() function, myFirewall.so has firewall() > function, and sysadmin writes a little XDP prog to call these: > > int main(struct xdp_md *ctx) > { > int rc = firewall(ctx), rc2; > > switch(rc) { > case XDP_DROP: > case XDP_ABORTED: > default: > return rc; > case XDP_PASS: > return ids_main(ctx); > case XDP_TX: > case XDP_REDIRECT: > rc2 = ids_main(ctx); > if (rc2 == XDP_PASS) > return rc; > return rc2; > } > } > > Now he compiles this and links it against those .so files, giving him > a new object file which he can then install. Sorry, but I don't think this makes sense. We are not dealing with .so (shared-object) files. These are BPF ELF-object files, which is not really ELF-objects in a normal sense. The role of libbpf is to load the BPF elements in the ELF-file, via the BPF-syscall. First the maps are loaded (gets back a FD from BPF-syscall), and then program sections are "opened" and ELF-relocation is preformed replacing map accesses with map-FD's, before loading the BPF byte-code via BPF-syscall. Next issue is that the program loading the BPF-ELF object will get the map FD's, which it expect and need to query and also populate with for example its config. Thus, you would need to pass the FD to the original userspace application. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Edward Cree wrote: > On 03/10/2019 15:33, Toke Høiland-Jørgensen wrote: > > In all cases, the sysadmin can't (or doesn't want to) modify any of the > > XDP programs. In fact, they may just be installed as pre-compiled .so > > BPF files on his system. So he needs to be able to configure the call > > chain of different programs without modifying the eBPF program source > > code. > Perhaps I'm being dumb, but can't we solve this if we make linking work? > I.e. myIDS.so has ids_main() function, myFirewall.so has firewall() > function, and sysadmin writes a little XDP prog to call these: > > int main(struct xdp_md *ctx) > { > int rc = firewall(ctx), rc2; > > switch(rc) { > case XDP_DROP: > case XDP_ABORTED: > default: > return rc; > case XDP_PASS: > return ids_main(ctx); > case XDP_TX: > case XDP_REDIRECT: > rc2 = ids_main(ctx); > if (rc2 == XDP_PASS) > return rc; > return rc2; > } > } > > Now he compiles this and links it against those .so files, giving him > a new object file which he can then install. > > (One problem which does spring to mind is that the .so files may very > inconsiderately both name their entry points main(), which makes > linking against both of them rather challenging. But I think that > can be worked around with a sufficiently clever linker). I agree but the same could be done today if ids_main and firewall were inline functions. Admin can write their little program like above and just '#include firewall', '#include ids'. Then you don't need linking although it does make things nicer. > > -Ed
Jesper Dangaard Brouer wrote: > On Thu, 03 Oct 2019 09:48:22 +0200 > Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > John Fastabend <john.fastabend@gmail.com> writes: > > > > > Toke Høiland-Jørgensen wrote: > > >> John Fastabend <john.fastabend@gmail.com> writes: > > >> > > >> > Toke Høiland-Jørgensen wrote: > > >> >> Alan Maguire <alan.maguire@oracle.com> writes: > > >> >> > > >> >> > On Wed, 2 Oct 2019, Toke Høiland-Jørgensen wrote: > > >> >> > > > >> >> >> This series adds support for executing multiple XDP programs on a single > > >> >> >> interface in sequence, through the use of chain calls, as discussed at the Linux > > >> >> >> Plumbers Conference last month: > > >> >> >> > > >> >> >> https://linuxplumbersconf.org/event/4/contributions/460/ > > >> >> >> > > >> >> >> # HIGH-LEVEL IDEA > > >> >> >> > > >> >> >> The basic idea is to express the chain call sequence through a special map type, > > >> >> >> which contains a mapping from a (program, return code) tuple to another program > > >> >> >> to run in next in the sequence. Userspace can populate this map to express > > >> >> >> arbitrary call sequences, and update the sequence by updating or replacing the > > >> >> >> map. > > >> >> >> > > >> >> >> The actual execution of the program sequence is done in bpf_prog_run_xdp(), > > >> >> >> which will lookup the chain sequence map, and if found, will loop through calls > > >> >> >> to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > > >> >> >> previous program ID and return code. > > >> >> >> > > >> >> >> An XDP chain call map can be installed on an interface by means of a new netlink > > >> >> >> attribute containing an fd pointing to a chain call map. This can be supplied > > >> >> >> along with the XDP prog fd, so that a chain map is always installed together > > >> >> >> with an XDP program. > > >> >> >> > > >> >> > > > >> >> > This is great stuff Toke! > > >> >> > > >> >> Thanks! :) > > >> >> > > >> >> > One thing that wasn't immediately clear to me - and this may be just > > >> >> > me - is the relationship between program behaviour for the XDP_DROP > > >> >> > case and chain call execution. My initial thought was that a program > > >> >> > in the chain XDP_DROP'ping the packet would terminate the call chain, > > >> >> > but on looking at patch #4 it seems that the only way the call chain > > >> >> > execution is terminated is if > > >> >> > > > >> >> > - XDP_ABORTED is returned from a program in the call chain; or > > >> >> > > >> >> Yes. Not actually sure about this one... > > >> >> > > >> >> > - the map entry for the next program (determined by the return value > > >> >> > of the current program) is empty; or > > >> >> > > >> >> This will be the common exit condition, I expect > > >> >> > > >> >> > - we run out of entries in the map > > >> >> > > >> >> You mean if we run the iteration counter to zero, right? > > >> >> > > >> >> > The return value of the last-executed program in the chain seems to be > > >> >> > what determines packet processing behaviour after executing the chain > > >> >> > (_DROP, _TX, _PASS, etc). So there's no way to both XDP_PASS and > > >> >> > XDP_TX a packet from the same chain, right? Just want to make sure > > >> >> > I've got the semantics correct. Thanks! > > >> >> > > >> >> Yeah, you've got all this right. The chain call mechanism itself doesn't > > >> >> change any of the underlying fundamentals of XDP. I.e., each packet gets > > >> >> exactly one verdict. > > >> >> > > >> >> For chaining actual XDP programs that do different things to the packet, > > >> >> I expect that the most common use case will be to only run the next > > >> >> program if the previous one returns XDP_PASS. That will make the most > > >> >> semantic sense I think. > > >> >> > > >> >> But there are also use cases where one would want to match on the other > > >> >> return codes; such as packet capture, for instance, where one might > > >> >> install a capture program that would carry forward the previous return > > >> >> code, but do something to the packet (throw it out to userspace) first. > > >> >> > > >> >> For the latter use case, the question is if we need to expose the > > >> >> previous return code to the program when it runs. You can do things > > >> >> without it (by just using a different program per return code), but it > > >> >> may simplify things if we just expose the return code. However, since > > >> >> this will also change the semantics for running programs, I decided to > > >> >> leave that off for now. > > >> >> > > >> >> -Toke > > >> > > > >> > In other cases where programs (e.g. cgroups) are run in an array the > > >> > return codes are 'AND'ed together so that we get > > >> > > > >> > result1 & result2 & ... & resultN > > But the XDP return codes are not bit values, so AND operation doesn't > make sense to me. > > > >> > > >> How would that work with multiple programs, though? PASS -> DROP seems > > >> obvious, but what if the first program returns TX? Also, programs may > > >> want to be able to actually override return codes (e.g., say you want to > > >> turn DROPs into REDIRECTs, to get all your dropped packets mirrored to > > >> your IDS or something). > > > > > > In general I think either you hard code a precedence that will have to > > > be overly conservative because if one program (your firewall) tells > > > XDP to drop the packet and some other program redirects it, passes, > > > etc. that seems incorrect to me. Or you get creative with the > > > precedence rules and they become complex and difficult to manage, > > > where a drop will drop a packet unless a previous/preceding program > > > redirects it, etc. I think any hard coded precedence you come up with > > > will make some one happy and some other user annoyed. Defeating the > > > programability of BPF. > > > > Yeah, exactly. That's basically why I punted on that completely. > > Besides, technically you can get this by just installing different > > programs in each slot if you really need it. > > I would really like to avoid hard coding precedence. I know it is > "challenging" that we want to allow overruling any XDP return code, but > I think it makes sense and it is the most flexible solution. > > > > > Better if its programmable. I would prefer to pass the context into > > > the next program then programs can build their own semantics. Then > > > leave the & of return codes so any program can if needed really drop a > > > packet. The context could be pushed into a shared memory region and > > > then it doesn't even need to be part of the program signature. > > > > Since it seems I'll be going down the rabbit hole of baking this into > > the BPF execution environment itself, I guess I'll keep this in mind as > > well. Either by stuffing the previous program return code into the > > context object(s), or by adding a new helper to retrieve it. > > I would like to see the ability to retrieve previous program return > code, and a new helper would be the simplest approach. As this could > potentially simplify and compact the data-structure. But I'm a bit lost here. I think similar to Alexei comment. Sounds like you want 'something' that implements a graph of functions, where depending on the return code of the function you call the next function in the graph. And you want the edges on that graph to be programmable so that depending on the administrators configuration the graph may be built differently? The more I think about this it seems that you want a BPF program that is generated from a configuration file to chain together whatever the administrator asks for. Then you don't need any kernel bits other than what we have. Of course we can optimize with shared libs and such as we go but I'm not seeing the missing piece. Use proper calls instead of tail calls and you can get return codes to handle however you like. It requires a bit of cooperation from the ids writers to build you a function to call but that should be simply their normal xdp hook. Maybe the verifier needs a bit of extra smarts to follow ctx types into calls I'm not sure. User says via some fancy gui or yaml or ... build this ----- XDP_PASS | | XDP_firefwall --- pass ---> XDP_IDS ------ XDP_DROP | -- drop ------------------> XDP_DROP generate the file and load it. This seems to be the advantage of BPF here you can build whatever you like, make the rules however complex you like, and build any precedent on the return codes you like. .John > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
John Fastabend <john.fastabend@gmail.com> writes:
> Edward Cree wrote:
>> On 03/10/2019 15:33, Toke Høiland-Jørgensen wrote:
>> > In all cases, the sysadmin can't (or doesn't want to) modify any of the
>> > XDP programs. In fact, they may just be installed as pre-compiled .so
>> > BPF files on his system. So he needs to be able to configure the call
>> > chain of different programs without modifying the eBPF program source
>> > code.
>> Perhaps I'm being dumb, but can't we solve this if we make linking work?
>> I.e. myIDS.so has ids_main() function, myFirewall.so has firewall()
>> function, and sysadmin writes a little XDP prog to call these:
>>
>> int main(struct xdp_md *ctx)
>> {
>> int rc = firewall(ctx), rc2;
>>
>> switch(rc) {
>> case XDP_DROP:
>> case XDP_ABORTED:
>> default:
>> return rc;
>> case XDP_PASS:
>> return ids_main(ctx);
>> case XDP_TX:
>> case XDP_REDIRECT:
>> rc2 = ids_main(ctx);
>> if (rc2 == XDP_PASS)
>> return rc;
>> return rc2;
>> }
>> }
>>
>> Now he compiles this and links it against those .so files, giving him
>> a new object file which he can then install.
>>
>> (One problem which does spring to mind is that the .so files may very
>> inconsiderately both name their entry points main(), which makes
>> linking against both of them rather challenging. But I think that
>> can be worked around with a sufficiently clever linker).
>
> I agree but the same could be done today if ids_main and firewall
> were inline functions. Admin can write their little program like above
> and just '#include firewall', '#include ids'. Then you don't need
> linking although it does make things nicer.
(Replying to both you and Edward here, as what you suggested in your
other email is basically the same).
Yes, you are right that we could technically do this entirely in
userspace with a sufficiently smart loader. This was even in the
presentation at LPC (see slide 34: "Implementation option #1: userspace
only").
The reason we want to add kernel support is that it solves several
important problems:
- Centralisation: To do this entirely in userspace, everything has to go
through the same loader that builds the "dispatcher program".
- Enforcement: Related to the point above - if an XDP-enabled
application loads its own XDP program without going through the
central loader, we can't load another program after that.
- State inspection: If userspace builds and loads a single dispatch
program, it is difficult to go from that back to the call sequence
graph.
- Dynamic changes: While the dispatch program can be rebuilt from a
config file, it becomes difficult to do dynamic changes to the
sequence (such as temporarily attaching xdpdump to the XDP_DROP action
of this already-loaded program while debugging).
Having the mechanism be in-kernel makes solving these problems a lot
easier, because the kernel can be responsible for state management, and
it can enforce the chain call execution logic.
The fact that Lorenz et al are interested in this feature (even though
they are essentially already doing what you suggested, by having a
centralised daemon to manage all XDP programs), tells me that having
kernel support for this is the right thing to do.
-Toke
On 04/10/2019 09:09, Toke Høiland-Jørgensen wrote: > Having the mechanism be in-kernel makes solving these problems a lot > easier, because the kernel can be responsible for state management, and > it can enforce the chain call execution logic. I would argue this isn't mechanism, but rather policy, because the mechanism we already have is sufficient to express the policy. Enforcement is easily dealt with: you just don't give people the caps/ perms to load XDP programs directly, so the only way they can do it is via your loader (which you give them a socket or dbus or something to talk to). (Whereas your chain map doesn't really 'enforce' anything; anyone who can add themselves to the chain can also remove others.) Then state inspection happens by querying the loader; if we assume that the distro provided the central loader, then they can also include the query in their standard system-dump tools. Dynamic changes would just mean compiling a new dispatcher, then atomically replacing the old prog with the new (which we can already do), since the central loader daemon knows the call graph and can make changed versions easily. Centralisation is something that happens normally in userspace; just count how many daemons your favourite init system runs to administer system resources and multiplex requests. Probably we'd end up with one or two standard loaders and interfaces to them. In any case, it seems like XDP users in userspace still need to communicate with each other in order to update the chain map (which seems to rely on knowing where one's own program fits into it); you suggest they might communicate through the chain map itself, and then veer off into the weeds of finding race-free ways of doing that. This seems (to me) needlessly complex. Incidentally, there's also a performance advantage to an eBPF dispatcher, because it means the calls to the individual programs can be JITted and therefore be direct, whereas an in-kernel data-driven dispatcher has to use indirect calls (*waves at spectre*). > The fact that Lorenz et al are interested in this feature (even though > they are essentially already doing what you suggested, by having a > centralised daemon to manage all XDP programs), tells me that having > kernel support for this is the right thing to do. Maybe Lorenz could describe what he sees as the difficulties with the centralised daemon approach. In what ways is his current "xdpd" solution unsatisfactory? -Ed
On Fri, 4 Oct 2019 at 11:34, Edward Cree <ecree@solarflare.com> wrote: > > Enforcement is easily dealt with: you just don't give people the caps/ > perms to load XDP programs directly, so the only way they can do it is > via your loader (which you give them a socket or dbus or something to > talk to). Writing this daemon is actually harder than it sounds. Loading eBPF programs can become fairly complex, with eBPF maps being shared between different programs. If you want to support all use cases (which you kind of have to) then you'll end up writing an RPC wrapper for libbpf, which sounds very painful to me. So I think for this to work at all, loading has to happen in the user space components. Only construction of the control flow should be centralised. This has the knock on effect that these components need CAP_NET_ADMIN, since too much of eBPF relies on having that capability right now: various map types, safety features applied to non-root eBPF, etc. Given time this will be fixed, and maybe these programs can then just have CAP_BPF or whatever. I chatted with my colleague Arthur, and we think this might work if all programs are forced to comply with the xdpcap-style tail call map: a prog array with MAX_XDP_ACTION slots, which each program calls into via tail_call(map, action); return action; // to handle the empty slot case You could then send (program fd, tail call map fd) along with a priority of some sort via SCM_RIGHTS. The daemon can then update the tail call maps as needed. The problem is that this only allows for linear (not tree-like) control flow. We'll try and hack up a POC to see if it works at all. > In any case, it seems like XDP users in userspace still need to > communicate with each other in order to update the chain map (which > seems to rely on knowing where one's own program fits into it); you > suggest they might communicate through the chain map itself, and then > veer off into the weeds of finding race-free ways of doing that. This > seems (to me) needlessly complex. I agree. > Incidentally, there's also a performance advantage to an eBPF dispatcher, > because it means the calls to the individual programs can be JITted and > therefore be direct, whereas an in-kernel data-driven dispatcher has to > use indirect calls (*waves at spectre*). This is if we somehow got full blown calls between distinct eBPF programs? > Maybe Lorenz could describe what he sees as the difficulties with the > centralised daemon approach. In what ways is his current "xdpd" > solution unsatisfactory? xdpd contains the logic to load and install all the different XDP programs we have. If we want to change one of them we have to redeploy the whole thing. Same if we want to add one. It also makes life-cycle management harder than it should be. So our xdpd is not at all like the "loader" you envision. -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com
On 04/10/2019 16:58, Lorenz Bauer wrote: > If you want to support > all use cases (which you kind of have to) then you'll end up writing an > RPC wrapper for libbpf, Yes, you more-or-less would need that. Though I think you can e.g. have the clients load & pin their own maps and then pass the map fds over SCM_RIGHTS (though I'm not sure if our current permissions system is granular enough for that). > which sounds very painful to me. I might be being naïve, but it doesn't sound more painful than is normal for userland. I mean, what operations have you got- * create/destroy map (maybe, see above) * load prog (pass it an fd from which it can read an ELF, and more fds for the maps it uses. Everything else, e.g. BTFs, can just live in the ELF.) * destroy prog * bind prog to hook (admittedly there's a long list of hooks, but this is only to cover the XDP ones, so basically we just have to specify interface and generic/driver/hw) -that doesn't seem like it presents great difficulties? >> Incidentally, there's also a performance advantage to an eBPF dispatcher, >> because it means the calls to the individual programs can be JITted and >> therefore be direct, whereas an in-kernel data-driven dispatcher has to >> use indirect calls (*waves at spectre*). > This is if we somehow got full blown calls between distinct eBPF programs? No, I'm talking about doing a linker step (using the 'full-blown calls' _within_ an eBPF program that Alexei added a few months back) before the program is submitted to the kernel. So the BPF_CALL|BPF_PSEUDO_CALL insn gets JITed to a direct call. (Although I also think full-blown dynamically-linked calls ought not to be impossible, *if* we restrict them to taking a ctx and returning a u64, in which case the callee can be verified as though it were a normal program, and the caller's verifier just treats the program as returning an unknown scalar. The devil is in the details, though, and it seems no-one's quite wanted it enough to do the work required to make it happen.) >> Maybe Lorenz could describe what he sees as the difficulties with the >> centralised daemon approach. In what ways is his current "xdpd" >> solution unsatisfactory? > xdpd contains the logic to load and install all the different XDP programs > we have. If we want to change one of them we have to redeploy the whole > thing. Same if we want to add one. It also makes life-cycle management > harder than it should be. So our xdpd is not at all like the "loader" > you envision. OK, but in that case xdpd isn't evidence that the "loader" approach doesn't work, so I still think it should be tried before we go to the lengths of pushing something into the kernel (that we then have to maintain forever). No promises but I might find the time to put together a strawman implementation of the loader, to show how I envisage it working. -Ed
On Mon, 7 Oct 2019 at 17:43, Edward Cree <ecree@solarflare.com> wrote: > > I might be being naïve, but it doesn't sound more painful than is normal > for userland. I mean, what operations have you got- > * create/destroy map (maybe, see above) > * load prog (pass it an fd from which it can read an ELF, and more fds > for the maps it uses. Everything else, e.g. BTFs, can just live in the > ELF.) > * destroy prog > * bind prog to hook (admittedly there's a long list of hooks, but this is > only to cover the XDP ones, so basically we just have to specify > interface and generic/driver/hw) > -that doesn't seem like it presents great difficulties? Sure, but this is the simplest, not necessarily realistic use case. There is a reason that libbpf has the API it has. For example, we patch our eBPF before loading it. I'm sure there are other complications, which is why I prefer to keep loading my own programs. > No, I'm talking about doing a linker step (using the 'full-blown calls' > _within_ an eBPF program that Alexei added a few months back) before the > program is submitted to the kernel. So the BPF_CALL|BPF_PSEUDO_CALL insn > gets JITed to a direct call. Ah, I see. I'm not sure whether this restriction has been lifted, but those calls are incompatible with tail calls. So we wouldn't be able to use this. > OK, but in that case xdpd isn't evidence that the "loader" approach doesn't > work, so I still think it should be tried before we go to the lengths of > pushing something into the kernel (that we then have to maintain forever). Maybe this came across the wrong way, I never said it is. Merely that it's the status quo we'd like to move away from. If we can achieve that in userspace, great. Lorenz -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com
On 07/10/2019 18:12, Lorenz Bauer wrote: > Sure, but this is the simplest, not necessarily realistic use case. There > is a reason that libbpf has the API it has. For example, we patch our > eBPF before loading it. I'm sure there are other complications, which is > why I prefer to keep loading my own programs. Any reason why you can't have the client patch the eBPF (possibly with libbpf) before supplying the patched object file to the loaderiser? >> No, I'm talking about doing a linker step (using the 'full-blown calls' >> _within_ an eBPF program that Alexei added a few months back) before the >> program is submitted to the kernel. So the BPF_CALL|BPF_PSEUDO_CALL insn >> gets JITed to a direct call. > Ah, I see. I'm not sure whether this restriction has been lifted, but those > calls are incompatible with tail calls. So we wouldn't be able to use this. Indeed, tail calls don't fit into my scheme, because being a tail-call from the subprogram doesn't make you a tail-call from the dispatcher program. But AIUI tail calls are only in use today in various work-arounds for the lack of proper linking (including dynamic linking). If we supported that, would you still need them? >> OK, but in that case xdpd isn't evidence that the "loader" approach doesn't >> work, so I still think it should be tried before we go to the lengths of >> pushing something into the kernel (that we then have to maintain forever). > Maybe this came across the wrong way, I never said it is. No, you didn't (sorry). Toke somewhat implied it, which is what I was responding to there. -Ed
On Mon, Oct 07, 2019 at 05:43:44PM +0100, Edward Cree wrote:
>
> (Although I also think full-blown dynamically-linked calls ought not to be
> impossible, *if* we restrict them to taking a ctx and returning a u64, in
> which case the callee can be verified as though it were a normal program,
> and the caller's verifier just treats the program as returning an unknown
> scalar. The devil is in the details, though, and it seems no-one's quite
> wanted it enough to do the work required to make it happen.)
Absolutely.
Full dynamic linking and libraries is on todo list.
It's taking long time, since it needs to be powerful and generic from the day one.
If we do 'pass ctx only and return u64' as a stop gap, it will be just as limited
as existing bpf_tail_calls.
bpf_tail_call api was badly designed.
I couldn't figure out how to make tail calls safe and generic, so I came up
with this bpf_tail_call hack. bpf usability suffers.
The verifier is different now. It's powerful enough to do true calls and jumps.
Especially with BTF it can see and track all types.
True tail calls will be seen as indirect jump assembler insn.
True indirect calls will be seen as indirect call assembler insn.
The choice of opcode encoding is clear.
The verifier, interpreter, and JIT work left.
It won't take long.