* [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup @ 2020-03-10 17:47 Lorenz Bauer 2020-03-10 17:47 ` [PATCH 1/5] bpf: add map_copy_value hook Lorenz Bauer ` (6 more replies) 0 siblings, 7 replies; 28+ messages in thread From: Lorenz Bauer @ 2020-03-10 17:47 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann Cc: kernel-team, Lorenz Bauer, netdev, bpf We want to use sockhash and sockmap to build the control plane for our upcoming BPF socket dispatch work. We realised that it's difficult to resize or otherwise rebuild these maps if needed, because there is no way to get at their contents. This patch set allows a privileged user to retrieve fds from these map types, which removes this obstacle. The approach here is different than that of program arrays and nested maps, which return an ID that can be turned into an fd using the BPF_*_GET_FD_BY_ID syscall. Sockets have IDs in the form of cookies, however there seems to be no way to go from a socket cookie to struct socket or struct file. Hence we return an fd directly. If unprivileged access is desired, the user can create the map with value_size = 8, which makes lookup return the socket cookie. It would be nicer if this behaviour was controllable at the time of calling bpf_map_lookup_elem, but I've not found a good solution for this. Patches 1-3 do a bit of clean up, but I'm happy to drop them if they don't make sense. Patch 4-5 are the interesting bit. Lorenz Bauer (5): bpf: add map_copy_value hook bpf: convert queue and stack map to map_copy_value bpf: convert sock map and hash to map_copy_value bpf: sockmap, sockhash: return file descriptors from privileged lookup bpf: sockmap, sockhash: test looking up fds include/linux/bpf-cgroup.h | 5 -- include/linux/bpf.h | 21 +----- include/linux/bpf_types.h | 2 +- kernel/bpf/arraymap.c | 13 +++- kernel/bpf/bpf_struct_ops.c | 7 +- kernel/bpf/hashtab.c | 10 ++- kernel/bpf/local_storage.c | 14 +++- kernel/bpf/queue_stack_maps.c | 18 +++++ kernel/bpf/reuseport_array.c | 5 +- kernel/bpf/syscall.c | 23 +------ net/core/sock_map.c | 67 ++++++++++++++----- .../selftests/bpf/prog_tests/sockmap_listen.c | 26 +++++-- 12 files changed, 130 insertions(+), 81 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/5] bpf: add map_copy_value hook 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer @ 2020-03-10 17:47 ` Lorenz Bauer 2020-03-10 17:47 ` [PATCH 2/5] bpf: convert queue and stack map to map_copy_value Lorenz Bauer ` (5 subsequent siblings) 6 siblings, 0 replies; 28+ messages in thread From: Lorenz Bauer @ 2020-03-10 17:47 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann Cc: kernel-team, Lorenz Bauer, netdev, bpf, linux-kernel bpf_map_copy_value has a lot of special cases for different map types that want more control than map_lookup_elem provides. On closer inspection, almost all of them follow the pattern int func(struct bpf_map *, void *, void *) Introduce a new member map_copy_value to struct bpf_map_ops, and convert the current special cases to use it. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> --- include/linux/bpf-cgroup.h | 5 ----- include/linux/bpf.h | 21 +-------------------- include/linux/bpf_types.h | 2 +- kernel/bpf/arraymap.c | 13 ++++++++++--- kernel/bpf/bpf_struct_ops.c | 7 ++++--- kernel/bpf/hashtab.c | 10 +++++++--- kernel/bpf/local_storage.c | 14 +++++++++++++- kernel/bpf/reuseport_array.c | 5 +++-- kernel/bpf/syscall.c | 24 ++++-------------------- 9 files changed, 43 insertions(+), 58 deletions(-) diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index a7cd5c7a2509..6741a6c460f6 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -162,7 +162,6 @@ void bpf_cgroup_storage_unlink(struct bpf_cgroup_storage *storage); int bpf_cgroup_storage_assign(struct bpf_prog_aux *aux, struct bpf_map *map); void bpf_cgroup_storage_release(struct bpf_prog_aux *aux, struct bpf_map *map); -int bpf_percpu_cgroup_storage_copy(struct bpf_map *map, void *key, void *value); int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, void *value, u64 flags); @@ -370,10 +369,6 @@ static inline struct bpf_cgroup_storage *bpf_cgroup_storage_alloc( struct bpf_prog *prog, enum bpf_cgroup_storage_type stype) { return NULL; } static inline void bpf_cgroup_storage_free( struct bpf_cgroup_storage *storage) {} -static inline int bpf_percpu_cgroup_storage_copy(struct bpf_map *map, void *key, - void *value) { - return 0; -} static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, void *value, u64 flags) { return 0; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 94a329b9da81..ad9f3be830f0 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -44,6 +44,7 @@ struct bpf_map_ops { int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key); void (*map_release_uref)(struct bpf_map *map); void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key); + int (*map_copy_value)(struct bpf_map *map, void *key, void *value); int (*map_lookup_batch)(struct bpf_map *map, const union bpf_attr *attr, union bpf_attr __user *uattr); int (*map_lookup_and_delete_batch)(struct bpf_map *map, @@ -741,8 +742,6 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id); void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log); bool bpf_struct_ops_get(const void *kdata); void bpf_struct_ops_put(const void *kdata); -int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, - void *value); static inline bool bpf_try_module_get(const void *data, struct module *owner) { if (owner == BPF_MODULE_OWNER) @@ -774,12 +773,6 @@ static inline void bpf_module_put(const void *data, struct module *owner) { module_put(owner); } -static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, - void *key, - void *value) -{ - return -EINVAL; -} #endif struct bpf_array { @@ -1082,8 +1075,6 @@ struct bpf_link *bpf_link_get_from_fd(u32 ufd); int bpf_obj_pin_user(u32 ufd, const char __user *pathname); int bpf_obj_get_user(const char __user *pathname, int flags); -int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value); -int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value); int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value, u64 flags); int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, @@ -1093,10 +1084,8 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value); int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags); -int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags); -int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); int bpf_get_file_flag(int flags); int bpf_check_uarg_tail_zero(void __user *uaddr, size_t expected_size, @@ -1437,8 +1426,6 @@ static inline int sock_map_get_from_fd(const union bpf_attr *attr, #if defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL) void bpf_sk_reuseport_detach(struct sock *sk); -int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, void *key, - void *value); int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags); #else @@ -1447,12 +1434,6 @@ static inline void bpf_sk_reuseport_detach(struct sock *sk) } #ifdef CONFIG_BPF_SYSCALL -static inline int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, - void *key, void *value) -{ - return -EOPNOTSUPP; -} - static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags) diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index c81d4ece79a4..4949638cd049 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -81,7 +81,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops) #endif #ifdef CONFIG_CGROUP_BPF BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) -BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, percpu_cgroup_storage_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 95d77770353c..58a0a8b3abe3 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -249,7 +249,8 @@ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key) return this_cpu_ptr(array->pptrs[index & array->index_mask]); } -int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value) +static int percpu_array_map_copy_value(struct bpf_map *map, void *key, + void *value) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; @@ -513,6 +514,7 @@ const struct bpf_map_ops percpu_array_map_ops = { .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = percpu_array_map_lookup_elem, + .map_copy_value = percpu_array_map_copy_value, .map_update_elem = array_map_update_elem, .map_delete_elem = array_map_delete_elem, .map_seq_show_elem = percpu_array_map_seq_show_elem, @@ -550,7 +552,8 @@ static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key) } /* only called from syscall */ -int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value) +static int fd_array_map_lookup_elem_sys_copy(struct bpf_map *map, void *key, + void *value) { void **elem, *ptr; int ret = 0; @@ -561,7 +564,7 @@ int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value) rcu_read_lock(); elem = array_map_lookup_elem(map, key); if (elem && (ptr = READ_ONCE(*elem))) - *value = map->ops->map_fd_sys_lookup_elem(ptr); + *(u32 *)value = map->ops->map_fd_sys_lookup_elem(ptr); else ret = -ENOENT; rcu_read_unlock(); @@ -872,6 +875,7 @@ const struct bpf_map_ops prog_array_map_ops = { .map_poke_run = prog_array_map_poke_run, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, + .map_copy_value = fd_array_map_lookup_elem_sys_copy, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = prog_fd_array_get_ptr, .map_fd_put_ptr = prog_fd_array_put_ptr, @@ -962,6 +966,7 @@ const struct bpf_map_ops perf_event_array_map_ops = { .map_free = fd_array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, + .map_copy_value = fd_array_map_lookup_elem_sys_copy, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = perf_event_fd_array_get_ptr, .map_fd_put_ptr = perf_event_fd_array_put_ptr, @@ -995,6 +1000,7 @@ const struct bpf_map_ops cgroup_array_map_ops = { .map_free = cgroup_fd_array_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, + .map_copy_value = fd_array_map_lookup_elem_sys_copy, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = cgroup_fd_array_get_ptr, .map_fd_put_ptr = cgroup_fd_array_put_ptr, @@ -1078,6 +1084,7 @@ const struct bpf_map_ops array_of_maps_map_ops = { .map_free = array_of_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = array_of_map_lookup_elem, + .map_copy_value = fd_array_map_lookup_elem_sys_copy, .map_delete_elem = fd_array_map_delete_elem, .map_fd_get_ptr = bpf_map_fd_get_ptr, .map_fd_put_ptr = bpf_map_fd_put_ptr, diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index ca5cc8cdb6eb..cc1d7d1077c1 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -238,8 +238,8 @@ static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key, return 0; } -int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, - void *value) +static int bpf_struct_ops_map_copy_value(struct bpf_map *map, void *key, + void *value) { struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; struct bpf_struct_ops_value *uvalue, *kvalue; @@ -509,7 +509,7 @@ static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key, if (!value) return; - err = bpf_struct_ops_map_sys_lookup_elem(map, key, value); + err = bpf_struct_ops_map_copy_value(map, key, value); if (!err) { btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id, value, m); @@ -609,6 +609,7 @@ const struct bpf_map_ops bpf_struct_ops_map_ops = { .map_free = bpf_struct_ops_map_free, .map_get_next_key = bpf_struct_ops_map_get_next_key, .map_lookup_elem = bpf_struct_ops_map_lookup_elem, + .map_copy_value = bpf_struct_ops_map_copy_value, .map_delete_elem = bpf_struct_ops_map_delete_elem, .map_update_elem = bpf_struct_ops_map_update_elem, .map_seq_show_elem = bpf_struct_ops_map_seq_show_elem, diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index d541c8486c95..f5452a8a5177 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1664,7 +1664,8 @@ static void *htab_lru_percpu_map_lookup_elem(struct bpf_map *map, void *key) return NULL; } -int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value) +static int htab_percpu_map_copy_value(struct bpf_map *map, void *key, + void *value) { struct htab_elem *l; void __percpu *pptr; @@ -1749,6 +1750,7 @@ const struct bpf_map_ops htab_percpu_map_ops = { .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, .map_lookup_elem = htab_percpu_map_lookup_elem, + .map_copy_value = htab_percpu_map_copy_value, .map_update_elem = htab_percpu_map_update_elem, .map_delete_elem = htab_map_delete_elem, .map_seq_show_elem = htab_percpu_map_seq_show_elem, @@ -1761,6 +1763,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = { .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, .map_lookup_elem = htab_lru_percpu_map_lookup_elem, + .map_copy_value = htab_percpu_map_copy_value, .map_update_elem = htab_lru_percpu_map_update_elem, .map_delete_elem = htab_lru_map_delete_elem, .map_seq_show_elem = htab_percpu_map_seq_show_elem, @@ -1796,7 +1799,7 @@ static void fd_htab_map_free(struct bpf_map *map) } /* only called from syscall */ -int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value) +static int fd_htab_map_copy_value(struct bpf_map *map, void *key, void *value) { void **ptr; int ret = 0; @@ -1807,7 +1810,7 @@ int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value) rcu_read_lock(); ptr = htab_map_lookup_elem(map, key); if (ptr) - *value = map->ops->map_fd_sys_lookup_elem(READ_ONCE(*ptr)); + *(u32 *)value = map->ops->map_fd_sys_lookup_elem(READ_ONCE(*ptr)); else ret = -ENOENT; rcu_read_unlock(); @@ -1893,6 +1896,7 @@ const struct bpf_map_ops htab_of_maps_map_ops = { .map_free = htab_of_map_free, .map_get_next_key = htab_map_get_next_key, .map_lookup_elem = htab_of_map_lookup_elem, + .map_copy_value = fd_htab_map_copy_value, .map_delete_elem = htab_map_delete_elem, .map_fd_get_ptr = bpf_map_fd_get_ptr, .map_fd_put_ptr = bpf_map_fd_put_ptr, diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 33d01866bcc2..fcc0b168dad2 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -167,7 +167,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key, return 0; } -int bpf_percpu_cgroup_storage_copy(struct bpf_map *_map, void *_key, +static int percpu_cgroup_storage_copy(struct bpf_map *_map, void *_key, void *value) { struct bpf_cgroup_storage_map *map = map_to_storage(_map); @@ -420,6 +420,18 @@ const struct bpf_map_ops cgroup_storage_map_ops = { .map_seq_show_elem = cgroup_storage_seq_show_elem, }; +const struct bpf_map_ops percpu_cgroup_storage_map_ops = { + .map_alloc = cgroup_storage_map_alloc, + .map_free = cgroup_storage_map_free, + .map_get_next_key = cgroup_storage_get_next_key, + .map_lookup_elem = cgroup_storage_lookup_elem, + .map_copy_value = percpu_cgroup_storage_copy, + .map_update_elem = cgroup_storage_update_elem, + .map_delete_elem = cgroup_storage_delete_elem, + .map_check_btf = cgroup_storage_check_btf, + .map_seq_show_elem = cgroup_storage_seq_show_elem, +}; + int bpf_cgroup_storage_assign(struct bpf_prog_aux *aux, struct bpf_map *_map) { enum bpf_cgroup_storage_type stype = cgroup_storage_type(_map); diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c index 01badd3eda7a..f36ccbf2612e 100644 --- a/kernel/bpf/reuseport_array.c +++ b/kernel/bpf/reuseport_array.c @@ -178,8 +178,8 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr) return &array->map; } -int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, void *key, - void *value) +static int reuseport_array_copy_value(struct bpf_map *map, void *key, + void *value) { struct sock *sk; int err; @@ -350,6 +350,7 @@ const struct bpf_map_ops reuseport_array_ops = { .map_alloc = reuseport_array_alloc, .map_free = reuseport_array_free, .map_lookup_elem = reuseport_array_lookup_elem, + .map_copy_value = reuseport_array_copy_value, .map_get_next_key = reuseport_array_get_next_key, .map_delete_elem = reuseport_array_delete_elem, }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7ce0815793dd..6503824e81e9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -218,27 +218,11 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, return bpf_map_offload_lookup_elem(map, key, value); bpf_disable_instrumentation(); - if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || - map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { - err = bpf_percpu_hash_copy(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { - err = bpf_percpu_array_copy(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { - err = bpf_percpu_cgroup_storage_copy(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) { - err = bpf_stackmap_copy(map, key, value); - } else if (IS_FD_ARRAY(map) || IS_FD_PROG_ARRAY(map)) { - err = bpf_fd_array_map_lookup_elem(map, key, value); - } else if (IS_FD_HASH(map)) { - err = bpf_fd_htab_map_lookup_elem(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) { - err = bpf_fd_reuseport_array_lookup_elem(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_QUEUE || - map->map_type == BPF_MAP_TYPE_STACK) { + if (map->map_type == BPF_MAP_TYPE_QUEUE || + map->map_type == BPF_MAP_TYPE_STACK) { err = map->ops->map_peek_elem(map, value); - } else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { - /* struct_ops map requires directly updating "value" */ - err = bpf_struct_ops_map_sys_lookup_elem(map, key, value); + } else if (map->ops->map_copy_value) { + err = map->ops->map_copy_value(map, key, value); } else { rcu_read_lock(); if (map->ops->map_lookup_elem_sys_only) -- 2.20.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 2/5] bpf: convert queue and stack map to map_copy_value 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer 2020-03-10 17:47 ` [PATCH 1/5] bpf: add map_copy_value hook Lorenz Bauer @ 2020-03-10 17:47 ` Lorenz Bauer 2020-03-11 14:00 ` Jakub Sitnicki 2020-03-10 17:47 ` [PATCH 3/5] bpf: convert sock map and hash " Lorenz Bauer ` (4 subsequent siblings) 6 siblings, 1 reply; 28+ messages in thread From: Lorenz Bauer @ 2020-03-10 17:47 UTC (permalink / raw) To: Alexei Starovoitov, Daniel Borkmann Cc: kernel-team, Lorenz Bauer, netdev, bpf, linux-kernel Migrate BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK to map_copy_value, by introducing small wrappers that discard the (unused) key argument. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> --- kernel/bpf/queue_stack_maps.c | 18 ++++++++++++++++++ kernel/bpf/syscall.c | 5 +---- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index f697647ceb54..5c89b7583cd2 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -262,11 +262,28 @@ static int queue_stack_map_get_next_key(struct bpf_map *map, void *key, return -EINVAL; } +/* Called from syscall */ +static int queue_map_copy_value(struct bpf_map *map, void *key, void *value) +{ + (void)key; + + return queue_map_peek_elem(map, value); +} + +/* Called from syscall */ +static int stack_map_copy_value(struct bpf_map *map, void *key, void *value) +{ + (void)key; + + return stack_map_peek_elem(map, value); +} + const struct bpf_map_ops queue_map_ops = { .map_alloc_check = queue_stack_map_alloc_check, .map_alloc = queue_stack_map_alloc, .map_free = queue_stack_map_free, .map_lookup_elem = queue_stack_map_lookup_elem, + .map_copy_value = queue_map_copy_value, .map_update_elem = queue_stack_map_update_elem, .map_delete_elem = queue_stack_map_delete_elem, .map_push_elem = queue_stack_map_push_elem, @@ -280,6 +297,7 @@ const struct bpf_map_ops stack_map_ops = { .map_alloc = queue_stack_map_alloc, .map_free = queue_stack_map_free, .map_lookup_elem = queue_stack_map_lookup_elem, + .map_copy_value = stack_map_copy_value, .map_update_elem = queue_stack_map_update_elem, .map_delete_elem = queue_stack_map_delete_elem, .map_push_elem = queue_stack_map_push_elem, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 6503824e81e9..20c6cdace275 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -218,10 +218,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, return bpf_map_offload_lookup_elem(map, key, value); bpf_disable_instrumentation(); - if (map->map_type == BPF_MAP_TYPE_QUEUE || - map->map_type == BPF_MAP_TYPE_STACK) { - err = map->ops->map_peek_elem(map, value); - } else if (map->ops->map_copy_value) { + if (map->ops->map_copy_value) { err = map->ops->map_copy_value(map, key, value); } else { rcu_read_lock(); -- 2.20.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] bpf: convert queue and stack map to map_copy_value 2020-03-10 17:47 ` [PATCH 2/5] bpf: convert queue and stack map to map_copy_value Lorenz Bauer @ 2020-03-11 14:00 ` Jakub Sitnicki 2020-03-11 22:31 ` John Fastabend 0 siblings, 1 reply; 28+ messages in thread From: Jakub Sitnicki @ 2020-03-11 14:00 UTC (permalink / raw) To: Lorenz Bauer Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, netdev, bpf, linux-kernel On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > Migrate BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK to map_copy_value, > by introducing small wrappers that discard the (unused) key argument. > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > --- > kernel/bpf/queue_stack_maps.c | 18 ++++++++++++++++++ > kernel/bpf/syscall.c | 5 +---- > 2 files changed, 19 insertions(+), 4 deletions(-) > > diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c > index f697647ceb54..5c89b7583cd2 100644 > --- a/kernel/bpf/queue_stack_maps.c > +++ b/kernel/bpf/queue_stack_maps.c > @@ -262,11 +262,28 @@ static int queue_stack_map_get_next_key(struct bpf_map *map, void *key, > return -EINVAL; > } > > +/* Called from syscall */ > +static int queue_map_copy_value(struct bpf_map *map, void *key, void *value) > +{ > + (void)key; Alternatively, there's is the __always_unused compiler attribute from include/linux/compiler_attributes.h that seems to be in wide use. > + > + return queue_map_peek_elem(map, value); > +} > + > +/* Called from syscall */ > +static int stack_map_copy_value(struct bpf_map *map, void *key, void *value) > +{ > + (void)key; > + > + return stack_map_peek_elem(map, value); > +} > + > const struct bpf_map_ops queue_map_ops = { > .map_alloc_check = queue_stack_map_alloc_check, > .map_alloc = queue_stack_map_alloc, > .map_free = queue_stack_map_free, > .map_lookup_elem = queue_stack_map_lookup_elem, > + .map_copy_value = queue_map_copy_value, > .map_update_elem = queue_stack_map_update_elem, > .map_delete_elem = queue_stack_map_delete_elem, > .map_push_elem = queue_stack_map_push_elem, > @@ -280,6 +297,7 @@ const struct bpf_map_ops stack_map_ops = { > .map_alloc = queue_stack_map_alloc, > .map_free = queue_stack_map_free, > .map_lookup_elem = queue_stack_map_lookup_elem, > + .map_copy_value = stack_map_copy_value, > .map_update_elem = queue_stack_map_update_elem, > .map_delete_elem = queue_stack_map_delete_elem, > .map_push_elem = queue_stack_map_push_elem, > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 6503824e81e9..20c6cdace275 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -218,10 +218,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, > return bpf_map_offload_lookup_elem(map, key, value); > > bpf_disable_instrumentation(); > - if (map->map_type == BPF_MAP_TYPE_QUEUE || > - map->map_type == BPF_MAP_TYPE_STACK) { > - err = map->ops->map_peek_elem(map, value); > - } else if (map->ops->map_copy_value) { > + if (map->ops->map_copy_value) { > err = map->ops->map_copy_value(map, key, value); > } else { > rcu_read_lock(); ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] bpf: convert queue and stack map to map_copy_value 2020-03-11 14:00 ` Jakub Sitnicki @ 2020-03-11 22:31 ` John Fastabend 0 siblings, 0 replies; 28+ messages in thread From: John Fastabend @ 2020-03-11 22:31 UTC (permalink / raw) To: Jakub Sitnicki, Lorenz Bauer Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, netdev, bpf, linux-kernel Jakub Sitnicki wrote: > On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > > Migrate BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK to map_copy_value, > > by introducing small wrappers that discard the (unused) key argument. > > > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > > --- > > kernel/bpf/queue_stack_maps.c | 18 ++++++++++++++++++ > > kernel/bpf/syscall.c | 5 +---- > > 2 files changed, 19 insertions(+), 4 deletions(-) > > > > diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c > > index f697647ceb54..5c89b7583cd2 100644 > > --- a/kernel/bpf/queue_stack_maps.c > > +++ b/kernel/bpf/queue_stack_maps.c > > @@ -262,11 +262,28 @@ static int queue_stack_map_get_next_key(struct bpf_map *map, void *key, > > return -EINVAL; > > } > > > > +/* Called from syscall */ > > +static int queue_map_copy_value(struct bpf_map *map, void *key, void *value) > > +{ > > + (void)key; > > Alternatively, there's is the __always_unused compiler attribute from > include/linux/compiler_attributes.h that seems to be in wide use. > +1 use the attribute its much nicer imo. > > + > > + return queue_map_peek_elem(map, value); > > +} > > + > > +/* Called from syscall */ > > +static int stack_map_copy_value(struct bpf_map *map, void *key, void *value) > > +{ > > + (void)key; > > + > > + return stack_map_peek_elem(map, value); > > +} > > + > > const struct bpf_map_ops queue_map_ops = { > > .map_alloc_check = queue_stack_map_alloc_check, > > .map_alloc = queue_stack_map_alloc, > > .map_free = queue_stack_map_free, > > .map_lookup_elem = queue_stack_map_lookup_elem, > > + .map_copy_value = queue_map_copy_value, > > .map_update_elem = queue_stack_map_update_elem, > > .map_delete_elem = queue_stack_map_delete_elem, > > .map_push_elem = queue_stack_map_push_elem, > > @@ -280,6 +297,7 @@ const struct bpf_map_ops stack_map_ops = { > > .map_alloc = queue_stack_map_alloc, > > .map_free = queue_stack_map_free, > > .map_lookup_elem = queue_stack_map_lookup_elem, > > + .map_copy_value = stack_map_copy_value, > > .map_update_elem = queue_stack_map_update_elem, > > .map_delete_elem = queue_stack_map_delete_elem, > > .map_push_elem = queue_stack_map_push_elem, > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 6503824e81e9..20c6cdace275 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -218,10 +218,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, > > return bpf_map_offload_lookup_elem(map, key, value); > > > > bpf_disable_instrumentation(); > > - if (map->map_type == BPF_MAP_TYPE_QUEUE || > > - map->map_type == BPF_MAP_TYPE_STACK) { > > - err = map->ops->map_peek_elem(map, value); > > - } else if (map->ops->map_copy_value) { > > + if (map->ops->map_copy_value) { > > err = map->ops->map_copy_value(map, key, value); > > } else { > > rcu_read_lock(); ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/5] bpf: convert sock map and hash to map_copy_value 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer 2020-03-10 17:47 ` [PATCH 1/5] bpf: add map_copy_value hook Lorenz Bauer 2020-03-10 17:47 ` [PATCH 2/5] bpf: convert queue and stack map to map_copy_value Lorenz Bauer @ 2020-03-10 17:47 ` Lorenz Bauer 2020-03-11 13:55 ` Jakub Sitnicki 2020-03-10 17:47 ` [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup Lorenz Bauer ` (3 subsequent siblings) 6 siblings, 1 reply; 28+ messages in thread From: Lorenz Bauer @ 2020-03-10 17:47 UTC (permalink / raw) To: John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer, David S. Miller, Jakub Kicinski, Alexei Starovoitov Cc: kernel-team, netdev, bpf, linux-kernel Migrate sockmap and sockhash to use map_copy_value instead of map_lookup_elem_sys_only. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> --- net/core/sock_map.c | 48 ++++++++++++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index a7075b3b4489..03e04426cd21 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -344,19 +344,34 @@ static void *sock_map_lookup(struct bpf_map *map, void *key) return __sock_map_lookup_elem(map, *(u32 *)key); } -static void *sock_map_lookup_sys(struct bpf_map *map, void *key) +static int __sock_map_copy_value(struct bpf_map *map, struct sock *sk, + void *value) +{ + switch (map->value_size) { + case sizeof(u64): + sock_gen_cookie(sk); + *(u64 *)value = atomic64_read(&sk->sk_cookie); + return 0; + + default: + return -ENOSPC; + } +} + +static int sock_map_copy_value(struct bpf_map *map, void *key, void *value) { struct sock *sk; + int ret = -ENOENT; - if (map->value_size != sizeof(u64)) - return ERR_PTR(-ENOSPC); - + rcu_read_lock(); sk = __sock_map_lookup_elem(map, *(u32 *)key); if (!sk) - return ERR_PTR(-ENOENT); + goto out; - sock_gen_cookie(sk); - return &sk->sk_cookie; + ret = __sock_map_copy_value(map, sk, value); +out: + rcu_read_unlock(); + return ret; } static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test, @@ -636,7 +651,7 @@ const struct bpf_map_ops sock_map_ops = { .map_alloc = sock_map_alloc, .map_free = sock_map_free, .map_get_next_key = sock_map_get_next_key, - .map_lookup_elem_sys_only = sock_map_lookup_sys, + .map_copy_value = sock_map_copy_value, .map_update_elem = sock_map_update_elem, .map_delete_elem = sock_map_delete_elem, .map_lookup_elem = sock_map_lookup, @@ -1030,19 +1045,20 @@ static void sock_hash_free(struct bpf_map *map) kfree(htab); } -static void *sock_hash_lookup_sys(struct bpf_map *map, void *key) +static int sock_hash_copy_value(struct bpf_map *map, void *key, void *value) { struct sock *sk; + int ret = -ENOENT; - if (map->value_size != sizeof(u64)) - return ERR_PTR(-ENOSPC); - + rcu_read_lock(); sk = __sock_hash_lookup_elem(map, key); if (!sk) - return ERR_PTR(-ENOENT); + goto out; - sock_gen_cookie(sk); - return &sk->sk_cookie; + ret = __sock_map_copy_value(map, sk, value); +out: + rcu_read_unlock(); + return ret; } static void *sock_hash_lookup(struct bpf_map *map, void *key) @@ -1139,7 +1155,7 @@ const struct bpf_map_ops sock_hash_ops = { .map_update_elem = sock_hash_update_elem, .map_delete_elem = sock_hash_delete_elem, .map_lookup_elem = sock_hash_lookup, - .map_lookup_elem_sys_only = sock_hash_lookup_sys, + .map_copy_value = sock_hash_copy_value, .map_release_uref = sock_hash_release_progs, .map_check_btf = map_check_no_btf, }; -- 2.20.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 3/5] bpf: convert sock map and hash to map_copy_value 2020-03-10 17:47 ` [PATCH 3/5] bpf: convert sock map and hash " Lorenz Bauer @ 2020-03-11 13:55 ` Jakub Sitnicki 0 siblings, 0 replies; 28+ messages in thread From: Jakub Sitnicki @ 2020-03-11 13:55 UTC (permalink / raw) To: Lorenz Bauer Cc: John Fastabend, Daniel Borkmann, David S. Miller, Jakub Kicinski, Alexei Starovoitov, kernel-team, netdev, bpf, linux-kernel On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > Migrate sockmap and sockhash to use map_copy_value instead of > map_lookup_elem_sys_only. > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > --- > net/core/sock_map.c | 48 ++++++++++++++++++++++++++++++--------------- > 1 file changed, 32 insertions(+), 16 deletions(-) > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index a7075b3b4489..03e04426cd21 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -344,19 +344,34 @@ static void *sock_map_lookup(struct bpf_map *map, void *key) > return __sock_map_lookup_elem(map, *(u32 *)key); > } > > -static void *sock_map_lookup_sys(struct bpf_map *map, void *key) > +static int __sock_map_copy_value(struct bpf_map *map, struct sock *sk, > + void *value) > +{ > + switch (map->value_size) { > + case sizeof(u64): > + sock_gen_cookie(sk); > + *(u64 *)value = atomic64_read(&sk->sk_cookie); You could use sock_gen_cookie return value to merge the two above statements into one. sock_gen_cookie also reads out the value. > + return 0; > + > + default: > + return -ENOSPC; > + } > +} > + > +static int sock_map_copy_value(struct bpf_map *map, void *key, void *value) > { > struct sock *sk; > + int ret = -ENOENT; > > - if (map->value_size != sizeof(u64)) > - return ERR_PTR(-ENOSPC); > - > + rcu_read_lock(); > sk = __sock_map_lookup_elem(map, *(u32 *)key); > if (!sk) > - return ERR_PTR(-ENOENT); > + goto out; > > - sock_gen_cookie(sk); > - return &sk->sk_cookie; > + ret = __sock_map_copy_value(map, sk, value); > +out: > + rcu_read_unlock(); > + return ret; > } > > static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test, > @@ -636,7 +651,7 @@ const struct bpf_map_ops sock_map_ops = { > .map_alloc = sock_map_alloc, > .map_free = sock_map_free, > .map_get_next_key = sock_map_get_next_key, > - .map_lookup_elem_sys_only = sock_map_lookup_sys, > + .map_copy_value = sock_map_copy_value, > .map_update_elem = sock_map_update_elem, > .map_delete_elem = sock_map_delete_elem, > .map_lookup_elem = sock_map_lookup, > @@ -1030,19 +1045,20 @@ static void sock_hash_free(struct bpf_map *map) > kfree(htab); > } > > -static void *sock_hash_lookup_sys(struct bpf_map *map, void *key) > +static int sock_hash_copy_value(struct bpf_map *map, void *key, void *value) > { > struct sock *sk; > + int ret = -ENOENT; > > - if (map->value_size != sizeof(u64)) > - return ERR_PTR(-ENOSPC); > - > + rcu_read_lock(); > sk = __sock_hash_lookup_elem(map, key); > if (!sk) > - return ERR_PTR(-ENOENT); > + goto out; > > - sock_gen_cookie(sk); > - return &sk->sk_cookie; > + ret = __sock_map_copy_value(map, sk, value); > +out: > + rcu_read_unlock(); > + return ret; > } > > static void *sock_hash_lookup(struct bpf_map *map, void *key) > @@ -1139,7 +1155,7 @@ const struct bpf_map_ops sock_hash_ops = { > .map_update_elem = sock_hash_update_elem, > .map_delete_elem = sock_hash_delete_elem, > .map_lookup_elem = sock_hash_lookup, > - .map_lookup_elem_sys_only = sock_hash_lookup_sys, > + .map_copy_value = sock_hash_copy_value, > .map_release_uref = sock_hash_release_progs, > .map_check_btf = map_check_no_btf, > }; ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer ` (2 preceding siblings ...) 2020-03-10 17:47 ` [PATCH 3/5] bpf: convert sock map and hash " Lorenz Bauer @ 2020-03-10 17:47 ` Lorenz Bauer 2020-03-11 23:27 ` John Fastabend 2020-03-17 15:18 ` Jakub Sitnicki 2020-03-10 17:47 ` [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds Lorenz Bauer ` (2 subsequent siblings) 6 siblings, 2 replies; 28+ messages in thread From: Lorenz Bauer @ 2020-03-10 17:47 UTC (permalink / raw) To: John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer, David S. Miller, Jakub Kicinski, Alexei Starovoitov Cc: kernel-team, netdev, bpf, linux-kernel Allow callers with CAP_NET_ADMIN to retrieve file descriptors from a sockmap and sockhash. O_CLOEXEC is enforced on all fds. Without this, it's difficult to resize or otherwise rebuild existing sockmap or sockhashes. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> --- net/core/sock_map.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 03e04426cd21..3228936aa31e 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -347,12 +347,31 @@ static void *sock_map_lookup(struct bpf_map *map, void *key) static int __sock_map_copy_value(struct bpf_map *map, struct sock *sk, void *value) { + struct file *file; + int fd; + switch (map->value_size) { case sizeof(u64): sock_gen_cookie(sk); *(u64 *)value = atomic64_read(&sk->sk_cookie); return 0; + case sizeof(u32): + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + fd = get_unused_fd_flags(O_CLOEXEC); + if (unlikely(fd < 0)) + return fd; + + read_lock_bh(&sk->sk_callback_lock); + file = get_file(sk->sk_socket->file); + read_unlock_bh(&sk->sk_callback_lock); + + fd_install(fd, file); + *(u32 *)value = fd; + return 0; + default: return -ENOSPC; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* RE: [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup 2020-03-10 17:47 ` [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup Lorenz Bauer @ 2020-03-11 23:27 ` John Fastabend 2020-03-17 10:17 ` Lorenz Bauer 2020-03-17 15:18 ` Jakub Sitnicki 1 sibling, 1 reply; 28+ messages in thread From: John Fastabend @ 2020-03-11 23:27 UTC (permalink / raw) To: Lorenz Bauer, John Fastabend, Daniel Borkmann, Jakub Sitnicki, Lorenz Bauer, David S. Miller, Jakub Kicinski, Alexei Starovoitov Cc: kernel-team, netdev, bpf, linux-kernel Lorenz Bauer wrote: > Allow callers with CAP_NET_ADMIN to retrieve file descriptors from a > sockmap and sockhash. O_CLOEXEC is enforced on all fds. > > Without this, it's difficult to resize or otherwise rebuild existing > sockmap or sockhashes. > > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > --- > net/core/sock_map.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index 03e04426cd21..3228936aa31e 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -347,12 +347,31 @@ static void *sock_map_lookup(struct bpf_map *map, void *key) > static int __sock_map_copy_value(struct bpf_map *map, struct sock *sk, > void *value) > { > + struct file *file; > + int fd; > + > switch (map->value_size) { > case sizeof(u64): > sock_gen_cookie(sk); > *(u64 *)value = atomic64_read(&sk->sk_cookie); > return 0; > > + case sizeof(u32): > + if (!capable(CAP_NET_ADMIN)) > + return -EPERM; > + > + fd = get_unused_fd_flags(O_CLOEXEC); > + if (unlikely(fd < 0)) > + return fd; > + > + read_lock_bh(&sk->sk_callback_lock); > + file = get_file(sk->sk_socket->file); > + read_unlock_bh(&sk->sk_callback_lock); > + > + fd_install(fd, file); > + *(u32 *)value = fd; > + return 0; > + Hi Lorenz, Can you say something about what happens if the sk is deleted from the map or the sock is closed/unhashed ideally in the commit message so we have it for later reference. I guess because we are in an rcu block here the sk will be OK and psock reference will exist until after the rcu block at least because of call_rcu(). If the psock is destroyed from another path then the fd will still point at the sock. correct? Thanks. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup 2020-03-11 23:27 ` John Fastabend @ 2020-03-17 10:17 ` Lorenz Bauer 0 siblings, 0 replies; 28+ messages in thread From: Lorenz Bauer @ 2020-03-17 10:17 UTC (permalink / raw) To: John Fastabend Cc: Daniel Borkmann, Jakub Sitnicki, David S. Miller, Jakub Kicinski, Alexei Starovoitov, kernel-team, Networking, bpf, linux-kernel On Wed, 11 Mar 2020 at 23:27, John Fastabend <john.fastabend@gmail.com> wrote: > > Lorenz Bauer wrote: > > Allow callers with CAP_NET_ADMIN to retrieve file descriptors from a > > sockmap and sockhash. O_CLOEXEC is enforced on all fds. > > > > Without this, it's difficult to resize or otherwise rebuild existing > > sockmap or sockhashes. > > > > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > > --- > > net/core/sock_map.c | 19 +++++++++++++++++++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > index 03e04426cd21..3228936aa31e 100644 > > --- a/net/core/sock_map.c > > +++ b/net/core/sock_map.c > > @@ -347,12 +347,31 @@ static void *sock_map_lookup(struct bpf_map *map, void *key) > > static int __sock_map_copy_value(struct bpf_map *map, struct sock *sk, > > void *value) > > { > > + struct file *file; > > + int fd; > > + > > switch (map->value_size) { > > case sizeof(u64): > > sock_gen_cookie(sk); > > *(u64 *)value = atomic64_read(&sk->sk_cookie); > > return 0; > > > > + case sizeof(u32): > > + if (!capable(CAP_NET_ADMIN)) > > + return -EPERM; > > + > > + fd = get_unused_fd_flags(O_CLOEXEC); > > + if (unlikely(fd < 0)) > > + return fd; > > + > > + read_lock_bh(&sk->sk_callback_lock); > > + file = get_file(sk->sk_socket->file); > > + read_unlock_bh(&sk->sk_callback_lock); > > + > > + fd_install(fd, file); > > + *(u32 *)value = fd; > > + return 0; > > + > > Hi Lorenz, Can you say something about what happens if the sk > is deleted from the map or the sock is closed/unhashed ideally > in the commit message so we have it for later reference. I guess > because we are in an rcu block here the sk will be OK and psock > reference will exist until after the rcu block at least because > of call_rcu(). If the psock is destroyed from another path then > the fd will still point at the sock. correct? This is how I understand it: * sk is protected by rcu_read_lock (as you point out) * sk->sk_callback_lock protects against sk->sk_socket being modified by sock_orphan, sock_graft, etc. via sk_set_socket * get_file increments the refcount on the file I'm not sure how the psock figures into this, maybe you can elaborate a little? -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup 2020-03-10 17:47 ` [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup Lorenz Bauer 2020-03-11 23:27 ` John Fastabend @ 2020-03-17 15:18 ` Jakub Sitnicki 2020-03-17 18:16 ` John Fastabend 1 sibling, 1 reply; 28+ messages in thread From: Jakub Sitnicki @ 2020-03-17 15:18 UTC (permalink / raw) To: Lorenz Bauer Cc: John Fastabend, Daniel Borkmann, David S. Miller, Jakub Kicinski, Alexei Starovoitov, kernel-team, netdev, bpf, linux-kernel On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > Allow callers with CAP_NET_ADMIN to retrieve file descriptors from a > sockmap and sockhash. O_CLOEXEC is enforced on all fds. > > Without this, it's difficult to resize or otherwise rebuild existing > sockmap or sockhashes. > > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > --- > net/core/sock_map.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index 03e04426cd21..3228936aa31e 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -347,12 +347,31 @@ static void *sock_map_lookup(struct bpf_map *map, void *key) > static int __sock_map_copy_value(struct bpf_map *map, struct sock *sk, > void *value) > { > + struct file *file; > + int fd; > + > switch (map->value_size) { > case sizeof(u64): > sock_gen_cookie(sk); > *(u64 *)value = atomic64_read(&sk->sk_cookie); > return 0; > > + case sizeof(u32): > + if (!capable(CAP_NET_ADMIN)) > + return -EPERM; > + > + fd = get_unused_fd_flags(O_CLOEXEC); > + if (unlikely(fd < 0)) > + return fd; > + > + read_lock_bh(&sk->sk_callback_lock); > + file = get_file(sk->sk_socket->file); I think this deserves a second look. We don't lock the sock, so what if tcp_close orphans it before we enter this critical section? Looks like sk->sk_socket might be NULL. I'd find a test that tries to trigger the race helpful, like: thread A: loop in lookup FD from map thread B: loop in insert FD into map, close FD > + read_unlock_bh(&sk->sk_callback_lock); > + > + fd_install(fd, file); > + *(u32 *)value = fd; > + return 0; > + > default: > return -ENOSPC; > } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup 2020-03-17 15:18 ` Jakub Sitnicki @ 2020-03-17 18:16 ` John Fastabend 0 siblings, 0 replies; 28+ messages in thread From: John Fastabend @ 2020-03-17 18:16 UTC (permalink / raw) To: Jakub Sitnicki, Lorenz Bauer Cc: John Fastabend, Daniel Borkmann, David S. Miller, Jakub Kicinski, Alexei Starovoitov, kernel-team, netdev, bpf, linux-kernel Jakub Sitnicki wrote: > On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > > Allow callers with CAP_NET_ADMIN to retrieve file descriptors from a > > sockmap and sockhash. O_CLOEXEC is enforced on all fds. > > > > Without this, it's difficult to resize or otherwise rebuild existing > > sockmap or sockhashes. > > > > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > > --- > > net/core/sock_map.c | 19 +++++++++++++++++++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > index 03e04426cd21..3228936aa31e 100644 > > --- a/net/core/sock_map.c > > +++ b/net/core/sock_map.c > > @@ -347,12 +347,31 @@ static void *sock_map_lookup(struct bpf_map *map, void *key) > > static int __sock_map_copy_value(struct bpf_map *map, struct sock *sk, > > void *value) > > { > > + struct file *file; > > + int fd; > > + > > switch (map->value_size) { > > case sizeof(u64): > > sock_gen_cookie(sk); > > *(u64 *)value = atomic64_read(&sk->sk_cookie); > > return 0; > > > > + case sizeof(u32): > > + if (!capable(CAP_NET_ADMIN)) > > + return -EPERM; > > + > > + fd = get_unused_fd_flags(O_CLOEXEC); > > + if (unlikely(fd < 0)) > > + return fd; > > + > > + read_lock_bh(&sk->sk_callback_lock); > > + file = get_file(sk->sk_socket->file); > > I think this deserves a second look. > > We don't lock the sock, so what if tcp_close orphans it before we enter > this critical section? Looks like sk->sk_socket might be NULL. > > I'd find a test that tries to trigger the race helpful, like: > > thread A: loop in lookup FD from map > thread B: loop in insert FD into map, close FD Agreed, this was essentially my question above as well. When the psock is created we call sock_hold() and will only do a sock_put() after an rcu grace period when its removed. So at least if you have the sock here it should have a sk_refcnt. (Note the user data is set to NULL so if you do reference psock you need to check its non-null.) Is that enough to ensure sk_socket? Seems not to me, tcp_close for example will still happen and call sock_orphan(sk) based on my admittddly quick look. Further, even if you do check sk->sk_socket is non-null what does it mean to return a file with a socket that is closed, deleted from the sock_map and psock removed? At this point is it just a dangling reference? Still a bit confused as well what would or should happen when the sock is closed after you have the file reference? I could probably dig up what exactly would happen but I think we need it in the commiit message so we understand it. I also didn't dig up the details here but if the receiver of the fd crashes or otherwise disappears this hopefully all get cleaned up? > > > + read_unlock_bh(&sk->sk_callback_lock); > > + > > + fd_install(fd, file); > > + *(u32 *)value = fd; > > + return 0; > > + > > default: > > return -ENOSPC; > > } ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer ` (3 preceding siblings ...) 2020-03-10 17:47 ` [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup Lorenz Bauer @ 2020-03-10 17:47 ` Lorenz Bauer 2020-03-11 13:52 ` Jakub Sitnicki 2020-03-11 13:44 ` [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Jakub Sitnicki 2020-03-12 1:58 ` Alexei Starovoitov 6 siblings, 1 reply; 28+ messages in thread From: Lorenz Bauer @ 2020-03-10 17:47 UTC (permalink / raw) To: Shuah Khan, Alexei Starovoitov, Daniel Borkmann Cc: kernel-team, Lorenz Bauer, linux-kselftest, netdev, bpf, linux-kernel Make sure that looking up an element from the map succeeds, and that the fd is valid by using it an fcntl call. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> --- .../selftests/bpf/prog_tests/sockmap_listen.c | 26 ++++++++++++++----- 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c index 52aa468bdccd..929e1e77ecc6 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c @@ -453,7 +453,7 @@ static void test_lookup_after_delete(int family, int sotype, int mapfd) xclose(s); } -static void test_lookup_32_bit_value(int family, int sotype, int mapfd) +static void test_lookup_fd(int family, int sotype, int mapfd) { u32 key, value32; int err, s; @@ -466,7 +466,7 @@ static void test_lookup_32_bit_value(int family, int sotype, int mapfd) sizeof(value32), 1, 0); if (mapfd < 0) { FAIL_ERRNO("map_create"); - goto close; + goto close_sock; } key = 0; @@ -475,11 +475,25 @@ static void test_lookup_32_bit_value(int family, int sotype, int mapfd) errno = 0; err = bpf_map_lookup_elem(mapfd, &key, &value32); - if (!err || errno != ENOSPC) - FAIL_ERRNO("map_lookup: expected ENOSPC"); + if (err) { + FAIL_ERRNO("map_lookup"); + goto close_map; + } + if ((int)value32 == s) { + FAIL("return value is identical"); + goto close; + } + + err = fcntl(value32, F_GETFD); + if (err == -1) + FAIL_ERRNO("fcntl"); + +close: + xclose(value32); +close_map: xclose(mapfd); -close: +close_sock: xclose(s); } @@ -1456,7 +1470,7 @@ static void test_ops(struct test_sockmap_listen *skel, struct bpf_map *map, /* lookup */ TEST(test_lookup_after_insert), TEST(test_lookup_after_delete), - TEST(test_lookup_32_bit_value), + TEST(test_lookup_fd), /* update */ TEST(test_update_existing), /* races with insert/delete */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds 2020-03-10 17:47 ` [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds Lorenz Bauer @ 2020-03-11 13:52 ` Jakub Sitnicki 2020-03-11 17:24 ` Lorenz Bauer 0 siblings, 1 reply; 28+ messages in thread From: Jakub Sitnicki @ 2020-03-11 13:52 UTC (permalink / raw) To: Lorenz Bauer Cc: Shuah Khan, Alexei Starovoitov, Daniel Borkmann, kernel-team, linux-kselftest, netdev, bpf, linux-kernel On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > Make sure that looking up an element from the map succeeds, > and that the fd is valid by using it an fcntl call. > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > --- > .../selftests/bpf/prog_tests/sockmap_listen.c | 26 ++++++++++++++----- > 1 file changed, 20 insertions(+), 6 deletions(-) > > diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c > index 52aa468bdccd..929e1e77ecc6 100644 > --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c > +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c > @@ -453,7 +453,7 @@ static void test_lookup_after_delete(int family, int sotype, int mapfd) > xclose(s); > } > > -static void test_lookup_32_bit_value(int family, int sotype, int mapfd) > +static void test_lookup_fd(int family, int sotype, int mapfd) > { > u32 key, value32; > int err, s; > @@ -466,7 +466,7 @@ static void test_lookup_32_bit_value(int family, int sotype, int mapfd) > sizeof(value32), 1, 0); > if (mapfd < 0) { > FAIL_ERRNO("map_create"); > - goto close; > + goto close_sock; > } > > key = 0; > @@ -475,11 +475,25 @@ static void test_lookup_32_bit_value(int family, int sotype, int mapfd) > > errno = 0; > err = bpf_map_lookup_elem(mapfd, &key, &value32); > - if (!err || errno != ENOSPC) > - FAIL_ERRNO("map_lookup: expected ENOSPC"); > + if (err) { > + FAIL_ERRNO("map_lookup"); > + goto close_map; > + } > > + if ((int)value32 == s) { > + FAIL("return value is identical"); > + goto close; > + } > + > + err = fcntl(value32, F_GETFD); > + if (err == -1) > + FAIL_ERRNO("fcntl"); I would call getsockopt()/getsockname() to assert that the FD lookup succeeded. We want to know not only that it's an FD (-EBADFD case), but also that it's associated with a socket (-ENOTSOCK). We can go even further, and compare socket cookies to ensure we got an FD for the expected socket. Also, I'm wondering if we could keep the -ENOSPC case test-covered by temporarily dropping NET_ADMIN capability. > + > +close: > + xclose(value32); > +close_map: > xclose(mapfd); > -close: > +close_sock: > xclose(s); > } > > @@ -1456,7 +1470,7 @@ static void test_ops(struct test_sockmap_listen *skel, struct bpf_map *map, > /* lookup */ > TEST(test_lookup_after_insert), > TEST(test_lookup_after_delete), > - TEST(test_lookup_32_bit_value), > + TEST(test_lookup_fd), > /* update */ > TEST(test_update_existing), > /* races with insert/delete */ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds 2020-03-11 13:52 ` Jakub Sitnicki @ 2020-03-11 17:24 ` Lorenz Bauer 0 siblings, 0 replies; 28+ messages in thread From: Lorenz Bauer @ 2020-03-11 17:24 UTC (permalink / raw) To: Jakub Sitnicki Cc: Shuah Khan, Alexei Starovoitov, Daniel Borkmann, kernel-team, linux-kselftest, Networking, bpf, linux-kernel On Wed, 11 Mar 2020 at 13:52, Jakub Sitnicki <jakub@cloudflare.com> wrote: > > On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > > Make sure that looking up an element from the map succeeds, > > and that the fd is valid by using it an fcntl call. > > > > Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> > > --- > > .../selftests/bpf/prog_tests/sockmap_listen.c | 26 ++++++++++++++----- > > 1 file changed, 20 insertions(+), 6 deletions(-) > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c > > index 52aa468bdccd..929e1e77ecc6 100644 > > --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c > > +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c > > @@ -453,7 +453,7 @@ static void test_lookup_after_delete(int family, int sotype, int mapfd) > > xclose(s); > > } > > > > -static void test_lookup_32_bit_value(int family, int sotype, int mapfd) > > +static void test_lookup_fd(int family, int sotype, int mapfd) > > { > > u32 key, value32; > > int err, s; > > @@ -466,7 +466,7 @@ static void test_lookup_32_bit_value(int family, int sotype, int mapfd) > > sizeof(value32), 1, 0); > > if (mapfd < 0) { > > FAIL_ERRNO("map_create"); > > - goto close; > > + goto close_sock; > > } > > > > key = 0; > > @@ -475,11 +475,25 @@ static void test_lookup_32_bit_value(int family, int sotype, int mapfd) > > > > errno = 0; > > err = bpf_map_lookup_elem(mapfd, &key, &value32); > > - if (!err || errno != ENOSPC) > > - FAIL_ERRNO("map_lookup: expected ENOSPC"); > > + if (err) { > > + FAIL_ERRNO("map_lookup"); > > + goto close_map; > > + } > > > > + if ((int)value32 == s) { > > + FAIL("return value is identical"); > > + goto close; > > + } > > + > > + err = fcntl(value32, F_GETFD); > > + if (err == -1) > > + FAIL_ERRNO("fcntl"); > > I would call getsockopt()/getsockname() to assert that the FD lookup > succeeded. We want to know not only that it's an FD (-EBADFD case), but > also that it's associated with a socket (-ENOTSOCK). > > We can go even further, and compare socket cookies to ensure we got an > FD for the expected socket. Good idea, thanks! > Also, I'm wondering if we could keep the -ENOSPC case test-covered by > temporarily dropping NET_ADMIN capability. You mean EPERM? ENOSPC isn't reachable, since the map can only be created with a map_value of 4 or 8. > > > + > > +close: > > + xclose(value32); > > +close_map: > > xclose(mapfd); > > -close: > > +close_sock: > > xclose(s); > > } > > > > @@ -1456,7 +1470,7 @@ static void test_ops(struct test_sockmap_listen *skel, struct bpf_map *map, > > /* lookup */ > > TEST(test_lookup_after_insert), > > TEST(test_lookup_after_delete), > > - TEST(test_lookup_32_bit_value), > > + TEST(test_lookup_fd), > > /* update */ > > TEST(test_update_existing), > > /* races with insert/delete */ -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer ` (4 preceding siblings ...) 2020-03-10 17:47 ` [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds Lorenz Bauer @ 2020-03-11 13:44 ` Jakub Sitnicki 2020-03-11 22:40 ` John Fastabend 2020-03-12 1:58 ` Alexei Starovoitov 6 siblings, 1 reply; 28+ messages in thread From: Jakub Sitnicki @ 2020-03-11 13:44 UTC (permalink / raw) To: Daniel Borkmann, John Fastabend Cc: Alexei Starovoitov, Lorenz Bauer, kernel-team, netdev, bpf On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > We want to use sockhash and sockmap to build the control plane for > our upcoming BPF socket dispatch work. We realised that it's > difficult to resize or otherwise rebuild these maps if needed, > because there is no way to get at their contents. This patch set > allows a privileged user to retrieve fds from these map types, > which removes this obstacle. Since it takes just a few lines of code to get an FD for a sock: fd = get_unused_fd_flags(O_CLOEXEC); if (unlikely(fd < 0)) return fd; fd_install(fd, get_file(sk->sk_socket->file)); ... I can't help but wonder where's the catch? IOW, why wasn't this needed so far? How does Cilium avoid resizing & rebuilding sockmaps? Just asking out of curiosity. [...] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-11 13:44 ` [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Jakub Sitnicki @ 2020-03-11 22:40 ` John Fastabend 0 siblings, 0 replies; 28+ messages in thread From: John Fastabend @ 2020-03-11 22:40 UTC (permalink / raw) To: Jakub Sitnicki, Daniel Borkmann, John Fastabend Cc: Alexei Starovoitov, Lorenz Bauer, kernel-team, netdev, bpf Jakub Sitnicki wrote: > On Tue, Mar 10, 2020 at 06:47 PM CET, Lorenz Bauer wrote: > > We want to use sockhash and sockmap to build the control plane for > > our upcoming BPF socket dispatch work. We realised that it's > > difficult to resize or otherwise rebuild these maps if needed, > > because there is no way to get at their contents. This patch set > > allows a privileged user to retrieve fds from these map types, > > which removes this obstacle. > > Since it takes just a few lines of code to get an FD for a sock: > > fd = get_unused_fd_flags(O_CLOEXEC); > if (unlikely(fd < 0)) > return fd; > fd_install(fd, get_file(sk->sk_socket->file)); > > ... I can't help but wonder where's the catch? > > IOW, why wasn't this needed so far? > How does Cilium avoid resizing & rebuilding sockmaps? I build a map at init time and pin it for the lifetime of the daemon. If we overrun the sockmap we can always fall back to the normal case so there has never been a reason to resize. I guess being able to change the map size at runtime would be a nice to have but we don't do this with any other maps, e.g. connection tracking, load balancing, etc. We expect good-sizing upfront. @Lorenz, Would it be possible to provide some more details where a resize would be used? I guess if the map is nearly full you could rebuild a bigger one and migrate? One thing I explored at one point is to just create a new map and use multiple maps in the datapath but that required extra lookups and for hashing might not be ideal. > > Just asking out of curiosity. > > [...] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer ` (5 preceding siblings ...) 2020-03-11 13:44 ` [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Jakub Sitnicki @ 2020-03-12 1:58 ` Alexei Starovoitov 2020-03-12 9:16 ` Lorenz Bauer 6 siblings, 1 reply; 28+ messages in thread From: Alexei Starovoitov @ 2020-03-12 1:58 UTC (permalink / raw) To: Lorenz Bauer Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, netdev, bpf On Tue, Mar 10, 2020 at 05:47:06PM +0000, Lorenz Bauer wrote: > We want to use sockhash and sockmap to build the control plane for > our upcoming BPF socket dispatch work. We realised that it's > difficult to resize or otherwise rebuild these maps if needed, > because there is no way to get at their contents. This patch set > allows a privileged user to retrieve fds from these map types, > which removes this obstacle. > > The approach here is different than that of program arrays and > nested maps, which return an ID that can be turned into an fd > using the BPF_*_GET_FD_BY_ID syscall. Sockets have IDs in the > form of cookies, however there seems to be no way to go from > a socket cookie to struct socket or struct file. Hence we > return an fd directly. we do store the socket FD into a sockmap, but returning new FD to that socket feels weird. The user space suppose to hold those sockets. If it was bpf prog that stored a socket then what does user space want to do with that foreign socket? It likely belongs to some other process. Stealing it from other process doesn't feel right. Sounds like the use case is to take sockets one by one from one map, allocate another map and store them there? The whole process has plenty of races. I think it's better to tackle the problem from resize perspective. imo making it something like sk_local_storage (which is already resizable pseudo map of sockets) is a better way forward. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-12 1:58 ` Alexei Starovoitov @ 2020-03-12 9:16 ` Lorenz Bauer 2020-03-12 17:58 ` Alexei Starovoitov 0 siblings, 1 reply; 28+ messages in thread From: Lorenz Bauer @ 2020-03-12 9:16 UTC (permalink / raw) To: Alexei Starovoitov Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Networking, bpf On Thu, 12 Mar 2020 at 01:58, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > we do store the socket FD into a sockmap, but returning new FD to that socket > feels weird. The user space suppose to hold those sockets. If it was bpf prog > that stored a socket then what does user space want to do with that foreign > socket? It likely belongs to some other process. Stealing it from other process > doesn't feel right. For our BPF socket dispatch control plane this is true by design: all sockets belong to another process. The privileged user space is the steward of these, and needs to make sure traffic is steered to them. I agree that stealing them is weird, but after all this is CAP_NET_ADMIN only. pidfd_getfd allows you to really steal an fd from another process, so that cat is out of the bag ;) Marek wrote a PoC control plane: https://github.com/majek/inet-tool It is a CLI tool and not a service, so it can't hold on to any sockets. You can argue that we should turn it into a service, but that leads to another problem: there is no way of recovering these fds if the service crashes for some reason. The only solution would be to restart all services, which in our set up is the same as rebooting a machine really. > Sounds like the use case is to take sockets one by one from one map, allocate > another map and store them there? The whole process has plenty of races. It doesn't have to race. Our user space can do the appropriate locking to ensure that operations are atomic wrt. dispatching to sockets: - lock - read sockets from sockmap - write sockets into new sockmap - create new instance of BPF socket dispatch program - attach BPF socket dispatch program - remove old map - unlock > I think it's better to tackle the problem from resize perspective. imo making it > something like sk_local_storage (which is already resizable pseudo map of > sockets) is a better way forward. Resizing is only one aspect. We may also need to shuffle services around, think "defragmentation", and I think there will be other cases as we gain more experience with the control plane. Being able to recover fds from the sockmap will make it more resilient. Adding a special API for every one of these cases seems cumbersome. Lorenz -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-12 9:16 ` Lorenz Bauer @ 2020-03-12 17:58 ` Alexei Starovoitov 2020-03-12 19:32 ` John Fastabend 2020-03-13 10:48 ` Lorenz Bauer 0 siblings, 2 replies; 28+ messages in thread From: Alexei Starovoitov @ 2020-03-12 17:58 UTC (permalink / raw) To: Lorenz Bauer Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Networking, bpf On Thu, Mar 12, 2020 at 09:16:34AM +0000, Lorenz Bauer wrote: > On Thu, 12 Mar 2020 at 01:58, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > we do store the socket FD into a sockmap, but returning new FD to that socket > > feels weird. The user space suppose to hold those sockets. If it was bpf prog > > that stored a socket then what does user space want to do with that foreign > > socket? It likely belongs to some other process. Stealing it from other process > > doesn't feel right. > > For our BPF socket dispatch control plane this is true by design: all sockets > belong to another process. The privileged user space is the steward of these, > and needs to make sure traffic is steered to them. I agree that stealing them is > weird, but after all this is CAP_NET_ADMIN only. pidfd_getfd allows you to > really steal an fd from another process, so that cat is out of the bag ;) but there it goes through ptrace checks and lsm hoooks, whereas here similar security model cannot be enforced. bpf prog can put any socket into sockmap and from bpf_lookup_elem side there is no way to figure out the owner task of the socket to do ptrace checks. Just doing it all under CAP_NET_ADMIN is not a great security answer. > Marek wrote a PoC control plane: https://github.com/majek/inet-tool > It is a CLI tool and not a service, so it can't hold on to any sockets. > > You can argue that we should turn it into a service, but that leads to another > problem: there is no way of recovering these fds if the service crashes for > some reason. The only solution would be to restart all services, which in > our set up is the same as rebooting a machine really. > > > Sounds like the use case is to take sockets one by one from one map, allocate > > another map and store them there? The whole process has plenty of races. > > It doesn't have to race. Our user space can do the appropriate locking to ensure > that operations are atomic wrt. dispatching to sockets: > > - lock > - read sockets from sockmap > - write sockets into new sockmap but bpf side may still need to insert them into old. you gonna solve it with a flag for the prog to stop doing its job? Or the prog will know that it needs to put sockets into second map now? It's really the same problem as with classic so_reuseport which was solved with BPF_MAP_TYPE_REUSEPORT_SOCKARRAY. > > I think it's better to tackle the problem from resize perspective. imo making it > > something like sk_local_storage (which is already resizable pseudo map of > > sockets) is a better way forward. > > Resizing is only one aspect. We may also need to shuffle services around, > think "defragmentation", and I think there will be other cases as we gain more > experience with the control plane. Being able to recover fds from the sockmap > will make it more resilient. Adding a special API for every one of these cases > seems cumbersome. I think sockmap needs a redesign. Consider that today all sockets can be in any number of sk_local_storage pseudo maps. They are 'defragmented' and resizable. I think plugging socket redirect to use sk_local_storage-like infra is the answer. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-12 17:58 ` Alexei Starovoitov @ 2020-03-12 19:32 ` John Fastabend 2020-03-13 11:03 ` Lorenz Bauer 2020-03-13 10:48 ` Lorenz Bauer 1 sibling, 1 reply; 28+ messages in thread From: John Fastabend @ 2020-03-12 19:32 UTC (permalink / raw) To: Alexei Starovoitov, Lorenz Bauer Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Networking, bpf Alexei Starovoitov wrote: > On Thu, Mar 12, 2020 at 09:16:34AM +0000, Lorenz Bauer wrote: > > On Thu, 12 Mar 2020 at 01:58, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > we do store the socket FD into a sockmap, but returning new FD to that socket > > > feels weird. The user space suppose to hold those sockets. If it was bpf prog > > > that stored a socket then what does user space want to do with that foreign > > > socket? It likely belongs to some other process. Stealing it from other process > > > doesn't feel right. > > > > For our BPF socket dispatch control plane this is true by design: all sockets > > belong to another process. The privileged user space is the steward of these, > > and needs to make sure traffic is steered to them. I agree that stealing them is > > weird, but after all this is CAP_NET_ADMIN only. pidfd_getfd allows you to > > really steal an fd from another process, so that cat is out of the bag ;) > > but there it goes through ptrace checks and lsm hoooks, whereas here similar > security model cannot be enforced. bpf prog can put any socket into sockmap and > from bpf_lookup_elem side there is no way to figure out the owner task of the > socket to do ptrace checks. Just doing it all under CAP_NET_ADMIN is not a > great security answer. > > > Marek wrote a PoC control plane: https://github.com/majek/inet-tool > > It is a CLI tool and not a service, so it can't hold on to any sockets. > > > > You can argue that we should turn it into a service, but that leads to another > > problem: there is no way of recovering these fds if the service crashes for > > some reason. The only solution would be to restart all services, which in > > our set up is the same as rebooting a machine really. > > > > > Sounds like the use case is to take sockets one by one from one map, allocate > > > another map and store them there? The whole process has plenty of races. > > > > It doesn't have to race. Our user space can do the appropriate locking to ensure > > that operations are atomic wrt. dispatching to sockets: > > > > - lock > > - read sockets from sockmap > > - write sockets into new sockmap > > but bpf side may still need to insert them into old. > you gonna solve it with a flag for the prog to stop doing its job? > Or the prog will know that it needs to put sockets into second map now? > It's really the same problem as with classic so_reuseport > which was solved with BPF_MAP_TYPE_REUSEPORT_SOCKARRAY. > > > > I think it's better to tackle the problem from resize perspective. imo making it > > > something like sk_local_storage (which is already resizable pseudo map of > > > sockets) is a better way forward. > > > > Resizing is only one aspect. We may also need to shuffle services around, > > think "defragmentation", and I think there will be other cases as we gain more > > experience with the control plane. Being able to recover fds from the sockmap > > will make it more resilient. Adding a special API for every one of these cases > > seems cumbersome. > > I think sockmap needs a redesign. Consider that today all sockets can be in any > number of sk_local_storage pseudo maps. They are 'defragmented' and resizable. > I think plugging socket redirect to use sk_local_storage-like infra is the > answer. socket redirect today can use any number of maps and redirect to any sock in any map. There is no restriction on only being able to redirect to socks in the same map. Further, the same sock can be in the multiple maps or even the same map in multiple slots. I think its fairly similar to sk local storage in this way. The restriction that the maps can not grow/shrink is perhaps limiting a bit. I can see how resizing might be useful. In my original load balancer case a single application owned all the socks so there was no need to ever pull them back out of the map. We "knew" where they were. I think resize ops could be added without to much redesign. Or a CREATE flag could be used to add it as a new entry if needed. At some point I guess someone will request it as a feature for Cilium for example. OTOH I'm not sure off-hand how to use a dynamically sized table for load balancing. I should know the size because I want to say something about the hash distribution and if the size is changing do I still know this? I really haven't considered it much. As an aside redirect helper could work with anything of sock type not just socks from maps. Now that we have BTF infra to do it we could just type check that we have a sock and do the redirect regardless of if the sock is in a map or not. The map really provides two functions, first a way to attach programs to the socks and second a stable array to hash over if needed. Rather than expose the fd's to user space would a map copy api be useful? I could imagine some useful cases where copy might be used map_copy(map *A, map *B, map_key *key) would need to sort out what to do with key/value size changes. But I can imagine for upgrades this might be useful. Another option I've been considering the need for a garbage collection thread trigger at regular intervals. This BPF program could do the copy from map to map in kernel space never exposing fds out of kernel Thanks. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-12 19:32 ` John Fastabend @ 2020-03-13 11:03 ` Lorenz Bauer 0 siblings, 0 replies; 28+ messages in thread From: Lorenz Bauer @ 2020-03-13 11:03 UTC (permalink / raw) To: John Fastabend Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann, kernel-team, Networking, bpf On Thu, 12 Mar 2020 at 19:32, John Fastabend <john.fastabend@gmail.com> wrote: > > The restriction that the maps can not grow/shrink is perhaps limiting a > bit. I can see how resizing might be useful. In my original load balancer > case a single application owned all the socks so there was no need to > ever pull them back out of the map. We "knew" where they were. I think > resize ops could be added without to much redesign. Or a CREATE flag could > be used to add it as a new entry if needed. At some point I guess someone > will request it as a feature for Cilium for example. OTOH I'm not sure > off-hand how to use a dynamically sized table for load balancing. I > should know the size because I want to say something about the hash > distribution and if the size is changing do I still know this? I really > haven't considered it much. I agree, magically changing the size of a sockmap isn't useful. We don't want to do load-balancing, but still need stable indices into the map: - derive some sort of ID from the skb - look up the ID in the sockmap - return the socket as the result of the program If the ID changes we need to coordinate this with the eBPF, or at least update some other map in a race-free way. [...] > > Rather than expose the fd's to user space would a map copy api be > useful? I could imagine some useful cases where copy might be used > > map_copy(map *A, map *B, map_key *key) > > would need to sort out what to do with key/value size changes. But > I can imagine for upgrades this might be useful. I guess that would be a way to approach it. I'd probably find a primitive to copy a whole map atomically more useful, but haven't really thought about it much. > > Another option I've been considering the need for a garbage collection > thread trigger at regular intervals. This BPF program could do the > copy from map to map in kernel space never exposing fds out of kernel So, have a dummy prog that has both maps, and copies from old to new. Invoke that from user space via BPF_PROG_TEST_RUN? I guess that would work, but falls back to being "protected" by CAP_SYS_ADMIN. It's just more cumbersome than doing it in user space! Lorenz -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-12 17:58 ` Alexei Starovoitov 2020-03-12 19:32 ` John Fastabend @ 2020-03-13 10:48 ` Lorenz Bauer 2020-03-14 2:58 ` Alexei Starovoitov 1 sibling, 1 reply; 28+ messages in thread From: Lorenz Bauer @ 2020-03-13 10:48 UTC (permalink / raw) To: Alexei Starovoitov Cc: Daniel Borkmann, kernel-team, Networking, bpf, Jakub Sitnicki On Thu, 12 Mar 2020 at 17:58, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > but there it goes through ptrace checks and lsm hoooks, whereas here similar > security model cannot be enforced. bpf prog can put any socket into sockmap and > from bpf_lookup_elem side there is no way to figure out the owner task of the > socket to do ptrace checks. Just doing it all under CAP_NET_ADMIN is not a > great security answer. Reading between the lines, you're concerned about something like a sock ops program "stealing" the socket and putting it in a sockmap, to be retrieved by an attacker later on? How is that different than BPF_MAP_GET_FD_BY_ID, except that it's CAP_SYS_ADMIN? > but bpf side may still need to insert them into old. > you gonna solve it with a flag for the prog to stop doing its job? > Or the prog will know that it needs to put sockets into second map now? > It's really the same problem as with classic so_reuseport > which was solved with BPF_MAP_TYPE_REUSEPORT_SOCKARRAY. We don't modify the sockmap from eBPF: receive a packet -> lookup sk in sockmap based on packet -> redirect Why do you think we'll have to insert sockets from BPF? > I think sockmap needs a redesign. Consider that today all sockets can be in any > number of sk_local_storage pseudo maps. They are 'defragmented' and resizable. > I think plugging socket redirect to use sk_local_storage-like infra is the > answer. Maybe Jakub can speak more to this but I don't see how this solves our problem. We need a way to get at struct sk * from an eBPF program that runs on an skb context, to make BPF socket dispatch feasible. How would we use sk_local_storage if we don't have a sk? -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-13 10:48 ` Lorenz Bauer @ 2020-03-14 2:58 ` Alexei Starovoitov 2020-03-17 9:55 ` Lorenz Bauer 0 siblings, 1 reply; 28+ messages in thread From: Alexei Starovoitov @ 2020-03-14 2:58 UTC (permalink / raw) To: Lorenz Bauer Cc: Daniel Borkmann, kernel-team, Networking, bpf, Jakub Sitnicki On Fri, Mar 13, 2020 at 10:48:57AM +0000, Lorenz Bauer wrote: > On Thu, 12 Mar 2020 at 17:58, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > but there it goes through ptrace checks and lsm hoooks, whereas here similar > > security model cannot be enforced. bpf prog can put any socket into sockmap and > > from bpf_lookup_elem side there is no way to figure out the owner task of the > > socket to do ptrace checks. Just doing it all under CAP_NET_ADMIN is not a > > great security answer. > > Reading between the lines, you're concerned about something like a sock ops > program "stealing" the socket and putting it in a sockmap, to be retrieved by an > attacker later on? > > How is that different than BPF_MAP_GET_FD_BY_ID, except that it's CAP_SYS_ADMIN? It's different because it's crossing domains. FD_BY_ID returns FD for bpf objects. Whereas here you're proposing bpf lookup to return FD from different domain. If lookup was returning a socket cookie and separate api on the networking side would convert cookie into FD I would be fine with that. > > but bpf side may still need to insert them into old. > > you gonna solve it with a flag for the prog to stop doing its job? > > Or the prog will know that it needs to put sockets into second map now? > > It's really the same problem as with classic so_reuseport > > which was solved with BPF_MAP_TYPE_REUSEPORT_SOCKARRAY. > > We don't modify the sockmap from eBPF: > receive a packet -> lookup sk in sockmap based on packet -> redirect > > Why do you think we'll have to insert sockets from BPF? sure, but sockmap allows socket insertion. Hence it's part of considerations. > > > I think sockmap needs a redesign. Consider that today all sockets can be in any > > number of sk_local_storage pseudo maps. They are 'defragmented' and resizable. > > I think plugging socket redirect to use sk_local_storage-like infra is the > > answer. > > Maybe Jakub can speak more to this but I don't see how this solves our problem. > We need a way to get at struct sk * from an eBPF program that runs on > an skb context, > to make BPF socket dispatch feasible. How would we use > sk_local_storage if we don't > have a sk? I'm not following. There is skb->sk. Why do you need to lookup sk ? Because your hook is before demux and skb->sk is not set? Then move your hook to after? I think we're arguing in circles because in this thread I haven't seen the explanation of the problem you're trying to solve. We argued about your proposed solution and got stuck. Can we restart from the beginning with all details? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-14 2:58 ` Alexei Starovoitov @ 2020-03-17 9:55 ` Lorenz Bauer 2020-03-17 19:05 ` John Fastabend 0 siblings, 1 reply; 28+ messages in thread From: Lorenz Bauer @ 2020-03-17 9:55 UTC (permalink / raw) To: Alexei Starovoitov Cc: Daniel Borkmann, kernel-team, Networking, bpf, Jakub Sitnicki On Sat, 14 Mar 2020 at 02:58, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > I'm not following. There is skb->sk. Why do you need to lookup sk ? Because > your hook is before demux and skb->sk is not set? Then move your hook to after? > > I think we're arguing in circles because in this thread I haven't seen the > explanation of the problem you're trying to solve. We argued about your > proposed solution and got stuck. Can we restart from the beginning with all > details? Yes, that's a good idea. I mentioned this in passing in my cover letter, but should have provided more context. Jakub is working on a patch series to add a BPF hook to socket dispatch [1] aka the inet_lookup function. The core idea is to control skb->sk via a BPF program. Hence, we can't use skb->sk. Introducing this hook poses another problem: we need to get the struct sk from somewhere. The canonical way in BPF is to use the lookup_sk helpers. Of course that doesn't work, since our hook would invoke itself. So we need a data structure that can hold sockets, to be used by programs attached on the new hook. Jakub's RFC patch set used REUSEPORT_SOCKARRAY for this. During LPC '19 we got feedback that sockmap is probably the better choice. As a result, Jakub started working on extending sockmap TCP support and after a while I joined to add UDP. Now, we are looking at what our control plane could look like. Based on the inet-tool work that Marek Majkowski has done [2], we currently have the following set up: * An LPM map that goes from IP prefix and port to an index in a sockmap * A sockmap that holds sockets * A BPF program that performs the business logic inet-tool is used to update the two maps to add and remove mappings on the fly. Essentially, services donate their sockets either via fork+exec or SCM_RIGHTS on a Unix socket. Once we have inserted a socket in the sockmap, it's not possible to retrieve it again. This makes it impossible to change the position of a socket in the map, to resize the map, etc. with our current design. One way to work around this is to add a persistent component to our control plane: a process can hold on to the sockets and re-build the map when necessary. The downsides are that upgrading the service is non-trivial (since we need to pass the socket fds) and that a failure of this service is catastrophic. Once it happens, we probably have to reboot the machine to get it into a workable state again. We'd like to avoid a persistent service if we can. By allowing to look up fds from the sockmap, we could make this part of our control plane more robust. 1: https://www.youtube.com/watch?v=qRDoUpqvYjY 2: https://github.com/majek/inet-tool I hope this explanation helps, sorry for not being more thorough in the original cover letter! Lorenz -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-17 9:55 ` Lorenz Bauer @ 2020-03-17 19:05 ` John Fastabend 2020-03-20 15:12 ` Lorenz Bauer 0 siblings, 1 reply; 28+ messages in thread From: John Fastabend @ 2020-03-17 19:05 UTC (permalink / raw) To: Lorenz Bauer, Alexei Starovoitov Cc: Daniel Borkmann, kernel-team, Networking, bpf, Jakub Sitnicki Lorenz Bauer wrote: > On Sat, 14 Mar 2020 at 02:58, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > I'm not following. There is skb->sk. Why do you need to lookup sk ? Because > > your hook is before demux and skb->sk is not set? Then move your hook to after? > > > > I think we're arguing in circles because in this thread I haven't seen the > > explanation of the problem you're trying to solve. We argued about your > > proposed solution and got stuck. Can we restart from the beginning with all > > details? > > Yes, that's a good idea. I mentioned this in passing in my cover > letter, but should > have provided more context. > > Jakub is working on a patch series to add a BPF hook to socket dispatch [1] aka > the inet_lookup function. The core idea is to control skb->sk via a BPF program. > Hence, we can't use skb->sk. > > Introducing this hook poses another problem: we need to get the struct sk from > somewhere. The canonical way in BPF is to use the lookup_sk helpers. Of course > that doesn't work, since our hook would invoke itself. So we need a > data structure > that can hold sockets, to be used by programs attached on the new hook. > > Jakub's RFC patch set used REUSEPORT_SOCKARRAY for this. During LPC '19 > we got feedback that sockmap is probably the better choice. As a > result, Jakub started > working on extending sockmap TCP support and after a while I joined to add UDP. > > Now, we are looking at what our control plane could look like. Based > on the inet-tool > work that Marek Majkowski has done [2], we currently have the following set up: > > * An LPM map that goes from IP prefix and port to an index in a sockmap As an aside we could do a LPM version of sockmap to avoid the extra lookup, but thats just an optimization for later. > * A sockmap that holds sockets > * A BPF program that performs the business logic > > inet-tool is used to update the two maps to add and remove mappings on the fly. > Essentially, services donate their sockets either via fork+exec or SCM_RIGHTS on > a Unix socket. This looks a lot like one of the LBs we prototyped early on. > > Once we have inserted a socket in the sockmap, it's not possible to > retrieve it again. > This makes it impossible to change the position of a socket in the > map, to resize the > map, etc. with our current design. Is it fair to say then that you don't actually need/care about the fd it just happens to be something stable you could grab relatively easy from the sockmap side and push back at a sockmap? > > One way to work around this is to add a persistent component to our > control plane: > a process can hold on to the sockets and re-build the map when necessary. The > downsides are that upgrading the service is non-trivial (since we need > to pass the > socket fds) and that a failure of this service is catastrophic. Once > it happens, we > probably have to reboot the machine to get it into a workable state again. Agreed this is not a good place to be in. We use the kernel maps for persistence in many cases today, such as updates or when the application crashes we have the nice property that the datapath keeps working without interruption. > > We'd like to avoid a persistent service if we can. By allowing to look > up fds from the > sockmap, we could make this part of our control plane more robust. > > 1: https://www.youtube.com/watch?v=qRDoUpqvYjY > 2: https://github.com/majek/inet-tool > > I hope this explanation helps, sorry for not being more thorough in the original > cover letter! Helps a lot for me at least. So instead of fd how about, sock_map_lookup returns bpf_sock sock_map_update can consume an fd or a bpf_sock Userland can do a dump of the sock_map then get a set of bpf_socks and push them into another map via updates. Nothing too special compared to other maps. In cilium for example I could plug this into our normal flows and we would get rid of the current corner case where upgrades and crashes lose sockmap state. The update hooks in sock_map already know how to deal with socks so the trick would be to do the lookup from bpf_sock to a real sock. For that I think we can just use sk_lookup(). Maybe bpf_sock needs to additionally include the cookie? Including the cookie in bpf_sock seems generally useful as well. I would probably use it outside of sock_map for example. Thoughts? I think it helps with Alexei's concern around passing fds. > > Lorenz > > -- > Lorenz Bauer | Systems Engineer > 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK > > www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-17 19:05 ` John Fastabend @ 2020-03-20 15:12 ` Lorenz Bauer 2020-04-07 3:08 ` Alexei Starovoitov 0 siblings, 1 reply; 28+ messages in thread From: Lorenz Bauer @ 2020-03-20 15:12 UTC (permalink / raw) To: John Fastabend Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Networking, bpf, Jakub Sitnicki On Tue, 17 Mar 2020 at 19:05, John Fastabend <john.fastabend@gmail.com> wrote: > > Lorenz Bauer wrote: > > On Sat, 14 Mar 2020 at 02:58, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > I'm not following. There is skb->sk. Why do you need to lookup sk ? Because > > > your hook is before demux and skb->sk is not set? Then move your hook to after? > > > > > > I think we're arguing in circles because in this thread I haven't seen the > > > explanation of the problem you're trying to solve. We argued about your > > > proposed solution and got stuck. Can we restart from the beginning with all > > > details? > > > > Yes, that's a good idea. I mentioned this in passing in my cover > > letter, but should > > have provided more context. > > > > Jakub is working on a patch series to add a BPF hook to socket dispatch [1] aka > > the inet_lookup function. The core idea is to control skb->sk via a BPF program. > > Hence, we can't use skb->sk. > > > > Introducing this hook poses another problem: we need to get the struct sk from > > somewhere. The canonical way in BPF is to use the lookup_sk helpers. Of course > > that doesn't work, since our hook would invoke itself. So we need a > > data structure > > that can hold sockets, to be used by programs attached on the new hook. > > > > Jakub's RFC patch set used REUSEPORT_SOCKARRAY for this. During LPC '19 > > we got feedback that sockmap is probably the better choice. As a > > result, Jakub started > > working on extending sockmap TCP support and after a while I joined to add UDP. > > > > Now, we are looking at what our control plane could look like. Based > > on the inet-tool > > work that Marek Majkowski has done [2], we currently have the following set up: > > > > * An LPM map that goes from IP prefix and port to an index in a sockmap > > As an aside we could do a LPM version of sockmap to avoid the extra lookup, > but thats just an optimization for later. > > > * A sockmap that holds sockets > > * A BPF program that performs the business logic > > > > inet-tool is used to update the two maps to add and remove mappings on the fly. > > Essentially, services donate their sockets either via fork+exec or SCM_RIGHTS on > > a Unix socket. > > This looks a lot like one of the LBs we prototyped early on. > > > > > Once we have inserted a socket in the sockmap, it's not possible to > > retrieve it again. > > This makes it impossible to change the position of a socket in the > > map, to resize the > > map, etc. with our current design. > > Is it fair to say then that you don't actually need/care about the fd it > just happens to be something stable you could grab relatively easy from > the sockmap side and push back at a sockmap? I think that falls a bit short. I'd like to have an fd because it is the user space representation of a socket. I can do useful things like get the address family, etc. from it. I can use this to provide meaningful semantics to users: you can't redirect UDP traffic into a SOCK_STREAM, etc. So, we'd definitely have to add some sort of getsockopt wrapper for this bpf sock, which sounds kind of complex? > > > > > One way to work around this is to add a persistent component to our > > control plane: > > a process can hold on to the sockets and re-build the map when necessary. The > > downsides are that upgrading the service is non-trivial (since we need > > to pass the > > socket fds) and that a failure of this service is catastrophic. Once > > it happens, we > > probably have to reboot the machine to get it into a workable state again. > > Agreed this is not a good place to be in. We use the kernel maps for > persistence in many cases today, such as updates or when the application > crashes we have the nice property that the datapath keeps working without > interruption. > > > > > We'd like to avoid a persistent service if we can. By allowing to look > > up fds from the > > sockmap, we could make this part of our control plane more robust. > > > > 1: https://www.youtube.com/watch?v=qRDoUpqvYjY > > 2: https://github.com/majek/inet-tool > > > > I hope this explanation helps, sorry for not being more thorough in the original > > cover letter! > > Helps a lot for me at least. > > So instead of fd how about, > > sock_map_lookup returns bpf_sock > sock_map_update can consume an fd or a bpf_sock > > Userland can do a dump of the sock_map then get a set of bpf_socks and > push them into another map via updates. Nothing too special compared > to other maps. In cilium for example I could plug this into our normal > flows and we would get rid of the current corner case where upgrades > and crashes lose sockmap state. I thought that bpf_sock is just used to document the BPF UAPI? How would you turn this into a refcountable thing? > > The update hooks in sock_map already know how to deal with socks so > the trick would be to do the lookup from bpf_sock to a real sock. For > that I think we can just use sk_lookup(). Maybe bpf_sock needs to > additionally include the cookie? Including the cookie in bpf_sock > seems generally useful as well. I would probably use it outside > of sock_map for example. > > Thoughts? I think it helps with Alexei's concern around passing fds. I like that it's a tangible way forward, but I'm worried that adding a new "thing" that's kind of like a socket (but not really) is going to be quite complex. Another way around this might be to add a way to go from socket cookie to socket, and not touch sockmap. I suspect that is an even steeper hill to climb, especially if it means adding a new syscall. Thank you for your thoughts, I'll need to go and ruminate on this some more. Specifically I'll flesh out the control plane some more, which should help me get a more focused understanding of what I need. > > > > > Lorenz > > > > -- > > Lorenz Bauer | Systems Engineer > > 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK > > > > www.cloudflare.com > > -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup 2020-03-20 15:12 ` Lorenz Bauer @ 2020-04-07 3:08 ` Alexei Starovoitov 0 siblings, 0 replies; 28+ messages in thread From: Alexei Starovoitov @ 2020-04-07 3:08 UTC (permalink / raw) To: Lorenz Bauer Cc: John Fastabend, Daniel Borkmann, kernel-team, Networking, bpf, Jakub Sitnicki On Fri, Mar 20, 2020 at 8:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > Another way around this might be to add a way to go from socket cookie to > socket, and not touch sockmap. I suspect that is an even steeper hill to climb, > especially if it means adding a new syscall. I think receiving socket FD from socket cookie would be much better approach. It's clean layering and security folks will enforce that bit of api individually. New syscall is not hard to add. man pages, docs are needed regardless. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2020-04-07 3:08 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer 2020-03-10 17:47 ` [PATCH 1/5] bpf: add map_copy_value hook Lorenz Bauer 2020-03-10 17:47 ` [PATCH 2/5] bpf: convert queue and stack map to map_copy_value Lorenz Bauer 2020-03-11 14:00 ` Jakub Sitnicki 2020-03-11 22:31 ` John Fastabend 2020-03-10 17:47 ` [PATCH 3/5] bpf: convert sock map and hash " Lorenz Bauer 2020-03-11 13:55 ` Jakub Sitnicki 2020-03-10 17:47 ` [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup Lorenz Bauer 2020-03-11 23:27 ` John Fastabend 2020-03-17 10:17 ` Lorenz Bauer 2020-03-17 15:18 ` Jakub Sitnicki 2020-03-17 18:16 ` John Fastabend 2020-03-10 17:47 ` [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds Lorenz Bauer 2020-03-11 13:52 ` Jakub Sitnicki 2020-03-11 17:24 ` Lorenz Bauer 2020-03-11 13:44 ` [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Jakub Sitnicki 2020-03-11 22:40 ` John Fastabend 2020-03-12 1:58 ` Alexei Starovoitov 2020-03-12 9:16 ` Lorenz Bauer 2020-03-12 17:58 ` Alexei Starovoitov 2020-03-12 19:32 ` John Fastabend 2020-03-13 11:03 ` Lorenz Bauer 2020-03-13 10:48 ` Lorenz Bauer 2020-03-14 2:58 ` Alexei Starovoitov 2020-03-17 9:55 ` Lorenz Bauer 2020-03-17 19:05 ` John Fastabend 2020-03-20 15:12 ` Lorenz Bauer 2020-04-07 3:08 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).