* [PATCH bpf-next v1 1/3] bpf: implement relay map basis
2023-12-27 10:01 [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY Philo Lu
@ 2023-12-27 10:01 ` Philo Lu
2023-12-27 13:41 ` Yafang Shao
2023-12-27 10:01 ` [PATCH bpf-next v1 2/3] bpf: add bpf_relay_output kfunc Philo Lu
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Philo Lu @ 2023-12-27 10:01 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
joannelkoong, laoar.shao, kuifeng, houtao, shung-hsi.yu,
xuanzhuo, dust.li, alibuda, guwen, hengqi
BPF_MAP_TYPE_RELAY is implemented based on relay interface, which
creates per-cpu buffer to transfer data. Each buffer is essentially a
list of fix-sized sub-buffers, and is exposed to user space as files in
debugfs. All of these compose a "relay chanel", which is the kernel of a
relay map.
Currently, attr->max_entries is used as subbuf size and attr->map_extra
is used as subbuf num, and the default value of subbuf num is 8. A new
map flag named BPF_F_OVERWRITE is also introduced to set overwrite mode
of relay map.
A relay map is represented as a directory in debugfs, and the per-cpu
buffers are files in this directory. Users can get the data through read
or mmap.
To avoid directory name conflicting, relay_map_update_elem is provided
to set the name. In fact, we create the relay channel and buffers with
BPF_MAP_CREATE, and create relay files and bind them with the channel
using BPF_MAP_UPDATE_ELEM. Generally, map_update_elem should be called
once and only once.
Here is an example:
```
struct {
__uint(type, BPF_MAP_TYPE_RELAY);
__uint(max_entries, 4096);
} my_relay SEC(".maps");
...
char dir_name[] = "relay_test";
bpf_map_update_elem(map_fd, NULL, dir_name, 0);
```
Assume there are 2 cpus, we will have 2 files:
```
/sys/kerenl/debug/relay_test/my_relay0
/sys/kerenl/debug/relay_test/my_relay1
```
Each represents a per-cpu buffer with size 8 * 4096 B (there are 8
subbufs by default, each with size 4096B).
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
---
include/linux/bpf_types.h | 3 +
include/uapi/linux/bpf.h | 7 ++
kernel/bpf/Makefile | 3 +
kernel/bpf/relaymap.c | 199 ++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 2 +
5 files changed, 214 insertions(+)
create mode 100644 kernel/bpf/relaymap.c
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index fc0d6f32c687..c122d7b494c5 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -132,6 +132,9 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
+#ifdef CONFIG_RELAY
+BPF_MAP_TYPE(BPF_MAP_TYPE_RELAY, relay_map_ops)
+#endif
BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 754e68ca8744..143b75676bd3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -951,6 +951,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_BLOOM_FILTER,
BPF_MAP_TYPE_USER_RINGBUF,
BPF_MAP_TYPE_CGRP_STORAGE,
+ BPF_MAP_TYPE_RELAY,
};
/* Note that tracing related programs such as
@@ -1330,6 +1331,9 @@ enum {
/* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */
BPF_F_PATH_FD = (1U << 14),
+
+/* Enable overwrite for relay map */
+ BPF_F_OVERWRITE = (1U << 15),
};
/* Flags for BPF_PROG_QUERY. */
@@ -1401,6 +1405,9 @@ union bpf_attr {
* BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
* number of hash functions (if 0, the bloom filter will default
* to using 5 hash functions).
+ *
+ * BPF_MAP_TYPE_RELAY - the lowest 32 bits indicate the number of
+ * relay subbufs (if 0, the number will be set to 8 by default).
*/
__u64 map_extra;
};
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f526b7573e97..45b35bb0e572 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -10,6 +10,9 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+ifeq ($(CONFIG_RELAY),y)
+obj-$(CONFIG_BPF_SYSCALL) += relaymap.o
+endif
obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
new file mode 100644
index 000000000000..02b33a8e6b6c
--- /dev/null
+++ b/kernel/bpf/relaymap.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpumask.h>
+#include <linux/debugfs.h>
+#include <linux/filter.h>
+#include <linux/relay.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/err.h>
+
+#define RELAY_CREATE_FLAG_MASK (BPF_F_OVERWRITE)
+
+struct bpf_relay_map {
+ struct bpf_map map;
+ struct rchan *relay_chan;
+ struct rchan_callbacks relay_cb;
+};
+
+static struct dentry *create_buf_file_handler(const char *filename,
+ struct dentry *parent, umode_t mode,
+ struct rchan_buf *buf, int *is_global)
+{
+ /* Because we do relay_late_setup_files(), create_buf_file(NULL, NULL, ...)
+ * will be called by relay_open.
+ */
+ if (!filename)
+ return NULL;
+
+ return debugfs_create_file(filename, mode, parent, buf,
+ &relay_file_operations);
+}
+
+static int remove_buf_file_handler(struct dentry *dentry)
+{
+ debugfs_remove(dentry);
+ return 0;
+}
+
+/* For non-overwrite, use default subbuf_start cb */
+static int subbuf_start_overwrite(struct rchan_buf *buf, void *subbuf,
+ void *prev_subbuf, size_t prev_padding)
+{
+ return 1;
+}
+
+/* bpf_attr is used as follows:
+ * - key size: must be 0
+ * - value size: value will be used as directory name by map_update_elem
+ * (to create relay files). If passed as 0, it will be set to NAME_MAX as
+ * default
+ *
+ * - max_entries: subbuf size
+ * - map_extra: subbuf num, default as 8
+ *
+ * When alloc, we do not set up relay files considering dir_name conflicts.
+ * Instead we use relay_late_setup_files() in map_update_elem(), and thus the
+ * value is used as dir_name, and map->name is used as base_filename.
+ */
+static struct bpf_map *relay_map_alloc(union bpf_attr *attr)
+{
+ struct bpf_relay_map *rmap;
+
+ if (unlikely(attr->map_flags & ~RELAY_CREATE_FLAG_MASK))
+ return ERR_PTR(-EINVAL);
+
+ /* key size must be 0 in relay map */
+ if (unlikely(attr->key_size))
+ return ERR_PTR(-EINVAL);
+
+ /* value size is used as directory name length */
+ if (unlikely(attr->value_size > NAME_MAX)) {
+ pr_warn("value_size should be no more than %d\n", NAME_MAX);
+ return ERR_PTR(-EINVAL);
+ } else if (attr->value_size == 0)
+ attr->value_size = NAME_MAX;
+
+ /* set default subbuf num */
+ if (unlikely(attr->map_extra & ~UINT_MAX))
+ return ERR_PTR(-EINVAL);
+ attr->map_extra = attr->map_extra & UINT_MAX;
+ if (!attr->map_extra)
+ attr->map_extra = 8;
+
+ if (strlen(attr->map_name) == 0)
+ return ERR_PTR(-EINVAL);
+
+ rmap = bpf_map_area_alloc(sizeof(*rmap), NUMA_NO_NODE);
+ if (!rmap)
+ return ERR_PTR(-ENOMEM);
+
+ bpf_map_init_from_attr(&rmap->map, attr);
+
+ rmap->relay_cb.create_buf_file = create_buf_file_handler;
+ rmap->relay_cb.remove_buf_file = remove_buf_file_handler;
+ if (attr->map_flags & BPF_F_OVERWRITE)
+ rmap->relay_cb.subbuf_start = subbuf_start_overwrite;
+
+ rmap->relay_chan = relay_open(NULL, NULL,
+ attr->max_entries, attr->map_extra,
+ &rmap->relay_cb, NULL);
+ if (!rmap->relay_chan) {
+ bpf_map_area_free(rmap);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return &rmap->map;
+}
+
+static void relay_map_free(struct bpf_map *map)
+{
+ struct bpf_relay_map *rmap;
+ struct dentry *parent;
+
+ rmap = container_of(map, struct bpf_relay_map, map);
+
+ parent = rmap->relay_chan->parent;
+ relay_close(rmap->relay_chan);
+ /* relay_chan->parent should be removed mannually if exists. */
+ debugfs_remove_recursive(parent);
+ bpf_map_area_free(rmap);
+}
+
+static void *relay_map_lookup_elem(struct bpf_map *map, void *key)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+static long relay_map_update_elem(struct bpf_map *map, void *key, void *value,
+ u64 flags)
+{
+ struct bpf_relay_map *rmap;
+ struct dentry *parent;
+ int err;
+
+ if (unlikely(flags))
+ return -EINVAL;
+
+ if (unlikely(key))
+ return -EINVAL;
+
+ /* If the directory already exists, debugfs_create_dir will fail. It could
+ * have been created by map_update_elem before, or another system that uses
+ * debugfs.
+ *
+ * Note that the directory name passed as value should not be longer than
+ * map->value_size, including the '\0' at the end.
+ */
+ ((char *)value)[map->value_size - 1] = '\0';
+ parent = debugfs_create_dir(value, NULL);
+ if (IS_ERR_OR_NULL(parent))
+ return PTR_ERR(parent);
+
+ /* We don't need a lock here, because the relay channel is protected in
+ * relay_late_setup_files() with a mutex.
+ */
+ rmap = container_of(map, struct bpf_relay_map, map);
+ err = relay_late_setup_files(rmap->relay_chan, map->name, parent);
+ if (err) {
+ debugfs_remove_recursive(parent);
+ return err;
+ }
+
+ return 0;
+}
+
+static long relay_map_delete_elem(struct bpf_map *map, void *key)
+{
+ return -EOPNOTSUPP;
+}
+
+static int relay_map_get_next_key(struct bpf_map *map, void *key,
+ void *next_key)
+{
+ return -EOPNOTSUPP;
+}
+
+static u64 relay_map_mem_usage(const struct bpf_map *map)
+{
+ struct bpf_relay_map *rmap;
+ u64 usage = sizeof(struct bpf_relay_map);
+
+ rmap = container_of(map, struct bpf_relay_map, map);
+ usage += sizeof(struct rchan);
+ usage += (sizeof(struct rchan_buf) + rmap->relay_chan->alloc_size)
+ * num_online_cpus();
+ return usage;
+}
+
+BTF_ID_LIST_SINGLE(relay_map_btf_ids, struct, bpf_relay_map)
+const struct bpf_map_ops relay_map_ops = {
+ .map_meta_equal = bpf_map_meta_equal,
+ .map_alloc = relay_map_alloc,
+ .map_free = relay_map_free,
+ .map_lookup_elem = relay_map_lookup_elem,
+ .map_update_elem = relay_map_update_elem,
+ .map_delete_elem = relay_map_delete_elem,
+ .map_get_next_key = relay_map_get_next_key,
+ .map_mem_usage = relay_map_mem_usage,
+ .map_btf_id = &relay_map_btf_ids[0],
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1bf9805ee185..d6b7949e29c7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1147,6 +1147,7 @@ static int map_create(union bpf_attr *attr)
}
if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
+ attr->map_type != BPF_MAP_TYPE_RELAY &&
attr->map_extra != 0)
return -EINVAL;
@@ -1202,6 +1203,7 @@ static int map_create(union bpf_attr *attr)
case BPF_MAP_TYPE_USER_RINGBUF:
case BPF_MAP_TYPE_CGROUP_STORAGE:
case BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE:
+ case BPF_MAP_TYPE_RELAY:
/* unprivileged */
break;
case BPF_MAP_TYPE_SK_STORAGE:
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next v1 1/3] bpf: implement relay map basis
2023-12-27 10:01 ` [PATCH bpf-next v1 1/3] bpf: implement relay map basis Philo Lu
@ 2023-12-27 13:41 ` Yafang Shao
0 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2023-12-27 13:41 UTC (permalink / raw)
To: Philo Lu
Cc: bpf, ast, daniel, john.fastabend, andrii, martin.lau, song,
yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
joannelkoong, kuifeng, houtao, shung-hsi.yu, xuanzhuo, dust.li,
alibuda, guwen, hengqi
On Wed, Dec 27, 2023 at 6:01 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> BPF_MAP_TYPE_RELAY is implemented based on relay interface, which
> creates per-cpu buffer to transfer data. Each buffer is essentially a
> list of fix-sized sub-buffers, and is exposed to user space as files in
> debugfs. All of these compose a "relay chanel", which is the kernel of a
> relay map.
>
> Currently, attr->max_entries is used as subbuf size and attr->map_extra
> is used as subbuf num, and the default value of subbuf num is 8. A new
> map flag named BPF_F_OVERWRITE is also introduced to set overwrite mode
> of relay map.
>
> A relay map is represented as a directory in debugfs, and the per-cpu
> buffers are files in this directory. Users can get the data through read
> or mmap.
>
> To avoid directory name conflicting, relay_map_update_elem is provided
> to set the name. In fact, we create the relay channel and buffers with
> BPF_MAP_CREATE, and create relay files and bind them with the channel
> using BPF_MAP_UPDATE_ELEM. Generally, map_update_elem should be called
> once and only once.
>
> Here is an example:
> ```
> struct {
> __uint(type, BPF_MAP_TYPE_RELAY);
> __uint(max_entries, 4096);
> } my_relay SEC(".maps");
> ...
> char dir_name[] = "relay_test";
> bpf_map_update_elem(map_fd, NULL, dir_name, 0);
> ```
>
> Assume there are 2 cpus, we will have 2 files:
> ```
> /sys/kerenl/debug/relay_test/my_relay0
> /sys/kerenl/debug/relay_test/my_relay1
> ```
Is there a specific reason why relayfs necessitates creating an
individual file for each CPU? Alternatively, are there any approaches
available to collectively expose all CPUs using a single file?
When dealing with a large number of available CPUs, such as 236,
reading the buffer using the command `cat
/sys/kernel/debug/relay_test/my_relay{0...236} | awk '{}' ` can become
a bit cumbersome and tedious.
> Each represents a per-cpu buffer with size 8 * 4096 B (there are 8
> subbufs by default, each with size 4096B).
>
> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> ---
> include/linux/bpf_types.h | 3 +
> include/uapi/linux/bpf.h | 7 ++
> kernel/bpf/Makefile | 3 +
> kernel/bpf/relaymap.c | 199 ++++++++++++++++++++++++++++++++++++++
> kernel/bpf/syscall.c | 2 +
> 5 files changed, 214 insertions(+)
> create mode 100644 kernel/bpf/relaymap.c
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index fc0d6f32c687..c122d7b494c5 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -132,6 +132,9 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
> +#ifdef CONFIG_RELAY
> +BPF_MAP_TYPE(BPF_MAP_TYPE_RELAY, relay_map_ops)
> +#endif
>
> BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
> BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 754e68ca8744..143b75676bd3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -951,6 +951,7 @@ enum bpf_map_type {
> BPF_MAP_TYPE_BLOOM_FILTER,
> BPF_MAP_TYPE_USER_RINGBUF,
> BPF_MAP_TYPE_CGRP_STORAGE,
> + BPF_MAP_TYPE_RELAY,
> };
>
> /* Note that tracing related programs such as
> @@ -1330,6 +1331,9 @@ enum {
>
> /* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */
> BPF_F_PATH_FD = (1U << 14),
> +
> +/* Enable overwrite for relay map */
> + BPF_F_OVERWRITE = (1U << 15),
> };
>
> /* Flags for BPF_PROG_QUERY. */
> @@ -1401,6 +1405,9 @@ union bpf_attr {
> * BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
> * number of hash functions (if 0, the bloom filter will default
> * to using 5 hash functions).
> + *
> + * BPF_MAP_TYPE_RELAY - the lowest 32 bits indicate the number of
> + * relay subbufs (if 0, the number will be set to 8 by default).
> */
> __u64 map_extra;
> };
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f526b7573e97..45b35bb0e572 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -10,6 +10,9 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
> obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
> obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> +ifeq ($(CONFIG_RELAY),y)
> +obj-$(CONFIG_BPF_SYSCALL) += relaymap.o
> +endif
> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
> obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
> diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
> new file mode 100644
> index 000000000000..02b33a8e6b6c
> --- /dev/null
> +++ b/kernel/bpf/relaymap.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpumask.h>
> +#include <linux/debugfs.h>
> +#include <linux/filter.h>
> +#include <linux/relay.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/err.h>
> +
> +#define RELAY_CREATE_FLAG_MASK (BPF_F_OVERWRITE)
> +
> +struct bpf_relay_map {
> + struct bpf_map map;
> + struct rchan *relay_chan;
> + struct rchan_callbacks relay_cb;
> +};
> +
> +static struct dentry *create_buf_file_handler(const char *filename,
> + struct dentry *parent, umode_t mode,
> + struct rchan_buf *buf, int *is_global)
> +{
> + /* Because we do relay_late_setup_files(), create_buf_file(NULL, NULL, ...)
> + * will be called by relay_open.
> + */
> + if (!filename)
> + return NULL;
> +
> + return debugfs_create_file(filename, mode, parent, buf,
> + &relay_file_operations);
> +}
> +
> +static int remove_buf_file_handler(struct dentry *dentry)
> +{
> + debugfs_remove(dentry);
> + return 0;
> +}
> +
> +/* For non-overwrite, use default subbuf_start cb */
> +static int subbuf_start_overwrite(struct rchan_buf *buf, void *subbuf,
> + void *prev_subbuf, size_t prev_padding)
> +{
> + return 1;
> +}
> +
> +/* bpf_attr is used as follows:
> + * - key size: must be 0
> + * - value size: value will be used as directory name by map_update_elem
> + * (to create relay files). If passed as 0, it will be set to NAME_MAX as
> + * default
> + *
> + * - max_entries: subbuf size
> + * - map_extra: subbuf num, default as 8
> + *
> + * When alloc, we do not set up relay files considering dir_name conflicts.
> + * Instead we use relay_late_setup_files() in map_update_elem(), and thus the
> + * value is used as dir_name, and map->name is used as base_filename.
> + */
> +static struct bpf_map *relay_map_alloc(union bpf_attr *attr)
> +{
> + struct bpf_relay_map *rmap;
> +
> + if (unlikely(attr->map_flags & ~RELAY_CREATE_FLAG_MASK))
> + return ERR_PTR(-EINVAL);
> +
> + /* key size must be 0 in relay map */
> + if (unlikely(attr->key_size))
> + return ERR_PTR(-EINVAL);
> +
> + /* value size is used as directory name length */
> + if (unlikely(attr->value_size > NAME_MAX)) {
> + pr_warn("value_size should be no more than %d\n", NAME_MAX);
> + return ERR_PTR(-EINVAL);
> + } else if (attr->value_size == 0)
> + attr->value_size = NAME_MAX;
> +
> + /* set default subbuf num */
> + if (unlikely(attr->map_extra & ~UINT_MAX))
> + return ERR_PTR(-EINVAL);
> + attr->map_extra = attr->map_extra & UINT_MAX;
> + if (!attr->map_extra)
> + attr->map_extra = 8;
> +
> + if (strlen(attr->map_name) == 0)
> + return ERR_PTR(-EINVAL);
> +
> + rmap = bpf_map_area_alloc(sizeof(*rmap), NUMA_NO_NODE);
> + if (!rmap)
> + return ERR_PTR(-ENOMEM);
> +
> + bpf_map_init_from_attr(&rmap->map, attr);
> +
> + rmap->relay_cb.create_buf_file = create_buf_file_handler;
> + rmap->relay_cb.remove_buf_file = remove_buf_file_handler;
> + if (attr->map_flags & BPF_F_OVERWRITE)
> + rmap->relay_cb.subbuf_start = subbuf_start_overwrite;
> +
> + rmap->relay_chan = relay_open(NULL, NULL,
> + attr->max_entries, attr->map_extra,
> + &rmap->relay_cb, NULL);
> + if (!rmap->relay_chan) {
> + bpf_map_area_free(rmap);
> + return ERR_PTR(-EINVAL);
> + }
> +
> + return &rmap->map;
> +}
> +
> +static void relay_map_free(struct bpf_map *map)
> +{
> + struct bpf_relay_map *rmap;
> + struct dentry *parent;
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> +
> + parent = rmap->relay_chan->parent;
> + relay_close(rmap->relay_chan);
> + /* relay_chan->parent should be removed mannually if exists. */
> + debugfs_remove_recursive(parent);
> + bpf_map_area_free(rmap);
> +}
> +
> +static void *relay_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> + return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +static long relay_map_update_elem(struct bpf_map *map, void *key, void *value,
> + u64 flags)
> +{
> + struct bpf_relay_map *rmap;
> + struct dentry *parent;
> + int err;
> +
> + if (unlikely(flags))
> + return -EINVAL;
> +
> + if (unlikely(key))
> + return -EINVAL;
> +
> + /* If the directory already exists, debugfs_create_dir will fail. It could
> + * have been created by map_update_elem before, or another system that uses
> + * debugfs.
> + *
> + * Note that the directory name passed as value should not be longer than
> + * map->value_size, including the '\0' at the end.
> + */
> + ((char *)value)[map->value_size - 1] = '\0';
> + parent = debugfs_create_dir(value, NULL);
> + if (IS_ERR_OR_NULL(parent))
> + return PTR_ERR(parent);
> +
> + /* We don't need a lock here, because the relay channel is protected in
> + * relay_late_setup_files() with a mutex.
> + */
> + rmap = container_of(map, struct bpf_relay_map, map);
> + err = relay_late_setup_files(rmap->relay_chan, map->name, parent);
> + if (err) {
> + debugfs_remove_recursive(parent);
> + return err;
> + }
> +
> + return 0;
> +}
> +
> +static long relay_map_delete_elem(struct bpf_map *map, void *key)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int relay_map_get_next_key(struct bpf_map *map, void *key,
> + void *next_key)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static u64 relay_map_mem_usage(const struct bpf_map *map)
> +{
> + struct bpf_relay_map *rmap;
> + u64 usage = sizeof(struct bpf_relay_map);
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> + usage += sizeof(struct rchan);
> + usage += (sizeof(struct rchan_buf) + rmap->relay_chan->alloc_size)
> + * num_online_cpus();
> + return usage;
> +}
> +
> +BTF_ID_LIST_SINGLE(relay_map_btf_ids, struct, bpf_relay_map)
> +const struct bpf_map_ops relay_map_ops = {
> + .map_meta_equal = bpf_map_meta_equal,
> + .map_alloc = relay_map_alloc,
> + .map_free = relay_map_free,
> + .map_lookup_elem = relay_map_lookup_elem,
> + .map_update_elem = relay_map_update_elem,
> + .map_delete_elem = relay_map_delete_elem,
> + .map_get_next_key = relay_map_get_next_key,
> + .map_mem_usage = relay_map_mem_usage,
> + .map_btf_id = &relay_map_btf_ids[0],
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 1bf9805ee185..d6b7949e29c7 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1147,6 +1147,7 @@ static int map_create(union bpf_attr *attr)
> }
>
> if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
> + attr->map_type != BPF_MAP_TYPE_RELAY &&
> attr->map_extra != 0)
> return -EINVAL;
>
> @@ -1202,6 +1203,7 @@ static int map_create(union bpf_attr *attr)
> case BPF_MAP_TYPE_USER_RINGBUF:
> case BPF_MAP_TYPE_CGROUP_STORAGE:
> case BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE:
> + case BPF_MAP_TYPE_RELAY:
> /* unprivileged */
> break;
> case BPF_MAP_TYPE_SK_STORAGE:
> --
> 2.32.0.3.g01195cf9f
>
--
Regards
Yafang
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf-next v1 2/3] bpf: add bpf_relay_output kfunc
2023-12-27 10:01 [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY Philo Lu
2023-12-27 10:01 ` [PATCH bpf-next v1 1/3] bpf: implement relay map basis Philo Lu
@ 2023-12-27 10:01 ` Philo Lu
2023-12-27 14:23 ` Hou Tao
2023-12-27 18:07 ` Alexei Starovoitov
2023-12-27 10:01 ` [PATCH bpf-next v1 3/3] selftests/bpf: add bpf relay map selftests Philo Lu
2023-12-27 18:02 ` [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY Alexei Starovoitov
3 siblings, 2 replies; 11+ messages in thread
From: Philo Lu @ 2023-12-27 10:01 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
joannelkoong, laoar.shao, kuifeng, houtao, shung-hsi.yu,
xuanzhuo, dust.li, alibuda, guwen, hengqi
A kfunc is needed to write into the relay channel, named
bpf_relay_output. The usage is same as bpf_ringbuf_output helper. It
only works after relay files are set, i.e., after calling
map_update_elem for the created relay map.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
---
kernel/bpf/helpers.c | 3 +++
kernel/bpf/relaymap.c | 22 ++++++++++++++++++++++
2 files changed, 25 insertions(+)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index be72824f32b2..22480b69ff27 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2617,6 +2617,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_null)
BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
BTF_ID_FLAGS(func, bpf_dynptr_size)
BTF_ID_FLAGS(func, bpf_dynptr_clone)
+#ifdef CONFIG_RELAY
+BTF_ID_FLAGS(func, bpf_relay_output)
+#endif
BTF_SET8_END(common_btf_ids)
static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
index 02b33a8e6b6c..37280d60133c 100644
--- a/kernel/bpf/relaymap.c
+++ b/kernel/bpf/relaymap.c
@@ -6,6 +6,7 @@
#include <linux/slab.h>
#include <linux/bpf.h>
#include <linux/err.h>
+#include <linux/btf.h>
#define RELAY_CREATE_FLAG_MASK (BPF_F_OVERWRITE)
@@ -197,3 +198,24 @@ const struct bpf_map_ops relay_map_ops = {
.map_mem_usage = relay_map_mem_usage,
.map_btf_id = &relay_map_btf_ids[0],
};
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc int bpf_relay_output(struct bpf_map *map,
+ void *data, u64 data__sz, u32 flags)
+{
+ struct bpf_relay_map *rmap;
+
+ /* not support any flag now */
+ if (unlikely(!map || flags))
+ return -EINVAL;
+
+ rmap = container_of(map, struct bpf_relay_map, map);
+ if (!rmap->relay_chan->has_base_filename)
+ return -ENOENT;
+
+ relay_write(rmap->relay_chan, data, data__sz);
+ return 0;
+}
+
+__bpf_kfunc_end_defs();
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next v1 2/3] bpf: add bpf_relay_output kfunc
2023-12-27 10:01 ` [PATCH bpf-next v1 2/3] bpf: add bpf_relay_output kfunc Philo Lu
@ 2023-12-27 14:23 ` Hou Tao
2023-12-27 18:07 ` Alexei Starovoitov
1 sibling, 0 replies; 11+ messages in thread
From: Hou Tao @ 2023-12-27 14:23 UTC (permalink / raw)
To: Philo Lu, bpf
Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
joannelkoong, laoar.shao, kuifeng, shung-hsi.yu, xuanzhuo,
dust.li, alibuda, guwen, hengqi
Hi,
On 12/27/2023 6:01 PM, Philo Lu wrote:
> A kfunc is needed to write into the relay channel, named
> bpf_relay_output. The usage is same as bpf_ringbuf_output helper. It
> only works after relay files are set, i.e., after calling
> map_update_elem for the created relay map.
>
> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> ---
> kernel/bpf/helpers.c | 3 +++
> kernel/bpf/relaymap.c | 22 ++++++++++++++++++++++
> 2 files changed, 25 insertions(+)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index be72824f32b2..22480b69ff27 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2617,6 +2617,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> BTF_ID_FLAGS(func, bpf_dynptr_size)
> BTF_ID_FLAGS(func, bpf_dynptr_clone)
> +#ifdef CONFIG_RELAY
> +BTF_ID_FLAGS(func, bpf_relay_output)
> +#endif
> BTF_SET8_END(common_btf_ids)
Could you explain the reason why bpf_relay_out is placed in
common_btf_ids instead of generic_kfunc_set ?
>
> static const struct btf_kfunc_id_set common_kfunc_set = {
> diff --git a/kernel/bpf/relaymap.c b/kernel/bpf/relaymap.c
> index 02b33a8e6b6c..37280d60133c 100644
> --- a/kernel/bpf/relaymap.c
> +++ b/kernel/bpf/relaymap.c
> @@ -6,6 +6,7 @@
> #include <linux/slab.h>
> #include <linux/bpf.h>
> #include <linux/err.h>
> +#include <linux/btf.h>
>
> #define RELAY_CREATE_FLAG_MASK (BPF_F_OVERWRITE)
>
> @@ -197,3 +198,24 @@ const struct bpf_map_ops relay_map_ops = {
> .map_mem_usage = relay_map_mem_usage,
> .map_btf_id = &relay_map_btf_ids[0],
> };
> +
> +__bpf_kfunc_start_defs();
> +
> +__bpf_kfunc int bpf_relay_output(struct bpf_map *map,
> + void *data, u64 data__sz, u32 flags)
> +{
> + struct bpf_relay_map *rmap;
> +
> + /* not support any flag now */
> + if (unlikely(!map || flags))
> + return -EINVAL;
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
How does bpf_relay_out() guarantee the passed map is a relay map ? And
just like bpf_map_sum_elem_count(), I think KF_TRUSTED_ARGS is also
necessary for bpf_relay_output().
> + if (!rmap->relay_chan->has_base_filename)
> + return -ENOENT;
> +
I think a comment is needed here. It needs to explain why checking
->has_base_filename is enough to guarantee the concurrently running of
bpf_relay_output() and .map_update_elem() is safe.
> + relay_write(rmap->relay_chan, data, data__sz);
> + return 0;
> +}
> +
> +__bpf_kfunc_end_defs();
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next v1 2/3] bpf: add bpf_relay_output kfunc
2023-12-27 10:01 ` [PATCH bpf-next v1 2/3] bpf: add bpf_relay_output kfunc Philo Lu
2023-12-27 14:23 ` Hou Tao
@ 2023-12-27 18:07 ` Alexei Starovoitov
1 sibling, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2023-12-27 18:07 UTC (permalink / raw)
To: Philo Lu
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
Shuah Khan, Joanne Koong, Yafang Shao, Kui-Feng Lee, Hou Tao,
Shung-Hsi Yu, Xuan Zhuo, Dust Li, D. Wythe, guwen, hengqi
On Wed, Dec 27, 2023 at 2:01 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> +__bpf_kfunc int bpf_relay_output(struct bpf_map *map,
> + void *data, u64 data__sz, u32 flags)
> +{
> + struct bpf_relay_map *rmap;
> +
> + /* not support any flag now */
> + if (unlikely(!map || flags))
> + return -EINVAL;
> +
> + rmap = container_of(map, struct bpf_relay_map, map);
> + if (!rmap->relay_chan->has_base_filename)
> + return -ENOENT;
> +
> + relay_write(rmap->relay_chan, data, data__sz);
> + return 0;
This just opens a can of worms.
Above is not nmi safe. relay_write() can be used only out of
known context which effectively makes it unusable out of bpf tracing
progs that can kprobe attach anywhere in the kernel.
perf_event buffer is the only sure way to deliver events to user
space with overwrite.
bpf ringbuf is a best effort due to
if (in_nmi()) if (!spin_trylock_irqsave
Sorry, but it's a nack to allow bpf progs interface with relay.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH bpf-next v1 3/3] selftests/bpf: add bpf relay map selftests
2023-12-27 10:01 [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY Philo Lu
2023-12-27 10:01 ` [PATCH bpf-next v1 1/3] bpf: implement relay map basis Philo Lu
2023-12-27 10:01 ` [PATCH bpf-next v1 2/3] bpf: add bpf_relay_output kfunc Philo Lu
@ 2023-12-27 10:01 ` Philo Lu
2023-12-27 13:42 ` Yafang Shao
2023-12-27 18:02 ` [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY Alexei Starovoitov
3 siblings, 1 reply; 11+ messages in thread
From: Philo Lu @ 2023-12-27 10:01 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, john.fastabend, andrii, martin.lau, song,
yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
joannelkoong, laoar.shao, kuifeng, houtao, shung-hsi.yu,
xuanzhuo, dust.li, alibuda, guwen, hengqi
The operations of relay map create, update_elem, and output are tested.
The test is borrowed from ringbuf tests, where 2 samples are written
into the relay channel, and we get the samples by reading the files.
Overwriting mode is also tested, where the size of relay buffer equals
sample size and just the last sample can be seen.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
---
tools/include/uapi/linux/bpf.h | 7 +
tools/testing/selftests/bpf/Makefile | 2 +-
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/relay_map.c | 197 ++++++++++++++++++
.../selftests/bpf/progs/test_relay_map.c | 69 ++++++
5 files changed, 275 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/relay_map.c
create mode 100644 tools/testing/selftests/bpf/progs/test_relay_map.c
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f24d898efbb..1e545bfe701f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -951,6 +951,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_BLOOM_FILTER,
BPF_MAP_TYPE_USER_RINGBUF,
BPF_MAP_TYPE_CGRP_STORAGE,
+ BPF_MAP_TYPE_RELAY,
};
/* Note that tracing related programs such as
@@ -1330,6 +1331,9 @@ enum {
/* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */
BPF_F_PATH_FD = (1U << 14),
+
+/* Enable overwrite for relay map */
+ BPF_F_OVERWRITE = (1U << 15),
};
/* Flags for BPF_PROG_QUERY. */
@@ -1401,6 +1405,9 @@ union bpf_attr {
* BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
* number of hash functions (if 0, the bloom filter will default
* to using 5 hash functions).
+ *
+ * BPF_MAP_TYPE_RELAY - the lowest 32 bits indicate the number of
+ * relay subbufs (if 0, the number will be set to 8 by default).
*/
__u64 map_extra;
};
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 617ae55c3bb5..8cebb3810d50 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -427,7 +427,7 @@ LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h \
LSKELS := fentry_test.c fexit_test.c fexit_sleep.c atomics.c \
trace_printk.c trace_vprintk.c map_ptr_kern.c \
core_kern.c core_kern_overflow.c test_ringbuf.c \
- test_ringbuf_map_key.c
+ test_ringbuf_map_key.c test_relay_map.c
# Generate both light skeleton and libbpf skeleton for these
LSKELS_EXTRA := test_ksyms_module.c test_ksyms_weak.c kfunc_call_test.c \
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c125c441abc7..8de1adf587f0 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -87,3 +87,4 @@ CONFIG_VSOCKETS=y
CONFIG_VXLAN=y
CONFIG_XDP_SOCKETS=y
CONFIG_XFRM_INTERFACE=y
+CONFIG_RELAY=y
diff --git a/tools/testing/selftests/bpf/prog_tests/relay_map.c b/tools/testing/selftests/bpf/prog_tests/relay_map.c
new file mode 100644
index 000000000000..bd9c1e62ca78
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/relay_map.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <linux/compiler.h>
+#include <linux/bpf.h>
+#include <sys/sysinfo.h>
+#include <test_progs.h>
+#include <sched.h>
+
+#include "test_relay_map.lskel.h"
+
+static int duration;
+
+/* file names in debugfs */
+static const char dirname[] = "relay_map_selftest";
+static const char mapname[] = "relay_map";
+static const char mapname_ow[] = "relay_map_ow";
+struct relay_sample {
+ int pid;
+ int seq;
+ long value;
+ char comm[16];
+};
+
+static int sample_cnt;
+static int overwrite;
+
+static void process_sample(struct relay_sample *s)
+{
+ ++sample_cnt;
+
+ switch (s->seq) {
+ case 0:
+ /* sample1 will not appear in overwrite mode */
+ CHECK(overwrite != 0, "overwrite_mode",
+ "sample1 appears in overwrite mode\n");
+ CHECK(s->value != 333, "sample1_value", "exp %ld, got %ld\n",
+ 333L, s->value);
+ break;
+ case 1:
+ CHECK(s->value != 777, "sample2_value", "exp %ld, got %ld\n",
+ 777L, s->value);
+ break;
+ default:
+ break;
+ }
+}
+
+static int relaymap_read(const char *mapname)
+{
+ int cpu = libbpf_num_possible_cpus();
+ char name[NAME_MAX];
+ struct relay_sample data;
+ int maxloop;
+ FILE *fp;
+
+ for (int i = 0; i < cpu; ++i) {
+ sprintf(name, "/sys/kernel/debug/%s/%s%d", dirname, mapname, i);
+ fp = fopen(name, "r");
+ if (CHECK(!fp, "fopen", "relay file open failed\n"))
+ return -1;
+
+ maxloop = 0;
+ while (fread(&data, sizeof(data), 1, fp)) {
+ process_sample(&data);
+
+ /* just 2 samples output */
+ if (++maxloop > 2)
+ return -1;
+ }
+ }
+ return 0;
+}
+
+static struct test_relay_map_lskel *skel;
+
+static void trigger_samples(void)
+{
+ skel->bss->dropped = 0;
+ skel->bss->total = 0;
+ skel->bss->seq = 0;
+
+ /* trigger exactly two samples */
+ skel->bss->value = 333;
+ syscall(__NR_getpgid);
+ skel->bss->value = 777;
+ syscall(__NR_getpgid);
+}
+
+static void relaymap_subtest(void)
+{
+ int err, map_fd;
+
+ skel = test_relay_map_lskel__open();
+ if (CHECK(!skel, "skel_open", "skeleton open failed\n"))
+ return;
+
+ /* setup relay param */
+ skel->maps.relay_map.max_entries = 1024;
+
+ err = test_relay_map_lskel__load(skel);
+ if (CHECK(err, "skel_load", "skeleton load failed\n"))
+ goto cleanup;
+
+ /* only trigger BPF program for current process */
+ skel->bss->pid = getpid();
+
+ /* turn off overwrite */
+ skel->bss->overwrite_enable = 0;
+ overwrite = skel->bss->overwrite_enable;
+
+ err = test_relay_map_lskel__attach(skel);
+ if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err))
+ goto cleanup;
+
+ /* before file setup - output failed */
+ trigger_samples();
+ CHECK(skel->bss->dropped != 2, "err_dropped", "exp %ld, got %ld\n",
+ 0L, skel->bss->dropped);
+ CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
+ 2L, skel->bss->total);
+
+ /* after file setup - output succ */
+ map_fd = skel->maps.relay_map.map_fd;
+ err = bpf_map_update_elem(map_fd, NULL, dirname, 0);
+ if (CHECK(err, "map_update", "map update failed: %d\n", err))
+ goto cleanup;
+ trigger_samples();
+ CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n",
+ 0L, skel->bss->dropped);
+ CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
+ 2L, skel->bss->total);
+
+ sample_cnt = 0;
+ err = relaymap_read(mapname);
+ CHECK(sample_cnt != 2, "sample_cnt", "exp %d samples, got %d\n",
+ 2, sample_cnt);
+
+ test_relay_map_lskel__detach(skel);
+cleanup:
+ test_relay_map_lskel__destroy(skel);
+}
+
+static void relaymap_overwrite_subtest(void)
+{
+ int err, map_fd;
+
+ skel = test_relay_map_lskel__open();
+ if (CHECK(!skel, "skel_open", "skeleton open failed\n"))
+ return;
+
+ /* To test overwrite mode, we create subbuf of one-sample size */
+ skel->maps.relay_map_ow.max_entries = sizeof(struct relay_sample);
+
+ err = test_relay_map_lskel__load(skel);
+ if (CHECK(err, "skel_load", "skeleton load failed\n"))
+ goto cleanup;
+
+ /* only trigger BPF program for current process */
+ skel->bss->pid = getpid();
+
+ /* turn on overwrite */
+ skel->bss->overwrite_enable = 1;
+ overwrite = skel->bss->overwrite_enable;
+
+ err = test_relay_map_lskel__attach(skel);
+ if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err))
+ goto cleanup;
+
+ map_fd = skel->maps.relay_map_ow.map_fd;
+ err = bpf_map_update_elem(map_fd, NULL, dirname, 0);
+ if (CHECK(err, "map_update", "map update failed: %d\n", err))
+ goto cleanup;
+ trigger_samples();
+ /* relay_write never fails whether overwriting or not */
+ CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n",
+ 0L, skel->bss->dropped);
+ CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
+ 2L, skel->bss->total);
+
+ /* 2 samples are output, but only the last (val=777) could be seen */
+ sample_cnt = 0;
+ err = relaymap_read(mapname_ow);
+ CHECK(sample_cnt != 1, "sample_cnt", "exp %d samples, got %d\n",
+ 1, sample_cnt);
+
+ test_relay_map_lskel__detach(skel);
+cleanup:
+ test_relay_map_lskel__destroy(skel);
+}
+
+void test_relaymap(void)
+{
+ if (test__start_subtest("relaymap"))
+ relaymap_subtest();
+ if (test__start_subtest("relaymap_overwrite"))
+ relaymap_overwrite_subtest();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_relay_map.c b/tools/testing/selftests/bpf/progs/test_relay_map.c
new file mode 100644
index 000000000000..1adf1be8e125
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_relay_map.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+extern int bpf_relay_output(struct bpf_map *map, void *data,
+ __u64 data__sz, __u32 flags) __ksym;
+
+struct relay_sample {
+ int pid;
+ int seq;
+ long value;
+ char comm[16];
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_RELAY);
+ __uint(max_entries, 1024);
+} relay_map SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_RELAY);
+ __uint(map_flags, BPF_F_OVERWRITE);
+ __uint(max_entries, 1024);
+ __uint(map_extra, 1);
+} relay_map_ow SEC(".maps");
+
+/* inputs */
+int pid = 0;
+long value = 0;
+int overwrite_enable = 0;
+
+/* outputs */
+long total = 0;
+long dropped = 0;
+
+/* inner state */
+long seq = 0;
+
+SEC("fentry/" SYS_PREFIX "sys_getpgid")
+int test_bpf_relaymap(void *ctx)
+{
+ int cur_pid = bpf_get_current_pid_tgid() >> 32;
+ struct relay_sample sample;
+ int ret = 0;
+
+ if (cur_pid != pid)
+ return 0;
+
+ sample.pid = pid;
+ bpf_get_current_comm(sample.comm, sizeof(sample.comm));
+ sample.value = value;
+ sample.seq = seq++;
+ __sync_fetch_and_add(&total, 1);
+
+ if (overwrite_enable)
+ ret = bpf_relay_output((struct bpf_map *)&relay_map_ow,
+ &sample, sizeof(sample), 0);
+ else
+ ret = bpf_relay_output((struct bpf_map *)&relay_map,
+ &sample, sizeof(sample), 0);
+
+ if (ret)
+ __sync_fetch_and_add(&dropped, 1);
+
+ return 0;
+}
--
2.32.0.3.g01195cf9f
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next v1 3/3] selftests/bpf: add bpf relay map selftests
2023-12-27 10:01 ` [PATCH bpf-next v1 3/3] selftests/bpf: add bpf relay map selftests Philo Lu
@ 2023-12-27 13:42 ` Yafang Shao
0 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2023-12-27 13:42 UTC (permalink / raw)
To: Philo Lu
Cc: bpf, ast, daniel, john.fastabend, andrii, martin.lau, song,
yonghong.song, kpsingh, sdf, haoluo, jolsa, mykolal, shuah,
joannelkoong, kuifeng, houtao, shung-hsi.yu, xuanzhuo, dust.li,
alibuda, guwen, hengqi
On Wed, Dec 27, 2023 at 6:01 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> The operations of relay map create, update_elem, and output are tested.
> The test is borrowed from ringbuf tests, where 2 samples are written
> into the relay channel, and we get the samples by reading the files.
> Overwriting mode is also tested, where the size of relay buffer equals
> sample size and just the last sample can be seen.
>
> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> ---
> tools/include/uapi/linux/bpf.h | 7 +
> tools/testing/selftests/bpf/Makefile | 2 +-
> tools/testing/selftests/bpf/config | 1 +
> .../selftests/bpf/prog_tests/relay_map.c | 197 ++++++++++++++++++
> .../selftests/bpf/progs/test_relay_map.c | 69 ++++++
> 5 files changed, 275 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/relay_map.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_relay_map.c
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 7f24d898efbb..1e545bfe701f 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -951,6 +951,7 @@ enum bpf_map_type {
> BPF_MAP_TYPE_BLOOM_FILTER,
> BPF_MAP_TYPE_USER_RINGBUF,
> BPF_MAP_TYPE_CGRP_STORAGE,
> + BPF_MAP_TYPE_RELAY,
> };
>
> /* Note that tracing related programs such as
> @@ -1330,6 +1331,9 @@ enum {
>
> /* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */
> BPF_F_PATH_FD = (1U << 14),
> +
> +/* Enable overwrite for relay map */
> + BPF_F_OVERWRITE = (1U << 15),
> };
>
> /* Flags for BPF_PROG_QUERY. */
> @@ -1401,6 +1405,9 @@ union bpf_attr {
> * BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
> * number of hash functions (if 0, the bloom filter will default
> * to using 5 hash functions).
> + *
> + * BPF_MAP_TYPE_RELAY - the lowest 32 bits indicate the number of
> + * relay subbufs (if 0, the number will be set to 8 by default).
> */
> __u64 map_extra;
> };
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 617ae55c3bb5..8cebb3810d50 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -427,7 +427,7 @@ LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h \
> LSKELS := fentry_test.c fexit_test.c fexit_sleep.c atomics.c \
> trace_printk.c trace_vprintk.c map_ptr_kern.c \
> core_kern.c core_kern_overflow.c test_ringbuf.c \
> - test_ringbuf_map_key.c
> + test_ringbuf_map_key.c test_relay_map.c
>
> # Generate both light skeleton and libbpf skeleton for these
> LSKELS_EXTRA := test_ksyms_module.c test_ksyms_weak.c kfunc_call_test.c \
> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index c125c441abc7..8de1adf587f0 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -87,3 +87,4 @@ CONFIG_VSOCKETS=y
> CONFIG_VXLAN=y
> CONFIG_XDP_SOCKETS=y
> CONFIG_XFRM_INTERFACE=y
> +CONFIG_RELAY=y
> diff --git a/tools/testing/selftests/bpf/prog_tests/relay_map.c b/tools/testing/selftests/bpf/prog_tests/relay_map.c
> new file mode 100644
> index 000000000000..bd9c1e62ca78
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/relay_map.c
> @@ -0,0 +1,197 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <linux/compiler.h>
> +#include <linux/bpf.h>
> +#include <sys/sysinfo.h>
> +#include <test_progs.h>
> +#include <sched.h>
> +
> +#include "test_relay_map.lskel.h"
> +
> +static int duration;
> +
> +/* file names in debugfs */
> +static const char dirname[] = "relay_map_selftest";
> +static const char mapname[] = "relay_map";
> +static const char mapname_ow[] = "relay_map_ow";
> +struct relay_sample {
> + int pid;
> + int seq;
> + long value;
> + char comm[16];
> +};
> +
> +static int sample_cnt;
> +static int overwrite;
> +
> +static void process_sample(struct relay_sample *s)
> +{
> + ++sample_cnt;
> +
> + switch (s->seq) {
> + case 0:
> + /* sample1 will not appear in overwrite mode */
> + CHECK(overwrite != 0, "overwrite_mode",
> + "sample1 appears in overwrite mode\n");
> + CHECK(s->value != 333, "sample1_value", "exp %ld, got %ld\n",
> + 333L, s->value);
> + break;
> + case 1:
> + CHECK(s->value != 777, "sample2_value", "exp %ld, got %ld\n",
> + 777L, s->value);
> + break;
> + default:
> + break;
> + }
> +}
> +
> +static int relaymap_read(const char *mapname)
> +{
> + int cpu = libbpf_num_possible_cpus();
> + char name[NAME_MAX];
> + struct relay_sample data;
> + int maxloop;
> + FILE *fp;
> +
> + for (int i = 0; i < cpu; ++i) {
> + sprintf(name, "/sys/kernel/debug/%s/%s%d", dirname, mapname, i);
> + fp = fopen(name, "r");
fclose() is missed.
> + if (CHECK(!fp, "fopen", "relay file open failed\n"))
> + return -1;
> +
> + maxloop = 0;
> + while (fread(&data, sizeof(data), 1, fp)) {
> + process_sample(&data);
> +
> + /* just 2 samples output */
> + if (++maxloop > 2)
> + return -1;
> + }
> + }
> + return 0;
> +}
> +
> +static struct test_relay_map_lskel *skel;
> +
> +static void trigger_samples(void)
> +{
> + skel->bss->dropped = 0;
> + skel->bss->total = 0;
> + skel->bss->seq = 0;
> +
> + /* trigger exactly two samples */
> + skel->bss->value = 333;
> + syscall(__NR_getpgid);
> + skel->bss->value = 777;
> + syscall(__NR_getpgid);
> +}
> +
> +static void relaymap_subtest(void)
> +{
> + int err, map_fd;
> +
> + skel = test_relay_map_lskel__open();
> + if (CHECK(!skel, "skel_open", "skeleton open failed\n"))
> + return;
> +
> + /* setup relay param */
> + skel->maps.relay_map.max_entries = 1024;
> +
> + err = test_relay_map_lskel__load(skel);
> + if (CHECK(err, "skel_load", "skeleton load failed\n"))
> + goto cleanup;
> +
> + /* only trigger BPF program for current process */
> + skel->bss->pid = getpid();
> +
> + /* turn off overwrite */
> + skel->bss->overwrite_enable = 0;
> + overwrite = skel->bss->overwrite_enable;
> +
> + err = test_relay_map_lskel__attach(skel);
> + if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err))
> + goto cleanup;
> +
> + /* before file setup - output failed */
> + trigger_samples();
> + CHECK(skel->bss->dropped != 2, "err_dropped", "exp %ld, got %ld\n",
> + 0L, skel->bss->dropped);
> + CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
> + 2L, skel->bss->total);
> +
> + /* after file setup - output succ */
> + map_fd = skel->maps.relay_map.map_fd;
> + err = bpf_map_update_elem(map_fd, NULL, dirname, 0);
> + if (CHECK(err, "map_update", "map update failed: %d\n", err))
> + goto cleanup;
> + trigger_samples();
> + CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n",
> + 0L, skel->bss->dropped);
> + CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
> + 2L, skel->bss->total);
> +
> + sample_cnt = 0;
> + err = relaymap_read(mapname);
> + CHECK(sample_cnt != 2, "sample_cnt", "exp %d samples, got %d\n",
> + 2, sample_cnt);
> +
> + test_relay_map_lskel__detach(skel);
> +cleanup:
> + test_relay_map_lskel__destroy(skel);
> +}
> +
> +static void relaymap_overwrite_subtest(void)
> +{
> + int err, map_fd;
> +
> + skel = test_relay_map_lskel__open();
> + if (CHECK(!skel, "skel_open", "skeleton open failed\n"))
> + return;
> +
> + /* To test overwrite mode, we create subbuf of one-sample size */
> + skel->maps.relay_map_ow.max_entries = sizeof(struct relay_sample);
> +
> + err = test_relay_map_lskel__load(skel);
> + if (CHECK(err, "skel_load", "skeleton load failed\n"))
> + goto cleanup;
> +
> + /* only trigger BPF program for current process */
> + skel->bss->pid = getpid();
> +
> + /* turn on overwrite */
> + skel->bss->overwrite_enable = 1;
> + overwrite = skel->bss->overwrite_enable;
> +
> + err = test_relay_map_lskel__attach(skel);
> + if (CHECK(err, "skel_attach", "skeleton attachment failed: %d\n", err))
> + goto cleanup;
> +
> + map_fd = skel->maps.relay_map_ow.map_fd;
> + err = bpf_map_update_elem(map_fd, NULL, dirname, 0);
> + if (CHECK(err, "map_update", "map update failed: %d\n", err))
> + goto cleanup;
> + trigger_samples();
> + /* relay_write never fails whether overwriting or not */
> + CHECK(skel->bss->dropped != 0, "err_dropped", "exp %ld, got %ld\n",
> + 0L, skel->bss->dropped);
> + CHECK(skel->bss->total != 2, "err_total", "exp %ld, got %ld\n",
> + 2L, skel->bss->total);
> +
> + /* 2 samples are output, but only the last (val=777) could be seen */
> + sample_cnt = 0;
> + err = relaymap_read(mapname_ow);
> + CHECK(sample_cnt != 1, "sample_cnt", "exp %d samples, got %d\n",
> + 1, sample_cnt);
> +
> + test_relay_map_lskel__detach(skel);
> +cleanup:
> + test_relay_map_lskel__destroy(skel);
> +}
> +
> +void test_relaymap(void)
> +{
> + if (test__start_subtest("relaymap"))
> + relaymap_subtest();
> + if (test__start_subtest("relaymap_overwrite"))
> + relaymap_overwrite_subtest();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_relay_map.c b/tools/testing/selftests/bpf/progs/test_relay_map.c
> new file mode 100644
> index 000000000000..1adf1be8e125
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_relay_map.c
> @@ -0,0 +1,69 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_misc.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +extern int bpf_relay_output(struct bpf_map *map, void *data,
> + __u64 data__sz, __u32 flags) __ksym;
> +
> +struct relay_sample {
> + int pid;
> + int seq;
> + long value;
> + char comm[16];
> +};
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_RELAY);
> + __uint(max_entries, 1024);
> +} relay_map SEC(".maps");
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_RELAY);
> + __uint(map_flags, BPF_F_OVERWRITE);
> + __uint(max_entries, 1024);
> + __uint(map_extra, 1);
> +} relay_map_ow SEC(".maps");
> +
> +/* inputs */
> +int pid = 0;
> +long value = 0;
> +int overwrite_enable = 0;
> +
> +/* outputs */
> +long total = 0;
> +long dropped = 0;
> +
> +/* inner state */
> +long seq = 0;
> +
> +SEC("fentry/" SYS_PREFIX "sys_getpgid")
> +int test_bpf_relaymap(void *ctx)
> +{
> + int cur_pid = bpf_get_current_pid_tgid() >> 32;
> + struct relay_sample sample;
> + int ret = 0;
> +
> + if (cur_pid != pid)
> + return 0;
> +
> + sample.pid = pid;
> + bpf_get_current_comm(sample.comm, sizeof(sample.comm));
> + sample.value = value;
> + sample.seq = seq++;
> + __sync_fetch_and_add(&total, 1);
> +
> + if (overwrite_enable)
> + ret = bpf_relay_output((struct bpf_map *)&relay_map_ow,
> + &sample, sizeof(sample), 0);
> + else
> + ret = bpf_relay_output((struct bpf_map *)&relay_map,
> + &sample, sizeof(sample), 0);
> +
> + if (ret)
> + __sync_fetch_and_add(&dropped, 1);
> +
> + return 0;
> +}
> --
> 2.32.0.3.g01195cf9f
>
--
Regards
Yafang
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY
2023-12-27 10:01 [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY Philo Lu
` (2 preceding siblings ...)
2023-12-27 10:01 ` [PATCH bpf-next v1 3/3] selftests/bpf: add bpf relay map selftests Philo Lu
@ 2023-12-27 18:02 ` Alexei Starovoitov
2023-12-28 11:19 ` Philo Lu
3 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2023-12-27 18:02 UTC (permalink / raw)
To: Philo Lu
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
Shuah Khan, Joanne Koong, Yafang Shao, Kui-Feng Lee, Hou Tao,
Shung-Hsi Yu, Xuan Zhuo, Dust Li, D. Wythe, guwen, hengqi
On Wed, Dec 27, 2023 at 2:01 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> The patch set introduce a new type of map, BPF_MAP_TYPE_RELAY, based on
> relay interface [0]. It provides a way for persistent and overwritable data
> transfer.
>
> As stated in [0], relay is a efficient method for log and data transfer.
> And the interface is simple enough so that we can implement and use this
> type of map with current map interfaces. Besides we need a kfunc
> bpf_relay_output to output data to user, similar with bpf_ringbuf_output.
>
> We need this map because currently neither ringbuf nor perfbuf satisfies
> the requirements of relatively long-term consistent tracing, where the bpf
> program keeps writing into the buffer without any bundled reader, and the
> buffer supports overwriting. For users, they just run the bpf program to
> collect data, and are able to read as need. The detailed discussion can be
> found at [1].
Hold on.
Earlier I mistakenly assumed that this relayfs is a multi producer
buffer instead of per-cpu.
Since it's actually per-cpu I see no need to introduce another per-cpu
ring buffer. We already have a perf_event buffer.
Earlier you said:
"I can use BPF_F_PRESERVE_ELEMS flag to keep the
perf_events, but I do not know how to get the buffer again in a new process.
"
Looks like the issue is lack of map_fd_sys_lookup_elem callback ?
Solve the latter part.
perf_event_array_map should be pinnable like any other map,
so there is a way to get an FD to a map in a new process.
What's missing is a way to get an FD to perf event itself.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY
2023-12-27 18:02 ` [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY Alexei Starovoitov
@ 2023-12-28 11:19 ` Philo Lu
2024-01-03 19:58 ` Alexei Starovoitov
0 siblings, 1 reply; 11+ messages in thread
From: Philo Lu @ 2023-12-28 11:19 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
Shuah Khan, Joanne Koong, Yafang Shao, Kui-Feng Lee, Hou Tao,
Shung-Hsi Yu, Xuan Zhuo, Dust Li, D. Wythe, guwen, hengqi
On 2023/12/28 02:02, Alexei Starovoitov wrote:
> On Wed, Dec 27, 2023 at 2:01 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>>
>> The patch set introduce a new type of map, BPF_MAP_TYPE_RELAY, based on
>> relay interface [0]. It provides a way for persistent and overwritable data
>> transfer.
>>
>> As stated in [0], relay is a efficient method for log and data transfer.
>> And the interface is simple enough so that we can implement and use this
>> type of map with current map interfaces. Besides we need a kfunc
>> bpf_relay_output to output data to user, similar with bpf_ringbuf_output.
>>
>> We need this map because currently neither ringbuf nor perfbuf satisfies
>> the requirements of relatively long-term consistent tracing, where the bpf
>> program keeps writing into the buffer without any bundled reader, and the
>> buffer supports overwriting. For users, they just run the bpf program to
>> collect data, and are able to read as need. The detailed discussion can be
>> found at [1].
>
> Hold on.
> Earlier I mistakenly assumed that this relayfs is a multi producer
> buffer instead of per-cpu.
> Since it's actually per-cpu I see no need to introduce another per-cpu
> ring buffer. We already have a perf_event buffer.
>
I think relay map and perfbuf don't conflict with each other, and relay
map could be a better choice in some use cases (e.g., constant tracing).
In our application, we output the tracing records as strings into relay
files, and users just read it through `cat` without any process, which
seems impossible to be implemented even with pinnable perfbuf.
Specifically, the advantages of relay map are summarized as follows:
(1) Read at any time without extra process: As discussed before, with
relay map, bpf programs can keep writing into the buffer and users can
read at any time.
(2) Custom data format: Unlike perfbuf processing data entry by entry
(or event), the data format of relay is up to users. It could be simple
string, or binary struct with a header, which provides users with high
flexibility.
(3) Better performance: Due to the simple design, relay outperforms
perfbuf in current bench_ringbufs (I added a relay map case to
`tools/testing/selftests/bpf/benchs/bench_ringbufs.c` without other
changes). Note that relay outputs data directly without notification,
and the consumer can get a batch of samples using read() at a time.
Single-producer, parallel producer, sampled notification
========================================================
relaymap 51.652 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf 22.773 ± 0.015M/s (drops 0.000 ± 0.000M/s)
rb-custom 23.782 ± 0.004M/s (drops 0.000 ± 0.000M/s)
pb-libbpf 18.506 ± 0.007M/s (drops 0.000 ± 0.000M/s)
pb-custom 19.503 ± 0.007M/s (drops 0.000 ± 0.000M/s)
Single-producer, back-to-back mode
==================================
relaymap 44.771 ± 0.014M/s (drops 0.000 ± 0.000M/s)
rb-libbpf 25.091 ± 0.013M/s (drops 0.000 ± 0.000M/s)
rb-libbpf-sampled 24.779 ± 0.018M/s (drops 0.000 ± 0.000M/s)
rb-custom 27.784 ± 0.012M/s (drops 0.000 ± 0.000M/s)
rb-custom-sampled 27.414 ± 0.017M/s (drops 0.000 ± 0.000M/s)
pb-libbpf 1.409 ± 0.000M/s (drops 0.000 ± 0.000M/s)
pb-libbpf-sampled 18.467 ± 0.005M/s (drops 0.000 ± 0.000M/s)
pb-custom 1.415 ± 0.000M/s (drops 0.000 ± 0.000M/s)
pb-custom-sampled 19.913 ± 0.007M/s (drops 0.000 ± 0.000M/s)
Thanks.
> Earlier you said:
> "I can use BPF_F_PRESERVE_ELEMS flag to keep the
> perf_events, but I do not know how to get the buffer again in a new process.
> "
>
> Looks like the issue is lack of map_fd_sys_lookup_elem callback ?
> Solve the latter part.
> perf_event_array_map should be pinnable like any other map,
> so there is a way to get an FD to a map in a new process.
> What's missing is a way to get an FD to perf event itself.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH bpf-next v1 0/3] bpf: introduce BPF_MAP_TYPE_RELAY
2023-12-28 11:19 ` Philo Lu
@ 2024-01-03 19:58 ` Alexei Starovoitov
0 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2024-01-03 19:58 UTC (permalink / raw)
To: Philo Lu
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
Shuah Khan, Joanne Koong, Yafang Shao, Kui-Feng Lee, Hou Tao,
Shung-Hsi Yu, Xuan Zhuo, Dust Li, D. Wythe, guwen, hengqi
On Thu, Dec 28, 2023 at 3:20 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>
>
>
> On 2023/12/28 02:02, Alexei Starovoitov wrote:
> > On Wed, Dec 27, 2023 at 2:01 AM Philo Lu <lulie@linux.alibaba.com> wrote:
> >>
> >> The patch set introduce a new type of map, BPF_MAP_TYPE_RELAY, based on
> >> relay interface [0]. It provides a way for persistent and overwritable data
> >> transfer.
> >>
> >> As stated in [0], relay is a efficient method for log and data transfer.
> >> And the interface is simple enough so that we can implement and use this
> >> type of map with current map interfaces. Besides we need a kfunc
> >> bpf_relay_output to output data to user, similar with bpf_ringbuf_output.
> >>
> >> We need this map because currently neither ringbuf nor perfbuf satisfies
> >> the requirements of relatively long-term consistent tracing, where the bpf
> >> program keeps writing into the buffer without any bundled reader, and the
> >> buffer supports overwriting. For users, they just run the bpf program to
> >> collect data, and are able to read as need. The detailed discussion can be
> >> found at [1].
> >
> > Hold on.
> > Earlier I mistakenly assumed that this relayfs is a multi producer
> > buffer instead of per-cpu.
> > Since it's actually per-cpu I see no need to introduce another per-cpu
> > ring buffer. We already have a perf_event buffer.
> >
> I think relay map and perfbuf don't conflict with each other, and relay
> map could be a better choice in some use cases (e.g., constant tracing).
> In our application, we output the tracing records as strings into relay
> files, and users just read it through `cat` without any process, which
> seems impossible to be implemented even with pinnable perfbuf.
>
> Specifically, the advantages of relay map are summarized as follows:
> (1) Read at any time without extra process: As discussed before, with
> relay map, bpf programs can keep writing into the buffer and users can
> read at any time.
Just add this ability to perf ring buf.
> (2) Custom data format: Unlike perfbuf processing data entry by entry
> (or event), the data format of relay is up to users. It could be simple
> string, or binary struct with a header, which provides users with high
> flexibility.
perf ring buf allows any format as well.
> (3) Better performance: Due to the simple design, relay outperforms
> perfbuf in current bench_ringbufs (I added a relay map case to
> `tools/testing/selftests/bpf/benchs/bench_ringbufs.c` without other
> changes). Note that relay outputs data directly without notification,
> and the consumer can get a batch of samples using read() at a time.
performance is irrelevant, since relay is unusable from bpf due
to dead lock issue I explained in the other email.
It's really a nack.
^ permalink raw reply [flat|nested] 11+ messages in thread