[PATCH v3 bpf-next 0/3] Add support for memory-mapping BPF array maps

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 bpf-next 0/3] Add support for memory-mapping BPF array maps
@ 2019-11-13  3:15 Andrii Nakryiko
  2019-11-13  3:15 ` [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2019-11-13  3:15 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

This patch set adds ability to memory-map BPF array maps (single- and
multi-element). The primary use case is memory-mapping BPF array maps, created
to back global data variables, created by libbpf implicitly. This allows for
much better usability, along with avoiding syscalls to read or update data
completely.

Due to memory-mapping requirements, BPF array map that is supposed to be
memory-mapped, has to be created with special BPF_F_MMAPABLE attribute, which
triggers slightly different memory allocation strategy internally. See
patch 1 for details.

Libbpf is extended to detect kernel support for this flag, and if supported,
will specify it for all global data maps automatically.

v2->v3:
- change allocation strategy to avoid extra pointer dereference (Jakub);

v1->v2:
- fix map lookup code generation for BPF_F_MMAPABLE case;
- prevent BPF_F_MMAPABLE flag for all but plain array map type;
- centralize ref-counting in generic bpf_map_mmap();
- don't use uref counting (Alexei);
- use vfree() directly;
- print flags with %x (Song);
- extend tests to verify bpf_map_{lookup,update}_elem() logic as well.

Andrii Nakryiko (3):
  bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  libbpf: make global data internal arrays mmap()-able, if possible
  selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests

 include/linux/bpf.h                           |   6 +-
 include/uapi/linux/bpf.h                      |   3 +
 kernel/bpf/arraymap.c                         |  93 +++++++++-
 kernel/bpf/syscall.c                          |  47 +++++
 tools/include/uapi/linux/bpf.h                |   3 +
 tools/lib/bpf/libbpf.c                        |  32 +++-
 .../selftests/bpf/prog_tests/core_reloc.c     |  45 +++--
 tools/testing/selftests/bpf/prog_tests/mmap.c | 170 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_mmap.c |  41 +++++
 9 files changed, 413 insertions(+), 27 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_mmap.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-13  3:15 [PATCH v3 bpf-next 0/3] Add support for memory-mapping BPF array maps Andrii Nakryiko
@ 2019-11-13  3:15 ` Andrii Nakryiko
  2019-11-13 20:06   ` John Fastabend
  2019-11-13 20:38   ` Daniel Borkmann
  2019-11-13  3:15 ` [PATCH v3 bpf-next 2/3] libbpf: make global data internal arrays mmap()-able, if possible Andrii Nakryiko
  2019-11-13  3:15 ` [PATCH v3 bpf-next 3/3] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests Andrii Nakryiko
  2 siblings, 2 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2019-11-13  3:15 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Rik van Riel,
	Johannes Weiner

Add ability to memory-map contents of BPF array map. This is extremely useful
for working with BPF global data from userspace programs. It allows to avoid
typical bpf_map_{lookup,update}_elem operations, improving both performance
and usability.

There had to be special considerations for map freezing, to avoid having
writable memory view into a frozen map. To solve this issue, map freezing and
mmap-ing is happening under mutex now:
  - if map is already frozen, no writable mapping is allowed;
  - if map has writable memory mappings active (accounted in map->writecnt),
    map freezing will keep failing with -EBUSY;
  - once number of writable memory mappings drops to zero, map freezing can be
    performed again.

Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
can't be memory mapped either.

For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
to be mmap()'able. We also need to make sure that array data memory is
page-sized and page-aligned, so we over-allocate memory in such a way that
struct bpf_array is at the end of a single page of memory with array->value
being aligned with the start of the second page. On deallocation we need to
accomodate this memory arrangement to free vmalloc()'ed memory correctly.

Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf.h            |  6 ++-
 include/uapi/linux/bpf.h       |  3 ++
 kernel/bpf/arraymap.c          | 93 +++++++++++++++++++++++++++++++---
 kernel/bpf/syscall.c           | 47 +++++++++++++++++
 tools/include/uapi/linux/bpf.h |  3 ++
 5 files changed, 145 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7c7f518811a6..ab9a24b94357 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -12,6 +12,7 @@
 #include <linux/err.h>
 #include <linux/rbtree_latch.h>
 #include <linux/numa.h>
+#include <linux/mm_types.h>
 #include <linux/wait.h>
 #include <linux/u64_stats_sync.h>
 
@@ -66,6 +67,7 @@ struct bpf_map_ops {
 				     u64 *imm, u32 off);
 	int (*map_direct_value_meta)(const struct bpf_map *map,
 				     u64 imm, u32 *off);
+	int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
 };
 
 struct bpf_map_memory {
@@ -95,7 +97,7 @@ struct bpf_map {
 	struct btf *btf;
 	struct bpf_map_memory memory;
 	bool unpriv_array;
-	bool frozen; /* write-once */
+	bool frozen; /* write-once; write-protected by freeze_mutex */
 	/* 48 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
@@ -105,6 +107,8 @@ struct bpf_map {
 	atomic_t usercnt;
 	struct work_struct work;
 	char name[BPF_OBJ_NAME_LEN];
+	struct mutex freeze_mutex;
+	int writecnt; /* writable mmap cnt; protected by freeze_mutex */
 };
 
 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index df6809a76404..bb39b53622d9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -346,6 +346,9 @@ enum bpf_attach_type {
 /* Clone map from listener for newly accepted socket */
 #define BPF_F_CLONE		(1U << 9)
 
+/* Enable memory-mapping BPF map */
+#define BPF_F_MMAPABLE		(1U << 10)
+
 /* flags for BPF_PROG_QUERY */
 #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a9..08dc9634c5ed 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -14,7 +14,7 @@
 #include "map_in_map.h"
 
 #define ARRAY_CREATE_FLAG_MASK \
-	(BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK)
+	(BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK)
 
 static void bpf_array_free_percpu(struct bpf_array *array)
 {
@@ -59,6 +59,10 @@ int array_map_alloc_check(union bpf_attr *attr)
 	    (percpu && numa_node != NUMA_NO_NODE))
 		return -EINVAL;
 
+	if (attr->map_type != BPF_MAP_TYPE_ARRAY &&
+	    attr->map_flags & BPF_F_MMAPABLE)
+		return -EINVAL;
+
 	if (attr->value_size > KMALLOC_MAX_SIZE)
 		/* if value_size is bigger, the user space won't be able to
 		 * access the elements.
@@ -102,10 +106,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	}
 
 	array_size = sizeof(*array);
-	if (percpu)
+	if (percpu) {
 		array_size += (u64) max_entries * sizeof(void *);
-	else
-		array_size += (u64) max_entries * elem_size;
+	} else {
+		/* rely on vmalloc() to return page-aligned memory and
+		 * ensure array->value is exactly page-aligned
+		 */
+		if (attr->map_flags & BPF_F_MMAPABLE) {
+			array_size = round_up(array_size, PAGE_SIZE);
+			array_size += (u64) max_entries * elem_size;
+			array_size = round_up(array_size, PAGE_SIZE);
+		} else {
+			array_size += (u64) max_entries * elem_size;
+		}
+	}
 
 	/* make sure there is no u32 overflow later in round_up() */
 	cost = array_size;
@@ -117,7 +131,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(ret);
 
 	/* allocate all map elements and zero-initialize them */
-	array = bpf_map_area_alloc(array_size, numa_node);
+	if (attr->map_flags & BPF_F_MMAPABLE) {
+		void *data;
+
+		/* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
+		data = vzalloc_node(array_size, numa_node);
+		if (!data) {
+			bpf_map_charge_finish(&mem);
+			return ERR_PTR(-ENOMEM);
+		}
+		array = data + round_up(sizeof(struct bpf_array), PAGE_SIZE)
+			- offsetof(struct bpf_array, value);
+	} else {
+		array = bpf_map_area_alloc(array_size, numa_node);
+	}
 	if (!array) {
 		bpf_map_charge_finish(&mem);
 		return ERR_PTR(-ENOMEM);
@@ -365,7 +392,10 @@ static void array_map_free(struct bpf_map *map)
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
-	bpf_map_area_free(array);
+	if (array->map.map_flags & BPF_F_MMAPABLE)
+		bpf_map_area_free((void *)round_down((long)array, PAGE_SIZE));
+	else
+		bpf_map_area_free(array);
 }
 
 static void array_map_seq_show_elem(struct bpf_map *map, void *key,
@@ -444,6 +474,56 @@ static int array_map_check_btf(const struct bpf_map *map,
 	return 0;
 }
 
+void array_map_mmap_close(struct vm_area_struct *vma)
+{
+	struct bpf_array *array = vma->vm_file->private_data;
+
+	mutex_lock(&array->map.freeze_mutex);
+	if (vma->vm_flags & VM_WRITE)
+		array->map.writecnt--;
+	mutex_unlock(&array->map.freeze_mutex);
+
+	bpf_map_put(&array->map);
+}
+
+static vm_fault_t array_map_mmap_fault(struct vm_fault *vmf)
+{
+	struct bpf_array *array = vmf->vma->vm_file->private_data;
+	void *p = array->value + (vmf->pgoff << PAGE_SHIFT);
+
+	vmf->page = vmalloc_to_page(p);
+	/* bump page refcount, it will be decremented by kernel on unmap */
+	get_page(vmf->page);
+
+	return 0;
+}
+
+static const struct vm_operations_struct array_map_vmops = {
+	.close		= array_map_mmap_close,
+	.fault		= array_map_mmap_fault,
+};
+
+int array_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	u64 data_size, vma_size;
+
+	if (!(map->map_flags & BPF_F_MMAPABLE))
+		return -EINVAL;
+
+	data_size = (u64)array->elem_size * map->max_entries;
+	data_size = round_up(data_size, PAGE_SIZE);
+	vma_size = vma->vm_end - vma->vm_start;
+	if (vma_size != data_size)
+		return -EINVAL;
+
+	vma->vm_ops = &array_map_vmops;
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+	vma->vm_private_data = array;
+
+	return 0;
+}
+
 const struct bpf_map_ops array_map_ops = {
 	.map_alloc_check = array_map_alloc_check,
 	.map_alloc = array_map_alloc,
@@ -455,6 +535,7 @@ const struct bpf_map_ops array_map_ops = {
 	.map_gen_lookup = array_map_gen_lookup,
 	.map_direct_value_addr = array_map_direct_value_addr,
 	.map_direct_value_meta = array_map_direct_value_meta,
+	.map_mmap = array_map_mmap,
 	.map_seq_show_elem = array_map_seq_show_elem,
 	.map_check_btf = array_map_check_btf,
 };
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6d9ce95e5a8d..c6ff1034c2f6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -425,6 +425,43 @@ static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
 	return -EINVAL;
 }
 
+static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct bpf_map *map = filp->private_data;
+	int err;
+
+	if (!map->ops->map_mmap || map_value_has_spin_lock(map))
+		return -ENOTSUPP;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	mutex_lock(&map->freeze_mutex);
+
+	if ((vma->vm_flags & VM_WRITE) && map->frozen) {
+		err = -EPERM;
+		goto out;
+	}
+
+	map = bpf_map_inc(map, false);
+	if (IS_ERR(map)) {
+		err = PTR_ERR(map);
+		goto out;
+	}
+
+	err = map->ops->map_mmap(map, vma);
+	if (err) {
+		bpf_map_put(map);
+		goto out;
+	}
+
+	if (vma->vm_flags & VM_WRITE)
+		map->writecnt++;
+out:
+	mutex_unlock(&map->freeze_mutex);
+	return err;
+}
+
 const struct file_operations bpf_map_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= bpf_map_show_fdinfo,
@@ -432,6 +469,7 @@ const struct file_operations bpf_map_fops = {
 	.release	= bpf_map_release,
 	.read		= bpf_dummy_read,
 	.write		= bpf_dummy_write,
+	.mmap		= bpf_map_mmap,
 };
 
 int bpf_map_new_fd(struct bpf_map *map, int flags)
@@ -577,6 +615,7 @@ static int map_create(union bpf_attr *attr)
 
 	atomic_set(&map->refcnt, 1);
 	atomic_set(&map->usercnt, 1);
+	mutex_init(&map->freeze_mutex);
 
 	if (attr->btf_key_type_id || attr->btf_value_type_id) {
 		struct btf *btf;
@@ -1173,6 +1212,13 @@ static int map_freeze(const union bpf_attr *attr)
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
+
+	mutex_lock(&map->freeze_mutex);
+
+	if (map->writecnt) {
+		err = -EBUSY;
+		goto err_put;
+	}
 	if (READ_ONCE(map->frozen)) {
 		err = -EBUSY;
 		goto err_put;
@@ -1184,6 +1230,7 @@ static int map_freeze(const union bpf_attr *attr)
 
 	WRITE_ONCE(map->frozen, true);
 err_put:
+	mutex_unlock(&map->freeze_mutex);
 	fdput(f);
 	return err;
 }
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index df6809a76404..bb39b53622d9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -346,6 +346,9 @@ enum bpf_attach_type {
 /* Clone map from listener for newly accepted socket */
 #define BPF_F_CLONE		(1U << 9)
 
+/* Enable memory-mapping BPF map */
+#define BPF_F_MMAPABLE		(1U << 10)
+
 /* flags for BPF_PROG_QUERY */
 #define BPF_F_QUERY_EFFECTIVE	(1U << 0)
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 bpf-next 2/3] libbpf: make global data internal arrays mmap()-able, if possible
  2019-11-13  3:15 [PATCH v3 bpf-next 0/3] Add support for memory-mapping BPF array maps Andrii Nakryiko
  2019-11-13  3:15 ` [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
@ 2019-11-13  3:15 ` Andrii Nakryiko
  2019-11-13 20:21   ` John Fastabend
  2019-11-13  3:15 ` [PATCH v3 bpf-next 3/3] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests Andrii Nakryiko
  2 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2019-11-13  3:15 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add detection of BPF_F_MMAPABLE flag support for arrays and add it as an extra
flag to internal global data maps, if supported by kernel. This allows users
to memory-map global data and use it without BPF map operations, greatly
simplifying user experience.

Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/libbpf.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 96ef18cfeffb..7e2be1288fa1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -142,6 +142,8 @@ struct bpf_capabilities {
 	__u32 btf_func:1;
 	/* BTF_KIND_VAR and BTF_KIND_DATASEC support */
 	__u32 btf_datasec:1;
+	/* BPF_F_MMAPABLE is supported for arrays */
+	__u32 array_mmap:1;
 };
 
 /*
@@ -856,8 +858,6 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 		pr_warn("failed to alloc map name\n");
 		return -ENOMEM;
 	}
-	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu.\n",
-		 map_name, map->sec_idx, map->sec_offset);
 
 	def = &map->def;
 	def->type = BPF_MAP_TYPE_ARRAY;
@@ -865,6 +865,12 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	def->value_size = data->d_size;
 	def->max_entries = 1;
 	def->map_flags = type == LIBBPF_MAP_RODATA ? BPF_F_RDONLY_PROG : 0;
+	if (obj->caps.array_mmap)
+		def->map_flags |= BPF_F_MMAPABLE;
+
+	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
+		 map_name, map->sec_idx, map->sec_offset, def->map_flags);
+
 	if (data_buff) {
 		*data_buff = malloc(data->d_size);
 		if (!*data_buff) {
@@ -2160,6 +2166,27 @@ static int bpf_object__probe_btf_datasec(struct bpf_object *obj)
 	return 0;
 }
 
+static int bpf_object__probe_array_mmap(struct bpf_object *obj)
+{
+	struct bpf_create_map_attr attr = {
+		.map_type = BPF_MAP_TYPE_ARRAY,
+		.map_flags = BPF_F_MMAPABLE,
+		.key_size = sizeof(int),
+		.value_size = sizeof(int),
+		.max_entries = 1,
+	};
+	int fd;
+
+	fd = bpf_create_map_xattr(&attr);
+	if (fd >= 0) {
+		obj->caps.array_mmap = 1;
+		close(fd);
+		return 1;
+	}
+
+	return 0;
+}
+
 static int
 bpf_object__probe_caps(struct bpf_object *obj)
 {
@@ -2168,6 +2195,7 @@ bpf_object__probe_caps(struct bpf_object *obj)
 		bpf_object__probe_global_data,
 		bpf_object__probe_btf_func,
 		bpf_object__probe_btf_datasec,
+		bpf_object__probe_array_mmap,
 	};
 	int i, ret;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 bpf-next 3/3] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests
  2019-11-13  3:15 [PATCH v3 bpf-next 0/3] Add support for memory-mapping BPF array maps Andrii Nakryiko
  2019-11-13  3:15 ` [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
  2019-11-13  3:15 ` [PATCH v3 bpf-next 2/3] libbpf: make global data internal arrays mmap()-able, if possible Andrii Nakryiko
@ 2019-11-13  3:15 ` Andrii Nakryiko
  2 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2019-11-13  3:15 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add selftests validating mmap()-ing BPF array maps: both single-element and
multi-element ones. Check that plain bpf_map_update_elem() and
bpf_map_lookup_elem() work correctly with memory-mapped array. Also convert
CO-RE relocation tests to use memory-mapped views of global data.

Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 .../selftests/bpf/prog_tests/core_reloc.c     |  45 +++--
 tools/testing/selftests/bpf/prog_tests/mmap.c | 170 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_mmap.c |  41 +++++
 3 files changed, 238 insertions(+), 18 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mmap.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_mmap.c

diff --git a/tools/testing/selftests/bpf/prog_tests/core_reloc.c b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
index f94bd071536b..ec9e2fdd6b89 100644
--- a/tools/testing/selftests/bpf/prog_tests/core_reloc.c
+++ b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
 #include "progs/core_reloc_types.h"
+#include <sys/mman.h>
 
 #define STRUCT_TO_CHAR_PTR(struct_name) (const char *)&(struct struct_name)
 
@@ -453,8 +454,15 @@ struct data {
 	char out[256];
 };
 
+static size_t roundup_page(size_t sz)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	return (sz + page_size - 1) / page_size * page_size;
+}
+
 void test_core_reloc(void)
 {
+	const size_t mmap_sz = roundup_page(sizeof(struct data));
 	struct bpf_object_load_attr load_attr = {};
 	struct core_reloc_test_case *test_case;
 	const char *tp_name, *probe_name;
@@ -463,8 +471,8 @@ void test_core_reloc(void)
 	struct bpf_map *data_map;
 	struct bpf_program *prog;
 	struct bpf_object *obj;
-	const int zero = 0;
-	struct data data;
+	struct data *data;
+	void *mmap_data = NULL;
 
 	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
 		test_case = &test_cases[i];
@@ -476,8 +484,7 @@ void test_core_reloc(void)
 		);
 
 		obj = bpf_object__open_file(test_case->bpf_obj_file, &opts);
-		if (CHECK(IS_ERR_OR_NULL(obj), "obj_open",
-			  "failed to open '%s': %ld\n",
+		if (CHECK(IS_ERR(obj), "obj_open", "failed to open '%s': %ld\n",
 			  test_case->bpf_obj_file, PTR_ERR(obj)))
 			continue;
 
@@ -519,24 +526,22 @@ void test_core_reloc(void)
 		if (CHECK(!data_map, "find_data_map", "data map not found\n"))
 			goto cleanup;
 
-		memset(&data, 0, sizeof(data));
-		memcpy(data.in, test_case->input, test_case->input_len);
-
-		err = bpf_map_update_elem(bpf_map__fd(data_map),
-					  &zero, &data, 0);
-		if (CHECK(err, "update_data_map",
-			  "failed to update .data map: %d\n", err))
+		mmap_data = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
+				 MAP_SHARED, bpf_map__fd(data_map), 0);
+		if (CHECK(mmap_data == MAP_FAILED, "mmap",
+			  ".bss mmap failed: %d", errno)) {
+			mmap_data = NULL;
 			goto cleanup;
+		}
+		data = mmap_data;
+
+		memset(mmap_data, 0, sizeof(*data));
+		memcpy(data->in, test_case->input, test_case->input_len);
 
 		/* trigger test run */
 		usleep(1);
 
-		err = bpf_map_lookup_elem(bpf_map__fd(data_map), &zero, &data);
-		if (CHECK(err, "get_result",
-			  "failed to get output data: %d\n", err))
-			goto cleanup;
-
-		equal = memcmp(data.out, test_case->output,
+		equal = memcmp(data->out, test_case->output,
 			       test_case->output_len) == 0;
 		if (CHECK(!equal, "check_result",
 			  "input/output data don't match\n")) {
@@ -548,12 +553,16 @@ void test_core_reloc(void)
 			}
 			for (j = 0; j < test_case->output_len; j++) {
 				printf("output byte #%d: EXP 0x%02hhx GOT 0x%02hhx\n",
-				       j, test_case->output[j], data.out[j]);
+				       j, test_case->output[j], data->out[j]);
 			}
 			goto cleanup;
 		}
 
 cleanup:
+		if (mmap_data) {
+			CHECK_FAIL(munmap(mmap_data, mmap_sz));
+			mmap_data = NULL;
+		}
 		if (!IS_ERR_OR_NULL(link)) {
 			bpf_link__destroy(link);
 			link = NULL;
diff --git a/tools/testing/selftests/bpf/prog_tests/mmap.c b/tools/testing/selftests/bpf/prog_tests/mmap.c
new file mode 100644
index 000000000000..ef3a4f926764
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/mmap.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <sys/mman.h>
+
+struct map_data {
+	__u64 val[3];
+};
+
+struct bss_data {
+	__u64 in_val;
+	__u64 out_val;
+};
+
+static size_t roundup_page(size_t sz)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	return (sz + page_size - 1) / page_size * page_size;
+}
+
+void test_mmap(void)
+{
+	const char *file = "test_mmap.o";
+	const char *probe_name = "raw_tracepoint/sys_enter";
+	const char *tp_name = "sys_enter";
+	const size_t bss_sz = roundup_page(sizeof(struct bss_data));
+	const size_t map_sz = roundup_page(sizeof(struct map_data));
+	int err, duration = 0, i, data_map_fd, zero = 0, one = 1, two = 2;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	struct bpf_link *link = NULL;
+	struct bpf_map *data_map, *bss_map;
+	void *bss_mmaped = NULL, *map_mmaped = NULL, *tmp_mmaped;
+	volatile struct bss_data *bss_data;
+	volatile struct map_data *map_data;
+	__u64 val = 0;
+
+	obj = bpf_object__open_file("test_mmap.o", NULL);
+	if (CHECK(IS_ERR(obj), "obj_open", "failed to open '%s': %ld\n",
+		  file, PTR_ERR(obj)))
+		return;
+	prog = bpf_object__find_program_by_title(obj, probe_name);
+	if (CHECK(!prog, "find_probe", "prog '%s' not found\n", probe_name))
+		goto cleanup;
+	err = bpf_object__load(obj);
+	if (CHECK(err, "obj_load", "failed to load prog '%s': %d\n",
+		  probe_name, err))
+		goto cleanup;
+
+	bss_map = bpf_object__find_map_by_name(obj, "test_mma.bss");
+	if (CHECK(!bss_map, "find_bss_map", ".bss map not found\n"))
+		goto cleanup;
+	data_map = bpf_object__find_map_by_name(obj, "data_map");
+	if (CHECK(!data_map, "find_data_map", "data_map map not found\n"))
+		goto cleanup;
+	data_map_fd = bpf_map__fd(data_map);
+
+	bss_mmaped = mmap(NULL, bss_sz, PROT_READ | PROT_WRITE, MAP_SHARED,
+			  bpf_map__fd(bss_map), 0);
+	if (CHECK(bss_mmaped == MAP_FAILED, "bss_mmap",
+		  ".bss mmap failed: %d\n", errno)) {
+		bss_mmaped = NULL;
+		goto cleanup;
+	}
+	/* map as R/W first */
+	map_mmaped = mmap(NULL, map_sz, PROT_READ | PROT_WRITE, MAP_SHARED,
+			  data_map_fd, 0);
+	if (CHECK(map_mmaped == MAP_FAILED, "data_mmap",
+		  "data_map mmap failed: %d\n", errno)) {
+		map_mmaped = NULL;
+		goto cleanup;
+	}
+
+	bss_data = bss_mmaped;
+	map_data = map_mmaped;
+
+	CHECK_FAIL(bss_data->in_val);
+	CHECK_FAIL(bss_data->out_val);
+	CHECK_FAIL(map_data->val[0]);
+	CHECK_FAIL(map_data->val[1]);
+	CHECK_FAIL(map_data->val[2]);
+
+	link = bpf_program__attach_raw_tracepoint(prog, tp_name);
+	if (CHECK(IS_ERR(link), "attach_raw_tp", "err %ld\n", PTR_ERR(link)))
+		goto cleanup;
+
+	bss_data->in_val = 123;
+	val = 111;
+	CHECK_FAIL(bpf_map_update_elem(data_map_fd, &zero, &val, 0));
+
+	usleep(1);
+
+	CHECK_FAIL(bss_data->in_val != 123);
+	CHECK_FAIL(bss_data->out_val != 123);
+	CHECK_FAIL(map_data->val[0] != 111);
+	CHECK_FAIL(map_data->val[1] != 222);
+	CHECK_FAIL(map_data->val[2] != 123);
+
+	CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &zero, &val));
+	CHECK_FAIL(val != 111);
+	CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &one, &val));
+	CHECK_FAIL(val != 222);
+	CHECK_FAIL(bpf_map_lookup_elem(data_map_fd, &two, &val));
+	CHECK_FAIL(val != 123);
+
+	/* data_map freeze should fail due to R/W mmap() */
+	err = bpf_map_freeze(data_map_fd);
+	if (CHECK(!err || errno != EBUSY, "no_freeze",
+		  "data_map freeze succeeded: err=%d, errno=%d\n", err, errno))
+		goto cleanup;
+
+	/* unmap R/W mapping */
+	err = munmap(map_mmaped, map_sz);
+	map_mmaped = NULL;
+	if (CHECK(err, "data_map_munmap", "data_map munmap failed: %d\n", errno))
+		goto cleanup;
+
+	/* re-map as R/O now */
+	map_mmaped = mmap(NULL, map_sz, PROT_READ, MAP_SHARED, data_map_fd, 0);
+	if (CHECK(map_mmaped == MAP_FAILED, "data_mmap",
+		  "data_map R/O mmap failed: %d\n", errno)) {
+		map_mmaped = NULL;
+		goto cleanup;
+	}
+	map_data = map_mmaped;
+
+	/* map/unmap in a loop to test ref counting */
+	for (i = 0; i < 10; i++) {
+		int flags = i % 2 ? PROT_READ : PROT_WRITE;
+		void *p;
+
+		p = mmap(NULL, map_sz, flags, MAP_SHARED, data_map_fd, 0);
+		if (CHECK_FAIL(p == MAP_FAILED))
+			goto cleanup;
+		err = munmap(p, map_sz);
+		if (CHECK_FAIL(err))
+			goto cleanup;
+	}
+
+	/* data_map freeze should now succeed due to no R/W mapping */
+	err = bpf_map_freeze(data_map_fd);
+	if (CHECK(err, "freeze", "data_map freeze failed: err=%d, errno=%d\n",
+		  err, errno))
+		goto cleanup;
+
+	/* mapping as R/W now should fail */
+	tmp_mmaped = mmap(NULL, map_sz, PROT_READ | PROT_WRITE, MAP_SHARED,
+			  data_map_fd, 0);
+	if (CHECK(tmp_mmaped != MAP_FAILED, "data_mmap",
+		  "data_map mmap succeeded\n")) {
+		munmap(tmp_mmaped, map_sz);
+		goto cleanup;
+	}
+
+	bss_data->in_val = 321;
+	usleep(1);
+	CHECK_FAIL(bss_data->in_val != 321);
+	CHECK_FAIL(bss_data->out_val != 321);
+	CHECK_FAIL(map_data->val[0] != 111);
+	CHECK_FAIL(map_data->val[1] != 222);
+	CHECK_FAIL(map_data->val[2] != 321);
+
+cleanup:
+	if (bss_mmaped)
+		CHECK_FAIL(munmap(bss_mmaped, bss_sz));
+	if (map_mmaped)
+		CHECK_FAIL(munmap(map_mmaped, map_sz));
+	if (!IS_ERR_OR_NULL(link))
+		bpf_link__destroy(link);
+	bpf_object__close(obj);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_mmap.c b/tools/testing/selftests/bpf/progs/test_mmap.c
new file mode 100644
index 000000000000..98e3ea36b88e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_mmap.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+
+#include <linux/bpf.h>
+#include <stdint.h>
+#include "bpf_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 3);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__type(key, __u32);
+	__type(value, __u64);
+} data_map SEC(".maps");
+
+static volatile __u64 in_val;
+static volatile __u64 out_val;
+
+SEC("raw_tracepoint/sys_enter")
+int test_mmap(void *ctx)
+{
+	int zero = 0, one = 1, two = 2;
+	__u64 val, *p;
+
+	out_val = in_val;
+
+	/* data_map[2] = in_val */
+	bpf_map_update_elem(&data_map, &two, (const void *)&in_val, 0);
+
+	/* data_map[1] = data_map[0] * 2; */
+	p = bpf_map_lookup_elem(&data_map, &zero);
+	if (p) {
+		val = (*p) * 2;
+		bpf_map_update_elem(&data_map, &one, &val, 0);
+	}
+
+	return 0;
+}
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-13  3:15 ` [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
@ 2019-11-13 20:06   ` John Fastabend
  2019-11-13 20:41     ` Andrii Nakryiko
  2019-11-13 20:38   ` Daniel Borkmann
  1 sibling, 1 reply; 11+ messages in thread
From: John Fastabend @ 2019-11-13 20:06 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Rik van Riel,
	Johannes Weiner

Andrii Nakryiko wrote:
> Add ability to memory-map contents of BPF array map. This is extremely useful
> for working with BPF global data from userspace programs. It allows to avoid
> typical bpf_map_{lookup,update}_elem operations, improving both performance
> and usability.
> 
> There had to be special considerations for map freezing, to avoid having
> writable memory view into a frozen map. To solve this issue, map freezing and
> mmap-ing is happening under mutex now:
>   - if map is already frozen, no writable mapping is allowed;
>   - if map has writable memory mappings active (accounted in map->writecnt),
>     map freezing will keep failing with -EBUSY;
>   - once number of writable memory mappings drops to zero, map freezing can be
>     performed again.
> 
> Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> can't be memory mapped either.
> 
> For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> to be mmap()'able. We also need to make sure that array data memory is
> page-sized and page-aligned, so we over-allocate memory in such a way that
> struct bpf_array is at the end of a single page of memory with array->value
> being aligned with the start of the second page. On deallocation we need to
> accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> 
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

[...]

> @@ -102,10 +106,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
>  	}
>  
>  	array_size = sizeof(*array);
> -	if (percpu)
> +	if (percpu) {
>  		array_size += (u64) max_entries * sizeof(void *);
> -	else
> -		array_size += (u64) max_entries * elem_size;
> +	} else {
> +		/* rely on vmalloc() to return page-aligned memory and
> +		 * ensure array->value is exactly page-aligned
> +		 */
> +		if (attr->map_flags & BPF_F_MMAPABLE) {
> +			array_size = round_up(array_size, PAGE_SIZE);
> +			array_size += (u64) max_entries * elem_size;
> +			array_size = round_up(array_size, PAGE_SIZE);
> +		} else {
> +			array_size += (u64) max_entries * elem_size;
> +		}
> +	}

Thought about this chunk for a bit, assuming we don't end up with lots of
small mmap arrays it should be OK. So userspace will probably need to try and
optimize this to create as few mmaps as possible. 

[...]

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v3 bpf-next 2/3] libbpf: make global data internal arrays mmap()-able, if possible
  2019-11-13  3:15 ` [PATCH v3 bpf-next 2/3] libbpf: make global data internal arrays mmap()-able, if possible Andrii Nakryiko
@ 2019-11-13 20:21   ` John Fastabend
  0 siblings, 0 replies; 11+ messages in thread
From: John Fastabend @ 2019-11-13 20:21 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Andrii Nakryiko wrote:
> Add detection of BPF_F_MMAPABLE flag support for arrays and add it as an extra
> flag to internal global data maps, if supported by kernel. This allows users
> to memory-map global data and use it without BPF map operations, greatly
> simplifying user experience.
> 
> Acked-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---

[...]

>  /*
> @@ -856,8 +858,6 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>  		pr_warn("failed to alloc map name\n");
>  		return -ENOMEM;
>  	}
> -	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu.\n",
> -		 map_name, map->sec_idx, map->sec_offset);
>  
>  	def = &map->def;
>  	def->type = BPF_MAP_TYPE_ARRAY;
> @@ -865,6 +865,12 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>  	def->value_size = data->d_size;
>  	def->max_entries = 1;
>  	def->map_flags = type == LIBBPF_MAP_RODATA ? BPF_F_RDONLY_PROG : 0;
> +	if (obj->caps.array_mmap)
> +		def->map_flags |= BPF_F_MMAPABLE;
> +
> +	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
> +		 map_name, map->sec_idx, map->sec_offset, def->map_flags);
> +
>  	if (data_buff) {
>  		*data_buff = malloc(data->d_size);
>  		if (!*data_buff) {
> @@ -2160,6 +2166,27 @@ static int bpf_object__probe_btf_datasec(struct bpf_object *obj)
>  	return 0;
>  }

I was a bit concerned we should fall back to making the call without the
BPF_F_MMAPABLE flag set if it fails but did a quick walk through the call
path and it seems like it shouldn't fail except if vmalloc/vzalloc failures
so seems fine.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-13  3:15 ` [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
  2019-11-13 20:06   ` John Fastabend
@ 2019-11-13 20:38   ` Daniel Borkmann
  2019-11-13 20:50     ` Andrii Nakryiko
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2019-11-13 20:38 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast
  Cc: andrii.nakryiko, kernel-team, Rik van Riel, Johannes Weiner

On 11/13/19 4:15 AM, Andrii Nakryiko wrote:
> Add ability to memory-map contents of BPF array map. This is extremely useful
> for working with BPF global data from userspace programs. It allows to avoid
> typical bpf_map_{lookup,update}_elem operations, improving both performance
> and usability.
> 
> There had to be special considerations for map freezing, to avoid having
> writable memory view into a frozen map. To solve this issue, map freezing and
> mmap-ing is happening under mutex now:
>    - if map is already frozen, no writable mapping is allowed;
>    - if map has writable memory mappings active (accounted in map->writecnt),
>      map freezing will keep failing with -EBUSY;
>    - once number of writable memory mappings drops to zero, map freezing can be
>      performed again.
> 
> Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> can't be memory mapped either.
> 
> For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> to be mmap()'able. We also need to make sure that array data memory is
> page-sized and page-aligned, so we over-allocate memory in such a way that
> struct bpf_array is at the end of a single page of memory with array->value
> being aligned with the start of the second page. On deallocation we need to
> accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> 
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Overall set looks good to me! One comment below:

[...]
> @@ -117,7 +131,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
>   		return ERR_PTR(ret);
>   
>   	/* allocate all map elements and zero-initialize them */
> -	array = bpf_map_area_alloc(array_size, numa_node);
> +	if (attr->map_flags & BPF_F_MMAPABLE) {
> +		void *data;
> +
> +		/* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
> +		data = vzalloc_node(array_size, numa_node);
> +		if (!data) {
> +			bpf_map_charge_finish(&mem);
> +			return ERR_PTR(-ENOMEM);
> +		}
> +		array = data + round_up(sizeof(struct bpf_array), PAGE_SIZE)
> +			- offsetof(struct bpf_array, value);
> +	} else {
> +		array = bpf_map_area_alloc(array_size, numa_node);
> +	}

Can't we place/extend all this logic inside bpf_map_area_alloc() and
bpf_map_area_free() API instead of hard-coding it here?

Given this is a generic feature of which global data is just one consumer,
my concern is that this reintroduces similar issues that mentioned API was
trying to solve already meaning failing early instead of trying hard and
triggering OOM if the array is large.

Consolidating this into bpf_map_area_alloc()/bpf_map_area_free() would
make sure all the rest has same semantics.

>   	if (!array) {
>   		bpf_map_charge_finish(&mem);
>   		return ERR_PTR(-ENOMEM);
> @@ -365,7 +392,10 @@ static void array_map_free(struct bpf_map *map)
>   	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
>   		bpf_array_free_percpu(array);
>   
> -	bpf_map_area_free(array);
> +	if (array->map.map_flags & BPF_F_MMAPABLE)
> +		bpf_map_area_free((void *)round_down((long)array, PAGE_SIZE));
> +	else
> +		bpf_map_area_free(array);
>   }
>   
>   static void array_map_seq_show_elem(struct bpf_map *map, void *key,
[...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-13 20:06   ` John Fastabend
@ 2019-11-13 20:41     ` Andrii Nakryiko
  0 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2019-11-13 20:41 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Rik van Riel, Johannes Weiner

On Wed, Nov 13, 2019 at 12:06 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > Add ability to memory-map contents of BPF array map. This is extremely useful
> > for working with BPF global data from userspace programs. It allows to avoid
> > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > and usability.
> >
> > There had to be special considerations for map freezing, to avoid having
> > writable memory view into a frozen map. To solve this issue, map freezing and
> > mmap-ing is happening under mutex now:
> >   - if map is already frozen, no writable mapping is allowed;
> >   - if map has writable memory mappings active (accounted in map->writecnt),
> >     map freezing will keep failing with -EBUSY;
> >   - once number of writable memory mappings drops to zero, map freezing can be
> >     performed again.
> >
> > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > can't be memory mapped either.
> >
> > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > to be mmap()'able. We also need to make sure that array data memory is
> > page-sized and page-aligned, so we over-allocate memory in such a way that
> > struct bpf_array is at the end of a single page of memory with array->value
> > being aligned with the start of the second page. On deallocation we need to
> > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> >
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>
> [...]
>
> > @@ -102,10 +106,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
> >       }
> >
> >       array_size = sizeof(*array);
> > -     if (percpu)
> > +     if (percpu) {
> >               array_size += (u64) max_entries * sizeof(void *);
> > -     else
> > -             array_size += (u64) max_entries * elem_size;
> > +     } else {
> > +             /* rely on vmalloc() to return page-aligned memory and
> > +              * ensure array->value is exactly page-aligned
> > +              */
> > +             if (attr->map_flags & BPF_F_MMAPABLE) {
> > +                     array_size = round_up(array_size, PAGE_SIZE);
> > +                     array_size += (u64) max_entries * elem_size;
> > +                     array_size = round_up(array_size, PAGE_SIZE);
> > +             } else {
> > +                     array_size += (u64) max_entries * elem_size;
> > +             }
> > +     }
>
> Thought about this chunk for a bit, assuming we don't end up with lots of
> small mmap arrays it should be OK. So userspace will probably need to try and
> optimize this to create as few mmaps as possible.

I think typically most explicitly declared maps won't be
BPF_F_MMAPABLE, unless user really expects to mmap() it for use from
user-space. For global data, though, the benefits are really great
from being able to mmap(), which is why I'm defaulting them to
BPF_F_MMAPABLE by default, if possible.

>
> [...]
>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-13 20:38   ` Daniel Borkmann
@ 2019-11-13 20:50     ` Andrii Nakryiko
  2019-11-13 21:10       ` Daniel Borkmann
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2019-11-13 20:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Kernel Team, Rik van Riel, Johannes Weiner

On Wed, Nov 13, 2019 at 12:38 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/13/19 4:15 AM, Andrii Nakryiko wrote:
> > Add ability to memory-map contents of BPF array map. This is extremely useful
> > for working with BPF global data from userspace programs. It allows to avoid
> > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > and usability.
> >
> > There had to be special considerations for map freezing, to avoid having
> > writable memory view into a frozen map. To solve this issue, map freezing and
> > mmap-ing is happening under mutex now:
> >    - if map is already frozen, no writable mapping is allowed;
> >    - if map has writable memory mappings active (accounted in map->writecnt),
> >      map freezing will keep failing with -EBUSY;
> >    - once number of writable memory mappings drops to zero, map freezing can be
> >      performed again.
> >
> > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > can't be memory mapped either.
> >
> > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > to be mmap()'able. We also need to make sure that array data memory is
> > page-sized and page-aligned, so we over-allocate memory in such a way that
> > struct bpf_array is at the end of a single page of memory with array->value
> > being aligned with the start of the second page. On deallocation we need to
> > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> >
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>
> Overall set looks good to me! One comment below:
>
> [...]
> > @@ -117,7 +131,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
> >               return ERR_PTR(ret);
> >
> >       /* allocate all map elements and zero-initialize them */
> > -     array = bpf_map_area_alloc(array_size, numa_node);
> > +     if (attr->map_flags & BPF_F_MMAPABLE) {
> > +             void *data;
> > +
> > +             /* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
> > +             data = vzalloc_node(array_size, numa_node);
> > +             if (!data) {
> > +                     bpf_map_charge_finish(&mem);
> > +                     return ERR_PTR(-ENOMEM);
> > +             }
> > +             array = data + round_up(sizeof(struct bpf_array), PAGE_SIZE)
> > +                     - offsetof(struct bpf_array, value);
> > +     } else {
> > +             array = bpf_map_area_alloc(array_size, numa_node);
> > +     }
>
> Can't we place/extend all this logic inside bpf_map_area_alloc() and
> bpf_map_area_free() API instead of hard-coding it here?
>
> Given this is a generic feature of which global data is just one consumer,
> my concern is that this reintroduces similar issues that mentioned API was
> trying to solve already meaning failing early instead of trying hard and
> triggering OOM if the array is large.
>
> Consolidating this into bpf_map_area_alloc()/bpf_map_area_free() would
> make sure all the rest has same semantics.

So a bunch of this (e.g, array pointer adjustment in mmapable case)
depends on specific layout of bpf_array, while bpf_map_area_alloc() is
called for multitude of different maps. What we can generalize,
though, is this enforcement of vmalloc() for mmapable case: enforce
size is multiple of PAGE_SIZE, bypass kmalloc, etc. I can do that part
easily, I refrained because it would require extra bool mmapable flag
to bpf_map_area_alloc() and (trivial) update to 13 call sites passing
false, I wasn't sure people would like code churn.

As for bpf_map_areas_free(), again, adjustment is specific to
bpf_array and its memory layout w.r.t. data placement, so I don't
think we can generalize it that much.

After talking with Johannes, I'm also adding new
vmalloc_user_node_flags() API and will specify same RETRY_MAYFAIL and
NOWARN flags, so behavior will stay the same.

Let me know if you want `bool mmapable` added to bpf_map_area_alloc().
And also if I'm missing how you wanted to generalize other parts,
please explain in more details.

>
> >       if (!array) {
> >               bpf_map_charge_finish(&mem);
> >               return ERR_PTR(-ENOMEM);
> > @@ -365,7 +392,10 @@ static void array_map_free(struct bpf_map *map)
> >       if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
> >               bpf_array_free_percpu(array);
> >
> > -     bpf_map_area_free(array);
> > +     if (array->map.map_flags & BPF_F_MMAPABLE)
> > +             bpf_map_area_free((void *)round_down((long)array, PAGE_SIZE));
> > +     else
> > +             bpf_map_area_free(array);
> >   }
> >
> >   static void array_map_seq_show_elem(struct bpf_map *map, void *key,
> [...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-13 20:50     ` Andrii Nakryiko
@ 2019-11-13 21:10       ` Daniel Borkmann
  2019-11-13 21:14         ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2019-11-13 21:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Kernel Team, Rik van Riel, Johannes Weiner

On Wed, Nov 13, 2019 at 12:50:23PM -0800, Andrii Nakryiko wrote:
> On Wed, Nov 13, 2019 at 12:38 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > On 11/13/19 4:15 AM, Andrii Nakryiko wrote:
> > > Add ability to memory-map contents of BPF array map. This is extremely useful
> > > for working with BPF global data from userspace programs. It allows to avoid
> > > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > > and usability.
> > >
> > > There had to be special considerations for map freezing, to avoid having
> > > writable memory view into a frozen map. To solve this issue, map freezing and
> > > mmap-ing is happening under mutex now:
> > >    - if map is already frozen, no writable mapping is allowed;
> > >    - if map has writable memory mappings active (accounted in map->writecnt),
> > >      map freezing will keep failing with -EBUSY;
> > >    - once number of writable memory mappings drops to zero, map freezing can be
> > >      performed again.
> > >
> > > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > > can't be memory mapped either.
> > >
> > > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > > to be mmap()'able. We also need to make sure that array data memory is
> > > page-sized and page-aligned, so we over-allocate memory in such a way that
> > > struct bpf_array is at the end of a single page of memory with array->value
> > > being aligned with the start of the second page. On deallocation we need to
> > > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> > >
> > > Cc: Rik van Riel <riel@surriel.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Acked-by: Song Liu <songliubraving@fb.com>
> > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> >
> > Overall set looks good to me! One comment below:
> >
> > [...]
> > > @@ -117,7 +131,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
> > >               return ERR_PTR(ret);
> > >
> > >       /* allocate all map elements and zero-initialize them */
> > > -     array = bpf_map_area_alloc(array_size, numa_node);
> > > +     if (attr->map_flags & BPF_F_MMAPABLE) {
> > > +             void *data;
> > > +
> > > +             /* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
> > > +             data = vzalloc_node(array_size, numa_node);
> > > +             if (!data) {
> > > +                     bpf_map_charge_finish(&mem);
> > > +                     return ERR_PTR(-ENOMEM);
> > > +             }
> > > +             array = data + round_up(sizeof(struct bpf_array), PAGE_SIZE)
> > > +                     - offsetof(struct bpf_array, value);
> > > +     } else {
> > > +             array = bpf_map_area_alloc(array_size, numa_node);
> > > +     }
> >
> > Can't we place/extend all this logic inside bpf_map_area_alloc() and
> > bpf_map_area_free() API instead of hard-coding it here?
> >
> > Given this is a generic feature of which global data is just one consumer,
> > my concern is that this reintroduces similar issues that mentioned API was
> > trying to solve already meaning failing early instead of trying hard and
> > triggering OOM if the array is large.
> >
> > Consolidating this into bpf_map_area_alloc()/bpf_map_area_free() would
> > make sure all the rest has same semantics.
> 
> So a bunch of this (e.g, array pointer adjustment in mmapable case)
> depends on specific layout of bpf_array, while bpf_map_area_alloc() is
> called for multitude of different maps. What we can generalize,
> though, is this enforcement of vmalloc() for mmapable case: enforce
> size is multiple of PAGE_SIZE, bypass kmalloc, etc. I can do that part
> easily, I refrained because it would require extra bool mmapable flag
> to bpf_map_area_alloc() and (trivial) update to 13 call sites passing
> false, I wasn't sure people would like code churn.
> 
> As for bpf_map_areas_free(), again, adjustment is specific to
> bpf_array and its memory layout w.r.t. data placement, so I don't
> think we can generalize it that much.
> 
> After talking with Johannes, I'm also adding new
> vmalloc_user_node_flags() API and will specify same RETRY_MAYFAIL and
> NOWARN flags, so behavior will stay the same.
> 
> Let me know if you want `bool mmapable` added to bpf_map_area_alloc().
> And also if I'm missing how you wanted to generalize other parts,
> please explain in more details.

Why changing all call-sites? You could have two pair of API helpers,
e.g. bpf_map_area_{,mmapable}_alloc() and bpf_map_area_{,mmapable}_free()
and they both call into __bpf_map_area_alloc() and __bpf_map_area_free()
which are private and common to both, so whenever we need to go and change
internals, they are fixed for all users. Call-sites would remain as-is
just for array map you'd select between the two.

> > >       if (!array) {
> > >               bpf_map_charge_finish(&mem);
> > >               return ERR_PTR(-ENOMEM);
> > > @@ -365,7 +392,10 @@ static void array_map_free(struct bpf_map *map)
> > >       if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
> > >               bpf_array_free_percpu(array);
> > >
> > > -     bpf_map_area_free(array);
> > > +     if (array->map.map_flags & BPF_F_MMAPABLE)
> > > +             bpf_map_area_free((void *)round_down((long)array, PAGE_SIZE));
> > > +     else
> > > +             bpf_map_area_free(array);
> > >   }
> > >
> > >   static void array_map_seq_show_elem(struct bpf_map *map, void *key,
> > [...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY
  2019-11-13 21:10       ` Daniel Borkmann
@ 2019-11-13 21:14         ` Andrii Nakryiko
  0 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2019-11-13 21:14 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Kernel Team, Rik van Riel, Johannes Weiner

On Wed, Nov 13, 2019 at 1:10 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On Wed, Nov 13, 2019 at 12:50:23PM -0800, Andrii Nakryiko wrote:
> > On Wed, Nov 13, 2019 at 12:38 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > On 11/13/19 4:15 AM, Andrii Nakryiko wrote:
> > > > Add ability to memory-map contents of BPF array map. This is extremely useful
> > > > for working with BPF global data from userspace programs. It allows to avoid
> > > > typical bpf_map_{lookup,update}_elem operations, improving both performance
> > > > and usability.
> > > >
> > > > There had to be special considerations for map freezing, to avoid having
> > > > writable memory view into a frozen map. To solve this issue, map freezing and
> > > > mmap-ing is happening under mutex now:
> > > >    - if map is already frozen, no writable mapping is allowed;
> > > >    - if map has writable memory mappings active (accounted in map->writecnt),
> > > >      map freezing will keep failing with -EBUSY;
> > > >    - once number of writable memory mappings drops to zero, map freezing can be
> > > >      performed again.
> > > >
> > > > Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
> > > > can't be memory mapped either.
> > > >
> > > > For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
> > > > to be mmap()'able. We also need to make sure that array data memory is
> > > > page-sized and page-aligned, so we over-allocate memory in such a way that
> > > > struct bpf_array is at the end of a single page of memory with array->value
> > > > being aligned with the start of the second page. On deallocation we need to
> > > > accomodate this memory arrangement to free vmalloc()'ed memory correctly.
> > > >
> > > > Cc: Rik van Riel <riel@surriel.com>
> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > > Acked-by: Song Liu <songliubraving@fb.com>
> > > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > >
> > > Overall set looks good to me! One comment below:
> > >
> > > [...]
> > > > @@ -117,7 +131,20 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
> > > >               return ERR_PTR(ret);
> > > >
> > > >       /* allocate all map elements and zero-initialize them */
> > > > -     array = bpf_map_area_alloc(array_size, numa_node);
> > > > +     if (attr->map_flags & BPF_F_MMAPABLE) {
> > > > +             void *data;
> > > > +
> > > > +             /* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */
> > > > +             data = vzalloc_node(array_size, numa_node);
> > > > +             if (!data) {
> > > > +                     bpf_map_charge_finish(&mem);
> > > > +                     return ERR_PTR(-ENOMEM);
> > > > +             }
> > > > +             array = data + round_up(sizeof(struct bpf_array), PAGE_SIZE)
> > > > +                     - offsetof(struct bpf_array, value);
> > > > +     } else {
> > > > +             array = bpf_map_area_alloc(array_size, numa_node);
> > > > +     }
> > >
> > > Can't we place/extend all this logic inside bpf_map_area_alloc() and
> > > bpf_map_area_free() API instead of hard-coding it here?
> > >
> > > Given this is a generic feature of which global data is just one consumer,
> > > my concern is that this reintroduces similar issues that mentioned API was
> > > trying to solve already meaning failing early instead of trying hard and
> > > triggering OOM if the array is large.
> > >
> > > Consolidating this into bpf_map_area_alloc()/bpf_map_area_free() would
> > > make sure all the rest has same semantics.
> >
> > So a bunch of this (e.g, array pointer adjustment in mmapable case)
> > depends on specific layout of bpf_array, while bpf_map_area_alloc() is
> > called for multitude of different maps. What we can generalize,
> > though, is this enforcement of vmalloc() for mmapable case: enforce
> > size is multiple of PAGE_SIZE, bypass kmalloc, etc. I can do that part
> > easily, I refrained because it would require extra bool mmapable flag
> > to bpf_map_area_alloc() and (trivial) update to 13 call sites passing
> > false, I wasn't sure people would like code churn.
> >
> > As for bpf_map_areas_free(), again, adjustment is specific to
> > bpf_array and its memory layout w.r.t. data placement, so I don't
> > think we can generalize it that much.
> >
> > After talking with Johannes, I'm also adding new
> > vmalloc_user_node_flags() API and will specify same RETRY_MAYFAIL and
> > NOWARN flags, so behavior will stay the same.
> >
> > Let me know if you want `bool mmapable` added to bpf_map_area_alloc().
> > And also if I'm missing how you wanted to generalize other parts,
> > please explain in more details.
>
> Why changing all call-sites? You could have two pair of API helpers,
> e.g. bpf_map_area_{,mmapable}_alloc() and bpf_map_area_{,mmapable}_free()
> and they both call into __bpf_map_area_alloc() and __bpf_map_area_free()
> which are private and common to both, so whenever we need to go and change
> internals, they are fixed for all users. Call-sites would remain as-is
> just for array map you'd select between the two.

ok, can do bpf_map_area_alloc() and bpf_map_area_mmapable_alloc().
Still don't see how I can do bpf_map_area_mmapable_free() (normal
bpf_map_area_free() will be able to free mmapable allocated memory, if
caller adjusts the pointer correctly; adjustment though is specific to
each use case, so can't be generalized).

>
> > > >       if (!array) {
> > > >               bpf_map_charge_finish(&mem);
> > > >               return ERR_PTR(-ENOMEM);
> > > > @@ -365,7 +392,10 @@ static void array_map_free(struct bpf_map *map)
> > > >       if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
> > > >               bpf_array_free_percpu(array);
> > > >
> > > > -     bpf_map_area_free(array);
> > > > +     if (array->map.map_flags & BPF_F_MMAPABLE)
> > > > +             bpf_map_area_free((void *)round_down((long)array, PAGE_SIZE));
> > > > +     else
> > > > +             bpf_map_area_free(array);
> > > >   }
> > > >
> > > >   static void array_map_seq_show_elem(struct bpf_map *map, void *key,
> > > [...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-11-13 21:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13  3:15 [PATCH v3 bpf-next 0/3] Add support for memory-mapping BPF array maps Andrii Nakryiko
2019-11-13  3:15 ` [PATCH v3 bpf-next 1/3] bpf: add mmap() support for BPF_MAP_TYPE_ARRAY Andrii Nakryiko
2019-11-13 20:06   ` John Fastabend
2019-11-13 20:41     ` Andrii Nakryiko
2019-11-13 20:38   ` Daniel Borkmann
2019-11-13 20:50     ` Andrii Nakryiko
2019-11-13 21:10       ` Daniel Borkmann
2019-11-13 21:14         ` Andrii Nakryiko
2019-11-13  3:15 ` [PATCH v3 bpf-next 2/3] libbpf: make global data internal arrays mmap()-able, if possible Andrii Nakryiko
2019-11-13 20:21   ` John Fastabend
2019-11-13  3:15 ` [PATCH v3 bpf-next 3/3] selftests/bpf: add BPF_TYPE_MAP_ARRAY mmap() tests Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.