All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC bpf-next 0/2] Mmapable task local storage.
@ 2022-03-24 23:41 Hao Luo
  2022-03-24 23:41 ` [PATCH RFC bpf-next 1/2] bpf: Mmapable " Hao Luo
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Hao Luo @ 2022-03-24 23:41 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: yhs, KP Singh, Martin KaFai Lau, Song Liu, bpf, linux-kernel, Hao Luo

Some map types support mmap operation, which allows userspace to
communicate with BPF programs directly. Currently only arraymap
and ringbuf have mmap implemented.

However, in some use cases, when multiple program instances can
run concurrently, global mmapable memory can cause race. In that
case, userspace needs to provide necessary synchronizations to
coordinate the usage of mapped global data. This can be a source
of bottleneck.

It would be great to have a mmapable local storage in that case.
This patch adds that.

Mmap isn't BPF syscall, so unpriv users can also use it to
interact with maps.

Currently the only way of allocating mmapable map area is using
vmalloc() and it's only used at map allocation time. Vmalloc()
may sleep, therefore it's not suitable for maps that may allocate
memory in an atomic context such as local storage. Local storage
uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
uses kmalloc() with GFP_ATOMIC as well for mmapable map area.

Allocating mmapable memory has requirment on page alignment. So we
have to deliberately allocate more memory than necessary to obtain
an address that has sdata->data aligned at page boundary. The
calculations for mmapable allocation size, and the actual
allocation/deallocation are packaged in three functions:

 - bpf_map_mmapable_alloc_size()
 - bpf_map_mmapable_kzalloc()
 - bpf_map_mmapable_kfree()

BPF local storage uses them to provide generic mmap API:

 - bpf_local_storage_mmap()

And task local storage adds the mmap callback:

 - task_storage_map_mmap()

When application calls mmap on a task local storage, it gets its
own local storage.

Overall, mmapable local storage trades off memory with flexibility
and efficiency. It brings memory fragmentation but can make programs
stateless. Therefore useful in some cases.

Hao Luo (2):
  bpf: Mmapable local storage.
  selftests/bpf: Test mmapable task local storage.

 include/linux/bpf.h                           |  4 +
 include/linux/bpf_local_storage.h             |  5 +-
 kernel/bpf/bpf_local_storage.c                | 73 +++++++++++++++++--
 kernel/bpf/bpf_task_storage.c                 | 40 ++++++++++
 kernel/bpf/syscall.c                          | 67 +++++++++++++++++
 .../bpf/prog_tests/task_local_storage.c       | 38 ++++++++++
 .../bpf/progs/task_local_storage_mmapable.c   | 38 ++++++++++
 7 files changed, 257 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c

-- 
2.35.1.1021.g381101b075-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RFC bpf-next 1/2] bpf: Mmapable local storage.
  2022-03-24 23:41 [PATCH RFC bpf-next 0/2] Mmapable task local storage Hao Luo
@ 2022-03-24 23:41 ` Hao Luo
  2022-03-24 23:41 ` [PATCH RFC bpf-next 2/2] selftests/bpf: Test mmapable task " Hao Luo
  2022-03-25 19:16 ` [PATCH RFC bpf-next 0/2] Mmapable " Yonghong Song
  2 siblings, 0 replies; 20+ messages in thread
From: Hao Luo @ 2022-03-24 23:41 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: yhs, KP Singh, Martin KaFai Lau, Song Liu, bpf, linux-kernel, Hao Luo

Some map types support mmap operation, which allows userspace to
communicate with BPF programs directly. Currently only arraymap
and ringbuf have mmap implemented.

However, in some use cases, when multiple program instances can
run concurrently, global mmapable memory can cause race. In that
case, userspace needs to provide necessary synchronizations to
coordinate the usage of mapped global data. This can be a source
of bottleneck.

It would be great to have a mmapable local storage in that case.
This patch adds that.

Mmap isn't BPF syscall, so unpriv users can also use it to
interact with maps.

Currently the only way of allocating mmapable map area is using
vmalloc() and it's only used at map allocation time. Vmalloc()
may sleep, therefore it's not suitable for maps that may allocate
memory in an atomic context such as local storage. Local storage
uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
uses kmalloc() with GFP_ATOMIC as well for mmapable map area.

Allocating mmapable memory has requirment on page alignment. So we
have to deliberately allocate more memory than necessary to obtain
an address that has sdata->data aligned at page boundary. The
calculations for mmapable allocation size, and the actual
allocation/deallocation are packaged in three functions:

 - bpf_map_mmapable_alloc_size()
 - bpf_map_mmapable_kzalloc()
 - bpf_map_mmapable_kfree()

BPF local storage uses them to provide generic mmap API:

 - bpf_local_storage_mmap()

And task local storage adds the mmap callback:

 - task_storage_map_mmap()

When application calls mmap on a task local storage, it gets its
own local storage.

Overall, mmapable local storage trades off memory with flexibility
and efficiency. It brings memory fragmentation but can make programs
stateless. Therefore useful in some cases.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h               |  4 ++
 include/linux/bpf_local_storage.h |  5 ++-
 kernel/bpf/bpf_local_storage.c    | 73 ++++++++++++++++++++++++++++---
 kernel/bpf/bpf_task_storage.c     | 40 +++++++++++++++++
 kernel/bpf/syscall.c              | 67 ++++++++++++++++++++++++++++
 5 files changed, 181 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bdb5298735ce..d76b8d6f91d2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1549,6 +1549,10 @@ bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 	return __alloc_percpu_gfp(size, align, flags);
 }
 #endif
+size_t bpf_map_mmapable_alloc_size(size_t size, size_t offset);
+void *bpf_map_mmapable_kzalloc(const struct bpf_map *map, size_t size,
+			       size_t offset, gfp_t flags);
+void bpf_map_mmapable_kfree(void *ptr);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 493e63258497..36dc1102ec48 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -74,7 +74,8 @@ struct bpf_local_storage_elem {
 	struct hlist_node snode;	/* Linked to bpf_local_storage */
 	struct bpf_local_storage __rcu *local_storage;
 	struct rcu_head rcu;
-	/* 8 bytes hole */
+	u32 map_flags;
+	/* 4 bytes hole */
 	/* The data is stored in another cacheline to minimize
 	 * the number of cachelines access during a cache hit.
 	 */
@@ -168,4 +169,6 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 
 void bpf_local_storage_free_rcu(struct rcu_head *rcu);
 
+int bpf_local_storage_mmap(struct bpf_local_storage_map *smap, void *data,
+			   struct vm_area_struct *vma);
 #endif /* _BPF_LOCAL_STORAGE_H */
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 01aa2b51ec4d..4dd1d7af4451 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -15,7 +15,7 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/rcupdate_wait.h>
 
-#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE)
+#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_MMAPABLE)
 
 static struct bpf_local_storage_map_bucket *
 select_bucket(struct bpf_local_storage_map *smap,
@@ -66,13 +66,26 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
 		void *value, bool charge_mem, gfp_t gfp_flags)
 {
 	struct bpf_local_storage_elem *selem;
+	struct bpf_map *map = &smap->map;
 
 	if (charge_mem && mem_charge(smap, owner, smap->elem_size))
 		return NULL;
 
-	selem = bpf_map_kzalloc(&smap->map, smap->elem_size,
-				gfp_flags | __GFP_NOWARN);
+	if (map->map_flags & BPF_F_MMAPABLE) {
+		size_t offset;
+
+		offset = offsetof(struct bpf_local_storage_elem, sdata) +
+			offsetof(struct bpf_local_storage_data, data);
+		selem = bpf_map_mmapable_kzalloc(&smap->map, offset,
+						 map->value_size,
+						 gfp_flags | __GFP_NOWARN);
+	} else {
+		selem = bpf_map_kzalloc(&smap->map, smap->elem_size,
+					gfp_flags | __GFP_NOWARN);
+	}
+
 	if (selem) {
+		selem->map_flags = map->map_flags;
 		if (value)
 			memcpy(SDATA(selem)->data, value, smap->map.value_size);
 		return selem;
@@ -92,12 +105,24 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
 	kfree_rcu(local_storage, rcu);
 }
 
+static void selem_mmapable_free_rcu(struct rcu_head *rcu)
+{
+	struct bpf_local_storage_elem *selem;
+
+	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
+	bpf_map_mmapable_kfree(selem);
+}
+
 static void bpf_selem_free_rcu(struct rcu_head *rcu)
 {
 	struct bpf_local_storage_elem *selem;
 
 	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
-	kfree_rcu(selem, rcu);
+	if (selem->map_flags & BPF_F_MMAPABLE) {
+		call_rcu(rcu, selem_mmapable_free_rcu);
+	} else {
+		kfree_rcu(selem, rcu);
+	}
 }
 
 /* local_storage->lock must be held and selem->local_storage == local_storage.
@@ -383,7 +408,10 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 
 		err = bpf_local_storage_alloc(owner, smap, selem, gfp_flags);
 		if (err) {
-			kfree(selem);
+			if (map_flags & BPF_F_MMAPABLE)
+				bpf_map_mmapable_kfree(selem);
+			else
+				kfree(selem);
 			mem_uncharge(smap, owner, smap->elem_size);
 			return ERR_PTR(err);
 		}
@@ -623,8 +651,17 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 		raw_spin_lock_init(&smap->buckets[i].lock);
 	}
 
-	smap->elem_size =
-		sizeof(struct bpf_local_storage_elem) + attr->value_size;
+	if (attr->map_flags & BPF_F_MMAPABLE) {
+		size_t offset;
+
+		offset = offsetof(struct bpf_local_storage_elem, sdata) +
+			offsetof(struct bpf_local_storage_data, data);
+		smap->elem_size = bpf_map_mmapable_alloc_size(offset,
+							      attr->value_size);
+	} else {
+		smap->elem_size =
+			sizeof(struct bpf_local_storage_elem) + attr->value_size;
+	}
 
 	return smap;
 }
@@ -645,3 +682,25 @@ int bpf_local_storage_map_check_btf(const struct bpf_map *map,
 
 	return 0;
 }
+
+int bpf_local_storage_mmap(struct bpf_local_storage_map *smap, void *data,
+			   struct vm_area_struct *vma)
+{
+	struct bpf_map *map;
+	unsigned long pfn;
+	unsigned long count;
+	unsigned long size;
+
+	map = &smap->map;
+	size = PAGE_ALIGN(map->value_size);
+	if (vma->vm_pgoff * PAGE_SIZE + (vma->vm_end - vma->vm_start) > size)
+		return -EINVAL;
+
+	if (!IS_ALIGNED((unsigned long)data, PAGE_SIZE))
+		return -EINVAL;
+
+	pfn = virt_to_phys(data) >> PAGE_SHIFT;
+	count = size >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn + vma->vm_pgoff,
+			       count << PAGE_SHIFT, vma->vm_page_prot);
+}
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index 6638a0ecc3d2..9552b84f96db 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -307,6 +307,45 @@ static void task_storage_map_free(struct bpf_map *map)
 	bpf_local_storage_map_free(smap, &bpf_task_storage_busy);
 }
 
+static int task_storage_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
+{
+	struct bpf_local_storage_map *smap;
+	struct bpf_local_storage_data *sdata;
+	int err;
+
+	if (!(map->map_flags & BPF_F_MMAPABLE))
+		return -EINVAL;
+
+	rcu_read_lock();
+	if (!bpf_task_storage_trylock()) {
+		rcu_read_unlock();
+		return -EBUSY;
+	}
+
+	smap = (struct bpf_local_storage_map *)map;
+	sdata = task_storage_lookup(current, map, true);
+	if (sdata) {
+		err = bpf_local_storage_mmap(smap, sdata->data, vma);
+		goto unlock;
+	}
+
+	/* only allocate new storage, when the task is refcounted */
+	if (refcount_read(&current->usage)) {
+		sdata = bpf_local_storage_update(current, smap, NULL,
+						 BPF_NOEXIST, GFP_ATOMIC);
+		if (IS_ERR(sdata)) {
+			err = PTR_ERR(sdata);
+			goto unlock;
+		}
+	}
+
+	err = bpf_local_storage_mmap(smap, sdata->data, vma);
+unlock:
+	bpf_task_storage_unlock();
+	rcu_read_unlock();
+	return err;
+}
+
 static int task_storage_map_btf_id;
 const struct bpf_map_ops task_storage_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
@@ -321,6 +360,7 @@ const struct bpf_map_ops task_storage_map_ops = {
 	.map_btf_name = "bpf_local_storage_map",
 	.map_btf_id = &task_storage_map_btf_id,
 	.map_owner_storage_ptr = task_storage_ptr,
+	.map_mmap = &task_storage_map_mmap,
 };
 
 const struct bpf_func_proto bpf_task_storage_get_proto = {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cdaa1152436a..facd6918698d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -473,6 +473,73 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 #endif
 
+/* Given an address 'addr', return an address A such that A + offset is
+ * page aligned. The distance between 'addr' and that page boundary
+ * (i.e. A + offset) must be >= offset + sizeof(ptr).
+ */
+static unsigned long mmapable_alloc_ret_addr(void *addr, size_t offset)
+{
+	const size_t ptr_size = sizeof(void *);
+
+	return PAGE_ALIGN((unsigned long)addr + offset + ptr_size) - offset;
+}
+
+/* Given an offset and size, return the minimal allocation size, such that it's
+ * guaranteed to contains an address where address + offset is page aligned and
+ * [address + offset, address + offset + size] is covered in the allocated area
+ */
+size_t bpf_map_mmapable_alloc_size(size_t offset, size_t size)
+{
+	const size_t ptr_size = sizeof(void *);
+
+	return offset + ptr_size + PAGE_ALIGN(size) + PAGE_SIZE;
+}
+
+/* Allocate a chunk of memory and return an address in the allocated area, such
+ * that address + offset is page aligned and address + offset + PAGE_ALIGN(size)
+ * is within the allocated area.
+ */
+void *bpf_map_mmapable_kzalloc(const struct bpf_map *map, size_t offset,
+			       size_t size, gfp_t flags)
+{
+	const size_t ptr_size = sizeof(void *);
+	size_t alloc_size;
+	void *alloc_ptr;
+	unsigned long addr, ret_addr;
+
+	if (!IS_ALIGNED(offset, ptr_size)) {
+		pr_warn("bpf_map_mmapable_kzalloc: offset (%lx) is not aligned with ptr_size (%lu)\n",
+			offset, ptr_size);
+		return NULL;
+	}
+
+	alloc_size = bpf_map_mmapable_alloc_size(offset, size);
+	alloc_ptr = bpf_map_kzalloc(map, alloc_size, flags);
+	if (!alloc_ptr)
+		return NULL;
+
+	ret_addr = mmapable_alloc_ret_addr(alloc_ptr, offset);
+
+	/* Save the raw allocation address just below the address to be returned. */
+	addr = ret_addr - ptr_size;
+	*(void **)addr = alloc_ptr;
+
+	return (void *)ret_addr;
+}
+
+/* Free the memory allocated from bpf_map_mmapable_kzalloc() */
+void bpf_map_mmapable_kfree(void *ptr)
+{
+	const size_t ptr_size = sizeof(void *);
+	unsigned long addr;
+
+	if (!IS_ALIGNED((unsigned long)ptr, ptr_size))
+		return;
+
+	addr = (unsigned long)ptr - ptr_size;
+	kfree(*(void **)addr);
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
-- 
2.35.1.1021.g381101b075-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC bpf-next 2/2] selftests/bpf: Test mmapable task local storage.
  2022-03-24 23:41 [PATCH RFC bpf-next 0/2] Mmapable task local storage Hao Luo
  2022-03-24 23:41 ` [PATCH RFC bpf-next 1/2] bpf: Mmapable " Hao Luo
@ 2022-03-24 23:41 ` Hao Luo
  2022-03-25 19:16 ` [PATCH RFC bpf-next 0/2] Mmapable " Yonghong Song
  2 siblings, 0 replies; 20+ messages in thread
From: Hao Luo @ 2022-03-24 23:41 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: yhs, KP Singh, Martin KaFai Lau, Song Liu, bpf, linux-kernel, Hao Luo

Tests mmapable task local storage.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../bpf/prog_tests/task_local_storage.c       | 38 +++++++++++++++++++
 .../bpf/progs/task_local_storage_mmapable.c   | 38 +++++++++++++++++++
 2 files changed, 76 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c

diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
index 035c263aab1b..24e6edd32a78 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
@@ -6,8 +6,10 @@
 #include <sys/syscall.h>   /* For SYS_xxx definitions */
 #include <sys/types.h>
 #include <test_progs.h>
+#include <sys/mman.h>
 #include "task_local_storage.skel.h"
 #include "task_local_storage_exit_creds.skel.h"
+#include "task_local_storage_mmapable.skel.h"
 #include "task_ls_recursion.skel.h"
 
 static void test_sys_enter_exit(void)
@@ -81,6 +83,40 @@ static void test_recursion(void)
 	task_ls_recursion__destroy(skel);
 }
 
+#define MAGIC_VALUE 0xabcd1234
+
+static void test_mmapable(void)
+{
+	struct task_local_storage_mmapable *skel;
+	const long page_size = sysconf(_SC_PAGE_SIZE);
+	int fd, err;
+	void *ptr;
+
+	skel = task_local_storage_mmapable__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	fd = bpf_map__fd(skel->maps.mmapable_map);
+	ptr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (!ASSERT_NEQ(ptr, MAP_FAILED, "mmap"))
+		goto out;
+
+	skel->bss->target_pid = syscall(SYS_gettid);
+
+	err = task_local_storage_mmapable__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto unmap;
+
+	syscall(SYS_gettid);
+
+	ASSERT_EQ(*(u64 *)ptr, MAGIC_VALUE, "value");
+
+unmap:
+	munmap(ptr, page_size);
+out:
+	task_local_storage_mmapable__destroy(skel);
+}
+
 void test_task_local_storage(void)
 {
 	if (test__start_subtest("sys_enter_exit"))
@@ -89,4 +125,6 @@ void test_task_local_storage(void)
 		test_exit_creds();
 	if (test__start_subtest("recursion"))
 		test_recursion();
+	if (test__start_subtest("mmapable"))
+		test_mmapable();
 }
diff --git a/tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c b/tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
new file mode 100644
index 000000000000..8cd82bb7336a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_MMAPABLE);
+	__type(key, int);
+	__type(value, long);
+} mmapable_map SEC(".maps");
+
+#define MAGIC_VALUE 0xabcd1234
+
+pid_t target_pid = 0;
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+
+	task = bpf_get_current_task_btf();
+	if (task->pid != target_pid)
+		return 0;
+
+	ptr = bpf_task_storage_get(&mmapable_map, task, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	*ptr = MAGIC_VALUE;
+	return 0;
+}
-- 
2.35.1.1021.g381101b075-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 1/2] bpf: Mmapable local storage.
  2022-03-24 23:41 ` [PATCH RFC bpf-next 1/2] bpf: Mmapable " Hao Luo
@ 2022-03-25 12:40 ` Dan Carpenter
  -1 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2022-03-25  7:53 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 3750 bytes --]

CC: kbuild-all(a)lists.01.org
BCC: lkp(a)intel.com
In-Reply-To: <20220324234123.1608337-2-haoluo@google.com>
References: <20220324234123.1608337-2-haoluo@google.com>
TO: Hao Luo <haoluo@google.com>

Hi Hao,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Hao-Luo/Mmapable-task-local-storage/20220325-074313
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
:::::: branch date: 8 hours ago
:::::: commit date: 8 hours ago
config: i386-randconfig-m021 (https://download.01.org/0day-ci/archive/20220325/202203251506.EuMlgJWs-lkp(a)intel.com/config)
compiler: gcc-9 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
kernel/bpf/bpf_task_storage.c:342 task_storage_map_mmap() error: we previously assumed 'sdata' could be null (see line 327)

vim +/sdata +342 kernel/bpf/bpf_task_storage.c

4cf1bc1f104520 KP Singh 2020-11-06  309  
e8fe1745ccea53 Hao Luo  2022-03-24  310  static int task_storage_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
e8fe1745ccea53 Hao Luo  2022-03-24  311  {
e8fe1745ccea53 Hao Luo  2022-03-24  312  	struct bpf_local_storage_map *smap;
e8fe1745ccea53 Hao Luo  2022-03-24  313  	struct bpf_local_storage_data *sdata;
e8fe1745ccea53 Hao Luo  2022-03-24  314  	int err;
e8fe1745ccea53 Hao Luo  2022-03-24  315  
e8fe1745ccea53 Hao Luo  2022-03-24  316  	if (!(map->map_flags & BPF_F_MMAPABLE))
e8fe1745ccea53 Hao Luo  2022-03-24  317  		return -EINVAL;
e8fe1745ccea53 Hao Luo  2022-03-24  318  
e8fe1745ccea53 Hao Luo  2022-03-24  319  	rcu_read_lock();
e8fe1745ccea53 Hao Luo  2022-03-24  320  	if (!bpf_task_storage_trylock()) {
e8fe1745ccea53 Hao Luo  2022-03-24  321  		rcu_read_unlock();
e8fe1745ccea53 Hao Luo  2022-03-24  322  		return -EBUSY;
e8fe1745ccea53 Hao Luo  2022-03-24  323  	}
e8fe1745ccea53 Hao Luo  2022-03-24  324  
e8fe1745ccea53 Hao Luo  2022-03-24  325  	smap = (struct bpf_local_storage_map *)map;
e8fe1745ccea53 Hao Luo  2022-03-24  326  	sdata = task_storage_lookup(current, map, true);
e8fe1745ccea53 Hao Luo  2022-03-24 @327  	if (sdata) {
e8fe1745ccea53 Hao Luo  2022-03-24  328  		err = bpf_local_storage_mmap(smap, sdata->data, vma);
e8fe1745ccea53 Hao Luo  2022-03-24  329  		goto unlock;
e8fe1745ccea53 Hao Luo  2022-03-24  330  	}
e8fe1745ccea53 Hao Luo  2022-03-24  331  
e8fe1745ccea53 Hao Luo  2022-03-24  332  	/* only allocate new storage, when the task is refcounted */
e8fe1745ccea53 Hao Luo  2022-03-24  333  	if (refcount_read(&current->usage)) {
e8fe1745ccea53 Hao Luo  2022-03-24  334  		sdata = bpf_local_storage_update(current, smap, NULL,
e8fe1745ccea53 Hao Luo  2022-03-24  335  						 BPF_NOEXIST, GFP_ATOMIC);
e8fe1745ccea53 Hao Luo  2022-03-24  336  		if (IS_ERR(sdata)) {
e8fe1745ccea53 Hao Luo  2022-03-24  337  			err = PTR_ERR(sdata);
e8fe1745ccea53 Hao Luo  2022-03-24  338  			goto unlock;
e8fe1745ccea53 Hao Luo  2022-03-24  339  		}
e8fe1745ccea53 Hao Luo  2022-03-24  340  	}
e8fe1745ccea53 Hao Luo  2022-03-24  341  
e8fe1745ccea53 Hao Luo  2022-03-24 @342  	err = bpf_local_storage_mmap(smap, sdata->data, vma);
e8fe1745ccea53 Hao Luo  2022-03-24  343  unlock:
e8fe1745ccea53 Hao Luo  2022-03-24  344  	bpf_task_storage_unlock();
e8fe1745ccea53 Hao Luo  2022-03-24  345  	rcu_read_unlock();
e8fe1745ccea53 Hao Luo  2022-03-24  346  	return err;
e8fe1745ccea53 Hao Luo  2022-03-24  347  }
e8fe1745ccea53 Hao Luo  2022-03-24  348  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [kbuild] Re: [PATCH RFC bpf-next 1/2] bpf: Mmapable local storage.
@ 2022-03-25 12:40 ` Dan Carpenter
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Carpenter @ 2022-03-25 12:40 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3664 bytes --]

Hi Hao,

url:    https://github.com/0day-ci/linux/commits/Hao-Luo/Mmapable-task-local-storage/20220325-074313 
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git  master
config: i386-randconfig-m021 (https://download.01.org/0day-ci/archive/20220325/202203251506.EuMlgJWs-lkp(a)intel.com/config )
compiler: gcc-9 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
kernel/bpf/bpf_task_storage.c:342 task_storage_map_mmap() error: we previously assumed 'sdata' could be null (see line 327)

vim +/sdata +342 kernel/bpf/bpf_task_storage.c

e8fe1745ccea53 Hao Luo  2022-03-24  310  static int task_storage_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
e8fe1745ccea53 Hao Luo  2022-03-24  311  {
e8fe1745ccea53 Hao Luo  2022-03-24  312  	struct bpf_local_storage_map *smap;
e8fe1745ccea53 Hao Luo  2022-03-24  313  	struct bpf_local_storage_data *sdata;
e8fe1745ccea53 Hao Luo  2022-03-24  314  	int err;
e8fe1745ccea53 Hao Luo  2022-03-24  315  
e8fe1745ccea53 Hao Luo  2022-03-24  316  	if (!(map->map_flags & BPF_F_MMAPABLE))
e8fe1745ccea53 Hao Luo  2022-03-24  317  		return -EINVAL;
e8fe1745ccea53 Hao Luo  2022-03-24  318  
e8fe1745ccea53 Hao Luo  2022-03-24  319  	rcu_read_lock();
e8fe1745ccea53 Hao Luo  2022-03-24  320  	if (!bpf_task_storage_trylock()) {
e8fe1745ccea53 Hao Luo  2022-03-24  321  		rcu_read_unlock();
e8fe1745ccea53 Hao Luo  2022-03-24  322  		return -EBUSY;
e8fe1745ccea53 Hao Luo  2022-03-24  323  	}
e8fe1745ccea53 Hao Luo  2022-03-24  324  
e8fe1745ccea53 Hao Luo  2022-03-24  325  	smap = (struct bpf_local_storage_map *)map;
e8fe1745ccea53 Hao Luo  2022-03-24  326  	sdata = task_storage_lookup(current, map, true);
e8fe1745ccea53 Hao Luo  2022-03-24 @327  	if (sdata) {
                                                    ^^^^^

e8fe1745ccea53 Hao Luo  2022-03-24  328  		err = bpf_local_storage_mmap(smap, sdata->data, vma);
e8fe1745ccea53 Hao Luo  2022-03-24  329  		goto unlock;
e8fe1745ccea53 Hao Luo  2022-03-24  330  	}
e8fe1745ccea53 Hao Luo  2022-03-24  331  
e8fe1745ccea53 Hao Luo  2022-03-24  332  	/* only allocate new storage, when the task is refcounted */
e8fe1745ccea53 Hao Luo  2022-03-24  333  	if (refcount_read(&current->usage)) {
e8fe1745ccea53 Hao Luo  2022-03-24  334  		sdata = bpf_local_storage_update(current, smap, NULL,
e8fe1745ccea53 Hao Luo  2022-03-24  335  						 BPF_NOEXIST, GFP_ATOMIC);
e8fe1745ccea53 Hao Luo  2022-03-24  336  		if (IS_ERR(sdata)) {
e8fe1745ccea53 Hao Luo  2022-03-24  337  			err = PTR_ERR(sdata);
e8fe1745ccea53 Hao Luo  2022-03-24  338  			goto unlock;
e8fe1745ccea53 Hao Luo  2022-03-24  339  		}
e8fe1745ccea53 Hao Luo  2022-03-24  340  	}

"sdata" is NULL if refcount_read() returns zero

e8fe1745ccea53 Hao Luo  2022-03-24  341  
e8fe1745ccea53 Hao Luo  2022-03-24 @342  	err = bpf_local_storage_mmap(smap, sdata->data, vma);
                                                                                   ^^^^^^^^^^^

e8fe1745ccea53 Hao Luo  2022-03-24  343  unlock:
e8fe1745ccea53 Hao Luo  2022-03-24  344  	bpf_task_storage_unlock();
e8fe1745ccea53 Hao Luo  2022-03-24  345  	rcu_read_unlock();
e8fe1745ccea53 Hao Luo  2022-03-24  346  	return err;
e8fe1745ccea53 Hao Luo  2022-03-24  347  }

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp 
_______________________________________________
kbuild mailing list -- kbuild(a)lists.01.org
To unsubscribe send an email to kbuild-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [kbuild] Re: [PATCH RFC bpf-next 1/2] bpf: Mmapable local storage.
  2022-03-25 12:40 ` [kbuild] " Dan Carpenter
  (?)
@ 2022-03-25 17:36 ` Hao Luo
  -1 siblings, 0 replies; 20+ messages in thread
From: Hao Luo @ 2022-03-25 17:36 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3450 bytes --]

Hello Dan,

On Fri, Mar 25, 2022 at 5:41 AM Dan Carpenter <dan.carpenter@oracle.com> wrote:
>
> Hi Hao,
>
> url:    https://github.com/0day-ci/linux/commits/Hao-Luo/Mmapable-task-local-storage/20220325-074313
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git  master
> config: i386-randconfig-m021 (https://download.01.org/0day-ci/archive/20220325/202203251506.EuMlgJWs-lkp(a)intel.com/config )
> compiler: gcc-9 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
>
> smatch warnings:
> kernel/bpf/bpf_task_storage.c:342 task_storage_map_mmap() error: we previously assumed 'sdata' could be null (see line 327)
>
> vim +/sdata +342 kernel/bpf/bpf_task_storage.c
>
> e8fe1745ccea53 Hao Luo  2022-03-24  310  static int task_storage_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> e8fe1745ccea53 Hao Luo  2022-03-24  311  {
> e8fe1745ccea53 Hao Luo  2022-03-24  312         struct bpf_local_storage_map *smap;
> e8fe1745ccea53 Hao Luo  2022-03-24  313         struct bpf_local_storage_data *sdata;
> e8fe1745ccea53 Hao Luo  2022-03-24  314         int err;
> e8fe1745ccea53 Hao Luo  2022-03-24  315
> e8fe1745ccea53 Hao Luo  2022-03-24  316         if (!(map->map_flags & BPF_F_MMAPABLE))
> e8fe1745ccea53 Hao Luo  2022-03-24  317                 return -EINVAL;
> e8fe1745ccea53 Hao Luo  2022-03-24  318
> e8fe1745ccea53 Hao Luo  2022-03-24  319         rcu_read_lock();
> e8fe1745ccea53 Hao Luo  2022-03-24  320         if (!bpf_task_storage_trylock()) {
> e8fe1745ccea53 Hao Luo  2022-03-24  321                 rcu_read_unlock();
> e8fe1745ccea53 Hao Luo  2022-03-24  322                 return -EBUSY;
> e8fe1745ccea53 Hao Luo  2022-03-24  323         }
> e8fe1745ccea53 Hao Luo  2022-03-24  324
> e8fe1745ccea53 Hao Luo  2022-03-24  325         smap = (struct bpf_local_storage_map *)map;
> e8fe1745ccea53 Hao Luo  2022-03-24  326         sdata = task_storage_lookup(current, map, true);
> e8fe1745ccea53 Hao Luo  2022-03-24 @327         if (sdata) {
>                                                     ^^^^^
>
> e8fe1745ccea53 Hao Luo  2022-03-24  328                 err = bpf_local_storage_mmap(smap, sdata->data, vma);
> e8fe1745ccea53 Hao Luo  2022-03-24  329                 goto unlock;
> e8fe1745ccea53 Hao Luo  2022-03-24  330         }
> e8fe1745ccea53 Hao Luo  2022-03-24  331
> e8fe1745ccea53 Hao Luo  2022-03-24  332         /* only allocate new storage, when the task is refcounted */
> e8fe1745ccea53 Hao Luo  2022-03-24  333         if (refcount_read(&current->usage)) {
> e8fe1745ccea53 Hao Luo  2022-03-24  334                 sdata = bpf_local_storage_update(current, smap, NULL,
> e8fe1745ccea53 Hao Luo  2022-03-24  335                                                  BPF_NOEXIST, GFP_ATOMIC);
> e8fe1745ccea53 Hao Luo  2022-03-24  336                 if (IS_ERR(sdata)) {
> e8fe1745ccea53 Hao Luo  2022-03-24  337                         err = PTR_ERR(sdata);
> e8fe1745ccea53 Hao Luo  2022-03-24  338                         goto unlock;
> e8fe1745ccea53 Hao Luo  2022-03-24  339                 }
> e8fe1745ccea53 Hao Luo  2022-03-24  340         }
>
> "sdata" is NULL if refcount_read() returns zero
>

Good catch! I will fix it in my next iteration.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-24 23:41 [PATCH RFC bpf-next 0/2] Mmapable task local storage Hao Luo
  2022-03-24 23:41 ` [PATCH RFC bpf-next 1/2] bpf: Mmapable " Hao Luo
  2022-03-24 23:41 ` [PATCH RFC bpf-next 2/2] selftests/bpf: Test mmapable task " Hao Luo
@ 2022-03-25 19:16 ` Yonghong Song
  2022-03-28 17:39   ` Hao Luo
  2 siblings, 1 reply; 20+ messages in thread
From: Yonghong Song @ 2022-03-25 19:16 UTC (permalink / raw)
  To: Hao Luo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: KP Singh, Martin KaFai Lau, Song Liu, bpf, linux-kernel



On 3/24/22 4:41 PM, Hao Luo wrote:
> Some map types support mmap operation, which allows userspace to
> communicate with BPF programs directly. Currently only arraymap
> and ringbuf have mmap implemented.
> 
> However, in some use cases, when multiple program instances can
> run concurrently, global mmapable memory can cause race. In that
> case, userspace needs to provide necessary synchronizations to
> coordinate the usage of mapped global data. This can be a source
> of bottleneck.

I can see your use case here. Each calling process can get the
corresponding bpf program task local storage data through
mmap interface. As you mentioned, there is a tradeoff
between more memory vs. non-global synchronization.

I am thinking that another bpf_iter approach can retrieve
the similar result. We could implement a bpf_iter
for task local storage map, optionally it can provide
a tid to retrieve the data for that particular tid.
This way, user space needs an explicit syscall, but
does not need to allocate more memory than necessary.

WDYT?

> 
> It would be great to have a mmapable local storage in that case.
> This patch adds that.
> 
> Mmap isn't BPF syscall, so unpriv users can also use it to
> interact with maps.
> 
> Currently the only way of allocating mmapable map area is using
> vmalloc() and it's only used at map allocation time. Vmalloc()
> may sleep, therefore it's not suitable for maps that may allocate
> memory in an atomic context such as local storage. Local storage
> uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> 
> Allocating mmapable memory has requirment on page alignment. So we
> have to deliberately allocate more memory than necessary to obtain
> an address that has sdata->data aligned at page boundary. The
> calculations for mmapable allocation size, and the actual
> allocation/deallocation are packaged in three functions:
> 
>   - bpf_map_mmapable_alloc_size()
>   - bpf_map_mmapable_kzalloc()
>   - bpf_map_mmapable_kfree()
> 
> BPF local storage uses them to provide generic mmap API:
> 
>   - bpf_local_storage_mmap()
> 
> And task local storage adds the mmap callback:
> 
>   - task_storage_map_mmap()
> 
> When application calls mmap on a task local storage, it gets its
> own local storage.
> 
> Overall, mmapable local storage trades off memory with flexibility
> and efficiency. It brings memory fragmentation but can make programs
> stateless. Therefore useful in some cases.
> 
> Hao Luo (2):
>    bpf: Mmapable local storage.
>    selftests/bpf: Test mmapable task local storage.
> 
>   include/linux/bpf.h                           |  4 +
>   include/linux/bpf_local_storage.h             |  5 +-
>   kernel/bpf/bpf_local_storage.c                | 73 +++++++++++++++++--
>   kernel/bpf/bpf_task_storage.c                 | 40 ++++++++++
>   kernel/bpf/syscall.c                          | 67 +++++++++++++++++
>   .../bpf/prog_tests/task_local_storage.c       | 38 ++++++++++
>   .../bpf/progs/task_local_storage_mmapable.c   | 38 ++++++++++
>   7 files changed, 257 insertions(+), 8 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-25 19:16 ` [PATCH RFC bpf-next 0/2] Mmapable " Yonghong Song
@ 2022-03-28 17:39   ` Hao Luo
  2022-03-28 17:46     ` Hao Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Hao Luo @ 2022-03-28 17:39 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, KP Singh,
	Martin KaFai Lau, Song Liu, bpf, linux-kernel

Hi Yonghong,

On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
>
> On 3/24/22 4:41 PM, Hao Luo wrote:
> > Some map types support mmap operation, which allows userspace to
> > communicate with BPF programs directly. Currently only arraymap
> > and ringbuf have mmap implemented.
> >
> > However, in some use cases, when multiple program instances can
> > run concurrently, global mmapable memory can cause race. In that
> > case, userspace needs to provide necessary synchronizations to
> > coordinate the usage of mapped global data. This can be a source
> > of bottleneck.
>
> I can see your use case here. Each calling process can get the
> corresponding bpf program task local storage data through
> mmap interface. As you mentioned, there is a tradeoff
> between more memory vs. non-global synchronization.
>
> I am thinking that another bpf_iter approach can retrieve
> the similar result. We could implement a bpf_iter
> for task local storage map, optionally it can provide
> a tid to retrieve the data for that particular tid.
> This way, user space needs an explicit syscall, but
> does not need to allocate more memory than necessary.
>
> WDYT?
>

Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:

- mmap prevents the calling task from reading other task's value.
Using bpf_iter, one can pass other task's tid to get their values. I
assume there are two potential ways of passing tid to bpf_iter: one is
to use global data in bpf prog, the other is adding tid parameterized
iter_link. For the first, it's not easy for unpriv tasks to use. For
the second, we need to create one iter_link object for each interested
tid. It may not be easy to use either.

- Regarding adding an explicit syscall. I thought about adding
write/read syscalls for task local storage maps, just like reading
values from iter_link. Writing or reading task local storage map
updates/reads the current task's value. I think this could achieve the
same effect as mmap.

Hao

> >
> > It would be great to have a mmapable local storage in that case.
> > This patch adds that.
> >
> > Mmap isn't BPF syscall, so unpriv users can also use it to
> > interact with maps.
> >
> > Currently the only way of allocating mmapable map area is using
> > vmalloc() and it's only used at map allocation time. Vmalloc()
> > may sleep, therefore it's not suitable for maps that may allocate
> > memory in an atomic context such as local storage. Local storage
> > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> >
> > Allocating mmapable memory has requirment on page alignment. So we
> > have to deliberately allocate more memory than necessary to obtain
> > an address that has sdata->data aligned at page boundary. The
> > calculations for mmapable allocation size, and the actual
> > allocation/deallocation are packaged in three functions:
> >
> >   - bpf_map_mmapable_alloc_size()
> >   - bpf_map_mmapable_kzalloc()
> >   - bpf_map_mmapable_kfree()
> >
> > BPF local storage uses them to provide generic mmap API:
> >
> >   - bpf_local_storage_mmap()
> >
> > And task local storage adds the mmap callback:
> >
> >   - task_storage_map_mmap()
> >
> > When application calls mmap on a task local storage, it gets its
> > own local storage.
> >
> > Overall, mmapable local storage trades off memory with flexibility
> > and efficiency. It brings memory fragmentation but can make programs
> > stateless. Therefore useful in some cases.
> >
> > Hao Luo (2):
> >    bpf: Mmapable local storage.
> >    selftests/bpf: Test mmapable task local storage.
> >
> >   include/linux/bpf.h                           |  4 +
> >   include/linux/bpf_local_storage.h             |  5 +-
> >   kernel/bpf/bpf_local_storage.c                | 73 +++++++++++++++++--
> >   kernel/bpf/bpf_task_storage.c                 | 40 ++++++++++
> >   kernel/bpf/syscall.c                          | 67 +++++++++++++++++
> >   .../bpf/prog_tests/task_local_storage.c       | 38 ++++++++++
> >   .../bpf/progs/task_local_storage_mmapable.c   | 38 ++++++++++
> >   7 files changed, 257 insertions(+), 8 deletions(-)
> >   create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-28 17:39   ` Hao Luo
@ 2022-03-28 17:46     ` Hao Luo
  2022-03-29  9:37       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 20+ messages in thread
From: Hao Luo @ 2022-03-28 17:46 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, KP Singh,
	Martin KaFai Lau, Song Liu, bpf, linux-kernel

On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
>
> Hi Yonghong,
>
> On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> >
> > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > Some map types support mmap operation, which allows userspace to
> > > communicate with BPF programs directly. Currently only arraymap
> > > and ringbuf have mmap implemented.
> > >
> > > However, in some use cases, when multiple program instances can
> > > run concurrently, global mmapable memory can cause race. In that
> > > case, userspace needs to provide necessary synchronizations to
> > > coordinate the usage of mapped global data. This can be a source
> > > of bottleneck.
> >
> > I can see your use case here. Each calling process can get the
> > corresponding bpf program task local storage data through
> > mmap interface. As you mentioned, there is a tradeoff
> > between more memory vs. non-global synchronization.
> >
> > I am thinking that another bpf_iter approach can retrieve
> > the similar result. We could implement a bpf_iter
> > for task local storage map, optionally it can provide
> > a tid to retrieve the data for that particular tid.
> > This way, user space needs an explicit syscall, but
> > does not need to allocate more memory than necessary.
> >
> > WDYT?
> >
>
> Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
>
> - mmap prevents the calling task from reading other task's value.
> Using bpf_iter, one can pass other task's tid to get their values. I
> assume there are two potential ways of passing tid to bpf_iter: one is
> to use global data in bpf prog, the other is adding tid parameterized
> iter_link. For the first, it's not easy for unpriv tasks to use. For
> the second, we need to create one iter_link object for each interested
> tid. It may not be easy to use either.
>
> - Regarding adding an explicit syscall. I thought about adding
> write/read syscalls for task local storage maps, just like reading
> values from iter_link. Writing or reading task local storage map
> updates/reads the current task's value. I think this could achieve the
> same effect as mmap.
>

Actually, my use case of using mmap on task local storage is to allow
userspace to pass FDs into bpf prog. Some of the helpers I want to add
need to take an FD as parameter and the bpf progs can run
concurrently, thus using global data is racy. Mmapable task local
storage is the best solution I can find for this purpose.

Song also mentioned to me offline, that mmapable task local storage
may be useful for his use case.

I am actually open to other proposals.

> > >
> > > It would be great to have a mmapable local storage in that case.
> > > This patch adds that.
> > >
> > > Mmap isn't BPF syscall, so unpriv users can also use it to
> > > interact with maps.
> > >
> > > Currently the only way of allocating mmapable map area is using
> > > vmalloc() and it's only used at map allocation time. Vmalloc()
> > > may sleep, therefore it's not suitable for maps that may allocate
> > > memory in an atomic context such as local storage. Local storage
> > > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> > >
> > > Allocating mmapable memory has requirment on page alignment. So we
> > > have to deliberately allocate more memory than necessary to obtain
> > > an address that has sdata->data aligned at page boundary. The
> > > calculations for mmapable allocation size, and the actual
> > > allocation/deallocation are packaged in three functions:
> > >
> > >   - bpf_map_mmapable_alloc_size()
> > >   - bpf_map_mmapable_kzalloc()
> > >   - bpf_map_mmapable_kfree()
> > >
> > > BPF local storage uses them to provide generic mmap API:
> > >
> > >   - bpf_local_storage_mmap()
> > >
> > > And task local storage adds the mmap callback:
> > >
> > >   - task_storage_map_mmap()
> > >
> > > When application calls mmap on a task local storage, it gets its
> > > own local storage.
> > >
> > > Overall, mmapable local storage trades off memory with flexibility
> > > and efficiency. It brings memory fragmentation but can make programs
> > > stateless. Therefore useful in some cases.
> > >
> > > Hao Luo (2):
> > >    bpf: Mmapable local storage.
> > >    selftests/bpf: Test mmapable task local storage.
> > >
> > >   include/linux/bpf.h                           |  4 +
> > >   include/linux/bpf_local_storage.h             |  5 +-
> > >   kernel/bpf/bpf_local_storage.c                | 73 +++++++++++++++++--
> > >   kernel/bpf/bpf_task_storage.c                 | 40 ++++++++++
> > >   kernel/bpf/syscall.c                          | 67 +++++++++++++++++
> > >   .../bpf/prog_tests/task_local_storage.c       | 38 ++++++++++
> > >   .../bpf/progs/task_local_storage_mmapable.c   | 38 ++++++++++
> > >   7 files changed, 257 insertions(+), 8 deletions(-)
> > >   create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> > >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-28 17:46     ` Hao Luo
@ 2022-03-29  9:37       ` Kumar Kartikeya Dwivedi
  2022-03-29 17:43         ` Hao Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-03-29  9:37 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, KP Singh, Martin KaFai Lau, Song Liu, bpf,
	linux-kernel

On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> >
> > Hi Yonghong,
> >
> > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > >
> > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > Some map types support mmap operation, which allows userspace to
> > > > communicate with BPF programs directly. Currently only arraymap
> > > > and ringbuf have mmap implemented.
> > > >
> > > > However, in some use cases, when multiple program instances can
> > > > run concurrently, global mmapable memory can cause race. In that
> > > > case, userspace needs to provide necessary synchronizations to
> > > > coordinate the usage of mapped global data. This can be a source
> > > > of bottleneck.
> > >
> > > I can see your use case here. Each calling process can get the
> > > corresponding bpf program task local storage data through
> > > mmap interface. As you mentioned, there is a tradeoff
> > > between more memory vs. non-global synchronization.
> > >
> > > I am thinking that another bpf_iter approach can retrieve
> > > the similar result. We could implement a bpf_iter
> > > for task local storage map, optionally it can provide
> > > a tid to retrieve the data for that particular tid.
> > > This way, user space needs an explicit syscall, but
> > > does not need to allocate more memory than necessary.
> > >
> > > WDYT?
> > >
> >
> > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> >
> > - mmap prevents the calling task from reading other task's value.
> > Using bpf_iter, one can pass other task's tid to get their values. I
> > assume there are two potential ways of passing tid to bpf_iter: one is
> > to use global data in bpf prog, the other is adding tid parameterized
> > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > the second, we need to create one iter_link object for each interested
> > tid. It may not be easy to use either.
> >
> > - Regarding adding an explicit syscall. I thought about adding
> > write/read syscalls for task local storage maps, just like reading
> > values from iter_link. Writing or reading task local storage map
> > updates/reads the current task's value. I think this could achieve the
> > same effect as mmap.
> >
>
> Actually, my use case of using mmap on task local storage is to allow
> userspace to pass FDs into bpf prog. Some of the helpers I want to add
> need to take an FD as parameter and the bpf progs can run
> concurrently, thus using global data is racy. Mmapable task local
> storage is the best solution I can find for this purpose.
>
> Song also mentioned to me offline, that mmapable task local storage
> may be useful for his use case.
>
> I am actually open to other proposals.
>

You could also use a syscall prog, and use bpf_prog_test_run to update local
storage for current. Data can be passed for that specific prog invocation using
ctx. You might have to enable bpf_task_storage helpers in it though, since they
are not allowed to be called right now.

> > > >
> > > > It would be great to have a mmapable local storage in that case.
> > > > This patch adds that.
> > > >
> > > > Mmap isn't BPF syscall, so unpriv users can also use it to
> > > > interact with maps.
> > > >
> > > > Currently the only way of allocating mmapable map area is using
> > > > vmalloc() and it's only used at map allocation time. Vmalloc()
> > > > may sleep, therefore it's not suitable for maps that may allocate
> > > > memory in an atomic context such as local storage. Local storage
> > > > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > > > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> > > >
> > > > Allocating mmapable memory has requirment on page alignment. So we
> > > > have to deliberately allocate more memory than necessary to obtain
> > > > an address that has sdata->data aligned at page boundary. The
> > > > calculations for mmapable allocation size, and the actual
> > > > allocation/deallocation are packaged in three functions:
> > > >
> > > >   - bpf_map_mmapable_alloc_size()
> > > >   - bpf_map_mmapable_kzalloc()
> > > >   - bpf_map_mmapable_kfree()
> > > >
> > > > BPF local storage uses them to provide generic mmap API:
> > > >
> > > >   - bpf_local_storage_mmap()
> > > >
> > > > And task local storage adds the mmap callback:
> > > >
> > > >   - task_storage_map_mmap()
> > > >
> > > > When application calls mmap on a task local storage, it gets its
> > > > own local storage.
> > > >
> > > > Overall, mmapable local storage trades off memory with flexibility
> > > > and efficiency. It brings memory fragmentation but can make programs
> > > > stateless. Therefore useful in some cases.
> > > >
> > > > Hao Luo (2):
> > > >    bpf: Mmapable local storage.
> > > >    selftests/bpf: Test mmapable task local storage.
> > > >
> > > >   include/linux/bpf.h                           |  4 +
> > > >   include/linux/bpf_local_storage.h             |  5 +-
> > > >   kernel/bpf/bpf_local_storage.c                | 73 +++++++++++++++++--
> > > >   kernel/bpf/bpf_task_storage.c                 | 40 ++++++++++
> > > >   kernel/bpf/syscall.c                          | 67 +++++++++++++++++
> > > >   .../bpf/prog_tests/task_local_storage.c       | 38 ++++++++++
> > > >   .../bpf/progs/task_local_storage_mmapable.c   | 38 ++++++++++
> > > >   7 files changed, 257 insertions(+), 8 deletions(-)
> > > >   create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> > > >

--
Kartikeya

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-29  9:37       ` Kumar Kartikeya Dwivedi
@ 2022-03-29 17:43         ` Hao Luo
  2022-03-29 21:45           ` Martin KaFai Lau
  2022-03-29 23:29           ` Alexei Starovoitov
  0 siblings, 2 replies; 20+ messages in thread
From: Hao Luo @ 2022-03-29 17:43 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, KP Singh, Martin KaFai Lau, Song Liu, bpf,
	linux-kernel

On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > >
> > > Hi Yonghong,
> > >
> > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > >
> > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > Some map types support mmap operation, which allows userspace to
> > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > and ringbuf have mmap implemented.
> > > > >
> > > > > However, in some use cases, when multiple program instances can
> > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > case, userspace needs to provide necessary synchronizations to
> > > > > coordinate the usage of mapped global data. This can be a source
> > > > > of bottleneck.
> > > >
> > > > I can see your use case here. Each calling process can get the
> > > > corresponding bpf program task local storage data through
> > > > mmap interface. As you mentioned, there is a tradeoff
> > > > between more memory vs. non-global synchronization.
> > > >
> > > > I am thinking that another bpf_iter approach can retrieve
> > > > the similar result. We could implement a bpf_iter
> > > > for task local storage map, optionally it can provide
> > > > a tid to retrieve the data for that particular tid.
> > > > This way, user space needs an explicit syscall, but
> > > > does not need to allocate more memory than necessary.
> > > >
> > > > WDYT?
> > > >
> > >
> > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > >
> > > - mmap prevents the calling task from reading other task's value.
> > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > to use global data in bpf prog, the other is adding tid parameterized
> > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > the second, we need to create one iter_link object for each interested
> > > tid. It may not be easy to use either.
> > >
> > > - Regarding adding an explicit syscall. I thought about adding
> > > write/read syscalls for task local storage maps, just like reading
> > > values from iter_link. Writing or reading task local storage map
> > > updates/reads the current task's value. I think this could achieve the
> > > same effect as mmap.
> > >
> >
> > Actually, my use case of using mmap on task local storage is to allow
> > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > need to take an FD as parameter and the bpf progs can run
> > concurrently, thus using global data is racy. Mmapable task local
> > storage is the best solution I can find for this purpose.
> >
> > Song also mentioned to me offline, that mmapable task local storage
> > may be useful for his use case.
> >
> > I am actually open to other proposals.
> >
>
> You could also use a syscall prog, and use bpf_prog_test_run to update local
> storage for current. Data can be passed for that specific prog invocation using
> ctx. You might have to enable bpf_task_storage helpers in it though, since they
> are not allowed to be called right now.
>

The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
thinking of allowing any thread including unpriv ones to be able to
pass data to the prog and update their own storage.

> > > > >
> > > > > It would be great to have a mmapable local storage in that case.
> > > > > This patch adds that.
> > > > >
> > > > > Mmap isn't BPF syscall, so unpriv users can also use it to
> > > > > interact with maps.
> > > > >
> > > > > Currently the only way of allocating mmapable map area is using
> > > > > vmalloc() and it's only used at map allocation time. Vmalloc()
> > > > > may sleep, therefore it's not suitable for maps that may allocate
> > > > > memory in an atomic context such as local storage. Local storage
> > > > > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch
> > > > > uses kmalloc() with GFP_ATOMIC as well for mmapable map area.
> > > > >
> > > > > Allocating mmapable memory has requirment on page alignment. So we
> > > > > have to deliberately allocate more memory than necessary to obtain
> > > > > an address that has sdata->data aligned at page boundary. The
> > > > > calculations for mmapable allocation size, and the actual
> > > > > allocation/deallocation are packaged in three functions:
> > > > >
> > > > >   - bpf_map_mmapable_alloc_size()
> > > > >   - bpf_map_mmapable_kzalloc()
> > > > >   - bpf_map_mmapable_kfree()
> > > > >
> > > > > BPF local storage uses them to provide generic mmap API:
> > > > >
> > > > >   - bpf_local_storage_mmap()
> > > > >
> > > > > And task local storage adds the mmap callback:
> > > > >
> > > > >   - task_storage_map_mmap()
> > > > >
> > > > > When application calls mmap on a task local storage, it gets its
> > > > > own local storage.
> > > > >
> > > > > Overall, mmapable local storage trades off memory with flexibility
> > > > > and efficiency. It brings memory fragmentation but can make programs
> > > > > stateless. Therefore useful in some cases.
> > > > >
> > > > > Hao Luo (2):
> > > > >    bpf: Mmapable local storage.
> > > > >    selftests/bpf: Test mmapable task local storage.
> > > > >
> > > > >   include/linux/bpf.h                           |  4 +
> > > > >   include/linux/bpf_local_storage.h             |  5 +-
> > > > >   kernel/bpf/bpf_local_storage.c                | 73 +++++++++++++++++--
> > > > >   kernel/bpf/bpf_task_storage.c                 | 40 ++++++++++
> > > > >   kernel/bpf/syscall.c                          | 67 +++++++++++++++++
> > > > >   .../bpf/prog_tests/task_local_storage.c       | 38 ++++++++++
> > > > >   .../bpf/progs/task_local_storage_mmapable.c   | 38 ++++++++++
> > > > >   7 files changed, 257 insertions(+), 8 deletions(-)
> > > > >   create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c
> > > > >
>
> --
> Kartikeya

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-29 17:43         ` Hao Luo
@ 2022-03-29 21:45           ` Martin KaFai Lau
  2022-03-30 18:05             ` Hao Luo
  2022-03-29 23:29           ` Alexei Starovoitov
  1 sibling, 1 reply; 20+ messages in thread
From: Martin KaFai Lau @ 2022-03-29 21:45 UTC (permalink / raw)
  To: Hao Luo
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, KP Singh, Song Liu, bpf,
	linux-kernel

On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > Hi Yonghong,
> > > >
> > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > >
> > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > and ringbuf have mmap implemented.
> > > > > >
> > > > > > However, in some use cases, when multiple program instances can
> > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > of bottleneck.
> > > > >
> > > > > I can see your use case here. Each calling process can get the
> > > > > corresponding bpf program task local storage data through
> > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > between more memory vs. non-global synchronization.
> > > > >
> > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > the similar result. We could implement a bpf_iter
> > > > > for task local storage map, optionally it can provide
> > > > > a tid to retrieve the data for that particular tid.
> > > > > This way, user space needs an explicit syscall, but
> > > > > does not need to allocate more memory than necessary.
> > > > >
> > > > > WDYT?
> > > > >
> > > >
> > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > >
> > > > - mmap prevents the calling task from reading other task's value.
> > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > the second, we need to create one iter_link object for each interested
> > > > tid. It may not be easy to use either.
> > > >
> > > > - Regarding adding an explicit syscall. I thought about adding
> > > > write/read syscalls for task local storage maps, just like reading
> > > > values from iter_link. Writing or reading task local storage map
> > > > updates/reads the current task's value. I think this could achieve the
> > > > same effect as mmap.
> > > >
> > >
> > > Actually, my use case of using mmap on task local storage is to allow
> > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > need to take an FD as parameter and the bpf progs can run
> > > concurrently, thus using global data is racy. Mmapable task local
> > > storage is the best solution I can find for this purpose.
Some more details is needed about the use case.  As long as there is
storage local to an individual task, racing within this one task's
specific storage is a non issue?

The patch 2 example is doable with the current api and is pretty far from
the above use case description.  The existing bpf_map_update_elem() and
bpf_map_lookup_elem() can not solve your use case?

or the current bpf_map_{update,lookup}_elem() works but
prefer a direct data read/write interface?

btw, how delete is going to look like ?

and do you see the mmap could be used with sk and inode storage instead
of the 'current' task?

> > >
> > > Song also mentioned to me offline, that mmapable task local storage
> > > may be useful for his use case.
> > >
> > > I am actually open to other proposals.
If the arraymap is local to an individual task, does it solve
your use case?  Have you thought about storing an arraymap (which is mmap-able)
in the local storage?  That could then be used to store ringbuf map and
other bpf maps in local storage.  It is logically similar to map-in-map.
The outer map is the pseudo local storage map here.

> > >
> >
> > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > storage for current. Data can be passed for that specific prog invocation using
> > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > are not allowed to be called right now.
> >
> 
> The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> thinking of allowing any thread including unpriv ones to be able to
> pass data to the prog and update their own storage.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-29 17:43         ` Hao Luo
  2022-03-29 21:45           ` Martin KaFai Lau
@ 2022-03-29 23:29           ` Alexei Starovoitov
  2022-03-30 18:06             ` Hao Luo
  1 sibling, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2022-03-29 23:29 UTC (permalink / raw)
  To: Hao Luo
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, KP Singh, Martin KaFai Lau,
	Song Liu, bpf, linux-kernel

On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > Hi Yonghong,
> > > >
> > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > >
> > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > and ringbuf have mmap implemented.
> > > > > >
> > > > > > However, in some use cases, when multiple program instances can
> > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > of bottleneck.
> > > > >
> > > > > I can see your use case here. Each calling process can get the
> > > > > corresponding bpf program task local storage data through
> > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > between more memory vs. non-global synchronization.
> > > > >
> > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > the similar result. We could implement a bpf_iter
> > > > > for task local storage map, optionally it can provide
> > > > > a tid to retrieve the data for that particular tid.
> > > > > This way, user space needs an explicit syscall, but
> > > > > does not need to allocate more memory than necessary.
> > > > >
> > > > > WDYT?
> > > > >
> > > >
> > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > >
> > > > - mmap prevents the calling task from reading other task's value.
> > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > the second, we need to create one iter_link object for each interested
> > > > tid. It may not be easy to use either.
> > > >
> > > > - Regarding adding an explicit syscall. I thought about adding
> > > > write/read syscalls for task local storage maps, just like reading
> > > > values from iter_link. Writing or reading task local storage map
> > > > updates/reads the current task's value. I think this could achieve the
> > > > same effect as mmap.
> > > >
> > >
> > > Actually, my use case of using mmap on task local storage is to allow
> > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > need to take an FD as parameter and the bpf progs can run
> > > concurrently, thus using global data is racy. Mmapable task local
> > > storage is the best solution I can find for this purpose.
> > >
> > > Song also mentioned to me offline, that mmapable task local storage
> > > may be useful for his use case.
> > >
> > > I am actually open to other proposals.
> > >
> >
> > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > storage for current. Data can be passed for that specific prog invocation using
> > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > are not allowed to be called right now.
> >
> 
> The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> thinking of allowing any thread including unpriv ones to be able to
> pass data to the prog and update their own storage.

If I understand the use case correctly all of this mmap-ing is only to
allow unpriv userspace to access a priv map via unpriv mmap() syscall.
But the map can be accessed as unpriv already.
Pin it with the world read creds and do map_lookup sys_bpf cmd on it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-29 21:45           ` Martin KaFai Lau
@ 2022-03-30 18:05             ` Hao Luo
  0 siblings, 0 replies; 20+ messages in thread
From: Hao Luo @ 2022-03-30 18:05 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, KP Singh, Song Liu, bpf,
	linux-kernel

On Tue, Mar 29, 2022 at 2:45 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > Hi Yonghong,
> > > > >
> > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > >
> > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > and ringbuf have mmap implemented.
> > > > > > >
> > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > of bottleneck.
> > > > > >
> > > > > > I can see your use case here. Each calling process can get the
> > > > > > corresponding bpf program task local storage data through
> > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > between more memory vs. non-global synchronization.
> > > > > >
> > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > the similar result. We could implement a bpf_iter
> > > > > > for task local storage map, optionally it can provide
> > > > > > a tid to retrieve the data for that particular tid.
> > > > > > This way, user space needs an explicit syscall, but
> > > > > > does not need to allocate more memory than necessary.
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > >
> > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > >
> > > > > - mmap prevents the calling task from reading other task's value.
> > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > the second, we need to create one iter_link object for each interested
> > > > > tid. It may not be easy to use either.
> > > > >
> > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > write/read syscalls for task local storage maps, just like reading
> > > > > values from iter_link. Writing or reading task local storage map
> > > > > updates/reads the current task's value. I think this could achieve the
> > > > > same effect as mmap.
> > > > >
> > > >
> > > > Actually, my use case of using mmap on task local storage is to allow
> > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > need to take an FD as parameter and the bpf progs can run
> > > > concurrently, thus using global data is racy. Mmapable task local
> > > > storage is the best solution I can find for this purpose.
> Some more details is needed about the use case.  As long as there is
> storage local to an individual task, racing within this one task's
> specific storage is a non issue?
>

Race is still possible. In the use case I was thinking, the workflow is:

1. Current task mmaps a local storage map, writes a value to its local storage.
2. Current task then makes a syscall, which triggers a bpf prog.
3. The bpf prog reads the current task's local storage and fetches the
value stored by the task.

The steps above are sequential, therefore no race between task and bpf
prog. If a task accesses other task's local storage, there is still
race.

> The patch 2 example is doable with the current api and is pretty far from
> the above use case description.  The existing bpf_map_update_elem() and
> bpf_map_lookup_elem() can not solve your use case?
>

With sysctl_unprivileged_bpf_disabled, tasks without CAP_BPF are not
able to make use of __sys_bpf(). bpf_map_update_elem() and
bpf_map_lookup_elem() call __sys_bpf underneath IIRC. So unpriv tasks
can't use these two syscalls.

> or the current bpf_map_{update,lookup}_elem() works but
> prefer a direct data read/write interface?
>
> btw, how delete is going to look like ?
>

Good question. Deletion is not done from the userspace. It's done at
bpf prog side. The task mmaps its storage (which creates the storage),
writes a value. The bpf prog reads the value and deletes the storage.

> and do you see the mmap could be used with sk and inode storage instead
> of the 'current' task?
>

Yes. I think this patch can certainly extend to sk or inode or cgroup
local storage. But I don't have a use case yet.

> > > >
> > > > Song also mentioned to me offline, that mmapable task local storage
> > > > may be useful for his use case.
> > > >
> > > > I am actually open to other proposals.
> If the arraymap is local to an individual task, does it solve
> your use case?  Have you thought about storing an arraymap (which is mmap-able)
> in the local storage?  That could then be used to store ringbuf map and
> other bpf maps in local storage.  It is logically similar to map-in-map.
> The outer map is the pseudo local storage map here.
>

I haven't thought about this. IMHO, I feel it might be complicated to
properly set up the set of maps.


> > > >
> > >
> > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > storage for current. Data can be passed for that specific prog invocation using
> > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > are not allowed to be called right now.
> > >
> >
> > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > thinking of allowing any thread including unpriv ones to be able to
> > pass data to the prog and update their own storage.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-29 23:29           ` Alexei Starovoitov
@ 2022-03-30 18:06             ` Hao Luo
  2022-03-30 18:16               ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Hao Luo @ 2022-03-30 18:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, KP Singh, Martin KaFai Lau,
	Song Liu, bpf, linux-kernel

On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > Hi Yonghong,
> > > > >
> > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > >
> > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > and ringbuf have mmap implemented.
> > > > > > >
> > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > of bottleneck.
> > > > > >
> > > > > > I can see your use case here. Each calling process can get the
> > > > > > corresponding bpf program task local storage data through
> > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > between more memory vs. non-global synchronization.
> > > > > >
> > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > the similar result. We could implement a bpf_iter
> > > > > > for task local storage map, optionally it can provide
> > > > > > a tid to retrieve the data for that particular tid.
> > > > > > This way, user space needs an explicit syscall, but
> > > > > > does not need to allocate more memory than necessary.
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > >
> > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > >
> > > > > - mmap prevents the calling task from reading other task's value.
> > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > the second, we need to create one iter_link object for each interested
> > > > > tid. It may not be easy to use either.
> > > > >
> > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > write/read syscalls for task local storage maps, just like reading
> > > > > values from iter_link. Writing or reading task local storage map
> > > > > updates/reads the current task's value. I think this could achieve the
> > > > > same effect as mmap.
> > > > >
> > > >
> > > > Actually, my use case of using mmap on task local storage is to allow
> > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > need to take an FD as parameter and the bpf progs can run
> > > > concurrently, thus using global data is racy. Mmapable task local
> > > > storage is the best solution I can find for this purpose.
> > > >
> > > > Song also mentioned to me offline, that mmapable task local storage
> > > > may be useful for his use case.
> > > >
> > > > I am actually open to other proposals.
> > > >
> > >
> > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > storage for current. Data can be passed for that specific prog invocation using
> > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > are not allowed to be called right now.
> > >
> >
> > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > thinking of allowing any thread including unpriv ones to be able to
> > pass data to the prog and update their own storage.
>
> If I understand the use case correctly all of this mmap-ing is only to
> allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> But the map can be accessed as unpriv already.
> Pin it with the world read creds and do map_lookup sys_bpf cmd on it.

Right, but, if I understand correctly, with
sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
use of __sys_bpf(). Is there anything I missed?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-30 18:06             ` Hao Luo
@ 2022-03-30 18:16               ` Alexei Starovoitov
  2022-03-30 18:26                 ` Hao Luo
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2022-03-30 18:16 UTC (permalink / raw)
  To: Hao Luo
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, KP Singh, Martin KaFai Lau,
	Song Liu, bpf, LKML

On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <haoluo@google.com> wrote:
>
> On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > > > >
> > > > > > Hi Yonghong,
> > > > > >
> > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > > >
> > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > and ringbuf have mmap implemented.
> > > > > > > >
> > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > of bottleneck.
> > > > > > >
> > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > corresponding bpf program task local storage data through
> > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > between more memory vs. non-global synchronization.
> > > > > > >
> > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > for task local storage map, optionally it can provide
> > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > does not need to allocate more memory than necessary.
> > > > > > >
> > > > > > > WDYT?
> > > > > > >
> > > > > >
> > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > >
> > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > the second, we need to create one iter_link object for each interested
> > > > > > tid. It may not be easy to use either.
> > > > > >
> > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > same effect as mmap.
> > > > > >
> > > > >
> > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > need to take an FD as parameter and the bpf progs can run
> > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > storage is the best solution I can find for this purpose.
> > > > >
> > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > may be useful for his use case.
> > > > >
> > > > > I am actually open to other proposals.
> > > > >
> > > >
> > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > storage for current. Data can be passed for that specific prog invocation using
> > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > are not allowed to be called right now.
> > > >
> > >
> > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > thinking of allowing any thread including unpriv ones to be able to
> > > pass data to the prog and update their own storage.
> >
> > If I understand the use case correctly all of this mmap-ing is only to
> > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > But the map can be accessed as unpriv already.
> > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
>
> Right, but, if I understand correctly, with
> sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> use of __sys_bpf(). Is there anything I missed?

That sysctl is a heavy hammer. Let's fix it instead.
map lookup/update/delete can be allowed for unpriv for certain map types.
There are permissions checks in corresponding lookup/update calls already.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-30 18:16               ` Alexei Starovoitov
@ 2022-03-30 18:26                 ` Hao Luo
  2022-03-31 22:32                   ` KP Singh
  0 siblings, 1 reply; 20+ messages in thread
From: Hao Luo @ 2022-03-30 18:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, KP Singh, Martin KaFai Lau,
	Song Liu, bpf, LKML

On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <haoluo@google.com> wrote:
> >
> > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > >
> > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > > > > >
> > > > > > > Hi Yonghong,
> > > > > > >
> > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > > > >
> > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > >
> > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > of bottleneck.
> > > > > > > >
> > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > >
> > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > >
> > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > >
> > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > tid. It may not be easy to use either.
> > > > > > >
> > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > same effect as mmap.
> > > > > > >
> > > > > >
> > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > storage is the best solution I can find for this purpose.
> > > > > >
> > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > may be useful for his use case.
> > > > > >
> > > > > > I am actually open to other proposals.
> > > > > >
> > > > >
> > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > are not allowed to be called right now.
> > > > >
> > > >
> > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > thinking of allowing any thread including unpriv ones to be able to
> > > > pass data to the prog and update their own storage.
> > >
> > > If I understand the use case correctly all of this mmap-ing is only to
> > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > But the map can be accessed as unpriv already.
> > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> >
> > Right, but, if I understand correctly, with
> > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > use of __sys_bpf(). Is there anything I missed?
>
> That sysctl is a heavy hammer. Let's fix it instead.
> map lookup/update/delete can be allowed for unpriv for certain map types.
> There are permissions checks in corresponding lookup/update calls already.

This sounds great. If we can allow basic map operations for some map
types, it will change many use cases I'm looking at. Let me take a
look and report back.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-30 18:26                 ` Hao Luo
@ 2022-03-31 22:32                   ` KP Singh
  2022-03-31 23:06                     ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: KP Singh @ 2022-03-31 22:32 UTC (permalink / raw)
  To: Hao Luo, Jann Horn
  Cc: Alexei Starovoitov, Kumar Kartikeya Dwivedi, Yonghong Song,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, bpf, LKML

On Wed, Mar 30, 2022 at 8:26 PM Hao Luo <haoluo@google.com> wrote:
>
> On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > > <memxor@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > > > > > >
> > > > > > > > Hi Yonghong,
> > > > > > > >
> > > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > > > > >
> > > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > > >
> > > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > > of bottleneck.
> > > > > > > > >
> > > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > > >
> > > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > > >
> > > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > > tid. It may not be easy to use either.
> > > > > > > >
> > > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > > same effect as mmap.
> > > > > > > >
> > > > > > >
> > > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > > storage is the best solution I can find for this purpose.
> > > > > > >
> > > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > > may be useful for his use case.
> > > > > > >
> > > > > > > I am actually open to other proposals.
> > > > > > >
> > > > > >
> > > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > > are not allowed to be called right now.
> > > > > >
> > > > >
> > > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > > thinking of allowing any thread including unpriv ones to be able to
> > > > > pass data to the prog and update their own storage.
> > > >
> > > > If I understand the use case correctly all of this mmap-ing is only to
> > > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > > But the map can be accessed as unpriv already.
> > > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> > >
> > > Right, but, if I understand correctly, with
> > > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > > use of __sys_bpf(). Is there anything I missed?
> >
> > That sysctl is a heavy hammer. Let's fix it instead.
> > map lookup/update/delete can be allowed for unpriv for certain map types.
> > There are permissions checks in corresponding lookup/update calls already.
>

(Adding Jann)

I wonder if we can tag a map as BPF_F_UNPRIVILEGED and allow the writes to
only maps that are explicitly marked as writable by unprivileged processes.

We will have task local storage in LSM programs that we
won't like unprivileged processes to write to as well.

struct {
        __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
        __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_UNPRIVILEGED);
        __type(key, int);
        __type(value, struct fd_storage);
} task_fd_storage_map SEC(".maps");

- KP

> This sounds great. If we can allow basic map operations for some map
> types, it will change many use cases I'm looking at. Let me take a
> look and report back.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-31 22:32                   ` KP Singh
@ 2022-03-31 23:06                     ` Alexei Starovoitov
  2022-04-02  0:48                       ` KP Singh
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2022-03-31 23:06 UTC (permalink / raw)
  To: KP Singh
  Cc: Hao Luo, Jann Horn, Kumar Kartikeya Dwivedi, Yonghong Song,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, bpf, LKML

On Thu, Mar 31, 2022 at 3:32 PM KP Singh <kpsingh@kernel.org> wrote:
>
> On Wed, Mar 30, 2022 at 8:26 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > > > <memxor@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Yonghong,
> > > > > > > > >
> > > > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > > > >
> > > > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > > > of bottleneck.
> > > > > > > > > >
> > > > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > > > >
> > > > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > > > >
> > > > > > > > > > WDYT?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > > > >
> > > > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > > > tid. It may not be easy to use either.
> > > > > > > > >
> > > > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > > > same effect as mmap.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > > > storage is the best solution I can find for this purpose.
> > > > > > > >
> > > > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > > > may be useful for his use case.
> > > > > > > >
> > > > > > > > I am actually open to other proposals.
> > > > > > > >
> > > > > > >
> > > > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > > > are not allowed to be called right now.
> > > > > > >
> > > > > >
> > > > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > > > thinking of allowing any thread including unpriv ones to be able to
> > > > > > pass data to the prog and update their own storage.
> > > > >
> > > > > If I understand the use case correctly all of this mmap-ing is only to
> > > > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > > > But the map can be accessed as unpriv already.
> > > > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> > > >
> > > > Right, but, if I understand correctly, with
> > > > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > > > use of __sys_bpf(). Is there anything I missed?
> > >
> > > That sysctl is a heavy hammer. Let's fix it instead.
> > > map lookup/update/delete can be allowed for unpriv for certain map types.
> > > There are permissions checks in corresponding lookup/update calls already.
> >
>
> (Adding Jann)
>
> I wonder if we can tag a map as BPF_F_UNPRIVILEGED and allow the writes to
> only maps that are explicitly marked as writable by unprivileged processes.

I think it's overkill for existing unpriv maps like hash and array.
These maps by themself don't pose a security threat.
The sysctl was/is in the wrong place.

> We will have task local storage in LSM programs that we
> won't like unprivileged processes to write to as well.
>
> struct {
>         __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
>         __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_UNPRIVILEGED);
>         __type(key, int);
>         __type(value, struct fd_storage);
> } task_fd_storage_map SEC(".maps");

local storage map was not exposed to unpriv before.
This would be a different consideration.
But even in such a case the extra flag looks unnecessary.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage.
  2022-03-31 23:06                     ` Alexei Starovoitov
@ 2022-04-02  0:48                       ` KP Singh
  0 siblings, 0 replies; 20+ messages in thread
From: KP Singh @ 2022-04-02  0:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hao Luo, Jann Horn, Kumar Kartikeya Dwivedi, Yonghong Song,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, bpf, LKML

On Fri, Apr 1, 2022 at 1:06 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Mar 31, 2022 at 3:32 PM KP Singh <kpsingh@kernel.org> wrote:
> >
> > On Wed, Mar 30, 2022 at 8:26 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Mar 30, 2022 at 11:06 AM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote:
> > > > > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi
> > > > > > > <memxor@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote:
> > > > > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo <haoluo@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Yonghong,
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song <yhs@fb.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote:
> > > > > > > > > > > > Some map types support mmap operation, which allows userspace to
> > > > > > > > > > > > communicate with BPF programs directly. Currently only arraymap
> > > > > > > > > > > > and ringbuf have mmap implemented.
> > > > > > > > > > > >
> > > > > > > > > > > > However, in some use cases, when multiple program instances can
> > > > > > > > > > > > run concurrently, global mmapable memory can cause race. In that
> > > > > > > > > > > > case, userspace needs to provide necessary synchronizations to
> > > > > > > > > > > > coordinate the usage of mapped global data. This can be a source
> > > > > > > > > > > > of bottleneck.
> > > > > > > > > > >
> > > > > > > > > > > I can see your use case here. Each calling process can get the
> > > > > > > > > > > corresponding bpf program task local storage data through
> > > > > > > > > > > mmap interface. As you mentioned, there is a tradeoff
> > > > > > > > > > > between more memory vs. non-global synchronization.
> > > > > > > > > > >
> > > > > > > > > > > I am thinking that another bpf_iter approach can retrieve
> > > > > > > > > > > the similar result. We could implement a bpf_iter
> > > > > > > > > > > for task local storage map, optionally it can provide
> > > > > > > > > > > a tid to retrieve the data for that particular tid.
> > > > > > > > > > > This way, user space needs an explicit syscall, but
> > > > > > > > > > > does not need to allocate more memory than necessary.
> > > > > > > > > > >
> > > > > > > > > > > WDYT?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap:
> > > > > > > > > >
> > > > > > > > > > - mmap prevents the calling task from reading other task's value.
> > > > > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I
> > > > > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is
> > > > > > > > > > to use global data in bpf prog, the other is adding tid parameterized
> > > > > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For
> > > > > > > > > > the second, we need to create one iter_link object for each interested
> > > > > > > > > > tid. It may not be easy to use either.
> > > > > > > > > >
> > > > > > > > > > - Regarding adding an explicit syscall. I thought about adding
> > > > > > > > > > write/read syscalls for task local storage maps, just like reading
> > > > > > > > > > values from iter_link. Writing or reading task local storage map
> > > > > > > > > > updates/reads the current task's value. I think this could achieve the
> > > > > > > > > > same effect as mmap.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Actually, my use case of using mmap on task local storage is to allow
> > > > > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add
> > > > > > > > > need to take an FD as parameter and the bpf progs can run
> > > > > > > > > concurrently, thus using global data is racy. Mmapable task local
> > > > > > > > > storage is the best solution I can find for this purpose.
> > > > > > > > >
> > > > > > > > > Song also mentioned to me offline, that mmapable task local storage
> > > > > > > > > may be useful for his use case.
> > > > > > > > >
> > > > > > > > > I am actually open to other proposals.
> > > > > > > > >
> > > > > > > >
> > > > > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local
> > > > > > > > storage for current. Data can be passed for that specific prog invocation using
> > > > > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they
> > > > > > > > are not allowed to be called right now.
> > > > > > > >
> > > > > > >
> > > > > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm
> > > > > > > thinking of allowing any thread including unpriv ones to be able to
> > > > > > > pass data to the prog and update their own storage.
> > > > > >
> > > > > > If I understand the use case correctly all of this mmap-ing is only to
> > > > > > allow unpriv userspace to access a priv map via unpriv mmap() syscall.
> > > > > > But the map can be accessed as unpriv already.
> > > > > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it.
> > > > >
> > > > > Right, but, if I understand correctly, with
> > > > > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make
> > > > > use of __sys_bpf(). Is there anything I missed?
> > > >
> > > > That sysctl is a heavy hammer. Let's fix it instead.
> > > > map lookup/update/delete can be allowed for unpriv for certain map types.
> > > > There are permissions checks in corresponding lookup/update calls already.
> > >
> >
> > (Adding Jann)
> >
> > I wonder if we can tag a map as BPF_F_UNPRIVILEGED and allow the writes to
> > only maps that are explicitly marked as writable by unprivileged processes.
>
> I think it's overkill for existing unpriv maps like hash and array.
> These maps by themself don't pose a security threat.
> The sysctl was/is in the wrong place.
>
> > We will have task local storage in LSM programs that we
> > won't like unprivileged processes to write to as well.
> >
> > struct {
> >         __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> >         __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_UNPRIVILEGED);
> >         __type(key, int);
> >         __type(value, struct fd_storage);
> > } task_fd_storage_map SEC(".maps");
>
> local storage map was not exposed to unpriv before.
> This would be a different consideration.
> But even in such a case the extra flag looks unnecessary.

I took a look at the code and it makes sense to allow maps like hash and array
which are already mmapable. These should not really need the BPF_F_UNPRIVILEGED.

Hao, I think you can send a patch that removes these map operations
from the scope of
the sysctl. Regarding the local storages, let's do them separately
since it would be a newer
access surface.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-04-02  0:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-24 23:41 [PATCH RFC bpf-next 0/2] Mmapable task local storage Hao Luo
2022-03-24 23:41 ` [PATCH RFC bpf-next 1/2] bpf: Mmapable " Hao Luo
2022-03-24 23:41 ` [PATCH RFC bpf-next 2/2] selftests/bpf: Test mmapable task " Hao Luo
2022-03-25 19:16 ` [PATCH RFC bpf-next 0/2] Mmapable " Yonghong Song
2022-03-28 17:39   ` Hao Luo
2022-03-28 17:46     ` Hao Luo
2022-03-29  9:37       ` Kumar Kartikeya Dwivedi
2022-03-29 17:43         ` Hao Luo
2022-03-29 21:45           ` Martin KaFai Lau
2022-03-30 18:05             ` Hao Luo
2022-03-29 23:29           ` Alexei Starovoitov
2022-03-30 18:06             ` Hao Luo
2022-03-30 18:16               ` Alexei Starovoitov
2022-03-30 18:26                 ` Hao Luo
2022-03-31 22:32                   ` KP Singh
2022-03-31 23:06                     ` Alexei Starovoitov
2022-04-02  0:48                       ` KP Singh
2022-03-25  7:53 [PATCH RFC bpf-next 1/2] bpf: Mmapable " kernel test robot
2022-03-25 12:40 ` [kbuild] " Dan Carpenter
2022-03-25 17:36 ` Hao Luo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.