All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena.
@ 2024-02-09  4:05 Alexei Starovoitov
  2024-02-09  4:05 ` [PATCH v2 bpf-next 01/20] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
                   ` (21 more replies)
  0 siblings, 22 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

v1->v2:
- Improved commit log with reasons for using vmap_pages_range() in bpf_arena.
  Thanks to Johannes
- Added support for __arena global variables in bpf programs
- Fixed race conditions spotted by Barret
- Fixed wrap32 issue spotted by Barret
- Fixed bpf_map_mmap_sz() the way Andrii suggested

The work on bpf_arena was inspired by Barret's work:
https://github.com/google/ghost-userspace/blob/main/lib/queue.bpf.h
that implements queues, lists and AVL trees completely as bpf programs
using giant bpf array map and integer indices instead of pointers.
bpf_arena is a sparse array that allows to use normal C pointers to
build such data structures. Last few patches implement page_frag
allocator, link list and hash table as bpf programs.

v1:
bpf programs have multiple options to communicate with user space:
- Various ring buffers (perf, ftrace, bpf): The data is streamed
  unidirectionally from bpf to user space.
- Hash map: The bpf program populates elements, and user space consumes them
  via bpf syscall.
- mmap()-ed array map: Libbpf creates an array map that is directly accessed by
  the bpf program and mmap-ed to user space. It's the fastest way. Its
  disadvantage is that memory for the whole array is reserved at the start.

Introduce bpf_arena, which is a sparse shared memory region between the bpf
program and user space.

Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
   region, like memcached or any key/value storage. The bpf program implements an
   in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
   value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
   rb-trees, sparse arrays), while user space consumes it.
3. bpf_arena is a "heap" of memory from the bpf program's point of view.
   The user space may mmap it, but bpf program will not convert pointers
   to user base at run-time to improve bpf program speed.

Initially, the kernel vm_area and user vma are not populated. User space can
fault in pages within the range. While servicing a page fault, bpf_arena logic
will insert a new page into the kernel and user vmas. The bpf program can
allocate pages from that region via bpf_arena_alloc_pages(). This kernel
function will insert pages into the kernel vm_area. The subsequent fault-in
from user space will populate that page into the user vma. The
BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
from user space. In such a case, if a page is not allocated by the bpf program
and not present in the kernel vm_area, the user process will segfault. This is
useful for use cases 2 and 3 above.

bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
and removes the range from the kernel vm_area and from user process vmas.

bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf
program is more important than ease of sharing with user space. This is use
case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will
tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a
32-bit move wX = wY, which will improve bpf prog performance. Otherwise,
bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits
of user vm_start (if the pointer is not NULL) to arena pointers before they are
stored into memory. This way, user space sees them as valid 64-bit pointers.

Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
generate the bpf_cast_kern() instruction before dereference of the arena
pointer and the bpf_cast_user() instruction when the arena pointer is formed.
In a typical bpf program there will be very few bpf_cast_user().

From LLVM's point of view, arena pointers are tagged as
__attribute__((address_space(1))). Hence, clang provides helpful diagnostics
when pointers cross address space. Libbpf and the kernel support only
address_space == 1. All other address space identifiers are reserved.

rX = bpf_cast_kern(rY, addr_space) tells the verifier that
rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
to be in the 32-bit domain. The verifier will mark load/store through
PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
copy_from_kernel_nofault() except that no address checks are necessary. The
address is guaranteed to be in the 4GB range. If the page is not present, the
destination register is zeroed on read, and the operation is ignored on write.

rX = bpf_cast_user(rY, addr_space) tells the verifier that
rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
equivalent to:
rX = (u32)rY;
if (rY)
  rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */

After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list built
by a bpf program can be walked natively by user space. The last two patches
demonstrate how algorithms in the C language can be compiled as a bpf program
and as native code.

Followup patches are planned:
. support bpf_spin_lock in arena
  bpf programs running on different CPUs can synchronize access to the arena via
  existing bpf_spin_lock mechanisms (spin_locks in bpf_array or in bpf hash map).
  It will be more convenient to allow spin_locks inside the arena too.

Patch set overview:
..
- patch 4: export vmap_pages_range() to be used outside of mm directory
- patch 5: main patch that introduces bpf_arena map type. See commit log
..
- patch 9: main verifier patch to support bpf_arena
..
- patch 11-14: libbpf support for arena
..
- patch 17-20: tests

Alexei Starovoitov (20):
  bpf: Allow kfuncs return 'void *'
  bpf: Recognize '__map' suffix in kfunc arguments
  bpf: Plumb get_unmapped_area() callback into bpf_map_ops
  mm: Expose vmap_pages_range() to the rest of the kernel.
  bpf: Introduce bpf_arena.
  bpf: Disasm support for cast_kern/user instructions.
  bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  bpf: Add x86-64 JIT support for bpf_cast_user instruction.
  bpf: Recognize cast_kern/user instructions in the verifier.
  bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
  libbpf: Add __arg_arena to bpf_helpers.h
  libbpf: Add support for bpf_arena.
  libbpf: Allow specifying 64-bit integers in map BTF.
  libbpf: Recognize __arena global varaibles.
  bpf: Tell bpf programs kernel's PAGE_SIZE
  bpf: Add helper macro bpf_arena_cast()
  selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  selftests/bpf: Add bpf_arena_list test.
  selftests/bpf: Add bpf_arena_htab test.
  selftests/bpf: Convert simple page_frag allocator to per-cpu.

 arch/x86/net/bpf_jit_comp.c                   | 222 ++++++-
 include/linux/bpf.h                           |  11 +-
 include/linux/bpf_types.h                     |   1 +
 include/linux/bpf_verifier.h                  |   1 +
 include/linux/filter.h                        |   4 +
 include/linux/vmalloc.h                       |   2 +
 include/uapi/linux/bpf.h                      |  12 +
 kernel/bpf/Makefile                           |   3 +
 kernel/bpf/arena.c                            | 557 ++++++++++++++++++
 kernel/bpf/btf.c                              |  19 +-
 kernel/bpf/core.c                             |  23 +-
 kernel/bpf/disasm.c                           |  11 +
 kernel/bpf/log.c                              |   3 +
 kernel/bpf/syscall.c                          |  15 +
 kernel/bpf/verifier.c                         | 135 ++++-
 mm/vmalloc.c                                  |   4 +-
 tools/bpf/bpftool/gen.c                       |  13 +-
 tools/include/uapi/linux/bpf.h                |  12 +
 tools/lib/bpf/bpf_helpers.h                   |   6 +
 tools/lib/bpf/libbpf.c                        | 189 +++++-
 tools/lib/bpf/libbpf_probes.c                 |   7 +
 tools/testing/selftests/bpf/DENYLIST.aarch64  |   2 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |   2 +
 tools/testing/selftests/bpf/bpf_arena_alloc.h |  67 +++
 .../testing/selftests/bpf/bpf_arena_common.h  |  70 +++
 tools/testing/selftests/bpf/bpf_arena_htab.h  | 100 ++++
 tools/testing/selftests/bpf/bpf_arena_list.h  |  95 +++
 .../testing/selftests/bpf/bpf_experimental.h  |  41 ++
 .../selftests/bpf/prog_tests/arena_htab.c     |  88 +++
 .../selftests/bpf/prog_tests/arena_list.c     |  68 +++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../testing/selftests/bpf/progs/arena_htab.c  |  46 ++
 .../selftests/bpf/progs/arena_htab_asm.c      |   5 +
 .../testing/selftests/bpf/progs/arena_list.c  |  76 +++
 .../selftests/bpf/progs/verifier_arena.c      |  91 +++
 tools/testing/selftests/bpf/test_loader.c     |   9 +-
 36 files changed, 1969 insertions(+), 43 deletions(-)
 create mode 100644 kernel/bpf/arena.c
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_alloc.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_htab.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_list.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_list.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab_asm.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_list.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_arena.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 01/20] bpf: Allow kfuncs return 'void *'
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  6:49   ` Kumar Kartikeya Dwivedi
  2024-02-09  4:05 ` [PATCH v2 bpf-next 02/20] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Recognize return of 'void *' from kfunc as returning unknown scalar.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ddaf09db1175..d9c2dbb3939f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12353,6 +12353,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 					meta.func_name);
 				return -EFAULT;
 			}
+		} else if (btf_type_is_void(ptr_type)) {
+			/* kfunc returning 'void *' is equivalent to returning scalar */
+			mark_reg_unknown(env, regs, BPF_REG_0);
 		} else if (!__btf_type_is_struct(ptr_type)) {
 			if (!meta.r0_size) {
 				__u32 sz;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 02/20] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
  2024-02-09  4:05 ` [PATCH v2 bpf-next 01/20] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  6:52   ` Kumar Kartikeya Dwivedi
  2024-02-09  4:05 ` [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops Alexei Starovoitov
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Recognize 'void *p__map' kfunc argument as 'struct bpf_map *p__map'.
It allows kfunc to have 'void *' argument for maps, since bpf progs
will call them as:
struct {
        __uint(type, BPF_MAP_TYPE_ARENA);
	...
} arena SEC(".maps");

bpf_kfunc_with_map(... &arena ...);

Underneath libbpf will load CONST_PTR_TO_MAP into the register via ld_imm64 insn.
If kfunc was defined with 'struct bpf_map *' it would pass
the verifier, but bpf prog would need to use '(void *)&arena'.
Which is not clean.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d9c2dbb3939f..db569ce89fb1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10741,6 +10741,11 @@ static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *a
 	return __kfunc_param_match_suffix(btf, arg, "__ign");
 }
 
+static bool is_kfunc_arg_map(const struct btf *btf, const struct btf_param *arg)
+{
+	return __kfunc_param_match_suffix(btf, arg, "__map");
+}
+
 static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg)
 {
 	return __kfunc_param_match_suffix(btf, arg, "__alloc");
@@ -11064,7 +11069,7 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 		return KF_ARG_PTR_TO_CONST_STR;
 
 	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
-		if (!btf_type_is_struct(ref_t)) {
+		if (!btf_type_is_struct(ref_t) && !btf_type_is_void(ref_t)) {
 			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
 				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
 			return -EINVAL;
@@ -11660,6 +11665,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		if (kf_arg_type < 0)
 			return kf_arg_type;
 
+		if (is_kfunc_arg_map(btf, &args[i])) {
+			/* If argument has '__map' suffix expect 'struct bpf_map *' */
+			ref_id = *reg2btf_ids[CONST_PTR_TO_MAP];
+			ref_t = btf_type_by_id(btf_vmlinux, ref_id);
+			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
+		}
+
 		switch (kf_arg_type) {
 		case KF_ARG_PTR_TO_NULL:
 			continue;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
  2024-02-09  4:05 ` [PATCH v2 bpf-next 01/20] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
  2024-02-09  4:05 ` [PATCH v2 bpf-next 02/20] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  0:06   ` kernel test robot
                     ` (3 more replies)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
                   ` (18 subsequent siblings)
  21 siblings, 4 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Subsequent patches introduce bpf_arena that imposes special alignment
requirements on address selection.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h  |  3 +++
 kernel/bpf/syscall.c | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1ebbee1d648e..8b0dcb66eb33 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -139,6 +139,9 @@ struct bpf_map_ops {
 	int (*map_mmap)(struct bpf_map *map, struct vm_area_struct *vma);
 	__poll_t (*map_poll)(struct bpf_map *map, struct file *filp,
 			     struct poll_table_struct *pts);
+	unsigned long (*map_get_unmapped_area)(struct file *filep, unsigned long addr,
+					       unsigned long len, unsigned long pgoff,
+					       unsigned long flags);
 
 	/* Functions called by bpf_local_storage maps */
 	int (*map_local_storage_charge)(struct bpf_local_storage_map *smap,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b2750b79ac80..8dd9814a0e14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -937,6 +937,17 @@ static __poll_t bpf_map_poll(struct file *filp, struct poll_table_struct *pts)
 	return EPOLLERR;
 }
 
+static unsigned long bpf_get_unmapped_area(struct file *filp, unsigned long addr,
+					   unsigned long len, unsigned long pgoff,
+					   unsigned long flags)
+{
+	struct bpf_map *map = filp->private_data;
+
+	if (map->ops->map_get_unmapped_area)
+		return map->ops->map_get_unmapped_area(filp, addr, len, pgoff, flags);
+	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+
 const struct file_operations bpf_map_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= bpf_map_show_fdinfo,
@@ -946,6 +957,7 @@ const struct file_operations bpf_map_fops = {
 	.write		= bpf_dummy_write,
 	.mmap		= bpf_map_mmap,
 	.poll		= bpf_map_poll,
+	.get_unmapped_area = bpf_get_unmapped_area,
 };
 
 int bpf_map_new_fd(struct bpf_map *map, int flags)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
  2024-02-14  8:36   ` Christoph Hellwig
  2024-02-09  4:05 ` [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena Alexei Starovoitov
                   ` (17 subsequent siblings)
  21 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

BPF would like to use the vmap API to implement a lazily-populated
memory space which can be shared by multiple userspace threads.

The vmap API is generally public and has functions to request and
release areas of kernel address space, as well as functions to map
various types of backing memory into that space.

For example, there is the public ioremap_page_range(), which is used
to map device memory into addressable kernel space.

The new BPF code needs the functionality of vmap_pages_range() in
order to incrementally map privately managed arrays of pages into its
vmap area. Indeed this function used to be public, but became private
when usecases other than vmalloc happened to disappear.

Make it public again for the new external user.

The next commits will introduce bpf_arena which is a sparsely populated shared
memory region between bpf program and user space process. It will map
privately-managed pages into an existing vm area. It's the same pattern and
layer of abstraction as ioremap_pages_range().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/vmalloc.h | 2 ++
 mm/vmalloc.c            | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..bafb87c69e3d 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -233,6 +233,8 @@ static inline bool is_vm_area_hugepages(const void *addr)
 
 #ifdef CONFIG_MMU
 void vunmap_range(unsigned long addr, unsigned long end);
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..eae93d575d1b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -625,8 +625,8 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-static int vmap_pages_range(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages, unsigned int page_shift)
+int vmap_pages_range(unsigned long addr, unsigned long end,
+		     pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
 	int err;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-09 20:36   ` David Vernet
                     ` (3 more replies)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 06/20] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
                   ` (16 subsequent siblings)
  21 siblings, 4 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce bpf_arena, which is a sparse shared memory region between the bpf
program and user space.

Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
   region, like memcached or any key/value storage. The bpf program implements an
   in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
   value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
   rb-trees, sparse arrays), while user space consumes it.
3. bpf_arena is a "heap" of memory from the bpf program's point of view.
   The user space may mmap it, but bpf program will not convert pointers
   to user base at run-time to improve bpf program speed.

Initially, the kernel vm_area and user vma are not populated. User space can
fault in pages within the range. While servicing a page fault, bpf_arena logic
will insert a new page into the kernel and user vmas. The bpf program can
allocate pages from that region via bpf_arena_alloc_pages(). This kernel
function will insert pages into the kernel vm_area. The subsequent fault-in
from user space will populate that page into the user vma. The
BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
from user space. In such a case, if a page is not allocated by the bpf program
and not present in the kernel vm_area, the user process will segfault. This is
useful for use cases 2 and 3 above.

bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
and removes the range from the kernel vm_area and from user process vmas.

bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf
program is more important than ease of sharing with user space. This is use
case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will
tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a
32-bit move wX = wY, which will improve bpf prog performance. Otherwise,
bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits
of user vm_start (if the pointer is not NULL) to arena pointers before they are
stored into memory. This way, user space sees them as valid 64-bit pointers.

Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
generate the bpf_cast_kern() instruction before dereference of the arena
pointer and the bpf_cast_user() instruction when the arena pointer is formed.
In a typical bpf program there will be very few bpf_cast_user().

From LLVM's point of view, arena pointers are tagged as
__attribute__((address_space(1))). Hence, clang provides helpful diagnostics
when pointers cross address space. Libbpf and the kernel support only
address_space == 1. All other address space identifiers are reserved.

rX = bpf_cast_kern(rY, addr_space) tells the verifier that
rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
to be in the 32-bit domain. The verifier will mark load/store through
PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
copy_from_kernel_nofault() except that no address checks are necessary. The
address is guaranteed to be in the 4GB range. If the page is not present, the
destination register is zeroed on read, and the operation is ignored on write.

rX = bpf_cast_user(rY, addr_space) tells the verifier that
rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
equivalent to:
rX = (u32)rY;
if (rY)
  rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */

After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list built
by a bpf program can be walked natively by user space.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h            |   5 +-
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |   7 +
 kernel/bpf/Makefile            |   3 +
 kernel/bpf/arena.c             | 557 +++++++++++++++++++++++++++++++++
 kernel/bpf/core.c              |  11 +
 kernel/bpf/syscall.c           |   3 +
 kernel/bpf/verifier.c          |   1 +
 tools/include/uapi/linux/bpf.h |   7 +
 9 files changed, 593 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/arena.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8b0dcb66eb33..de557c6c42e0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -37,6 +37,7 @@ struct perf_event;
 struct bpf_prog;
 struct bpf_prog_aux;
 struct bpf_map;
+struct bpf_arena;
 struct sock;
 struct seq_file;
 struct btf;
@@ -534,8 +535,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
 			struct bpf_spin_lock *spin_lock);
 void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
 		      struct bpf_spin_lock *spin_lock);
-
-
+u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena);
+u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
 struct bpf_offload_dev;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 94baced5a1ad..9f2a6b83b49e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -132,6 +132,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d96708380e52..f6648851eae6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -983,6 +983,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	BPF_MAP_TYPE_ARENA,
 	__MAX_BPF_MAP_TYPE
 };
 
@@ -1370,6 +1371,12 @@ enum {
 
 /* BPF token FD is passed in a corresponding command's token_fd field */
 	BPF_F_TOKEN_FD          = (1U << 16),
+
+/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
+	BPF_F_SEGV_ON_FAULT	= (1U << 17),
+
+/* Do not translate kernel bpf_arena pointers to user pointers */
+	BPF_F_NO_USER_CONV	= (1U << 18),
 };
 
 /* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 4ce95acfcaa7..368c5d86b5b7 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -15,6 +15,9 @@ obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
+ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
+obj-$(CONFIG_BPF_SYSCALL) += arena.o
+endif
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
new file mode 100644
index 000000000000..5c1014471740
--- /dev/null
+++ b/kernel/bpf/arena.c
@@ -0,0 +1,557 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/err.h>
+#include <linux/btf_ids.h>
+#include <linux/vmalloc.h>
+#include <linux/pagemap.h>
+
+/*
+ * bpf_arena is a sparsely populated shared memory region between bpf program and
+ * user space process.
+ *
+ * For example on x86-64 the values could be:
+ * user_vm_start 7f7d26200000     // picked by mmap()
+ * kern_vm_start ffffc90001e69000 // picked by get_vm_area()
+ * For user space all pointers within the arena are normal 8-byte addresses.
+ * In this example 7f7d26200000 is the address of the first page (pgoff=0).
+ * The bpf program will access it as: kern_vm_start + lower_32bit_of_user_ptr
+ * (u32)7f7d26200000 -> 26200000
+ * hence
+ * ffffc90001e69000 + 26200000 == ffffc90028069000 is "pgoff=0" within 4Gb
+ * kernel memory region.
+ *
+ * BPF JITs generate the following code to access arena:
+ *   mov eax, eax  // eax has lower 32-bit of user pointer
+ *   mov word ptr [rax + r12 + off], bx
+ * where r12 == kern_vm_start and off is s16.
+ * Hence allocate 4Gb + GUARD_SZ/2 on each side.
+ *
+ * Initially kernel vm_area and user vma are not populated.
+ * User space can fault-in any address which will insert the page
+ * into kernel and user vma.
+ * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc
+ * which will insert it into kernel vm_area.
+ * The later fault-in from user space will populate that page into user vma.
+ */
+
+/* number of bytes addressable by LDX/STX insn with 16-bit 'off' field */
+#define GUARD_SZ (1ull << sizeof(((struct bpf_insn *)0)->off) * 8)
+#define KERN_VM_SZ ((1ull << 32) + GUARD_SZ)
+
+struct bpf_arena {
+	struct bpf_map map;
+	u64 user_vm_start;
+	u64 user_vm_end;
+	struct vm_struct *kern_vm;
+	struct maple_tree mt;
+	struct list_head vma_list;
+	struct mutex lock;
+};
+
+u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
+{
+	return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0;
+}
+
+u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
+{
+	return arena ? arena->user_vm_start : 0;
+}
+
+static long arena_map_peek_elem(struct bpf_map *map, void *value)
+{
+	return -EOPNOTSUPP;
+}
+
+static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static long arena_map_pop_elem(struct bpf_map *map, void *value)
+{
+	return -EOPNOTSUPP;
+}
+
+static long arena_map_delete_elem(struct bpf_map *map, void *value)
+{
+	return -EOPNOTSUPP;
+}
+
+static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -EOPNOTSUPP;
+}
+
+static long compute_pgoff(struct bpf_arena *arena, long uaddr)
+{
+	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
+}
+
+static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
+{
+	struct vm_struct *kern_vm;
+	int numa_node = bpf_map_attr_numa_node(attr);
+	struct bpf_arena *arena;
+	u64 vm_range;
+	int err = -ENOMEM;
+
+	if (attr->key_size || attr->value_size || attr->max_entries == 0 ||
+	    /* BPF_F_MMAPABLE must be set */
+	    !(attr->map_flags & BPF_F_MMAPABLE) ||
+	    /* No unsupported flags present */
+	    (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | BPF_F_NO_USER_CONV)))
+		return ERR_PTR(-EINVAL);
+
+	if (attr->map_extra & ~PAGE_MASK)
+		/* If non-zero the map_extra is an expected user VMA start address */
+		return ERR_PTR(-EINVAL);
+
+	vm_range = (u64)attr->max_entries * PAGE_SIZE;
+	if (vm_range > (1ull << 32))
+		return ERR_PTR(-E2BIG);
+
+	if ((attr->map_extra >> 32) != ((attr->map_extra + vm_range - 1) >> 32))
+		/* user vma must not cross 32-bit boundary */
+		return ERR_PTR(-ERANGE);
+
+	kern_vm = get_vm_area(KERN_VM_SZ, VM_MAP | VM_USERMAP);
+	if (!kern_vm)
+		return ERR_PTR(-ENOMEM);
+
+	arena = bpf_map_area_alloc(sizeof(*arena), numa_node);
+	if (!arena)
+		goto err;
+
+	arena->kern_vm = kern_vm;
+	arena->user_vm_start = attr->map_extra;
+	if (arena->user_vm_start)
+		arena->user_vm_end = arena->user_vm_start + vm_range;
+
+	INIT_LIST_HEAD(&arena->vma_list);
+	bpf_map_init_from_attr(&arena->map, attr);
+	mt_init_flags(&arena->mt, MT_FLAGS_ALLOC_RANGE);
+	mutex_init(&arena->lock);
+
+	return &arena->map;
+err:
+	free_vm_area(kern_vm);
+	return ERR_PTR(err);
+}
+
+static int for_each_pte(pte_t *ptep, unsigned long addr, void *data)
+{
+	struct page *page;
+	pte_t pte;
+
+	pte = ptep_get(ptep);
+	if (!pte_present(pte))
+		return 0;
+	page = pte_page(pte);
+	/*
+	 * We do not update pte here:
+	 * 1. Nobody should be accessing bpf_arena's range outside of a kernel bug
+	 * 2. TLB flushing is batched or deferred. Even if we clear pte,
+	 * the TLB entries can stick around and continue to permit access to
+	 * the freed page. So it all relies on 1.
+	 */
+	__free_page(page);
+	return 0;
+}
+
+static void arena_map_free(struct bpf_map *map)
+{
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	/*
+	 * Check that user vma-s are not around when bpf map is freed.
+	 * mmap() holds vm_file which holds bpf_map refcnt.
+	 * munmap() must have happened on vma followed by arena_vm_close()
+	 * which would clear arena->vma_list.
+	 */
+	if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
+		return;
+
+	/*
+	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
+	 * It unmaps everything from vmalloc area and clears pgtables.
+	 * Call apply_to_existing_page_range() first to find populated ptes and
+	 * free those pages.
+	 */
+	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
+				     KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);
+	free_vm_area(arena->kern_vm);
+	mtree_destroy(&arena->mt);
+	bpf_map_area_free(arena);
+}
+
+static void *arena_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static long arena_map_update_elem(struct bpf_map *map, void *key,
+				  void *value, u64 flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static int arena_map_check_btf(const struct bpf_map *map, const struct btf *btf,
+			       const struct btf_type *key_type, const struct btf_type *value_type)
+{
+	return 0;
+}
+
+static u64 arena_map_mem_usage(const struct bpf_map *map)
+{
+	return 0;
+}
+
+struct vma_list {
+	struct vm_area_struct *vma;
+	struct list_head head;
+};
+
+static int remember_vma(struct bpf_arena *arena, struct vm_area_struct *vma)
+{
+	struct vma_list *vml;
+
+	vml = kmalloc(sizeof(*vml), GFP_KERNEL);
+	if (!vml)
+		return -ENOMEM;
+	vma->vm_private_data = vml;
+	vml->vma = vma;
+	list_add(&vml->head, &arena->vma_list);
+	return 0;
+}
+
+static void arena_vm_close(struct vm_area_struct *vma)
+{
+	struct bpf_map *map = vma->vm_file->private_data;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+	struct vma_list *vml;
+
+	guard(mutex)(&arena->lock);
+	vml = vma->vm_private_data;
+	list_del(&vml->head);
+	vma->vm_private_data = NULL;
+	kfree(vml);
+}
+
+#define MT_ENTRY ((void *)&arena_map_ops) /* unused. has to be valid pointer */
+
+static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
+{
+	struct bpf_map *map = vmf->vma->vm_file->private_data;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+	struct page *page;
+	long kbase, kaddr;
+	int ret;
+
+	kbase = bpf_arena_get_kern_vm_start(arena);
+	kaddr = kbase + (u32)(vmf->address & PAGE_MASK);
+
+	guard(mutex)(&arena->lock);
+	page = vmalloc_to_page((void *)kaddr);
+	if (page)
+		/* already have a page vmap-ed */
+		goto out;
+
+	if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
+		/* User space requested to segfault when page is not allocated by bpf prog */
+		return VM_FAULT_SIGSEGV;
+
+	ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL);
+	if (ret)
+		return VM_FAULT_SIGSEGV;
+
+	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page) {
+		mtree_erase(&arena->mt, vmf->pgoff);
+		return VM_FAULT_SIGSEGV;
+	}
+
+	ret = vmap_pages_range(kaddr, kaddr + PAGE_SIZE, PAGE_KERNEL, &page, PAGE_SHIFT);
+	if (ret) {
+		mtree_erase(&arena->mt, vmf->pgoff);
+		__free_page(page);
+		return VM_FAULT_SIGSEGV;
+	}
+out:
+	page_ref_add(page, 1);
+	vmf->page = page;
+	return 0;
+}
+
+static const struct vm_operations_struct arena_vm_ops = {
+	.close		= arena_vm_close,
+	.fault          = arena_vm_fault,
+};
+
+static unsigned long arena_get_unmapped_area(struct file *filp, unsigned long addr,
+					     unsigned long len, unsigned long pgoff,
+					     unsigned long flags)
+{
+	struct bpf_map *map = filp->private_data;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+	long ret;
+
+	if (pgoff)
+		return -EINVAL;
+	if (len > (1ull << 32))
+		return -E2BIG;
+
+	/* if user_vm_start was specified at arena creation time */
+	if (arena->user_vm_start) {
+		if (len > arena->user_vm_end - arena->user_vm_start)
+			return -E2BIG;
+		if (len != arena->user_vm_end - arena->user_vm_start)
+			return -EINVAL;
+		if (addr != arena->user_vm_start)
+			return -EINVAL;
+	}
+
+	ret = current->mm->get_unmapped_area(filp, addr, len * 2, 0, flags);
+	if (IS_ERR_VALUE(ret))
+                return 0;
+	if ((ret >> 32) == ((ret + len - 1) >> 32))
+		return ret;
+	if (WARN_ON_ONCE(arena->user_vm_start))
+		/* checks at map creation time should prevent this */
+		return -EFAULT;
+	return round_up(ret, 1ull << 32);
+}
+
+static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
+{
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	guard(mutex)(&arena->lock);
+	if (arena->user_vm_start && arena->user_vm_start != vma->vm_start)
+		/*
+		 * If map_extra was not specified at arena creation time then
+		 * 1st user process can do mmap(NULL, ...) to pick user_vm_start
+		 * 2nd user process must pass the same addr to mmap(addr, MAP_FIXED..);
+		 *   or
+		 * specify addr in map_extra and
+		 * use the same addr later with mmap(addr, MAP_FIXED..);
+		 */
+		return -EBUSY;
+
+	if (arena->user_vm_end && arena->user_vm_end != vma->vm_end)
+		/* all user processes must have the same size of mmap-ed region */
+		return -EBUSY;
+
+	/* Earlier checks should prevent this */
+	if (WARN_ON_ONCE(vma->vm_end - vma->vm_start > (1ull << 32) || vma->vm_pgoff))
+		return -EFAULT;
+
+	if (remember_vma(arena, vma))
+		return -ENOMEM;
+
+	arena->user_vm_start = vma->vm_start;
+	arena->user_vm_end = vma->vm_end;
+	/*
+	 * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and
+	 * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid
+	 * potential change of user_vm_start.
+	 */
+	vm_flags_set(vma, VM_DONTEXPAND);
+	vma->vm_ops = &arena_vm_ops;
+	return 0;
+}
+
+static int arena_map_direct_value_addr(const struct bpf_map *map, u64 *imm, u32 off)
+{
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if ((u64)off > arena->user_vm_end - arena->user_vm_start)
+		return -ERANGE;
+	*imm = (unsigned long)arena->user_vm_start;
+	return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_arena_map_btf_ids, struct, bpf_arena)
+const struct bpf_map_ops arena_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = arena_map_alloc,
+	.map_free = arena_map_free,
+	.map_direct_value_addr = arena_map_direct_value_addr,
+	.map_mmap = arena_map_mmap,
+	.map_get_unmapped_area = arena_get_unmapped_area,
+	.map_get_next_key = arena_map_get_next_key,
+	.map_push_elem = arena_map_push_elem,
+	.map_peek_elem = arena_map_peek_elem,
+	.map_pop_elem = arena_map_pop_elem,
+	.map_lookup_elem = arena_map_lookup_elem,
+	.map_update_elem = arena_map_update_elem,
+	.map_delete_elem = arena_map_delete_elem,
+	.map_check_btf = arena_map_check_btf,
+	.map_mem_usage = arena_map_mem_usage,
+	.map_btf_id = &bpf_arena_map_btf_ids[0],
+};
+
+static u64 clear_lo32(u64 val)
+{
+	return val & ~(u64)~0U;
+}
+
+/*
+ * Allocate pages and vmap them into kernel vmalloc area.
+ * Later the pages will be mmaped into user space vma.
+ */
+static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
+{
+	/* user_vm_end/start are fixed before bpf prog runs */
+	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
+	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
+	long pgoff = 0, nr_pages = 0;
+	struct page **pages;
+	u32 uaddr32;
+	int ret, i;
+
+	if (page_cnt > page_cnt_max)
+		return 0;
+
+	if (uaddr) {
+		if (uaddr & ~PAGE_MASK)
+			return 0;
+		pgoff = compute_pgoff(arena, uaddr);
+		if (pgoff + page_cnt > page_cnt_max)
+			/* requested address will be outside of user VMA */
+			return 0;
+	}
+
+	/* zeroing is needed, since alloc_pages_bulk_array() only fills in non-zero entries */
+	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return 0;
+
+	guard(mutex)(&arena->lock);
+
+	if (uaddr)
+		ret = mtree_insert_range(&arena->mt, pgoff, pgoff + page_cnt - 1,
+					 MT_ENTRY, GFP_KERNEL);
+	else
+		ret = mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY,
+					page_cnt, 0, page_cnt_max - 1, GFP_KERNEL);
+	if (ret)
+		goto out_free_pages;
+
+	nr_pages = alloc_pages_bulk_array_node(GFP_KERNEL | __GFP_ZERO, node_id, page_cnt, pages);
+	if (nr_pages != page_cnt)
+		goto out;
+
+	uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
+	/* Earlier checks make sure that uaddr32 + page_cnt * PAGE_SIZE will not overflow 32-bit */
+	ret = vmap_pages_range(kern_vm_start + uaddr32,
+			       kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE,
+			       PAGE_KERNEL, pages, PAGE_SHIFT);
+	if (ret)
+		goto out;
+	kvfree(pages);
+	return clear_lo32(arena->user_vm_start) + uaddr32;
+out:
+	mtree_erase(&arena->mt, pgoff);
+out_free_pages:
+	if (pages)
+		for (i = 0; i < nr_pages; i++)
+			__free_page(pages[i]);
+	kvfree(pages);
+	return 0;
+}
+
+/*
+ * If page is present in vmalloc area, unmap it from vmalloc area,
+ * unmap it from all user space vma-s,
+ * and free it.
+ */
+static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+{
+	struct vma_list *vml;
+
+	list_for_each_entry(vml, &arena->vma_list, head)
+		zap_page_range_single(vml->vma, uaddr,
+				      PAGE_SIZE * page_cnt, NULL);
+}
+
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+{
+	u64 full_uaddr, uaddr_end;
+	long kaddr, pgoff, i;
+	struct page *page;
+
+	/* only aligned lower 32-bit are relevant */
+	uaddr = (u32)uaddr;
+	uaddr &= PAGE_MASK;
+	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
+	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
+	if (full_uaddr >= uaddr_end)
+		return;
+
+	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
+
+	guard(mutex)(&arena->lock);
+
+	pgoff = compute_pgoff(arena, uaddr);
+	/* clear range */
+	mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt - 1, NULL, GFP_KERNEL);
+
+	if (page_cnt > 1)
+		/* bulk zap if multiple pages being freed */
+		zap_pages(arena, full_uaddr, page_cnt);
+
+	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
+	for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
+		page = vmalloc_to_page((void *)kaddr);
+		if (!page)
+			continue;
+		if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
+			zap_pages(arena, full_uaddr, 1);
+		vunmap_range(kaddr, kaddr + PAGE_SIZE);
+		__free_page(page);
+	}
+}
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt,
+					int node_id, u64 flags)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt)
+		return NULL;
+
+	return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id);
+}
+
+__bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
+		return;
+	arena_free_pages(arena, (long)ptr__ign, page_cnt);
+}
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(arena_kfuncs)
+BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE)
+BTF_KFUNCS_END(arena_kfuncs)
+
+static const struct btf_kfunc_id_set common_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &arena_kfuncs,
+};
+
+static int __init kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &common_kfunc_set);
+}
+late_initcall(kfunc_init);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 71c459a51d9e..2539d9bfe369 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2970,6 +2970,17 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
 {
 }
 
+/* for configs without MMU or 32-bit */
+__weak const struct bpf_map_ops arena_map_ops;
+__weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
+{
+	return 0;
+}
+__weak u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
+{
+	return 0;
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 static int __init bpf_global_ma_init(void)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8dd9814a0e14..6b9efb3f79dd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -164,6 +164,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 	if (bpf_map_is_offloaded(map)) {
 		return bpf_map_offload_update_elem(map, key, value, flags);
 	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
+		   map->map_type == BPF_MAP_TYPE_ARENA ||
 		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
 		return map->ops->map_update_elem(map, key, value, flags);
 	} else if (map->map_type == BPF_MAP_TYPE_SOCKHASH ||
@@ -1172,6 +1173,7 @@ static int map_create(union bpf_attr *attr)
 	}
 
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
+	    attr->map_type != BPF_MAP_TYPE_ARENA &&
 	    attr->map_extra != 0)
 		return -EINVAL;
 
@@ -1261,6 +1263,7 @@ static int map_create(union bpf_attr *attr)
 	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
 	case BPF_MAP_TYPE_STRUCT_OPS:
 	case BPF_MAP_TYPE_CPUMAP:
+	case BPF_MAP_TYPE_ARENA:
 		if (!bpf_token_capable(token, CAP_BPF))
 			goto put_token;
 		break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index db569ce89fb1..3c77a3ab1192 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -18047,6 +18047,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		case BPF_MAP_TYPE_SK_STORAGE:
 		case BPF_MAP_TYPE_TASK_STORAGE:
 		case BPF_MAP_TYPE_CGRP_STORAGE:
+		case BPF_MAP_TYPE_ARENA:
 			break;
 		default:
 			verbose(env,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d96708380e52..f6648851eae6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -983,6 +983,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	BPF_MAP_TYPE_ARENA,
 	__MAX_BPF_MAP_TYPE
 };
 
@@ -1370,6 +1371,12 @@ enum {
 
 /* BPF token FD is passed in a corresponding command's token_fd field */
 	BPF_F_TOKEN_FD          = (1U << 16),
+
+/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
+	BPF_F_SEGV_ON_FAULT	= (1U << 17),
+
+/* Do not translate kernel bpf_arena pointers to user pointers */
+	BPF_F_NO_USER_CONV	= (1U << 18),
 };
 
 /* Flags for BPF_PROG_QUERY. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 06/20] bpf: Disasm support for cast_kern/user instructions.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  7:41   ` Kumar Kartikeya Dwivedi
  2024-02-09  4:05 ` [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

LLVM generates rX = bpf_cast_kern/_user(rY, address_space) instructions
when pointers in non-zero address space are used by the bpf program.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpf.h       |  5 +++++
 kernel/bpf/disasm.c            | 11 +++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++++
 3 files changed, 21 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f6648851eae6..3de1581379d4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1313,6 +1313,11 @@ enum {
  */
 #define BPF_PSEUDO_KFUNC_CALL	2
 
+enum bpf_arena_cast_kinds {
+	BPF_ARENA_CAST_KERN = 1,
+	BPF_ARENA_CAST_USER = 2,
+};
+
 /* flags for BPF_MAP_UPDATE_ELEM command */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c
index 49940c26a227..37d9b37b34f7 100644
--- a/kernel/bpf/disasm.c
+++ b/kernel/bpf/disasm.c
@@ -166,6 +166,12 @@ static bool is_movsx(const struct bpf_insn *insn)
 	       (insn->off == 8 || insn->off == 16 || insn->off == 32);
 }
 
+static bool is_arena_cast(const struct bpf_insn *insn)
+{
+	return insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) &&
+		(insn->off == BPF_ARENA_CAST_KERN || insn->off == BPF_ARENA_CAST_USER);
+}
+
 void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 		    const struct bpf_insn *insn,
 		    bool allow_ptr_leaks)
@@ -184,6 +190,11 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 				insn->code, class == BPF_ALU ? 'w' : 'r',
 				insn->dst_reg, class == BPF_ALU ? 'w' : 'r',
 				insn->dst_reg);
+		} else if (is_arena_cast(insn)) {
+			verbose(cbs->private_data, "(%02x) r%d = cast_%s(r%d, %d)\n",
+				insn->code, insn->dst_reg,
+				insn->off == BPF_ARENA_CAST_KERN ? "kern" : "user",
+				insn->src_reg, insn->imm);
 		} else if (BPF_SRC(insn->code) == BPF_X) {
 			verbose(cbs->private_data, "(%02x) %c%d %s %s%c%d\n",
 				insn->code, class == BPF_ALU ? 'w' : 'r',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f6648851eae6..3de1581379d4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1313,6 +1313,11 @@ enum {
  */
 #define BPF_PSEUDO_KFUNC_CALL	2
 
+enum bpf_arena_cast_kinds {
+	BPF_ARENA_CAST_KERN = 1,
+	BPF_ARENA_CAST_USER = 2,
+};
+
 /* flags for BPF_MAP_UPDATE_ELEM command */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 06/20] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-09 17:20   ` Eduard Zingerman
  2024-02-10  6:48   ` Kumar Kartikeya Dwivedi
  2024-02-09  4:05 ` [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions.
They are similar to PROBE_MEM instructions with the following differences:
- PROBE_MEM has to check that the address is in the kernel range with
  src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check
- PROBE_MEM doesn't support store
- PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue)
  Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed
  to be within arena virtual range, so no address check at run-time.
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop.
  When LDX faults the destination register is zeroed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 183 +++++++++++++++++++++++++++++++++++-
 include/linux/bpf.h         |   1 +
 include/linux/filter.h      |   3 +
 3 files changed, 186 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e1390d1e331b..883b7f604b9a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -113,6 +113,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
 /* Pick a register outside of BPF range for JIT internal work */
 #define AUX_REG (MAX_BPF_JIT_REG + 1)
 #define X86_REG_R9 (MAX_BPF_JIT_REG + 2)
+#define X86_REG_R12 (MAX_BPF_JIT_REG + 3)
 
 /*
  * The following table maps BPF registers to x86-64 registers.
@@ -139,6 +140,7 @@ static const int reg2hex[] = {
 	[BPF_REG_AX] = 2, /* R10 temp register */
 	[AUX_REG] = 3,    /* R11 temp register */
 	[X86_REG_R9] = 1, /* R9 register, 6th function argument */
+	[X86_REG_R12] = 4, /* R12 callee saved */
 };
 
 static const int reg2pt_regs[] = {
@@ -167,6 +169,7 @@ static bool is_ereg(u32 reg)
 			     BIT(BPF_REG_8) |
 			     BIT(BPF_REG_9) |
 			     BIT(X86_REG_R9) |
+			     BIT(X86_REG_R12) |
 			     BIT(BPF_REG_AX));
 }
 
@@ -205,6 +208,17 @@ static u8 add_2mod(u8 byte, u32 r1, u32 r2)
 	return byte;
 }
 
+static u8 add_3mod(u8 byte, u32 r1, u32 r2, u32 index)
+{
+	if (is_ereg(r1))
+		byte |= 1;
+	if (is_ereg(index))
+		byte |= 2;
+	if (is_ereg(r2))
+		byte |= 4;
+	return byte;
+}
+
 /* Encode 'dst_reg' register into x86-64 opcode 'byte' */
 static u8 add_1reg(u8 byte, u32 dst_reg)
 {
@@ -887,6 +901,18 @@ static void emit_insn_suffix(u8 **pprog, u32 ptr_reg, u32 val_reg, int off)
 	*pprog = prog;
 }
 
+static void emit_insn_suffix_SIB(u8 **pprog, u32 ptr_reg, u32 val_reg, u32 index_reg, int off)
+{
+	u8 *prog = *pprog;
+
+	if (is_imm8(off)) {
+		EMIT3(add_2reg(0x44, BPF_REG_0, val_reg), add_2reg(0, ptr_reg, index_reg) /* SIB */, off);
+	} else {
+		EMIT2_off32(add_2reg(0x84, BPF_REG_0, val_reg), add_2reg(0, ptr_reg, index_reg) /* SIB */, off);
+	}
+	*pprog = prog;
+}
+
 /*
  * Emit a REX byte if it will be necessary to address these registers
  */
@@ -968,6 +994,37 @@ static void emit_ldsx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 	*pprog = prog;
 }
 
+static void emit_ldx_index(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, u32 index_reg, int off)
+{
+	u8 *prog = *pprog;
+
+	switch (size) {
+	case BPF_B:
+		/* movzx rax, byte ptr [rax + r12 + off] */
+		EMIT3(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x0F, 0xB6);
+		break;
+	case BPF_H:
+		/* movzx rax, word ptr [rax + r12 + off] */
+		EMIT3(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x0F, 0xB7);
+		break;
+	case BPF_W:
+		/* mov eax, dword ptr [rax + r12 + off] */
+		EMIT2(add_3mod(0x40, src_reg, dst_reg, index_reg), 0x8B);
+		break;
+	case BPF_DW:
+		/* mov rax, qword ptr [rax + r12 + off] */
+		EMIT2(add_3mod(0x48, src_reg, dst_reg, index_reg), 0x8B);
+		break;
+	}
+	emit_insn_suffix_SIB(&prog, src_reg, dst_reg, index_reg, off);
+	*pprog = prog;
+}
+
+static void emit_ldx_r12(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
+{
+	emit_ldx_index(pprog, size, dst_reg, src_reg, X86_REG_R12, off);
+}
+
 /* STX: *(u8*)(dst_reg + off) = src_reg */
 static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 {
@@ -1002,6 +1059,71 @@ static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 	*pprog = prog;
 }
 
+/* STX: *(u8*)(dst_reg + index_reg + off) = src_reg */
+static void emit_stx_index(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, u32 index_reg, int off)
+{
+	u8 *prog = *pprog;
+
+	switch (size) {
+	case BPF_B:
+		/* mov byte ptr [rax + r12 + off], al */
+		EMIT2(add_3mod(0x40, dst_reg, src_reg, index_reg), 0x88);
+		break;
+	case BPF_H:
+		/* mov word ptr [rax + r12 + off], ax */
+		EMIT3(0x66, add_3mod(0x40, dst_reg, src_reg, index_reg), 0x89);
+		break;
+	case BPF_W:
+		/* mov dword ptr [rax + r12 + 1], eax */
+		EMIT2(add_3mod(0x40, dst_reg, src_reg, index_reg), 0x89);
+		break;
+	case BPF_DW:
+		/* mov qword ptr [rax + r12 + 1], rax */
+		EMIT2(add_3mod(0x48, dst_reg, src_reg, index_reg), 0x89);
+		break;
+	}
+	emit_insn_suffix_SIB(&prog, dst_reg, src_reg, index_reg, off);
+	*pprog = prog;
+}
+
+static void emit_stx_r12(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
+{
+	emit_stx_index(pprog, size, dst_reg, src_reg, X86_REG_R12, off);
+}
+
+/* ST: *(u8*)(dst_reg + index_reg + off) = imm32 */
+static void emit_st_index(u8 **pprog, u32 size, u32 dst_reg, u32 index_reg, int off, int imm)
+{
+	u8 *prog = *pprog;
+
+	switch (size) {
+	case BPF_B:
+		/* mov byte ptr [rax + r12 + off], imm8 */
+		EMIT2(add_3mod(0x40, dst_reg, 0, index_reg), 0xC6);
+		break;
+	case BPF_H:
+		/* mov word ptr [rax + r12 + off], imm16 */
+		EMIT3(0x66, add_3mod(0x40, dst_reg, 0, index_reg), 0xC7);
+		break;
+	case BPF_W:
+		/* mov dword ptr [rax + r12 + 1], imm32 */
+		EMIT2(add_3mod(0x40, dst_reg, 0, index_reg), 0xC7);
+		break;
+	case BPF_DW:
+		/* mov qword ptr [rax + r12 + 1], imm32 */
+		EMIT2(add_3mod(0x48, dst_reg, 0, index_reg), 0xC7);
+		break;
+	}
+	emit_insn_suffix_SIB(&prog, dst_reg, 0, index_reg, off);
+	EMIT(imm, bpf_size_to_x86_bytes(size));
+	*pprog = prog;
+}
+
+static void emit_st_r12(u8 **pprog, u32 size, u32 dst_reg, int off, int imm)
+{
+	emit_st_index(pprog, size, dst_reg, X86_REG_R12, off, imm);
+}
+
 static int emit_atomic(u8 **pprog, u8 atomic_op,
 		       u32 dst_reg, u32 src_reg, s16 off, u8 bpf_size)
 {
@@ -1043,12 +1165,15 @@ static int emit_atomic(u8 **pprog, u8 atomic_op,
 	return 0;
 }
 
+#define DONT_CLEAR 1
+
 bool ex_handler_bpf(const struct exception_table_entry *x, struct pt_regs *regs)
 {
 	u32 reg = x->fixup >> 8;
 
 	/* jump over faulting load and clear dest register */
-	*(unsigned long *)((void *)regs + reg) = 0;
+	if (reg != DONT_CLEAR)
+		*(unsigned long *)((void *)regs + reg) = 0;
 	regs->ip += x->fixup & 0xff;
 	return true;
 }
@@ -1147,11 +1272,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+	u64 arena_vm_start;
 	int i, excnt = 0;
 	int ilen, proglen = 0;
 	u8 *prog = temp;
 	int err;
 
+	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
+
 	detect_reg_usage(insn, insn_cnt, callee_regs_used,
 			 &tail_call_seen);
 
@@ -1172,8 +1300,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		push_r12(&prog);
 		push_callee_regs(&prog, all_callee_regs_used);
 	} else {
+		if (arena_vm_start)
+			push_r12(&prog);
 		push_callee_regs(&prog, callee_regs_used);
 	}
+	if (arena_vm_start)
+		emit_mov_imm64(&prog, X86_REG_R12,
+			       arena_vm_start >> 32, (u32) arena_vm_start);
 
 	ilen = prog - temp;
 	if (rw_image)
@@ -1564,6 +1697,52 @@ st:			if (is_imm8(insn->off))
 			emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
 			break;
 
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
+		case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
+			start_of_ldx = prog;
+			emit_st_r12(&prog, BPF_SIZE(insn->code), dst_reg, insn->off, insn->imm);
+			goto populate_extable;
+
+			/* LDX: dst_reg = *(u8*)(src_reg + r12 + off) */
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
+		case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
+		case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
+			start_of_ldx = prog;
+			if (BPF_CLASS(insn->code) == BPF_LDX)
+				emit_ldx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
+			else
+				emit_stx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
+populate_extable:
+			{
+				struct exception_table_entry *ex;
+				u8 *_insn = image + proglen + (start_of_ldx - temp);
+				s64 delta;
+
+				if (!bpf_prog->aux->extable)
+					break;
+
+				ex = &bpf_prog->aux->extable[excnt++];
+
+				delta = _insn - (u8 *)&ex->insn;
+				/* switch ex to rw buffer for writes */
+				ex = (void *)rw_image + ((void *)ex - (void *)image);
+
+				ex->insn = delta;
+
+				ex->data = EX_TYPE_BPF;
+
+				ex->fixup = (prog - start_of_ldx) |
+					((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[dst_reg] : DONT_CLEAR) << 8);
+			}
+			break;
+
 			/* LDX: dst_reg = *(u8*)(src_reg + off) */
 		case BPF_LDX | BPF_MEM | BPF_B:
 		case BPF_LDX | BPF_PROBE_MEM | BPF_B:
@@ -2036,6 +2215,8 @@ st:			if (is_imm8(insn->off))
 				pop_r12(&prog);
 			} else {
 				pop_callee_regs(&prog, callee_regs_used);
+				if (arena_vm_start)
+					pop_r12(&prog);
 			}
 			EMIT1(0xC9);         /* leave */
 			emit_return(&prog, image + addrs[i - 1] + (prog - temp));
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index de557c6c42e0..26419a57bf9f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1463,6 +1463,7 @@ struct bpf_prog_aux {
 	bool xdp_has_frags;
 	bool exception_cb;
 	bool exception_boundary;
+	struct bpf_arena *arena;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
 	/* function name for valid attach_btf_id */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fee070b9826e..cd76d43412d0 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -72,6 +72,9 @@ struct ctl_table_header;
 /* unused opcode to mark special ldsx instruction. Same as BPF_IND */
 #define BPF_PROBE_MEMSX	0x40
 
+/* unused opcode to mark special load instruction. Same as BPF_MSH */
+#define BPF_PROBE_MEM32	0xa0
+
 /* unused opcode to mark call to interpreter with arguments */
 #define BPF_CALL_ARGS	0xe0
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  1:15   ` Eduard Zingerman
  2024-02-10  8:40   ` Kumar Kartikeya Dwivedi
  2024-02-09  4:05 ` [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

LLVM generates bpf_cast_kern and bpf_cast_user instructions while translating
pointers with __attribute__((address_space(1))).

rX = cast_kern(rY) is processed by the verifier and converted to
normal 32-bit move: wX = wY

bpf_cast_user has to be converted by JIT.

rX = cast_user(rY) is

aux_reg = upper_32_bits of arena->user_vm_start
aux_reg <<= 32
wX = wY // clear upper 32 bits of dst register
if (wX) // if not zero add upper bits of user_vm_start
  wX |= aux_reg

JIT can do it more efficiently:

mov dst_reg32, src_reg32  // 32-bit move
shl dst_reg, 32
or dst_reg, user_vm_start
rol dst_reg, 32
xor r11, r11
test dst_reg32, dst_reg32 // check if lower 32-bit are zero
cmove r11, dst_reg	  // if so, set dst_reg to zero
			  // Intel swapped src/dst register encoding in CMOVcc

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 41 ++++++++++++++++++++++++++++++++++++-
 include/linux/filter.h      |  1 +
 kernel/bpf/core.c           |  5 +++++
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 883b7f604b9a..a042ed57af7b 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1272,13 +1272,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
-	u64 arena_vm_start;
+	u64 arena_vm_start, user_vm_start;
 	int i, excnt = 0;
 	int ilen, proglen = 0;
 	u8 *prog = temp;
 	int err;
 
 	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
+	user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
 
 	detect_reg_usage(insn, insn_cnt, callee_regs_used,
 			 &tail_call_seen);
@@ -1346,6 +1347,39 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 			break;
 
 		case BPF_ALU64 | BPF_MOV | BPF_X:
+			if (insn->off == BPF_ARENA_CAST_USER) {
+				if (dst_reg != src_reg)
+					/* 32-bit mov */
+					emit_mov_reg(&prog, false, dst_reg, src_reg);
+				/* shl dst_reg, 32 */
+				maybe_emit_1mod(&prog, dst_reg, true);
+				EMIT3(0xC1, add_1reg(0xE0, dst_reg), 32);
+
+				/* or dst_reg, user_vm_start */
+				maybe_emit_1mod(&prog, dst_reg, true);
+				if (is_axreg(dst_reg))
+					EMIT1_off32(0x0D,  user_vm_start >> 32);
+				else
+					EMIT2_off32(0x81, add_1reg(0xC8, dst_reg),  user_vm_start >> 32);
+
+				/* rol dst_reg, 32 */
+				maybe_emit_1mod(&prog, dst_reg, true);
+				EMIT3(0xC1, add_1reg(0xC0, dst_reg), 32);
+
+				/* xor r11, r11 */
+				EMIT3(0x4D, 0x31, 0xDB);
+
+				/* test dst_reg32, dst_reg32; check if lower 32-bit are zero */
+				maybe_emit_mod(&prog, dst_reg, dst_reg, false);
+				EMIT2(0x85, add_2reg(0xC0, dst_reg, dst_reg));
+
+				/* cmove r11, dst_reg; if so, set dst_reg to zero */
+				/* WARNING: Intel swapped src/dst register encoding in CMOVcc !!! */
+				maybe_emit_mod(&prog, AUX_REG, dst_reg, true);
+				EMIT3(0x0F, 0x44, add_2reg(0xC0, AUX_REG, dst_reg));
+				break;
+			}
+			fallthrough;
 		case BPF_ALU | BPF_MOV | BPF_X:
 			if (insn->off == 0)
 				emit_mov_reg(&prog,
@@ -3424,6 +3458,11 @@ void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
 	}
 }
 
+bool bpf_jit_supports_arena(void)
+{
+	return true;
+}
+
 bool bpf_jit_supports_ptr_xchg(void)
 {
 	return true;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index cd76d43412d0..78ea63002531 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -959,6 +959,7 @@ bool bpf_jit_supports_kfunc_call(void);
 bool bpf_jit_supports_far_kfunc_call(void);
 bool bpf_jit_supports_exceptions(void);
 bool bpf_jit_supports_ptr_xchg(void);
+bool bpf_jit_supports_arena(void);
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
 bool bpf_helper_changes_pkt_data(void *func);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2539d9bfe369..2829077f0461 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2926,6 +2926,11 @@ bool __weak bpf_jit_supports_far_kfunc_call(void)
 	return false;
 }
 
+bool __weak bpf_jit_supports_arena(void)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend satisfies the following two conditions:
  * 1) JIT backend supports atomic_xchg() on pointer-sized words.
  * 2) Under the specific arch, the implementation of xchg() is the same
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (7 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  1:13   ` Eduard Zingerman
  2024-02-09  4:05 ` [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

rX = bpf_cast_kern(rY, addr_space) tells the verifier that rX->type = PTR_TO_ARENA.
Any further operations on PTR_TO_ARENA register have to be in 32-bit domain.

The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32.
JIT will generate them as kern_vm_start + 32bit_addr memory accesses.

rX = bpf_cast_user(rY, addr_space) tells the verifier that rX->type = unknown scalar.
If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well.
Otherwise JIT will convert it to:
  rX = (u32)rY;
  if (rX)
     rX |= arena->user_vm_start & ~(u64)~0U;

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h          |   1 +
 include/linux/bpf_verifier.h |   1 +
 kernel/bpf/log.c             |   3 ++
 kernel/bpf/verifier.c        | 102 ++++++++++++++++++++++++++++++++---
 4 files changed, 100 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 26419a57bf9f..70d5351427e6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -889,6 +889,7 @@ enum bpf_reg_type {
 	 * an explicit null check is required for this struct.
 	 */
 	PTR_TO_MEM,		 /* reg points to valid memory region */
+	PTR_TO_ARENA,
 	PTR_TO_BUF,		 /* reg points to a read/write buffer */
 	PTR_TO_FUNC,		 /* reg points to a bpf program function */
 	CONST_PTR_TO_DYNPTR,	 /* reg points to a const struct bpf_dynptr */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 84365e6dd85d..43c95e3e2a3c 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -547,6 +547,7 @@ struct bpf_insn_aux_data {
 	u32 seen; /* this insn was processed by the verifier at env->pass_cnt */
 	bool sanitize_stack_spill; /* subject to Spectre v4 sanitation */
 	bool zext_dst; /* this insn zero extends dst reg */
+	bool needs_zext; /* alu op needs to clear upper bits */
 	bool storage_get_func_atomic; /* bpf_*_storage_get() with atomic memory alloc */
 	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
 	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
index 594a234f122b..677076c760ff 100644
--- a/kernel/bpf/log.c
+++ b/kernel/bpf/log.c
@@ -416,6 +416,7 @@ const char *reg_type_str(struct bpf_verifier_env *env, enum bpf_reg_type type)
 		[PTR_TO_XDP_SOCK]	= "xdp_sock",
 		[PTR_TO_BTF_ID]		= "ptr_",
 		[PTR_TO_MEM]		= "mem",
+		[PTR_TO_ARENA]		= "arena",
 		[PTR_TO_BUF]		= "buf",
 		[PTR_TO_FUNC]		= "func",
 		[PTR_TO_MAP_KEY]	= "map_key",
@@ -651,6 +652,8 @@ static void print_reg_state(struct bpf_verifier_env *env,
 	}
 
 	verbose(env, "%s", reg_type_str(env, t));
+	if (t == PTR_TO_ARENA)
+		return;
 	if (t == PTR_TO_STACK) {
 		if (state->frameno != reg->frameno)
 			verbose(env, "[%d]", reg->frameno);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3c77a3ab1192..5eeb9bf7e324 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4370,6 +4370,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_MEM:
 	case PTR_TO_FUNC:
 	case PTR_TO_MAP_KEY:
+	case PTR_TO_ARENA:
 		return true;
 	default:
 		return false;
@@ -5805,6 +5806,8 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_XDP_SOCK:
 		pointer_desc = "xdp_sock ";
 		break;
+	case PTR_TO_ARENA:
+		return 0;
 	default:
 		break;
 	}
@@ -6906,6 +6909,9 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 
 		if (!err && value_regno >= 0 && (rdonly_mem || t == BPF_READ))
 			mark_reg_unknown(env, regs, value_regno);
+	} else if (reg->type == PTR_TO_ARENA) {
+		if (t == BPF_READ && value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
 	} else {
 		verbose(env, "R%d invalid mem access '%s'\n", regno,
 			reg_type_str(env, reg->type));
@@ -8377,6 +8383,7 @@ static int check_func_arg_reg_off(struct bpf_verifier_env *env,
 	case PTR_TO_MEM | MEM_RINGBUF:
 	case PTR_TO_BUF:
 	case PTR_TO_BUF | MEM_RDONLY:
+	case PTR_TO_ARENA:
 	case SCALAR_VALUE:
 		return 0;
 	/* All the rest must be rejected, except PTR_TO_BTF_ID which allows
@@ -13837,6 +13844,21 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 
 	dst_reg = &regs[insn->dst_reg];
 	src_reg = NULL;
+
+	if (dst_reg->type == PTR_TO_ARENA) {
+		struct bpf_insn_aux_data *aux = cur_aux(env);
+
+		if (BPF_CLASS(insn->code) == BPF_ALU64)
+			/*
+			 * 32-bit operations zero upper bits automatically.
+			 * 64-bit operations need to be converted to 32.
+			 */
+			aux->needs_zext = true;
+
+		/* Any arithmetic operations are allowed on arena pointers */
+		return 0;
+	}
+
 	if (dst_reg->type != SCALAR_VALUE)
 		ptr_reg = dst_reg;
 	else
@@ -13954,16 +13976,17 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	} else if (opcode == BPF_MOV) {
 
 		if (BPF_SRC(insn->code) == BPF_X) {
-			if (insn->imm != 0) {
-				verbose(env, "BPF_MOV uses reserved fields\n");
-				return -EINVAL;
-			}
-
 			if (BPF_CLASS(insn->code) == BPF_ALU) {
-				if (insn->off != 0 && insn->off != 8 && insn->off != 16) {
+				if ((insn->off != 0 && insn->off != 8 && insn->off != 16) ||
+				    insn->imm) {
 					verbose(env, "BPF_MOV uses reserved fields\n");
 					return -EINVAL;
 				}
+			} else if (insn->off == BPF_ARENA_CAST_KERN || insn->off == BPF_ARENA_CAST_USER) {
+				if (!insn->imm) {
+					verbose(env, "cast_kern/user insn must have non zero imm32\n");
+					return -EINVAL;
+				}
 			} else {
 				if (insn->off != 0 && insn->off != 8 && insn->off != 16 &&
 				    insn->off != 32) {
@@ -13993,7 +14016,12 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 			struct bpf_reg_state *dst_reg = regs + insn->dst_reg;
 
 			if (BPF_CLASS(insn->code) == BPF_ALU64) {
-				if (insn->off == 0) {
+				if (insn->imm) {
+					/* off == BPF_ARENA_CAST_KERN || off == BPF_ARENA_CAST_USER */
+					mark_reg_unknown(env, regs, insn->dst_reg);
+					if (insn->off == BPF_ARENA_CAST_KERN)
+						dst_reg->type = PTR_TO_ARENA;
+				} else if (insn->off == 0) {
 					/* case: R1 = R2
 					 * copy register state to dest reg
 					 */
@@ -14059,6 +14087,9 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 						dst_reg->subreg_def = env->insn_idx + 1;
 						coerce_subreg_to_size_sx(dst_reg, insn->off >> 3);
 					}
+				} else if (src_reg->type == PTR_TO_ARENA) {
+					mark_reg_unknown(env, regs, insn->dst_reg);
+					dst_reg->type = PTR_TO_ARENA;
 				} else {
 					mark_reg_unknown(env, regs,
 							 insn->dst_reg);
@@ -15142,6 +15173,10 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
 
 	if (insn->src_reg == BPF_PSEUDO_MAP_VALUE ||
 	    insn->src_reg == BPF_PSEUDO_MAP_IDX_VALUE) {
+		if (map->map_type == BPF_MAP_TYPE_ARENA) {
+			__mark_reg_unknown(env, dst_reg);
+			return 0;
+		}
 		dst_reg->type = PTR_TO_MAP_VALUE;
 		dst_reg->off = aux->map_off;
 		WARN_ON_ONCE(map->max_entries != 1);
@@ -16519,6 +16554,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
 		 * the same stack frame, since fp-8 in foo != fp-8 in bar
 		 */
 		return regs_exact(rold, rcur, idmap) && rold->frameno == rcur->frameno;
+	case PTR_TO_ARENA:
+		return true;
 	default:
 		return regs_exact(rold, rcur, idmap);
 	}
@@ -18235,6 +18272,31 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
 				fdput(f);
 				return -EBUSY;
 			}
+			if (map->map_type == BPF_MAP_TYPE_ARENA) {
+				if (env->prog->aux->arena) {
+					verbose(env, "Only one arena per program\n");
+					fdput(f);
+					return -EBUSY;
+				}
+				if (!env->allow_ptr_leaks || !env->bpf_capable) {
+					verbose(env, "CAP_BPF and CAP_PERFMON are required to use arena\n");
+					fdput(f);
+					return -EPERM;
+				}
+				if (!env->prog->jit_requested) {
+					verbose(env, "JIT is required to use arena\n");
+					return -EOPNOTSUPP;
+				}
+				if (!bpf_jit_supports_arena()) {
+					verbose(env, "JIT doesn't support arena\n");
+					return -EOPNOTSUPP;
+				}
+				env->prog->aux->arena = (void *)map;
+				if (!bpf_arena_get_user_vm_start(env->prog->aux->arena)) {
+					verbose(env, "arena's user address must be set via map_extra or mmap()\n");
+					return -EINVAL;
+				}
+			}
 
 			fdput(f);
 next_insn:
@@ -18799,6 +18861,18 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			   insn->code == (BPF_ST | BPF_MEM | BPF_W) ||
 			   insn->code == (BPF_ST | BPF_MEM | BPF_DW)) {
 			type = BPF_WRITE;
+		} else if (insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) && insn->imm) {
+			if (insn->off == BPF_ARENA_CAST_KERN ||
+			    (((struct bpf_map *)env->prog->aux->arena)->map_flags & BPF_F_NO_USER_CONV)) {
+				/* convert to 32-bit mov that clears upper 32-bit */
+				insn->code = BPF_ALU | BPF_MOV | BPF_X;
+				/* clear off, so it's a normal 'wX = wY' from JIT pov */
+				insn->off = 0;
+			} /* else insn->off == BPF_ARENA_CAST_USER should be handled by JIT */
+			continue;
+		} else if (env->insn_aux_data[i + delta].needs_zext) {
+			/* Convert BPF_CLASS(insn->code) == BPF_ALU64 to 32-bit ALU */
+			insn->code = BPF_ALU | BPF_OP(insn->code) | BPF_SRC(insn->code);
 		} else {
 			continue;
 		}
@@ -18856,6 +18930,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 				env->prog->aux->num_exentries++;
 			}
 			continue;
+		case PTR_TO_ARENA:
+			if (BPF_MODE(insn->code) == BPF_MEMSX) {
+				verbose(env, "sign extending loads from arena are not supported yet\n");
+				return -EOPNOTSUPP;
+			}
+			insn->code = BPF_CLASS(insn->code) | BPF_PROBE_MEM32 | BPF_SIZE(insn->code);
+			env->prog->aux->num_exentries++;
+			continue;
 		default:
 			continue;
 		}
@@ -19041,13 +19123,19 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		func[i]->aux->nr_linfo = prog->aux->nr_linfo;
 		func[i]->aux->jited_linfo = prog->aux->jited_linfo;
 		func[i]->aux->linfo_idx = env->subprog_info[i].linfo_idx;
+		func[i]->aux->arena = prog->aux->arena;
 		num_exentries = 0;
 		insn = func[i]->insnsi;
 		for (j = 0; j < func[i]->len; j++, insn++) {
 			if (BPF_CLASS(insn->code) == BPF_LDX &&
 			    (BPF_MODE(insn->code) == BPF_PROBE_MEM ||
+			     BPF_MODE(insn->code) == BPF_PROBE_MEM32 ||
 			     BPF_MODE(insn->code) == BPF_PROBE_MEMSX))
 				num_exentries++;
+			if ((BPF_CLASS(insn->code) == BPF_STX ||
+			     BPF_CLASS(insn->code) == BPF_ST) &&
+			     BPF_MODE(insn->code) == BPF_PROBE_MEM32)
+				num_exentries++;
 		}
 		func[i]->aux->num_exentries = num_exentries;
 		func[i]->aux->tail_call_reachable = env->subprog_info[i].tail_call_reachable;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (8 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
  2024-02-13 23:14   ` Andrii Nakryiko
  2024-02-09  4:05 ` [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.

Note, when the verifier sees:

__weak void foo(struct bar *p)

it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars.
Hence the only way to use arena pointers in global functions is to tag them with "arg:arena".

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/btf.c      | 19 +++++++++++++++----
 kernel/bpf/verifier.c | 15 +++++++++++++++
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 70d5351427e6..46a92e41b9d5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -718,6 +718,7 @@ enum bpf_arg_type {
 	 * on eBPF program stack
 	 */
 	ARG_PTR_TO_MEM,		/* pointer to valid memory (stack, packet, map value) */
+	ARG_PTR_TO_ARENA,
 
 	ARG_CONST_SIZE,		/* number of bytes accessed from memory */
 	ARG_CONST_SIZE_OR_ZERO,	/* number of bytes accessed from memory or 0 */
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8e06d29961f1..857059c8d56c 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -7053,10 +7053,11 @@ static int btf_get_ptr_to_btf_id(struct bpf_verifier_log *log, int arg_idx,
 }
 
 enum btf_arg_tag {
-	ARG_TAG_CTX = 0x1,
-	ARG_TAG_NONNULL = 0x2,
-	ARG_TAG_TRUSTED = 0x4,
-	ARG_TAG_NULLABLE = 0x8,
+	ARG_TAG_CTX	 = BIT_ULL(0),
+	ARG_TAG_NONNULL  = BIT_ULL(1),
+	ARG_TAG_TRUSTED  = BIT_ULL(2),
+	ARG_TAG_NULLABLE = BIT_ULL(3),
+	ARG_TAG_ARENA	 = BIT_ULL(4),
 };
 
 /* Process BTF of a function to produce high-level expectation of function
@@ -7168,6 +7169,8 @@ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog)
 				tags |= ARG_TAG_NONNULL;
 			} else if (strcmp(tag, "nullable") == 0) {
 				tags |= ARG_TAG_NULLABLE;
+			} else if (strcmp(tag, "arena") == 0) {
+				tags |= ARG_TAG_ARENA;
 			} else {
 				bpf_log(log, "arg#%d has unsupported set of tags\n", i);
 				return -EOPNOTSUPP;
@@ -7222,6 +7225,14 @@ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog)
 			sub->args[i].btf_id = kern_type_id;
 			continue;
 		}
+		if (tags & ARG_TAG_ARENA) {
+			if (tags & ~ARG_TAG_ARENA) {
+				bpf_log(log, "arg#%d arena cannot be combined with any other tags\n", i);
+				return -EINVAL;
+			}
+			sub->args[i].arg_type = ARG_PTR_TO_ARENA;
+			continue;
+		}
 		if (is_global) { /* generic user data pointer */
 			u32 mem_size;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5eeb9bf7e324..fa49602194d5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9348,6 +9348,18 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
 				bpf_log(log, "arg#%d is expected to be non-NULL\n", i);
 				return -EINVAL;
 			}
+		} else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
+			/*
+			 * Can pass any value and the kernel won't crash, but
+			 * only PTR_TO_ARENA or SCALAR make sense. Everything
+			 * else is a bug in the bpf program. Point it out to
+			 * the user at the verification time instead of
+			 * run-time debug nightmare.
+			 */
+			if (reg->type != PTR_TO_ARENA && reg->type != SCALAR_VALUE) {
+				bpf_log(log, "R%d is not a pointer to arena or scalar.\n", regno);
+				return -EINVAL;
+			}
 		} else if (arg->arg_type == (ARG_PTR_TO_DYNPTR | MEM_RDONLY)) {
 			ret = process_dynptr_func(env, regno, -1, arg->arg_type, 0);
 			if (ret)
@@ -20329,6 +20341,9 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 				reg->btf = bpf_get_btf_vmlinux(); /* can't fail at this point */
 				reg->btf_id = arg->btf_id;
 				reg->id = ++env->id_gen;
+			} else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
+				/* caller can pass either PTR_TO_ARENA or SCALAR */
+				mark_reg_unknown(env, regs, i);
 			} else {
 				WARN_ONCE(1, "BUG: unhandled arg#%d type %d\n",
 					  i - BPF_REG_1, arg->arg_type);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (9 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
@ 2024-02-09  4:05 ` Alexei Starovoitov
  2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
  2024-02-13 23:14   ` Andrii Nakryiko
  2024-02-09  4:06 ` [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena Alexei Starovoitov
                   ` (10 subsequent siblings)
  21 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:05 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add __arg_arena to bpf_helpers.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/bpf_helpers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 79eaa581be98..9c777c21da28 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -192,6 +192,7 @@ enum libbpf_tristate {
 #define __arg_nonnull __attribute((btf_decl_tag("arg:nonnull")))
 #define __arg_nullable __attribute((btf_decl_tag("arg:nullable")))
 #define __arg_trusted __attribute((btf_decl_tag("arg:trusted")))
+#define __arg_arena __attribute((btf_decl_tag("arg:arena")))
 
 #ifndef ___bpf_concat
 #define ___bpf_concat(a, b) a ## b
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (10 preceding siblings ...)
  2024-02-09  4:05 ` [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-10  7:16   ` Kumar Kartikeya Dwivedi
                     ` (2 more replies)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
                   ` (9 subsequent siblings)
  21 siblings, 3 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

mmap() bpf_arena right after creation, since the kernel needs to
remember the address returned from mmap. This is user_vm_start.
LLVM will generate bpf_arena_cast_user() instructions where
necessary and JIT will add upper 32-bit of user_vm_start
to such pointers.

Fix up bpf_map_mmap_sz() to compute mmap size as
map->value_size * map->max_entries for arrays and
PAGE_SIZE * map->max_entries for arena.

Don't set BTF at arena creation time, since it doesn't support it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/libbpf.c        | 43 ++++++++++++++++++++++++++++++-----
 tools/lib/bpf/libbpf_probes.c |  7 ++++++
 2 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 01f407591a92..4880d623098d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
 	[BPF_MAP_TYPE_BLOOM_FILTER]		= "bloom_filter",
 	[BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
 	[BPF_MAP_TYPE_CGRP_STORAGE]		= "cgrp_storage",
+	[BPF_MAP_TYPE_ARENA]			= "arena",
 };
 
 static const char * const prog_type_name[] = {
@@ -1577,7 +1578,7 @@ static struct bpf_map *bpf_object__add_map(struct bpf_object *obj)
 	return map;
 }
 
-static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
+static size_t __bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
 {
 	const long page_sz = sysconf(_SC_PAGE_SIZE);
 	size_t map_sz;
@@ -1587,6 +1588,20 @@ static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
 	return map_sz;
 }
 
+static size_t bpf_map_mmap_sz(const struct bpf_map *map)
+{
+	const long page_sz = sysconf(_SC_PAGE_SIZE);
+
+	switch (map->def.type) {
+	case BPF_MAP_TYPE_ARRAY:
+		return __bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
+	case BPF_MAP_TYPE_ARENA:
+		return page_sz * map->def.max_entries;
+	default:
+		return 0; /* not supported */
+	}
+}
+
 static int bpf_map_mmap_resize(struct bpf_map *map, size_t old_sz, size_t new_sz)
 {
 	void *mmaped;
@@ -1740,7 +1755,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
 		 map->name, map->sec_idx, map->sec_offset, def->map_flags);
 
-	mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
+	mmap_sz = bpf_map_mmap_sz(map);
 	map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
 			   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 	if (map->mmaped == MAP_FAILED) {
@@ -4852,6 +4867,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_QUEUE:
 	case BPF_MAP_TYPE_STACK:
+	case BPF_MAP_TYPE_ARENA:
 		create_attr.btf_fd = 0;
 		create_attr.btf_key_type_id = 0;
 		create_attr.btf_value_type_id = 0;
@@ -4908,6 +4924,21 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 	if (map->fd == map_fd)
 		return 0;
 
+	if (def->type == BPF_MAP_TYPE_ARENA) {
+		map->mmaped = mmap((void *)map->map_extra, bpf_map_mmap_sz(map),
+				   PROT_READ | PROT_WRITE,
+				   map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
+				   map_fd, 0);
+		if (map->mmaped == MAP_FAILED) {
+			err = -errno;
+			map->mmaped = NULL;
+			close(map_fd);
+			pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
+				bpf_map__name(map), err);
+			return err;
+		}
+	}
+
 	/* Keep placeholder FD value but now point it to the BPF map object.
 	 * This way everything that relied on this map's FD (e.g., relocated
 	 * ldimm64 instructions) will stay valid and won't need adjustments.
@@ -8582,7 +8613,7 @@ static void bpf_map__destroy(struct bpf_map *map)
 	if (map->mmaped) {
 		size_t mmap_sz;
 
-		mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
+		mmap_sz = bpf_map_mmap_sz(map);
 		munmap(map->mmaped, mmap_sz);
 		map->mmaped = NULL;
 	}
@@ -9830,8 +9861,8 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
 		int err;
 		size_t mmap_old_sz, mmap_new_sz;
 
-		mmap_old_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
-		mmap_new_sz = bpf_map_mmap_sz(size, map->def.max_entries);
+		mmap_old_sz = bpf_map_mmap_sz(map);
+		mmap_new_sz = __bpf_map_mmap_sz(size, map->def.max_entries);
 		err = bpf_map_mmap_resize(map, mmap_old_sz, mmap_new_sz);
 		if (err) {
 			pr_warn("map '%s': failed to resize memory-mapped region: %d\n",
@@ -13356,7 +13387,7 @@ int bpf_object__load_skeleton(struct bpf_object_skeleton *s)
 
 	for (i = 0; i < s->map_cnt; i++) {
 		struct bpf_map *map = *s->maps[i].map;
-		size_t mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
+		size_t mmap_sz = bpf_map_mmap_sz(map);
 		int prot, map_fd = map->fd;
 		void **mmaped = s->maps[i].mmaped;
 
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index ee9b1dbea9eb..302188122439 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -338,6 +338,13 @@ static int probe_map_create(enum bpf_map_type map_type)
 		key_size = 0;
 		max_entries = 1;
 		break;
+	case BPF_MAP_TYPE_ARENA:
+		key_size	= 0;
+		value_size	= 0;
+		max_entries	= 1; /* one page */
+		opts.map_extra	= 0; /* can mmap() at any address */
+		opts.map_flags	= BPF_F_MMAPABLE;
+		break;
 	case BPF_MAP_TYPE_HASH:
 	case BPF_MAP_TYPE_ARRAY:
 	case BPF_MAP_TYPE_PROG_ARRAY:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (11 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-12 18:58   ` Eduard Zingerman
  2024-02-13 23:15   ` Andrii Nakryiko
  2024-02-09  4:06 ` [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles Alexei Starovoitov
                   ` (8 subsequent siblings)
  21 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

__uint() macro that is used to specify map attributes like:
  __uint(type, BPF_MAP_TYPE_ARRAY);
  __uint(map_flags, BPF_F_MMAPABLE);
is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.

Introduce __ulong() macro that allows specifying values bigger than 32-bit.
In map definition "map_extra" is the only u64 field.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/bpf_helpers.h |  5 +++++
 tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 9c777c21da28..0aeac8ea7af2 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -13,6 +13,11 @@
 #define __uint(name, val) int (*name)[val]
 #define __type(name, val) typeof(val) *name
 #define __array(name, val) typeof(val) *name[]
+#ifndef __PASTE
+#define ___PASTE(a,b) a##b
+#define __PASTE(a,b) ___PASTE(a,b)
+#endif
+#define __ulong(name, val) enum { __PASTE(__unique_value, __COUNTER__) = val } name
 
 /*
  * Helper macro to place programs, maps, license in
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4880d623098d..f8158e250327 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2243,6 +2243,39 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf,
 	return true;
 }
 
+static bool get_map_field_long(const char *map_name, const struct btf *btf,
+			       const struct btf_member *m, __u64 *res)
+{
+	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
+	const char *name = btf__name_by_offset(btf, m->name_off);
+
+	if (btf_is_ptr(t))
+		return false;
+
+	if (!btf_is_enum(t) && !btf_is_enum64(t)) {
+		pr_warn("map '%s': attr '%s': expected enum or enum64, got %s.\n",
+			map_name, name, btf_kind_str(t));
+		return false;
+	}
+
+	if (btf_vlen(t) != 1) {
+		pr_warn("map '%s': attr '%s': invalid __ulong\n",
+			map_name, name);
+		return false;
+	}
+
+	if (btf_is_enum(t)) {
+		const struct btf_enum *e = btf_enum(t);
+
+		*res = e->val;
+	} else {
+		const struct btf_enum64 *e = btf_enum64(t);
+
+		*res = btf_enum64_value(e);
+	}
+	return true;
+}
+
 static int pathname_concat(char *buf, size_t buf_sz, const char *path, const char *name)
 {
 	int len;
@@ -2476,10 +2509,15 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
 			map_def->pinning = val;
 			map_def->parts |= MAP_DEF_PINNING;
 		} else if (strcmp(name, "map_extra") == 0) {
-			__u32 map_extra;
+			__u64 map_extra;
 
-			if (!get_map_field_int(map_name, btf, m, &map_extra))
-				return -EINVAL;
+			if (!get_map_field_long(map_name, btf, m, &map_extra)) {
+				__u32 map_extra_u32;
+
+				if (!get_map_field_int(map_name, btf, m, &map_extra_u32))
+					return -EINVAL;
+				map_extra = map_extra_u32;
+			}
 			map_def->map_extra = map_extra;
 			map_def->parts |= MAP_DEF_MAP_EXTRA;
 		} else {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (12 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-13  0:34   ` Eduard Zingerman
                     ` (2 more replies)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 15/20] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
                   ` (7 subsequent siblings)
  21 siblings, 3 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

LLVM automatically places __arena variables into ".arena.1" ELF section.
When libbpf sees such section it creates internal 'struct bpf_map' LIBBPF_MAP_ARENA
that is connected to actual BPF_MAP_TYPE_ARENA 'struct bpf_map'.
They share the same kernel's side bpf map and single map_fd.
Both are emitted into skeleton. Real arena with the name given by bpf program
in SEC(".maps") and another with "__arena_internal" name.
All global variables from ".arena.1" section are accessible from user space
via skel->arena->name_of_var.

For bss/data/rodata the skeleton/libbpf perform the following sequence:
1. addr = mmap(MAP_ANONYMOUS)
2. user space optionally modifies global vars
3. map_fd = bpf_create_map()
4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel
5. mmap(addr, MAP_FIXED, map_fd)
after step 5 user spaces see the values it wrote at step 2 at the same addresses

arena doesn't support update_map_elem. Hence skeleton/libbpf do:
1. addr = mmap(MAP_ANONYMOUS)
2. user space optionally modifies global vars
3. map_fd = bpf_create_map(MAP_TYPE_ARENA)
4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd)
5. memcpy(real_addr, addr) // this will fault-in and allocate pages
6. munmap(addr)

At the end look and feel of global data vs __arena global data is the same from bpf prog pov.

Another complication is:
struct {
  __uint(type, BPF_MAP_TYPE_ARENA);
} arena SEC(".maps");

int __arena foo;
int bar;

  ptr1 = &foo;   // relocation against ".arena.1" section
  ptr2 = &arena; // relocation against ".maps" section
  ptr3 = &bar;   // relocation against ".bss" section

Fo the kernel ptr1 and ptr2 has point to the same arena's map_fd
while ptr3 points to a different global array's map_fd.
For the verifier:
ptr1->type == unknown_scalar
ptr2->type == const_ptr_to_map
ptr3->type == ptr_to_map_value

after the verifier and for JIT all 3 ptr-s are normal ld_imm64 insns.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/bpf/bpftool/gen.c |  13 ++++-
 tools/lib/bpf/libbpf.c  | 102 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 101 insertions(+), 14 deletions(-)

diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c
index a9334c57e859..74fabbdbad2b 100644
--- a/tools/bpf/bpftool/gen.c
+++ b/tools/bpf/bpftool/gen.c
@@ -82,7 +82,7 @@ static bool get_map_ident(const struct bpf_map *map, char *buf, size_t buf_sz)
 	const char *name = bpf_map__name(map);
 	int i, n;
 
-	if (!bpf_map__is_internal(map)) {
+	if (!bpf_map__is_internal(map) || bpf_map__type(map) == BPF_MAP_TYPE_ARENA) {
 		snprintf(buf, buf_sz, "%s", name);
 		return true;
 	}
@@ -106,6 +106,12 @@ static bool get_datasec_ident(const char *sec_name, char *buf, size_t buf_sz)
 	static const char *pfxs[] = { ".data", ".rodata", ".bss", ".kconfig" };
 	int i, n;
 
+	/* recognize hard coded LLVM section name */
+	if (strcmp(sec_name, ".arena.1") == 0) {
+		/* this is the name to use in skeleton */
+		strncpy(buf, "arena", buf_sz);
+		return true;
+	}
 	for  (i = 0, n = ARRAY_SIZE(pfxs); i < n; i++) {
 		const char *pfx = pfxs[i];
 
@@ -239,6 +245,11 @@ static bool is_internal_mmapable_map(const struct bpf_map *map, char *buf, size_
 	if (!bpf_map__is_internal(map) || !(bpf_map__map_flags(map) & BPF_F_MMAPABLE))
 		return false;
 
+	if (bpf_map__type(map) == BPF_MAP_TYPE_ARENA) {
+		strncpy(buf, "arena", sz);
+		return true;
+	}
+
 	if (!get_map_ident(map, buf, sz))
 		return false;
 
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index f8158e250327..d5364280a06c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -498,6 +498,7 @@ struct bpf_struct_ops {
 #define KSYMS_SEC ".ksyms"
 #define STRUCT_OPS_SEC ".struct_ops"
 #define STRUCT_OPS_LINK_SEC ".struct_ops.link"
+#define ARENA_SEC ".arena.1"
 
 enum libbpf_map_type {
 	LIBBPF_MAP_UNSPEC,
@@ -505,6 +506,7 @@ enum libbpf_map_type {
 	LIBBPF_MAP_BSS,
 	LIBBPF_MAP_RODATA,
 	LIBBPF_MAP_KCONFIG,
+	LIBBPF_MAP_ARENA,
 };
 
 struct bpf_map_def {
@@ -547,6 +549,7 @@ struct bpf_map {
 	bool reused;
 	bool autocreate;
 	__u64 map_extra;
+	struct bpf_map *arena;
 };
 
 enum extern_type {
@@ -613,6 +616,7 @@ enum sec_type {
 	SEC_BSS,
 	SEC_DATA,
 	SEC_RODATA,
+	SEC_ARENA,
 };
 
 struct elf_sec_desc {
@@ -1718,10 +1722,34 @@ static int
 bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 			      const char *real_name, int sec_idx, void *data, size_t data_sz)
 {
+	const long page_sz = sysconf(_SC_PAGE_SIZE);
+	struct bpf_map *map, *arena = NULL;
 	struct bpf_map_def *def;
-	struct bpf_map *map;
 	size_t mmap_sz;
-	int err;
+	int err, i;
+
+	if (type == LIBBPF_MAP_ARENA) {
+		for (i = 0; i < obj->nr_maps; i++) {
+			map = &obj->maps[i];
+			if (map->def.type != BPF_MAP_TYPE_ARENA)
+				continue;
+			arena = map;
+			real_name = "__arena_internal";
+		        mmap_sz = bpf_map_mmap_sz(map);
+			if (roundup(data_sz, page_sz) > mmap_sz) {
+				pr_warn("Declared arena map size %zd is too small to hold"
+					"global __arena variables of size %zd\n",
+					mmap_sz, data_sz);
+				return -E2BIG;
+			}
+			break;
+		}
+		if (!arena) {
+			pr_warn("To use global __arena variables the arena map should"
+				"be declared explicitly in SEC(\".maps\")\n");
+			return -ENOENT;
+		}
+	}
 
 	map = bpf_object__add_map(obj);
 	if (IS_ERR(map))
@@ -1732,6 +1760,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	map->sec_offset = 0;
 	map->real_name = strdup(real_name);
 	map->name = internal_map_name(obj, real_name);
+	map->arena = arena;
 	if (!map->real_name || !map->name) {
 		zfree(&map->real_name);
 		zfree(&map->name);
@@ -1739,18 +1768,32 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	}
 
 	def = &map->def;
-	def->type = BPF_MAP_TYPE_ARRAY;
-	def->key_size = sizeof(int);
-	def->value_size = data_sz;
-	def->max_entries = 1;
-	def->map_flags = type == LIBBPF_MAP_RODATA || type == LIBBPF_MAP_KCONFIG
-			 ? BPF_F_RDONLY_PROG : 0;
+	if (type == LIBBPF_MAP_ARENA) {
+		/* bpf_object will contain two arena maps:
+		 * LIBBPF_MAP_ARENA & BPF_MAP_TYPE_ARENA
+		 * and
+		 * LIBBPF_MAP_UNSPEC & BPF_MAP_TYPE_ARENA.
+		 * The former map->arena will point to latter.
+		 */
+		def->type = BPF_MAP_TYPE_ARENA;
+		def->key_size = 0;
+		def->value_size = 0;
+		def->max_entries = roundup(data_sz, page_sz) / page_sz;
+		def->map_flags = BPF_F_MMAPABLE;
+	} else {
+		def->type = BPF_MAP_TYPE_ARRAY;
+		def->key_size = sizeof(int);
+		def->value_size = data_sz;
+		def->max_entries = 1;
+		def->map_flags = type == LIBBPF_MAP_RODATA || type == LIBBPF_MAP_KCONFIG
+			? BPF_F_RDONLY_PROG : 0;
 
-	/* failures are fine because of maps like .rodata.str1.1 */
-	(void) map_fill_btf_type_info(obj, map);
+		/* failures are fine because of maps like .rodata.str1.1 */
+		(void) map_fill_btf_type_info(obj, map);
 
-	if (map_is_mmapable(obj, map))
-		def->map_flags |= BPF_F_MMAPABLE;
+		if (map_is_mmapable(obj, map))
+			def->map_flags |= BPF_F_MMAPABLE;
+	}
 
 	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
 		 map->name, map->sec_idx, map->sec_offset, def->map_flags);
@@ -1814,6 +1857,13 @@ static int bpf_object__init_global_data_maps(struct bpf_object *obj)
 							    NULL,
 							    sec_desc->data->d_size);
 			break;
+		case SEC_ARENA:
+			sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, sec_idx));
+			err = bpf_object__init_internal_map(obj, LIBBPF_MAP_ARENA,
+							    sec_name, sec_idx,
+							    sec_desc->data->d_buf,
+							    sec_desc->data->d_size);
+			break;
 		default:
 			/* skip */
 			break;
@@ -3646,6 +3696,10 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
 			} else if (strcmp(name, STRUCT_OPS_LINK_SEC) == 0) {
 				obj->efile.st_ops_link_data = data;
 				obj->efile.st_ops_link_shndx = idx;
+			} else if (strcmp(name, ARENA_SEC) == 0) {
+				sec_desc->sec_type = SEC_ARENA;
+				sec_desc->shdr = sh;
+				sec_desc->data = data;
 			} else {
 				pr_info("elf: skipping unrecognized data section(%d) %s\n",
 					idx, name);
@@ -4148,6 +4202,7 @@ static bool bpf_object__shndx_is_data(const struct bpf_object *obj,
 	case SEC_BSS:
 	case SEC_DATA:
 	case SEC_RODATA:
+	case SEC_ARENA:
 		return true;
 	default:
 		return false;
@@ -4173,6 +4228,8 @@ bpf_object__section_to_libbpf_map_type(const struct bpf_object *obj, int shndx)
 		return LIBBPF_MAP_DATA;
 	case SEC_RODATA:
 		return LIBBPF_MAP_RODATA;
+	case SEC_ARENA:
+		return LIBBPF_MAP_ARENA;
 	default:
 		return LIBBPF_MAP_UNSPEC;
 	}
@@ -4326,7 +4383,7 @@ static int bpf_program__record_reloc(struct bpf_program *prog,
 
 	reloc_desc->type = RELO_DATA;
 	reloc_desc->insn_idx = insn_idx;
-	reloc_desc->map_idx = map_idx;
+	reloc_desc->map_idx = map->arena ? map->arena - obj->maps : map_idx;
 	reloc_desc->sym_off = sym->st_value;
 	return 0;
 }
@@ -4813,6 +4870,9 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)
 			bpf_gen__map_freeze(obj->gen_loader, map - obj->maps);
 		return 0;
 	}
+	if (map_type == LIBBPF_MAP_ARENA)
+		return 0;
+
 	err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);
 	if (err) {
 		err = -errno;
@@ -5119,6 +5179,15 @@ bpf_object__create_maps(struct bpf_object *obj)
 		if (bpf_map__is_internal(map) && !kernel_supports(obj, FEAT_GLOBAL_DATA))
 			map->autocreate = false;
 
+		if (map->libbpf_type == LIBBPF_MAP_ARENA) {
+			size_t len = bpf_map_mmap_sz(map);
+
+			memcpy(map->arena->mmaped, map->mmaped, len);
+			map->autocreate = false;
+			munmap(map->mmaped, len);
+			map->mmaped = NULL;
+		}
+
 		if (!map->autocreate) {
 			pr_debug("map '%s': skipped auto-creating...\n", map->name);
 			continue;
@@ -9735,6 +9804,8 @@ static bool map_uses_real_name(const struct bpf_map *map)
 		return true;
 	if (map->libbpf_type == LIBBPF_MAP_RODATA && strcmp(map->real_name, RODATA_SEC) != 0)
 		return true;
+	if (map->libbpf_type == LIBBPF_MAP_ARENA)
+		return true;
 	return false;
 }
 
@@ -13437,6 +13508,11 @@ int bpf_object__load_skeleton(struct bpf_object_skeleton *s)
 			continue;
 		}
 
+		if (map->arena) {
+			*mmaped = map->arena->mmaped;
+			continue;
+		}
+
 		if (map->def.map_flags & BPF_F_RDONLY_PROG)
 			prot = PROT_READ;
 		else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 15/20] bpf: Tell bpf programs kernel's PAGE_SIZE
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (13 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-10  8:52   ` Kumar Kartikeya Dwivedi
  2024-02-09  4:06 ` [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

vmlinux BTF includes all kernel enums.
Add __PAGE_SIZE = PAGE_SIZE enum, so that bpf programs
that include vmlinux.h can easily access it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/core.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2829077f0461..3aa3f56a4310 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -88,13 +88,18 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns
 	return NULL;
 }
 
+/* tell bpf programs that include vmlinux.h kernel's PAGE_SIZE */
+enum page_size_enum {
+	__PAGE_SIZE = PAGE_SIZE
+};
+
 struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags)
 {
 	gfp_t gfp_flags = bpf_memcg_flags(GFP_KERNEL | __GFP_ZERO | gfp_extra_flags);
 	struct bpf_prog_aux *aux;
 	struct bpf_prog *fp;
 
-	size = round_up(size, PAGE_SIZE);
+	size = round_up(size, __PAGE_SIZE);
 	fp = __vmalloc(size, gfp_flags);
 	if (fp == NULL)
 		return NULL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast()
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (14 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 15/20] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-10  8:54   ` Kumar Kartikeya Dwivedi
  2024-02-09  4:06 ` [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages Alexei Starovoitov
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce helper macro bpf_arena_cast() that emits:
rX = rX
instruction with off = BPF_ARENA_CAST_KERN or off = BPF_ARENA_CAST_USER
and encodes address_space into imm32.

It's useful with older LLVM that doesn't emit this insn automatically.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../testing/selftests/bpf/bpf_experimental.h  | 41 +++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 0d749006d107..e73b7d48439f 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -331,6 +331,47 @@ l_true:												\
 	asm volatile("%[reg]=%[reg]"::[reg]"r"((short)var))
 #endif
 
+/* emit instruction: rX=rX .off = mode .imm32 = address_space */
+#ifndef bpf_arena_cast
+#define bpf_arena_cast(var, mode, addr_space)	\
+	({					\
+	typeof(var) __var = var;		\
+	asm volatile(".byte 0xBF;		\
+		     .ifc %[reg], r0;		\
+		     .byte 0x00;		\
+		     .endif;			\
+		     .ifc %[reg], r1;		\
+		     .byte 0x11;		\
+		     .endif;			\
+		     .ifc %[reg], r2;		\
+		     .byte 0x22;		\
+		     .endif;			\
+		     .ifc %[reg], r3;		\
+		     .byte 0x33;		\
+		     .endif;			\
+		     .ifc %[reg], r4;		\
+		     .byte 0x44;		\
+		     .endif;			\
+		     .ifc %[reg], r5;		\
+		     .byte 0x55;		\
+		     .endif;			\
+		     .ifc %[reg], r6;		\
+		     .byte 0x66;		\
+		     .endif;			\
+		     .ifc %[reg], r7;		\
+		     .byte 0x77;		\
+		     .endif;			\
+		     .ifc %[reg], r8;		\
+		     .byte 0x88;		\
+		     .endif;			\
+		     .ifc %[reg], r9;		\
+		     .byte 0x99;		\
+		     .endif;			\
+		     .short %[off]; .long %[as]"	\
+		     :: [reg]"r"(__var), [off]"i"(mode), [as]"i"(addr_space)); __var; \
+	})
+#endif
+
 /* Description
  *	Assert that a conditional expression is true.
  * Returns
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (15 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-09 23:14   ` David Vernet
  2024-02-09  4:06 ` [PATCH v2 bpf-next 18/20] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Add unit tests for bpf_arena_alloc/free_pages() functionality
and bpf_arena_common.h with a set of common helpers and macros that
is used in this test and the following patches.

Also modify test_loader that didn't support running bpf_prog_type_syscall
programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/DENYLIST.aarch64  |  1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
 .../testing/selftests/bpf/bpf_arena_common.h  | 70 ++++++++++++++
 .../selftests/bpf/prog_tests/verifier.c       |  2 +
 .../selftests/bpf/progs/verifier_arena.c      | 91 +++++++++++++++++++
 tools/testing/selftests/bpf/test_loader.c     |  9 +-
 6 files changed, 172 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_arena.c

diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
index 5c2cc7e8c5d0..8e70af386e52 100644
--- a/tools/testing/selftests/bpf/DENYLIST.aarch64
+++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
@@ -11,3 +11,4 @@ fill_link_info/kprobe_multi_link_info            # bpf_program__attach_kprobe_mu
 fill_link_info/kretprobe_multi_link_info         # bpf_program__attach_kprobe_multi_opts unexpected error: -95
 fill_link_info/kprobe_multi_invalid_ubuff        # bpf_program__attach_kprobe_multi_opts unexpected error: -95
 missed/kprobe_recursion                          # missed_kprobe_recursion__attach unexpected error: -95 (errno 95)
+verifier_arena                                   # JIT does not support arena
diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index 1a63996c0304..ded440277f6e 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -3,3 +3,4 @@
 exceptions				 # JIT does not support calling kfunc bpf_throw				       (exceptions)
 get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
 stacktrace_build_id                      # compare_map_keys stackid_hmap vs. stackmap err -2 errno 2                   (?)
+verifier_arena                           # JIT does not support arena
diff --git a/tools/testing/selftests/bpf/bpf_arena_common.h b/tools/testing/selftests/bpf/bpf_arena_common.h
new file mode 100644
index 000000000000..07849d502f40
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_common.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+
+#ifndef WRITE_ONCE
+#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *) &(x)) = (val))
+#endif
+
+#ifndef NUMA_NO_NODE
+#define	NUMA_NO_NODE	(-1)
+#endif
+
+#ifndef arena_container_of
+#define arena_container_of(ptr, type, member)			\
+	({							\
+		void __arena *__mptr = (void __arena *)(ptr);	\
+		((type *)(__mptr - offsetof(type, member)));	\
+	})
+#endif
+
+#ifdef __BPF__ /* when compiled as bpf program */
+
+#ifndef PAGE_SIZE
+#define PAGE_SIZE __PAGE_SIZE
+/*
+ * for older kernels try sizeof(struct genradix_node)
+ * or flexible:
+ * static inline long __bpf_page_size(void) {
+ *   return bpf_core_enum_value(enum page_size_enum___l, __PAGE_SIZE___l) ?: sizeof(struct genradix_node);
+ * }
+ * but generated code is not great.
+ */
+#endif
+
+#if defined(__BPF_FEATURE_ARENA_CAST) && !defined(BPF_ARENA_FORCE_ASM)
+#define __arena __attribute__((address_space(1)))
+#define cast_kern(ptr) /* nop for bpf prog. emitted by LLVM */
+#define cast_user(ptr) /* nop for bpf prog. emitted by LLVM */
+#else
+#define __arena
+#define cast_kern(ptr) bpf_arena_cast(ptr, BPF_ARENA_CAST_KERN, 1)
+#define cast_user(ptr) bpf_arena_cast(ptr, BPF_ARENA_CAST_USER, 1)
+#endif
+
+void __arena* bpf_arena_alloc_pages(void *map, void __arena *addr, __u32 page_cnt,
+				    int node_id, __u64 flags) __ksym __weak;
+void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak;
+
+#else /* when compiled as user space code */
+
+#define __arena
+#define __arg_arena
+#define cast_kern(ptr) /* nop for user space */
+#define cast_user(ptr) /* nop for user space */
+__weak char arena[1];
+
+#ifndef offsetof
+#define offsetof(type, member)  ((unsigned long)&((type *)0)->member)
+#endif
+
+static inline void __arena* bpf_arena_alloc_pages(void *map, void *addr, __u32 page_cnt,
+						  int node_id, __u64 flags)
+{
+	return NULL;
+}
+static inline void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt)
+{
+}
+
+#endif
diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index 9c6072a19745..985273832f89 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -4,6 +4,7 @@
 
 #include "cap_helpers.h"
 #include "verifier_and.skel.h"
+#include "verifier_arena.skel.h"
 #include "verifier_array_access.skel.h"
 #include "verifier_basic_stack.skel.h"
 #include "verifier_bitfield_write.skel.h"
@@ -118,6 +119,7 @@ static void run_tests_aux(const char *skel_name,
 #define RUN(skel) run_tests_aux(#skel, skel##__elf_bytes, NULL)
 
 void test_verifier_and(void)                  { RUN(verifier_and); }
+void test_verifier_arena(void)                { RUN(verifier_arena); }
 void test_verifier_basic_stack(void)          { RUN(verifier_basic_stack); }
 void test_verifier_bitfield_write(void)       { RUN(verifier_bitfield_write); }
 void test_verifier_bounds(void)               { RUN(verifier_bounds); }
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c
new file mode 100644
index 000000000000..0e667132ef92
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_arena.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+#include "bpf_arena_common.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARENA);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__uint(max_entries, 2); /* arena of two pages close to 32-bit boundary*/
+	__ulong(map_extra, (1ull << 44) | (~0u - __PAGE_SIZE * 2 + 1)); /* start of mmap() region */
+} arena SEC(".maps");
+
+SEC("syscall")
+__success __retval(0)
+int basic_alloc1(void *ctx)
+{
+	volatile int __arena *page1, *page2, *no_page, *page3;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	*page1 = 1;
+	page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page2)
+		return 2;
+	*page2 = 2;
+	no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (no_page)
+		return 3;
+	if (*page1 != 1)
+		return 4;
+	if (*page2 != 2)
+		return 5;
+	bpf_arena_free_pages(&arena, (void __arena *)page2, 1);
+	if (*page1 != 1)
+		return 6;
+	if (*page2 != 0) /* use-after-free should return 0 */
+		return 7;
+	page3 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page3)
+		return 8;
+	*page3 = 3;
+	if (page2 != page3)
+		return 9;
+	if (*page1 != 1)
+		return 10;
+	return 0;
+}
+
+SEC("syscall")
+__success __retval(0)
+int basic_alloc2(void *ctx)
+{
+	volatile char __arena *page1, *page2, *page3, *page4;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	page2 = page1 + __PAGE_SIZE;
+	page3 = page1 + __PAGE_SIZE * 2;
+	page4 = page1 - __PAGE_SIZE;
+	*page1 = 1;
+	*page2 = 2;
+	*page3 = 3;
+	*page4 = 4;
+	if (*page1 != 1)
+		return 1;
+	if (*page2 != 2)
+		return 2;
+	if (*page3 != 0)
+		return 3;
+	if (*page4 != 0)
+		return 4;
+	bpf_arena_free_pages(&arena, (void __arena *)page1, 2);
+	if (*page1 != 0)
+		return 5;
+	if (*page2 != 0)
+		return 6;
+	if (*page3 != 0)
+		return 7;
+	if (*page4 != 0)
+		return 8;
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_loader.c b/tools/testing/selftests/bpf/test_loader.c
index ba57601c2a4d..524c38e9cde4 100644
--- a/tools/testing/selftests/bpf/test_loader.c
+++ b/tools/testing/selftests/bpf/test_loader.c
@@ -501,7 +501,7 @@ static bool is_unpriv_capable_map(struct bpf_map *map)
 	}
 }
 
-static int do_prog_test_run(int fd_prog, int *retval)
+static int do_prog_test_run(int fd_prog, int *retval, bool empty_opts)
 {
 	__u8 tmp_out[TEST_DATA_LEN << 2] = {};
 	__u8 tmp_in[TEST_DATA_LEN] = {};
@@ -514,6 +514,10 @@ static int do_prog_test_run(int fd_prog, int *retval)
 		.repeat = 1,
 	);
 
+	if (empty_opts) {
+		memset(&topts, 0, sizeof(struct bpf_test_run_opts));
+		topts.sz = sizeof(struct bpf_test_run_opts);
+	}
 	err = bpf_prog_test_run_opts(fd_prog, &topts);
 	saved_errno = errno;
 
@@ -649,7 +653,8 @@ void run_subtest(struct test_loader *tester,
 			}
 		}
 
-		do_prog_test_run(bpf_program__fd(tprog), &retval);
+		do_prog_test_run(bpf_program__fd(tprog), &retval,
+				 bpf_program__type(tprog) == BPF_PROG_TYPE_SYSCALL ? true : false);
 		if (retval != subspec->retval && subspec->retval != POINTER_VALUE) {
 			PRINT_FAIL("Unexpected retval: %d != %d\n", retval, subspec->retval);
 			goto tobj_cleanup;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 18/20] selftests/bpf: Add bpf_arena_list test.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (16 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-09  4:06 ` [PATCH v2 bpf-next 19/20] selftests/bpf: Add bpf_arena_htab test Alexei Starovoitov
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_arena_alloc.h - implements page_frag allocator as a bpf program.
bpf_arena_list.h - doubly linked link list as a bpf program.

Compiled as a bpf program and as native C code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/DENYLIST.aarch64  |  1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
 tools/testing/selftests/bpf/bpf_arena_alloc.h | 58 +++++++++++
 tools/testing/selftests/bpf/bpf_arena_list.h  | 95 +++++++++++++++++++
 .../selftests/bpf/prog_tests/arena_list.c     | 68 +++++++++++++
 .../testing/selftests/bpf/progs/arena_list.c  | 76 +++++++++++++++
 6 files changed, 299 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_alloc.h
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_list.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_list.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_list.c

diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
index 8e70af386e52..83a3d9bee59c 100644
--- a/tools/testing/selftests/bpf/DENYLIST.aarch64
+++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
@@ -12,3 +12,4 @@ fill_link_info/kretprobe_multi_link_info         # bpf_program__attach_kprobe_mu
 fill_link_info/kprobe_multi_invalid_ubuff        # bpf_program__attach_kprobe_multi_opts unexpected error: -95
 missed/kprobe_recursion                          # missed_kprobe_recursion__attach unexpected error: -95 (errno 95)
 verifier_arena                                   # JIT does not support arena
+arena						 # JIT does not support arena
diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index ded440277f6e..9293b88a327e 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -4,3 +4,4 @@ exceptions				 # JIT does not support calling kfunc bpf_throw				       (excepti
 get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
 stacktrace_build_id                      # compare_map_keys stackid_hmap vs. stackmap err -2 errno 2                   (?)
 verifier_arena                           # JIT does not support arena
+arena					 # JIT does not support arena
diff --git a/tools/testing/selftests/bpf/bpf_arena_alloc.h b/tools/testing/selftests/bpf/bpf_arena_alloc.h
new file mode 100644
index 000000000000..0f4cb399b4c7
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_alloc.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+#include "bpf_arena_common.h"
+
+#ifndef __round_mask
+#define __round_mask(x, y) ((__typeof__(x))((y)-1))
+#endif
+#ifndef round_up
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
+#endif
+
+void __arena *cur_page;
+int cur_offset;
+
+/* Simple page_frag allocator */
+static inline void __arena* bpf_alloc(unsigned int size)
+{
+	__u64 __arena *obj_cnt;
+	void __arena *page = cur_page;
+	int offset;
+
+	size = round_up(size, 8);
+	if (size >= PAGE_SIZE - 8)
+		return NULL;
+	if (!page) {
+refill:
+		page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+		if (!page)
+			return NULL;
+		cast_kern(page);
+		cur_page = page;
+		cur_offset = PAGE_SIZE - 8;
+		obj_cnt = page + PAGE_SIZE - 8;
+		*obj_cnt = 0;
+	} else {
+		cast_kern(page);
+		obj_cnt = page + PAGE_SIZE - 8;
+	}
+
+	offset = cur_offset - size;
+	if (offset < 0)
+		goto refill;
+
+	(*obj_cnt)++;
+	cur_offset = offset;
+	return page + offset;
+}
+
+static inline void bpf_free(void __arena *addr)
+{
+	__u64 __arena *obj_cnt;
+
+	addr = (void __arena *)(((long)addr) & ~(PAGE_SIZE - 1));
+	obj_cnt = addr + PAGE_SIZE - 8;
+	if (--(*obj_cnt) == 0)
+		bpf_arena_free_pages(&arena, addr, 1);
+}
diff --git a/tools/testing/selftests/bpf/bpf_arena_list.h b/tools/testing/selftests/bpf/bpf_arena_list.h
new file mode 100644
index 000000000000..31fd744dfb72
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_list.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+#include "bpf_arena_common.h"
+
+struct arena_list_node;
+
+typedef struct arena_list_node __arena arena_list_node_t;
+
+struct arena_list_node {
+	arena_list_node_t *next;
+	arena_list_node_t * __arena *pprev;
+};
+
+struct arena_list_head {
+	struct arena_list_node __arena *first;
+};
+typedef struct arena_list_head __arena arena_list_head_t;
+
+#define list_entry(ptr, type, member) arena_container_of(ptr, type, member)
+
+#define list_entry_safe(ptr, type, member) \
+	({ typeof(*ptr) * ___ptr = (ptr); \
+	 ___ptr ? ({ cast_kern(___ptr); list_entry(___ptr, type, member); }) : NULL; \
+	 })
+
+#ifndef __BPF__
+static inline void *bpf_iter_num_new(struct bpf_iter_num *it, int i, int j) { return NULL; }
+static inline void bpf_iter_num_destroy(struct bpf_iter_num *it) {}
+static inline bool bpf_iter_num_next(struct bpf_iter_num *it) { return true; }
+#endif
+
+/* Safely walk link list of up to 1M elements. Deletion of elements is allowed. */
+#define list_for_each_entry(pos, head, member)						\
+	for (struct bpf_iter_num ___it __attribute__((aligned(8),			\
+						      cleanup(bpf_iter_num_destroy))),	\
+			* ___tmp = (			\
+				bpf_iter_num_new(&___it, 0, (1000000)),			\
+				pos = list_entry_safe((head)->first,			\
+						      typeof(*(pos)), member),		\
+				(void)bpf_iter_num_destroy, (void *)0);			\
+	     bpf_iter_num_next(&___it) && pos &&				\
+		({ ___tmp = (void *)pos->member.next; 1; });			\
+	     pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member))
+
+static inline void list_add_head(arena_list_node_t *n, arena_list_head_t *h)
+{
+	arena_list_node_t *first = h->first, * __arena *tmp;
+
+	cast_user(first);
+	cast_kern(n);
+	WRITE_ONCE(n->next, first);
+	cast_kern(first);
+	if (first) {
+		tmp = &n->next;
+		cast_user(tmp);
+		WRITE_ONCE(first->pprev, tmp);
+	}
+	cast_user(n);
+	WRITE_ONCE(h->first, n);
+
+	tmp = &h->first;
+	cast_user(tmp);
+	cast_kern(n);
+	WRITE_ONCE(n->pprev, tmp);
+}
+
+static inline void __list_del(arena_list_node_t *n)
+{
+	arena_list_node_t *next = n->next, *tmp;
+	arena_list_node_t * __arena *pprev = n->pprev;
+
+	cast_user(next);
+	cast_kern(pprev);
+	tmp = *pprev;
+	cast_kern(tmp);
+	WRITE_ONCE(tmp, next);
+	if (next) {
+		cast_user(pprev);
+		cast_kern(next);
+		WRITE_ONCE(next->pprev, pprev);
+	}
+}
+
+#define POISON_POINTER_DELTA 0
+
+#define LIST_POISON1  ((void __arena *) 0x100 + POISON_POINTER_DELTA)
+#define LIST_POISON2  ((void __arena *) 0x122 + POISON_POINTER_DELTA)
+
+static inline void list_del(arena_list_node_t *n)
+{
+	__list_del(n);
+	n->next = LIST_POISON1;
+	n->pprev = LIST_POISON2;
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c
new file mode 100644
index 000000000000..e61886debab1
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <sys/mman.h>
+#include <network_helpers.h>
+
+#define PAGE_SIZE 4096
+
+#include "bpf_arena_list.h"
+#include "arena_list.skel.h"
+
+struct elem {
+	struct arena_list_node node;
+	__u64 value;
+};
+
+static int list_sum(struct arena_list_head *head)
+{
+	struct elem __arena *n;
+	int sum = 0;
+
+	list_for_each_entry(n, head, node)
+		sum += n->value;
+	return sum;
+}
+
+static void test_arena_list_add_del(int cnt)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct arena_list *skel;
+	int expected_sum = (u64)cnt * (cnt - 1) / 2;
+	int ret, sum;
+
+	skel = arena_list__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load"))
+		return;
+
+	skel->bss->cnt = cnt;
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
+	ASSERT_OK(ret, "ret_add");
+	ASSERT_OK(opts.retval, "retval");
+	if (skel->bss->skip) {
+		printf("%s:SKIP:compiler doesn't support arena_cast\n", __func__);
+		test__skip();
+		goto out;
+	}
+	sum = list_sum(skel->bss->list_head);
+	ASSERT_EQ(sum, expected_sum, "sum of elems");
+	ASSERT_EQ(skel->arena->arena_sum, expected_sum, "__arena sum of elems");
+	ASSERT_EQ(skel->arena->test_val, cnt + 1, "num of elems");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_del), &opts);
+	ASSERT_OK(ret, "ret_del");
+	sum = list_sum(skel->bss->list_head);
+	ASSERT_EQ(sum, 0, "sum of list elems after del");
+	ASSERT_EQ(skel->bss->list_sum, expected_sum, "sum of list elems computed by prog");
+	ASSERT_EQ(skel->arena->arena_sum, expected_sum, "__arena sum of elems");
+out:
+	arena_list__destroy(skel);
+}
+
+void test_arena_list(void)
+{
+	if (test__start_subtest("arena_list_1"))
+		test_arena_list_add_del(1);
+	if (test__start_subtest("arena_list_1000"))
+		test_arena_list_add_del(1000);
+}
diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c
new file mode 100644
index 000000000000..04ebcdd98f10
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/arena_list.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARENA);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__uint(max_entries, 1000); /* number of pages */
+	__ulong(map_extra, 2ull << 44); /* start of mmap() region */
+} arena SEC(".maps");
+
+#include "bpf_arena_alloc.h"
+#include "bpf_arena_list.h"
+
+struct elem {
+	struct arena_list_node node;
+	__u64 value;
+};
+
+struct arena_list_head __arena *list_head;
+int list_sum;
+int cnt;
+bool skip = false;
+
+long __arena arena_sum;
+int __arena test_val = 1;
+struct arena_list_head __arena global_head;
+
+SEC("syscall")
+int arena_list_add(void *ctx)
+{
+#ifdef __BPF_FEATURE_ARENA_CAST
+	__u64 i;
+
+	list_head = &global_head;
+
+	bpf_for(i, 0, cnt) {
+		struct elem __arena *n = bpf_alloc(sizeof(*n));
+
+		test_val++;
+		n->value = i;
+		arena_sum += i;
+		list_add_head(&n->node, list_head);
+	}
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+SEC("syscall")
+int arena_list_del(void *ctx)
+{
+#ifdef __BPF_FEATURE_ARENA_CAST
+	struct elem __arena *n;
+	int sum = 0;
+
+	arena_sum = 0;
+	list_for_each_entry(n, list_head, node) {
+		sum += n->value;
+		arena_sum += n->value;
+		list_del(&n->node);
+		bpf_free(n);
+	}
+	list_sum = sum;
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 19/20] selftests/bpf: Add bpf_arena_htab test.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (17 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 18/20] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-09  4:06 ` [PATCH v2 bpf-next 20/20] selftests/bpf: Convert simple page_frag allocator to per-cpu Alexei Starovoitov
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

bpf_arena_htab.h - hash table implemented as bpf program

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/bpf_arena_htab.h  | 100 ++++++++++++++++++
 .../selftests/bpf/prog_tests/arena_htab.c     |  88 +++++++++++++++
 .../testing/selftests/bpf/progs/arena_htab.c  |  46 ++++++++
 .../selftests/bpf/progs/arena_htab_asm.c      |   5 +
 4 files changed, 239 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bpf_arena_htab.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab.c
 create mode 100644 tools/testing/selftests/bpf/progs/arena_htab_asm.c

diff --git a/tools/testing/selftests/bpf/bpf_arena_htab.h b/tools/testing/selftests/bpf/bpf_arena_htab.h
new file mode 100644
index 000000000000..acc01a876668
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_arena_htab.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#pragma once
+#include <errno.h>
+#include "bpf_arena_alloc.h"
+#include "bpf_arena_list.h"
+
+struct htab_bucket {
+	struct arena_list_head head;
+};
+typedef struct htab_bucket __arena htab_bucket_t;
+
+struct htab {
+	htab_bucket_t *buckets;
+	int n_buckets;
+};
+typedef struct htab __arena htab_t;
+
+static inline htab_bucket_t *__select_bucket(htab_t *htab, __u32 hash)
+{
+	htab_bucket_t *b = htab->buckets;
+
+	cast_kern(b);
+	return &b[hash & (htab->n_buckets - 1)];
+}
+
+static inline arena_list_head_t *select_bucket(htab_t *htab, __u32 hash)
+{
+	return &__select_bucket(htab, hash)->head;
+}
+
+struct hashtab_elem {
+	int hash;
+	int key;
+	int value;
+	struct arena_list_node hash_node;
+};
+typedef struct hashtab_elem __arena hashtab_elem_t;
+
+static hashtab_elem_t *lookup_elem_raw(arena_list_head_t *head, __u32 hash, int key)
+{
+	hashtab_elem_t *l;
+
+	list_for_each_entry(l, head, hash_node)
+		if (l->hash == hash && l->key == key)
+			return l;
+
+	return NULL;
+}
+
+static int htab_hash(int key)
+{
+	return key;
+}
+
+__weak int htab_lookup_elem(htab_t *htab __arg_arena, int key)
+{
+	hashtab_elem_t *l_old;
+	arena_list_head_t *head;
+
+	cast_kern(htab);
+	head = select_bucket(htab, key);
+	l_old = lookup_elem_raw(head, htab_hash(key), key);
+	if (l_old)
+		return l_old->value;
+	return 0;
+}
+
+__weak int htab_update_elem(htab_t *htab __arg_arena, int key, int value)
+{
+	hashtab_elem_t *l_new = NULL, *l_old;
+	arena_list_head_t *head;
+
+	cast_kern(htab);
+	head = select_bucket(htab, key);
+	l_old = lookup_elem_raw(head, htab_hash(key), key);
+
+	l_new = bpf_alloc(sizeof(*l_new));
+	if (!l_new)
+		return -ENOMEM;
+	l_new->key = key;
+	l_new->hash = htab_hash(key);
+	l_new->value = value;
+
+	list_add_head(&l_new->hash_node, head);
+	if (l_old) {
+		list_del(&l_old->hash_node);
+		bpf_free(l_old);
+	}
+	return 0;
+}
+
+void htab_init(htab_t *htab)
+{
+	void __arena *buckets = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0);
+
+	cast_user(buckets);
+	htab->buckets = buckets;
+	htab->n_buckets = 2 * PAGE_SIZE / sizeof(struct htab_bucket);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/arena_htab.c b/tools/testing/selftests/bpf/prog_tests/arena_htab.c
new file mode 100644
index 000000000000..0766702de846
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/arena_htab.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <sys/mman.h>
+#include <network_helpers.h>
+
+#include "arena_htab_asm.skel.h"
+#include "arena_htab.skel.h"
+
+#define PAGE_SIZE 4096
+
+#include "bpf_arena_htab.h"
+
+static void test_arena_htab_common(struct htab *htab)
+{
+	int i;
+
+	printf("htab %p buckets %p n_buckets %d\n", htab, htab->buckets, htab->n_buckets);
+	ASSERT_OK_PTR(htab->buckets, "htab->buckets shouldn't be NULL");
+	for (i = 0; htab->buckets && i < 16; i += 4) {
+		/*
+		 * Walk htab buckets and link lists since all pointers are correct,
+		 * though they were written by bpf program.
+		 */
+		int val = htab_lookup_elem(htab, i);
+
+		ASSERT_EQ(i, val, "key == value");
+	}
+}
+
+static void test_arena_htab_llvm(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct arena_htab *skel;
+	struct htab *htab;
+	size_t arena_sz;
+	void *area;
+	int ret;
+
+	skel = arena_htab__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "arena_htab__open_and_load"))
+		return;
+
+	area = bpf_map__initial_value(skel->maps.arena, &arena_sz);
+	/* fault-in a page with pgoff == 0 as sanity check */
+	*(volatile int *)area = 0x55aa;
+
+	/* bpf prog will allocate more pages */
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_htab_llvm), &opts);
+	ASSERT_OK(ret, "ret");
+	ASSERT_OK(opts.retval, "retval");
+	if (skel->bss->skip) {
+		printf("%s:SKIP:compiler doesn't support arena_cast\n", __func__);
+		test__skip();
+		goto out;
+	}
+	htab = skel->bss->htab_for_user;
+	test_arena_htab_common(htab);
+out:
+	arena_htab__destroy(skel);
+}
+
+static void test_arena_htab_asm(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct arena_htab_asm *skel;
+	struct htab *htab;
+	int ret;
+
+	skel = arena_htab_asm__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "arena_htab_asm__open_and_load"))
+		return;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_htab_asm), &opts);
+	ASSERT_OK(ret, "ret");
+	ASSERT_OK(opts.retval, "retval");
+	htab = skel->bss->htab_for_user;
+	test_arena_htab_common(htab);
+	arena_htab_asm__destroy(skel);
+}
+
+void test_arena_htab(void)
+{
+	if (test__start_subtest("arena_htab_llvm"))
+		test_arena_htab_llvm();
+	if (test__start_subtest("arena_htab_asm"))
+		test_arena_htab_asm();
+}
diff --git a/tools/testing/selftests/bpf/progs/arena_htab.c b/tools/testing/selftests/bpf/progs/arena_htab.c
new file mode 100644
index 000000000000..441fc502312f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/arena_htab.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARENA);
+	__uint(map_flags, BPF_F_MMAPABLE);
+	__uint(max_entries, 100); /* number of pages */
+} arena SEC(".maps");
+
+#include "bpf_arena_htab.h"
+
+void __arena *htab_for_user;
+bool skip = false;
+
+SEC("syscall")
+int arena_htab_llvm(void *ctx)
+{
+#if defined(__BPF_FEATURE_ARENA_CAST) || defined(BPF_ARENA_FORCE_ASM)
+	struct htab __arena *htab;
+	__u64 i;
+
+	htab = bpf_alloc(sizeof(*htab));
+	cast_kern(htab);
+	htab_init(htab);
+
+	/* first run. No old elems in the table */
+	bpf_for(i, 0, 1000)
+		htab_update_elem(htab, i, i);
+
+	/* should replace all elems with new ones */
+	bpf_for(i, 0, 1000)
+		htab_update_elem(htab, i, i);
+	cast_user(htab);
+	htab_for_user = htab;
+#else
+	skip = true;
+#endif
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/arena_htab_asm.c b/tools/testing/selftests/bpf/progs/arena_htab_asm.c
new file mode 100644
index 000000000000..6cd70ea12f0d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/arena_htab_asm.c
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#define BPF_ARENA_FORCE_ASM
+#define arena_htab_llvm arena_htab_asm
+#include "arena_htab.c"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH v2 bpf-next 20/20] selftests/bpf: Convert simple page_frag allocator to per-cpu.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (18 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 19/20] selftests/bpf: Add bpf_arena_htab test Alexei Starovoitov
@ 2024-02-09  4:06 ` Alexei Starovoitov
  2024-02-10  7:05   ` Kumar Kartikeya Dwivedi
  2024-02-12 14:14 ` [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena David Hildenbrand
  2024-02-12 17:36 ` Barret Rhoden
  21 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-09  4:06 UTC (permalink / raw)
  To: bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert simple page_frag allocator to per-cpu page_frag to further stress test
a combination of __arena global and static variables and alloc/free from arena.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/bpf_arena_alloc.h | 23 +++++++++++++------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_arena_alloc.h b/tools/testing/selftests/bpf/bpf_arena_alloc.h
index 0f4cb399b4c7..c27678299e0c 100644
--- a/tools/testing/selftests/bpf/bpf_arena_alloc.h
+++ b/tools/testing/selftests/bpf/bpf_arena_alloc.h
@@ -10,14 +10,19 @@
 #define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
 #endif
 
-void __arena *cur_page;
-int cur_offset;
+#ifdef __BPF__
+#define NR_CPUS (sizeof(struct cpumask) * 8)
+
+static void __arena * __arena page_frag_cur_page[NR_CPUS];
+static int __arena page_frag_cur_offset[NR_CPUS];
 
 /* Simple page_frag allocator */
 static inline void __arena* bpf_alloc(unsigned int size)
 {
 	__u64 __arena *obj_cnt;
-	void __arena *page = cur_page;
+	__u32 cpu = bpf_get_smp_processor_id();
+	void __arena *page = page_frag_cur_page[cpu];
+	int __arena *cur_offset = &page_frag_cur_offset[cpu];
 	int offset;
 
 	size = round_up(size, 8);
@@ -29,8 +34,8 @@ static inline void __arena* bpf_alloc(unsigned int size)
 		if (!page)
 			return NULL;
 		cast_kern(page);
-		cur_page = page;
-		cur_offset = PAGE_SIZE - 8;
+		page_frag_cur_page[cpu] = page;
+		*cur_offset = PAGE_SIZE - 8;
 		obj_cnt = page + PAGE_SIZE - 8;
 		*obj_cnt = 0;
 	} else {
@@ -38,12 +43,12 @@ static inline void __arena* bpf_alloc(unsigned int size)
 		obj_cnt = page + PAGE_SIZE - 8;
 	}
 
-	offset = cur_offset - size;
+	offset = *cur_offset - size;
 	if (offset < 0)
 		goto refill;
 
 	(*obj_cnt)++;
-	cur_offset = offset;
+	*cur_offset = offset;
 	return page + offset;
 }
 
@@ -56,3 +61,7 @@ static inline void bpf_free(void __arena *addr)
 	if (--(*obj_cnt) == 0)
 		bpf_arena_free_pages(&arena, addr, 1);
 }
+#else
+static inline void __arena* bpf_alloc(unsigned int size) { return NULL; }
+static inline void bpf_free(void __arena *addr) {}
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
@ 2024-02-09 17:20   ` Eduard Zingerman
  2024-02-13 22:20     ` Alexei Starovoitov
  2024-02-10  6:48   ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-09 17:20 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, tj, brho, hannes, lstoakes, akpm, urezki,
	hch, linux-mm, kernel-team

On Thu, 2024-02-08 at 20:05 -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions.
> They are similar to PROBE_MEM instructions with the following differences:
> - PROBE_MEM has to check that the address is in the kernel range with
>   src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check
> - PROBE_MEM doesn't support store
> - PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register
> - PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue)
>   Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed
>   to be within arena virtual range, so no address check at run-time.
> - PROBE_MEM32 allows STX and ST. If they fault the store is a nop.
>   When LDX faults the destination register is zeroed.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

It would be great to add support for these new probe instructions in disasm,
otherwise commands like "bpftool prog dump xlated" can't print them.

I sort-of brute-force verified jit code generated for new instructions
and disassembly seem to be as expected.

[...]

> @@ -1564,6 +1697,52 @@ st:			if (is_imm8(insn->off))
>  			emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
>  			break;
>  
> +		case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
> +		case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
> +		case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
> +		case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
> +			start_of_ldx = prog;
> +			emit_st_r12(&prog, BPF_SIZE(insn->code), dst_reg, insn->off, insn->imm);
> +			goto populate_extable;
> +
> +			/* LDX: dst_reg = *(u8*)(src_reg + r12 + off) */
> +		case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
> +		case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
> +		case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
> +		case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
> +		case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
> +		case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
> +		case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
> +		case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
> +			start_of_ldx = prog;
> +			if (BPF_CLASS(insn->code) == BPF_LDX)
> +				emit_ldx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
> +			else
> +				emit_stx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
> +populate_extable:
> +			{
> +				struct exception_table_entry *ex;
> +				u8 *_insn = image + proglen + (start_of_ldx - temp);
> +				s64 delta;
> +
> +				if (!bpf_prog->aux->extable)
> +					break;
> +
> +				ex = &bpf_prog->aux->extable[excnt++];

Nit: this seem to mostly repeat exception logic for
     "BPF_LDX | BPF_MEM | BPF_B" & co,
     is there a way to abstract it a bit?
     Also note that there excnt is checked for overflow.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena Alexei Starovoitov
@ 2024-02-09 20:36   ` David Vernet
  2024-02-10  4:38     ` Alexei Starovoitov
  2024-02-10  7:40   ` Kumar Kartikeya Dwivedi
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 112+ messages in thread
From: David Vernet @ 2024-02-09 20:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

[-- Attachment #1: Type: text/plain, Size: 12246 bytes --]

On Thu, Feb 08, 2024 at 08:05:53PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce bpf_arena, which is a sparse shared memory region between the bpf
> program and user space.
> 
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>    region, like memcached or any key/value storage. The bpf program implements an
>    in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>    value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
>    rb-trees, sparse arrays), while user space consumes it.
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
>    The user space may mmap it, but bpf program will not convert pointers
>    to user base at run-time to improve bpf program speed.
> 
> Initially, the kernel vm_area and user vma are not populated. User space can
> fault in pages within the range. While servicing a page fault, bpf_arena logic
> will insert a new page into the kernel and user vmas. The bpf program can
> allocate pages from that region via bpf_arena_alloc_pages(). This kernel
> function will insert pages into the kernel vm_area. The subsequent fault-in
> from user space will populate that page into the user vma. The
> BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
> from user space. In such a case, if a page is not allocated by the bpf program
> and not present in the kernel vm_area, the user process will segfault. This is
> useful for use cases 2 and 3 above.
> 
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> either at a specific address within the arena or allocates a range with the
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
> and removes the range from the kernel vm_area and from user process vmas.
> 
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf
> program is more important than ease of sharing with user space. This is use
> case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will
> tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a
> 32-bit move wX = wY, which will improve bpf prog performance. Otherwise,
> bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits
> of user vm_start (if the pointer is not NULL) to arena pointers before they are
> stored into memory. This way, user space sees them as valid 64-bit pointers.
> 
> Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
> generate the bpf_cast_kern() instruction before dereference of the arena
> pointer and the bpf_cast_user() instruction when the arena pointer is formed.
> In a typical bpf program there will be very few bpf_cast_user().
> 
> From LLVM's point of view, arena pointers are tagged as
> __attribute__((address_space(1))). Hence, clang provides helpful diagnostics
> when pointers cross address space. Libbpf and the kernel support only
> address_space == 1. All other address space identifiers are reserved.
> 
> rX = bpf_cast_kern(rY, addr_space) tells the verifier that
> rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
> to be in the 32-bit domain. The verifier will mark load/store through
> PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
> kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
> copy_from_kernel_nofault() except that no address checks are necessary. The
> address is guaranteed to be in the 4GB range. If the page is not present, the
> destination register is zeroed on read, and the operation is ignored on write.
> 
> rX = bpf_cast_user(rY, addr_space) tells the verifier that
> rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
> the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
> equivalent to:
> rX = (u32)rY;
> if (rY)
>   rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
> 
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list built
> by a bpf program can be walked natively by user space.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf.h            |   5 +-
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |   7 +
>  kernel/bpf/Makefile            |   3 +
>  kernel/bpf/arena.c             | 557 +++++++++++++++++++++++++++++++++
>  kernel/bpf/core.c              |  11 +
>  kernel/bpf/syscall.c           |   3 +
>  kernel/bpf/verifier.c          |   1 +
>  tools/include/uapi/linux/bpf.h |   7 +
>  9 files changed, 593 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/arena.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8b0dcb66eb33..de557c6c42e0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -37,6 +37,7 @@ struct perf_event;
>  struct bpf_prog;
>  struct bpf_prog_aux;
>  struct bpf_map;
> +struct bpf_arena;
>  struct sock;
>  struct seq_file;
>  struct btf;
> @@ -534,8 +535,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>  			struct bpf_spin_lock *spin_lock);
>  void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
>  		      struct bpf_spin_lock *spin_lock);
> -
> -
> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena);
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena);
>  int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
>  
>  struct bpf_offload_dev;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 94baced5a1ad..9f2a6b83b49e 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -132,6 +132,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops)
>  
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d96708380e52..f6648851eae6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -983,6 +983,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_BLOOM_FILTER,
>  	BPF_MAP_TYPE_USER_RINGBUF,
>  	BPF_MAP_TYPE_CGRP_STORAGE,
> +	BPF_MAP_TYPE_ARENA,
>  	__MAX_BPF_MAP_TYPE
>  };
>  
> @@ -1370,6 +1371,12 @@ enum {
>  
>  /* BPF token FD is passed in a corresponding command's token_fd field */
>  	BPF_F_TOKEN_FD          = (1U << 16),
> +
> +/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
> +	BPF_F_SEGV_ON_FAULT	= (1U << 17),
> +
> +/* Do not translate kernel bpf_arena pointers to user pointers */
> +	BPF_F_NO_USER_CONV	= (1U << 18),
>  };
>  
>  /* Flags for BPF_PROG_QUERY. */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 4ce95acfcaa7..368c5d86b5b7 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -15,6 +15,9 @@ obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
> +ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
> +obj-$(CONFIG_BPF_SYSCALL) += arena.o
> +endif
>  obj-$(CONFIG_BPF_JIT) += dispatcher.o
>  ifeq ($(CONFIG_NET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> new file mode 100644
> index 000000000000..5c1014471740
> --- /dev/null
> +++ b/kernel/bpf/arena.c
> @@ -0,0 +1,557 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/err.h>
> +#include <linux/btf_ids.h>
> +#include <linux/vmalloc.h>
> +#include <linux/pagemap.h>
> +
> +/*
> + * bpf_arena is a sparsely populated shared memory region between bpf program and
> + * user space process.
> + *
> + * For example on x86-64 the values could be:
> + * user_vm_start 7f7d26200000     // picked by mmap()
> + * kern_vm_start ffffc90001e69000 // picked by get_vm_area()
> + * For user space all pointers within the arena are normal 8-byte addresses.
> + * In this example 7f7d26200000 is the address of the first page (pgoff=0).
> + * The bpf program will access it as: kern_vm_start + lower_32bit_of_user_ptr
> + * (u32)7f7d26200000 -> 26200000
> + * hence
> + * ffffc90001e69000 + 26200000 == ffffc90028069000 is "pgoff=0" within 4Gb
> + * kernel memory region.
> + *
> + * BPF JITs generate the following code to access arena:
> + *   mov eax, eax  // eax has lower 32-bit of user pointer
> + *   mov word ptr [rax + r12 + off], bx
> + * where r12 == kern_vm_start and off is s16.
> + * Hence allocate 4Gb + GUARD_SZ/2 on each side.
> + *
> + * Initially kernel vm_area and user vma are not populated.
> + * User space can fault-in any address which will insert the page
> + * into kernel and user vma.
> + * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc
> + * which will insert it into kernel vm_area.
> + * The later fault-in from user space will populate that page into user vma.
> + */
> +
> +/* number of bytes addressable by LDX/STX insn with 16-bit 'off' field */
> +#define GUARD_SZ (1ull << sizeof(((struct bpf_insn *)0)->off) * 8)
> +#define KERN_VM_SZ ((1ull << 32) + GUARD_SZ)
> +
> +struct bpf_arena {
> +	struct bpf_map map;
> +	u64 user_vm_start;
> +	u64 user_vm_end;
> +	struct vm_struct *kern_vm;
> +	struct maple_tree mt;
> +	struct list_head vma_list;
> +	struct mutex lock;
> +};
> +
> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
> +{
> +	return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0;
> +}
> +
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> +{
> +	return arena ? arena->user_vm_start : 0;
> +}
> +
> +static long arena_map_peek_elem(struct bpf_map *map, void *value)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_pop_elem(struct bpf_map *map, void *value)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_delete_elem(struct bpf_map *map, void *value)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> +{
> +	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> +}
> +
> +static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> +{
> +	struct vm_struct *kern_vm;
> +	int numa_node = bpf_map_attr_numa_node(attr);
> +	struct bpf_arena *arena;
> +	u64 vm_range;
> +	int err = -ENOMEM;
> +
> +	if (attr->key_size || attr->value_size || attr->max_entries == 0 ||
> +	    /* BPF_F_MMAPABLE must be set */
> +	    !(attr->map_flags & BPF_F_MMAPABLE) ||
> +	    /* No unsupported flags present */
> +	    (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | BPF_F_NO_USER_CONV)))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (attr->map_extra & ~PAGE_MASK)
> +		/* If non-zero the map_extra is an expected user VMA start address */
> +		return ERR_PTR(-EINVAL);

So I haven't done a thorough review of this patch, beyond trying to
understand the semantics of bpf arenas. On that note, could you please
document the semantics of map_extra with arena maps where map_extra is
defined in [0]?

[0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/uapi/linux/bpf.h#n1439

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  2024-02-09  4:06 ` [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages Alexei Starovoitov
@ 2024-02-09 23:14   ` David Vernet
  2024-02-10  4:35     ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: David Vernet @ 2024-02-09 23:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

[-- Attachment #1: Type: text/plain, Size: 10034 bytes --]

On Thu, Feb 08, 2024 at 08:06:05PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Add unit tests for bpf_arena_alloc/free_pages() functionality
> and bpf_arena_common.h with a set of common helpers and macros that
> is used in this test and the following patches.
> 
> Also modify test_loader that didn't support running bpf_prog_type_syscall
> programs.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/testing/selftests/bpf/DENYLIST.aarch64  |  1 +
>  tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
>  .../testing/selftests/bpf/bpf_arena_common.h  | 70 ++++++++++++++
>  .../selftests/bpf/prog_tests/verifier.c       |  2 +
>  .../selftests/bpf/progs/verifier_arena.c      | 91 +++++++++++++++++++
>  tools/testing/selftests/bpf/test_loader.c     |  9 +-
>  6 files changed, 172 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h
>  create mode 100644 tools/testing/selftests/bpf/progs/verifier_arena.c
> 
> diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
> index 5c2cc7e8c5d0..8e70af386e52 100644
> --- a/tools/testing/selftests/bpf/DENYLIST.aarch64
> +++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
> @@ -11,3 +11,4 @@ fill_link_info/kprobe_multi_link_info            # bpf_program__attach_kprobe_mu
>  fill_link_info/kretprobe_multi_link_info         # bpf_program__attach_kprobe_multi_opts unexpected error: -95
>  fill_link_info/kprobe_multi_invalid_ubuff        # bpf_program__attach_kprobe_multi_opts unexpected error: -95
>  missed/kprobe_recursion                          # missed_kprobe_recursion__attach unexpected error: -95 (errno 95)
> +verifier_arena                                   # JIT does not support arena
> diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
> index 1a63996c0304..ded440277f6e 100644
> --- a/tools/testing/selftests/bpf/DENYLIST.s390x
> +++ b/tools/testing/selftests/bpf/DENYLIST.s390x
> @@ -3,3 +3,4 @@
>  exceptions				 # JIT does not support calling kfunc bpf_throw				       (exceptions)
>  get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
>  stacktrace_build_id                      # compare_map_keys stackid_hmap vs. stackmap err -2 errno 2                   (?)
> +verifier_arena                           # JIT does not support arena
> diff --git a/tools/testing/selftests/bpf/bpf_arena_common.h b/tools/testing/selftests/bpf/bpf_arena_common.h
> new file mode 100644
> index 000000000000..07849d502f40
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bpf_arena_common.h
> @@ -0,0 +1,70 @@
> +/* SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) */
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +#pragma once
> +
> +#ifndef WRITE_ONCE
> +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *) &(x)) = (val))
> +#endif
> +
> +#ifndef NUMA_NO_NODE
> +#define	NUMA_NO_NODE	(-1)
> +#endif
> +
> +#ifndef arena_container_of

Why is this ifndef required if we have a pragma once above?

> +#define arena_container_of(ptr, type, member)			\
> +	({							\
> +		void __arena *__mptr = (void __arena *)(ptr);	\
> +		((type *)(__mptr - offsetof(type, member)));	\
> +	})
> +#endif
> +
> +#ifdef __BPF__ /* when compiled as bpf program */
> +
> +#ifndef PAGE_SIZE
> +#define PAGE_SIZE __PAGE_SIZE
> +/*
> + * for older kernels try sizeof(struct genradix_node)
> + * or flexible:
> + * static inline long __bpf_page_size(void) {
> + *   return bpf_core_enum_value(enum page_size_enum___l, __PAGE_SIZE___l) ?: sizeof(struct genradix_node);
> + * }
> + * but generated code is not great.
> + */
> +#endif
> +
> +#if defined(__BPF_FEATURE_ARENA_CAST) && !defined(BPF_ARENA_FORCE_ASM)
> +#define __arena __attribute__((address_space(1)))
> +#define cast_kern(ptr) /* nop for bpf prog. emitted by LLVM */
> +#define cast_user(ptr) /* nop for bpf prog. emitted by LLVM */
> +#else
> +#define __arena
> +#define cast_kern(ptr) bpf_arena_cast(ptr, BPF_ARENA_CAST_KERN, 1)
> +#define cast_user(ptr) bpf_arena_cast(ptr, BPF_ARENA_CAST_USER, 1)
> +#endif
> +
> +void __arena* bpf_arena_alloc_pages(void *map, void __arena *addr, __u32 page_cnt,
> +				    int node_id, __u64 flags) __ksym __weak;
> +void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak;
> +
> +#else /* when compiled as user space code */
> +
> +#define __arena
> +#define __arg_arena
> +#define cast_kern(ptr) /* nop for user space */
> +#define cast_user(ptr) /* nop for user space */
> +__weak char arena[1];
> +
> +#ifndef offsetof
> +#define offsetof(type, member)  ((unsigned long)&((type *)0)->member)
> +#endif
> +
> +static inline void __arena* bpf_arena_alloc_pages(void *map, void *addr, __u32 page_cnt,
> +						  int node_id, __u64 flags)
> +{
> +	return NULL;
> +}
> +static inline void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt)
> +{
> +}
> +
> +#endif
> diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
> index 9c6072a19745..985273832f89 100644
> --- a/tools/testing/selftests/bpf/prog_tests/verifier.c
> +++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
> @@ -4,6 +4,7 @@
>  
>  #include "cap_helpers.h"
>  #include "verifier_and.skel.h"
> +#include "verifier_arena.skel.h"
>  #include "verifier_array_access.skel.h"
>  #include "verifier_basic_stack.skel.h"
>  #include "verifier_bitfield_write.skel.h"
> @@ -118,6 +119,7 @@ static void run_tests_aux(const char *skel_name,
>  #define RUN(skel) run_tests_aux(#skel, skel##__elf_bytes, NULL)
>  
>  void test_verifier_and(void)                  { RUN(verifier_and); }
> +void test_verifier_arena(void)                { RUN(verifier_arena); }
>  void test_verifier_basic_stack(void)          { RUN(verifier_basic_stack); }
>  void test_verifier_bitfield_write(void)       { RUN(verifier_bitfield_write); }
>  void test_verifier_bounds(void)               { RUN(verifier_bounds); }
> diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c
> new file mode 100644
> index 000000000000..0e667132ef92
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/verifier_arena.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bpf_misc.h"
> +#include "bpf_experimental.h"
> +#include "bpf_arena_common.h"
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARENA);
> +	__uint(map_flags, BPF_F_MMAPABLE);
> +	__uint(max_entries, 2); /* arena of two pages close to 32-bit boundary*/
> +	__ulong(map_extra, (1ull << 44) | (~0u - __PAGE_SIZE * 2 + 1)); /* start of mmap() region */
> +} arena SEC(".maps");
> +
> +SEC("syscall")
> +__success __retval(0)
> +int basic_alloc1(void *ctx)
> +{
> +	volatile int __arena *page1, *page2, *no_page, *page3;
> +
> +	page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> +	if (!page1)
> +		return 1;
> +	*page1 = 1;
> +	page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> +	if (!page2)
> +		return 2;
> +	*page2 = 2;
> +	no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> +	if (no_page)
> +		return 3;
> +	if (*page1 != 1)
> +		return 4;
> +	if (*page2 != 2)
> +		return 5;
> +	bpf_arena_free_pages(&arena, (void __arena *)page2, 1);
> +	if (*page1 != 1)
> +		return 6;
> +	if (*page2 != 0) /* use-after-free should return 0 */

So I know that longer term the plan is to leverage exceptions and have
us exit and unwind the program here, but I think it's somewhat important
to underscore how significant of a usability improvement that will be.
Reading 0 isn't terribly uncommon for typical scheduling use cases. For
example, if we stored a set of cpumasks in arena pages, we may AND them
together and not be concerned if there are no CPUs set as that would be
a perfectly normal state. E.g. if we're using arena cpumasks to track
idle cores and a task's allowed CPUs, and we AND them together and see
0.  We'd just assume there were no idle cores available on the system.
Another example would be scx_nest where we would incorrectly think that
a nest didn't have enough cores, seeing if a task could run in a domain,
etc.

Obviously it's way better for us to actually have arenas in the interim
so this is fine for now, but UAF bugs could potentially be pretty
painful until we get proper exception unwinding support.

Otherwise, in terms of usability, this looks really good. The only thing
to bear in mind is that I don't think we can fully get away from kptrs
that will have some duplicated logic compared to what we can enable in
an arena. For example, we will have to retain at least some of the
struct cpumask * kptrs for e.g. copying a struct task_struct's struct
cpumask *cpus_ptr field.

For now, we could iterate over the cpumask and manually set the bits, so
maybe even just supporting bpf_cpumask_test_cpu() would be enough
(though donig a bitmap_copy() would be better of course)? This is
probably fine for most use cases as we'd likely only be doing struct
cpumask * -> arena copies on slowpaths. But is there any kind of more
generalized integration we want to have between arenas and kptrs?  Not
sure, can't think of any off the top of my head.

> +		return 7;
> +	page3 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> +	if (!page3)
> +		return 8;
> +	*page3 = 3;
> +	if (page2 != page3)
> +		return 9;
> +	if (*page1 != 1)
> +		return 10;

Should we also test doing a store after an arena has been freed?

> +	return 0;
> +}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
  2024-02-09  4:05 ` [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops Alexei Starovoitov
@ 2024-02-10  0:06   ` kernel test robot
  2024-02-10  0:17   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 112+ messages in thread
From: kernel test robot @ 2024-02-10  0:06 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: llvm, oe-kbuild-all

Hi Alexei,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Alexei-Starovoitov/bpf-Allow-kfuncs-return-void/20240209-120941
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20240209040608.98927-4-alexei.starovoitov%40gmail.com
patch subject: [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
config: arm-randconfig-001-20240209 (https://download.01.org/0day-ci/archive/20240210/202402100739.m56uNkG2-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project ac0577177f053ba7e7016e1b7e44cf5932d00b03)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240210/202402100739.m56uNkG2-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402100739.m56uNkG2-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/bpf/syscall.c:948:22: error: no member named 'get_unmapped_area' in 'struct mm_struct'
     948 |         return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
         |                ~~~~~~~~~~~  ^
   1 error generated.


vim +948 kernel/bpf/syscall.c

   939	
   940	static unsigned long bpf_get_unmapped_area(struct file *filp, unsigned long addr,
   941						   unsigned long len, unsigned long pgoff,
   942						   unsigned long flags)
   943	{
   944		struct bpf_map *map = filp->private_data;
   945	
   946		if (map->ops->map_get_unmapped_area)
   947			return map->ops->map_get_unmapped_area(filp, addr, len, pgoff, flags);
 > 948		return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
   949	}
   950	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
  2024-02-09  4:05 ` [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops Alexei Starovoitov
  2024-02-10  0:06   ` kernel test robot
@ 2024-02-10  0:17   ` kernel test robot
  2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
  2024-02-10  9:06   ` kernel test robot
  3 siblings, 0 replies; 112+ messages in thread
From: kernel test robot @ 2024-02-10  0:17 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: oe-kbuild-all

Hi Alexei,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Alexei-Starovoitov/bpf-Allow-kfuncs-return-void/20240209-120941
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20240209040608.98927-4-alexei.starovoitov%40gmail.com
patch subject: [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
config: sh-allyesconfig (https://download.01.org/0day-ci/archive/20240210/202402100839.DY3MDEf1-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240210/202402100839.DY3MDEf1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402100839.DY3MDEf1-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/bpf/syscall.c: In function 'bpf_get_unmapped_area':
   kernel/bpf/syscall.c:948:27: error: 'struct mm_struct' has no member named 'get_unmapped_area'
     948 |         return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
         |                           ^~
>> kernel/bpf/syscall.c:949:1: warning: control reaches end of non-void function [-Wreturn-type]
     949 | }
         | ^


vim +949 kernel/bpf/syscall.c

   939	
   940	static unsigned long bpf_get_unmapped_area(struct file *filp, unsigned long addr,
   941						   unsigned long len, unsigned long pgoff,
   942						   unsigned long flags)
   943	{
   944		struct bpf_map *map = filp->private_data;
   945	
   946		if (map->ops->map_get_unmapped_area)
   947			return map->ops->map_get_unmapped_area(filp, addr, len, pgoff, flags);
   948		return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
 > 949	}
   950	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
@ 2024-02-10  1:13   ` Eduard Zingerman
  2024-02-13  2:58     ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-10  1:13 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, tj, brho, hannes, lstoakes, akpm, urezki,
	hch, linux-mm, kernel-team

On Thu, 2024-02-08 at 20:05 -0800, Alexei Starovoitov wrote:
[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3c77a3ab1192..5eeb9bf7e324 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[...]

> @@ -13837,6 +13844,21 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
>  
>  	dst_reg = &regs[insn->dst_reg];
>  	src_reg = NULL;
> +
> +	if (dst_reg->type == PTR_TO_ARENA) {
> +		struct bpf_insn_aux_data *aux = cur_aux(env);
> +
> +		if (BPF_CLASS(insn->code) == BPF_ALU64)
> +			/*
> +			 * 32-bit operations zero upper bits automatically.
> +			 * 64-bit operations need to be converted to 32.
> +			 */
> +			aux->needs_zext = true;

It should be possible to write an example, when the same insn is
visited with both PTR_TO_ARENA and some other PTR type.
Such examples should be rejected as is currently done in do_check()
for BPF_{ST,STX} using save_aux_ptr_type().

[...]

> @@ -13954,16 +13976,17 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  	} else if (opcode == BPF_MOV) {
>  
>  		if (BPF_SRC(insn->code) == BPF_X) {
> -			if (insn->imm != 0) {
> -				verbose(env, "BPF_MOV uses reserved fields\n");
> -				return -EINVAL;
> -			}
> -
>  			if (BPF_CLASS(insn->code) == BPF_ALU) {
> -				if (insn->off != 0 && insn->off != 8 && insn->off != 16) {
> +				if ((insn->off != 0 && insn->off != 8 && insn->off != 16) ||
> +				    insn->imm) {
>  					verbose(env, "BPF_MOV uses reserved fields\n");
>  					return -EINVAL;
>  				}
> +			} else if (insn->off == BPF_ARENA_CAST_KERN || insn->off == BPF_ARENA_CAST_USER) {
> +				if (!insn->imm) {
> +					verbose(env, "cast_kern/user insn must have non zero imm32\n");
> +					return -EINVAL;
> +				}
>  			} else {
>  				if (insn->off != 0 && insn->off != 8 && insn->off != 16 &&
>  				    insn->off != 32) {

I think it is now necessary to check insn->imm here,
as is it allows ALU64 move with non-zero imm.

> @@ -13993,7 +14016,12 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  			struct bpf_reg_state *dst_reg = regs + insn->dst_reg;
>  
>  			if (BPF_CLASS(insn->code) == BPF_ALU64) {
> -				if (insn->off == 0) {
> +				if (insn->imm) {
> +					/* off == BPF_ARENA_CAST_KERN || off == BPF_ARENA_CAST_USER */
> +					mark_reg_unknown(env, regs, insn->dst_reg);
> +					if (insn->off == BPF_ARENA_CAST_KERN)
> +						dst_reg->type = PTR_TO_ARENA;

This effectively allows casting anything to PTR_TO_ARENA.
Do we want to check that src_reg somehow originates from arena?
Might be tricky, a new type modifier bit or something like that.

> +				} else if (insn->off == 0) {
>  					/* case: R1 = R2
>  					 * copy register state to dest reg
>  					 */
> @@ -14059,6 +14087,9 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  						dst_reg->subreg_def = env->insn_idx + 1;
>  						coerce_subreg_to_size_sx(dst_reg, insn->off >> 3);
>  					}
> +				} else if (src_reg->type == PTR_TO_ARENA) {
> +					mark_reg_unknown(env, regs, insn->dst_reg);
> +					dst_reg->type = PTR_TO_ARENA;

This describes a case wX = wY, where rY is PTR_TO_ARENA,
should rX be marked as SCALAR instead of PTR_TO_ARENA?

[...]

> @@ -18235,6 +18272,31 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
>  				fdput(f);
>  				return -EBUSY;
>  			}
> +			if (map->map_type == BPF_MAP_TYPE_ARENA) {
> +				if (env->prog->aux->arena) {

Does this have to be (env->prog->aux->arena && env->prog->aux->arena != map) ?

> +					verbose(env, "Only one arena per program\n");
> +					fdput(f);
> +					return -EBUSY;
> +				}

[...]

> @@ -18799,6 +18861,18 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>  			   insn->code == (BPF_ST | BPF_MEM | BPF_W) ||
>  			   insn->code == (BPF_ST | BPF_MEM | BPF_DW)) {
>  			type = BPF_WRITE;
> +		} else if (insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) && insn->imm) {
> +			if (insn->off == BPF_ARENA_CAST_KERN ||
> +			    (((struct bpf_map *)env->prog->aux->arena)->map_flags & BPF_F_NO_USER_CONV)) {
> +				/* convert to 32-bit mov that clears upper 32-bit */
> +				insn->code = BPF_ALU | BPF_MOV | BPF_X;
> +				/* clear off, so it's a normal 'wX = wY' from JIT pov */
> +				insn->off = 0;
> +			} /* else insn->off == BPF_ARENA_CAST_USER should be handled by JIT */
> +			continue;
> +		} else if (env->insn_aux_data[i + delta].needs_zext) {
> +			/* Convert BPF_CLASS(insn->code) == BPF_ALU64 to 32-bit ALU */
> +			insn->code = BPF_ALU | BPF_OP(insn->code) | BPF_SRC(insn->code);

Tbh, I think this should be done in do_misc_fixups(),
mixing it with context handling in convert_ctx_accesses()
seems a bit confusing.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
@ 2024-02-10  1:15   ` Eduard Zingerman
  2024-02-10  8:40   ` Kumar Kartikeya Dwivedi
  1 sibling, 0 replies; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-10  1:15 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, tj, brho, hannes, lstoakes, akpm, urezki,
	hch, linux-mm, kernel-team

On Thu, 2024-02-08 at 20:05 -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> LLVM generates bpf_cast_kern and bpf_cast_user instructions while translating
> pointers with __attribute__((address_space(1))).
> 
> rX = cast_kern(rY) is processed by the verifier and converted to
> normal 32-bit move: wX = wY
> 
> bpf_cast_user has to be converted by JIT.
> 
> rX = cast_user(rY) is
> 
> aux_reg = upper_32_bits of arena->user_vm_start
> aux_reg <<= 32
> wX = wY // clear upper 32 bits of dst register
> if (wX) // if not zero add upper bits of user_vm_start
>   wX |= aux_reg
> 
> JIT can do it more efficiently:
> 
> mov dst_reg32, src_reg32  // 32-bit move
> shl dst_reg, 32
> or dst_reg, user_vm_start
> rol dst_reg, 32
> xor r11, r11
> test dst_reg32, dst_reg32 // check if lower 32-bit are zero
> cmove r11, dst_reg	  // if so, set dst_reg to zero
> 			  // Intel swapped src/dst register encoding in CMOVcc
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Checked generated x86 code for all reg combinations, works as expected.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  2024-02-09 23:14   ` David Vernet
@ 2024-02-10  4:35     ` Alexei Starovoitov
  2024-02-10  7:03       ` Kumar Kartikeya Dwivedi
  2024-02-12 16:48       ` David Vernet
  0 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-10  4:35 UTC (permalink / raw)
  To: David Vernet
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Fri, Feb 9, 2024 at 3:14 PM David Vernet <void@manifault.com> wrote:
>
> > +
> > +#ifndef arena_container_of
>
> Why is this ifndef required if we have a pragma once above?

Just a habit to check for a macro before defining it.

> Obviously it's way better for us to actually have arenas in the interim
> so this is fine for now, but UAF bugs could potentially be pretty
> painful until we get proper exception unwinding support.

Detection that arena access faulted doesn't have to come after
exception unwinding. Exceptions vs cancellable progs are also different.
A record of the line in bpf prog that caused the first fault is probably
good enough for prog debugging.

> Otherwise, in terms of usability, this looks really good. The only thing
> to bear in mind is that I don't think we can fully get away from kptrs
> that will have some duplicated logic compared to what we can enable in
> an arena. For example, we will have to retain at least some of the
> struct cpumask * kptrs for e.g. copying a struct task_struct's struct
> cpumask *cpus_ptr field.

I think that's a bit orthogonal.
task->cpus_ptr is a trusted_ptr_to_btf_id access that can be mixed
within a program with arena access.

> For now, we could iterate over the cpumask and manually set the bits, so
> maybe even just supporting bpf_cpumask_test_cpu() would be enough
> (though donig a bitmap_copy() would be better of course)? This is
> probably fine for most use cases as we'd likely only be doing struct
> cpumask * -> arena copies on slowpaths. But is there any kind of more
> generalized integration we want to have between arenas and kptrs?  Not
> sure, can't think of any off the top of my head.

Hopefully we'll be able to invent a way to store kptr-s inside the arena,
but from a cpumask perspective bpf_cpumask_test_cpu() can be made
polymorphic to work with arena ptrs and kptrs.
Same with bpf_cpumask_and(). Mixed arguments can be allowed.
Args can be either kptr or ptr_to_arena.

I still believe that we can deprecate 'struct bpf_cpumask'.
The cpumask_t will stay, of course, but we won't need to
bpf_obj_new(bpf_cpumask) and carefully track refcnt.
The arena can do the same much faster.

>
> > +             return 7;
> > +     page3 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> > +     if (!page3)
> > +             return 8;
> > +     *page3 = 3;
> > +     if (page2 != page3)
> > +             return 9;
> > +     if (*page1 != 1)
> > +             return 10;
>
> Should we also test doing a store after an arena has been freed?

You mean the whole bpf arena map was freed ?
I don't see how the verifier would allow that.
If you meant a few pages were freed from the arena then such a test is
already in the patches.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-09 20:36   ` David Vernet
@ 2024-02-10  4:38     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-10  4:38 UTC (permalink / raw)
  To: David Vernet
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Fri, Feb 9, 2024 at 12:36 PM David Vernet <void@manifault.com> wrote:
> > +     if (attr->map_extra & ~PAGE_MASK)
> > +             /* If non-zero the map_extra is an expected user VMA start address */
> > +             return ERR_PTR(-EINVAL);
>
> So I haven't done a thorough review of this patch, beyond trying to
> understand the semantics of bpf arenas. On that note, could you please
> document the semantics of map_extra with arena maps where map_extra is
> defined in [0]?
>
> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/uapi/linux/bpf.h#n1439

Good point. Done.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
  2024-02-09 17:20   ` Eduard Zingerman
@ 2024-02-10  6:48   ` Kumar Kartikeya Dwivedi
  2024-02-13 22:00     ` Alexei Starovoitov
  1 sibling, 1 reply; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  6:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions.
> They are similar to PROBE_MEM instructions with the following differences:
> - PROBE_MEM has to check that the address is in the kernel range with
>   src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check
> - PROBE_MEM doesn't support store
> - PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register
> - PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue)
>   Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed
>   to be within arena virtual range, so no address check at run-time.
> - PROBE_MEM32 allows STX and ST. If they fault the store is a nop.
>   When LDX faults the destination register is zeroed.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Just a potential issue with tail calls, but otherwise lgtm so:
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

>  arch/x86/net/bpf_jit_comp.c | 183 +++++++++++++++++++++++++++++++++++-
>  include/linux/bpf.h         |   1 +
>  include/linux/filter.h      |   3 +
>  3 files changed, 186 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index e1390d1e331b..883b7f604b9a 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -113,6 +113,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
>  /* Pick a register outside of BPF range for JIT internal work */
>  #define AUX_REG (MAX_BPF_JIT_REG + 1)
>  #define X86_REG_R9 (MAX_BPF_JIT_REG + 2)
> +#define X86_REG_R12 (MAX_BPF_JIT_REG + 3)
>
> [...]
> +       arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
> +
>         detect_reg_usage(insn, insn_cnt, callee_regs_used,
>                          &tail_call_seen);
>
> @@ -1172,8 +1300,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>                 push_r12(&prog);
>                 push_callee_regs(&prog, all_callee_regs_used);
>         } else {
> +               if (arena_vm_start)
> +                       push_r12(&prog);

I believe since this is done on entry for arena_vm_start, we need to
do matching pop_r12 in
emit_bpf_tail_call_indirect and emit_bpf_tail_call_direct before tail
call, unless I'm missing something.
Otherwise r12 may be bad after prog (push + set to arena_vm_start) ->
tail call -> exit (no pop of r12 back from stack).

>                 push_callee_regs(&prog, callee_regs_used);
>         }
> +       if (arena_vm_start)
> +               emit_mov_imm64(&prog, X86_REG_R12,
> +                              arena_vm_start >> 32, (u32) arena_vm_start);
>
>         ilen = prog - temp;
>         if (rw_image)
> @@ -1564,6 +1697,52 @@ st:                      if (is_imm8(insn->off))
>                         emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
>                         break;
>
> +               case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
> +               case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
> +               case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
> +               case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
> +                       start_of_ldx = prog;
> +                       emit_st_r12(&prog, BPF_SIZE(insn->code), dst_reg, insn->off, insn->imm);
> +                       goto populate_extable;
> +
> +                       /* LDX: dst_reg = *(u8*)(src_reg + r12 + off) */
> +               case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
> +               case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
> +               case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
> +               case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
> +               case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
> +               case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
> +               case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
> +               case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
> +                       start_of_ldx = prog;
> +                       if (BPF_CLASS(insn->code) == BPF_LDX)
> +                               emit_ldx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
> +                       else
> +                               emit_stx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
> +populate_extable:
> +                       {
> +                               struct exception_table_entry *ex;
> +                               u8 *_insn = image + proglen + (start_of_ldx - temp);
> +                               s64 delta;
> +
> +                               if (!bpf_prog->aux->extable)
> +                                       break;
> +
> +                               ex = &bpf_prog->aux->extable[excnt++];
> +
> +                               delta = _insn - (u8 *)&ex->insn;
> +                               /* switch ex to rw buffer for writes */
> +                               ex = (void *)rw_image + ((void *)ex - (void *)image);
> +
> +                               ex->insn = delta;
> +
> +                               ex->data = EX_TYPE_BPF;
> +
> +                               ex->fixup = (prog - start_of_ldx) |
> +                                       ((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[dst_reg] : DONT_CLEAR) << 8);
> +                       }
> +                       break;
> +
>                         /* LDX: dst_reg = *(u8*)(src_reg + off) */
>                 case BPF_LDX | BPF_MEM | BPF_B:
>                 case BPF_LDX | BPF_PROBE_MEM | BPF_B:
> @@ -2036,6 +2215,8 @@ st:                       if (is_imm8(insn->off))
>                                 pop_r12(&prog);
>                         } else {
>                                 pop_callee_regs(&prog, callee_regs_used);
> +                               if (arena_vm_start)
> +                                       pop_r12(&prog);
>                         }

... Basically this if condition copied to those two other places.

> [...]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 01/20] bpf: Allow kfuncs return 'void *'
  2024-02-09  4:05 ` [PATCH v2 bpf-next 01/20] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
@ 2024-02-10  6:49   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  6:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Recognize return of 'void *' from kfunc as returning unknown scalar.
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

>  kernel/bpf/verifier.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ddaf09db1175..d9c2dbb3939f 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -12353,6 +12353,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>                                         meta.func_name);
>                                 return -EFAULT;
>                         }
> +               } else if (btf_type_is_void(ptr_type)) {
> +                       /* kfunc returning 'void *' is equivalent to returning scalar */
> +                       mark_reg_unknown(env, regs, BPF_REG_0);
>                 } else if (!__btf_type_is_struct(ptr_type)) {
>                         if (!meta.r0_size) {
>                                 __u32 sz;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 02/20] bpf: Recognize '__map' suffix in kfunc arguments
  2024-02-09  4:05 ` [PATCH v2 bpf-next 02/20] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
@ 2024-02-10  6:52   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  6:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Recognize 'void *p__map' kfunc argument as 'struct bpf_map *p__map'.
> It allows kfunc to have 'void *' argument for maps, since bpf progs
> will call them as:
> struct {
>         __uint(type, BPF_MAP_TYPE_ARENA);
>         ...
> } arena SEC(".maps");
>
> bpf_kfunc_with_map(... &arena ...);
>
> Underneath libbpf will load CONST_PTR_TO_MAP into the register via ld_imm64 insn.
> If kfunc was defined with 'struct bpf_map *' it would pass
> the verifier, but bpf prog would need to use '(void *)&arena'.
> Which is not clean.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

> [...]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  2024-02-10  4:35     ` Alexei Starovoitov
@ 2024-02-10  7:03       ` Kumar Kartikeya Dwivedi
  2024-02-13 23:19         ` Alexei Starovoitov
  2024-02-12 16:48       ` David Vernet
  1 sibling, 1 reply; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  7:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Vernet, bpf, Daniel Borkmann, Andrii Nakryiko, Eddy Z,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Sat, 10 Feb 2024 at 05:35, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 9, 2024 at 3:14 PM David Vernet <void@manifault.com> wrote:
> >
> > > +
> > > +#ifndef arena_container_of
> >
> > Why is this ifndef required if we have a pragma once above?
>
> Just a habit to check for a macro before defining it.
>
> > Obviously it's way better for us to actually have arenas in the interim
> > so this is fine for now, but UAF bugs could potentially be pretty
> > painful until we get proper exception unwinding support.
>
> Detection that arena access faulted doesn't have to come after
> exception unwinding. Exceptions vs cancellable progs are also different.

What do you mean exactly by 'cancellable progs'? That they can be
interrupted at any (or well-known) points and stopped? I believe
whatever plumbing was done to enable exceptions will be useful there
as well. The verifier would just need to know e.g. that a load into
PTR_TO_ARENA may fault, and thus generate descriptors for all frames
for that pc. Then, at runtime, you could technically release all
resources by looking up the frame descriptor and unwind the stack and
return back to the caller of the prog.

> A record of the line in bpf prog that caused the first fault is probably
> good enough for prog debugging.
>

I think it would make more sense to abort the program by default,
because use-after-free in the arena most certainly means a bug in the
program.
There is no speed up from zeroing faults, it only papers over
potential problems in the program.
Something is being accessed in a page that has since been unallocated,
or the pointer is bad/access is out-of-bounds.
If not for all UAFs, especially for guard pages. In that case it is
100% a problem in the program.
Unlike PROBE_MEM where we cannot reason about what kernel memory
tracing programs may read from, there is no need for a best-effort
continuation here.

Now that the verifier will stop reasoning precisely about object
lifetimes unlike bpf_obj_new objects, all bugs that happen in normal C
have a possibility of surfacing in a BPF program using arenas as
heaps, so it is more likely that these cases are hit.

> [...]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
  2024-02-09  4:05 ` [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops Alexei Starovoitov
  2024-02-10  0:06   ` kernel test robot
  2024-02-10  0:17   ` kernel test robot
@ 2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
  2024-02-10  9:06   ` kernel test robot
  3 siblings, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  7:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Subsequent patches introduce bpf_arena that imposes special alignment
> requirements on address selection.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
@ 2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
  2024-02-14  8:36   ` Christoph Hellwig
  1 sibling, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  7:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> BPF would like to use the vmap API to implement a lazily-populated
> memory space which can be shared by multiple userspace threads.
>
> The vmap API is generally public and has functions to request and
> release areas of kernel address space, as well as functions to map
> various types of backing memory into that space.
>
> For example, there is the public ioremap_page_range(), which is used
> to map device memory into addressable kernel space.
>
> The new BPF code needs the functionality of vmap_pages_range() in
> order to incrementally map privately managed arrays of pages into its
> vmap area. Indeed this function used to be public, but became private
> when usecases other than vmalloc happened to disappear.
>
> Make it public again for the new external user.
>
> The next commits will introduce bpf_arena which is a sparsely populated shared
> memory region between bpf program and user space process. It will map
> privately-managed pages into an existing vm area. It's the same pattern and
> layer of abstraction as ioremap_pages_range().
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 20/20] selftests/bpf: Convert simple page_frag allocator to per-cpu.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 20/20] selftests/bpf: Convert simple page_frag allocator to per-cpu Alexei Starovoitov
@ 2024-02-10  7:05   ` Kumar Kartikeya Dwivedi
  2024-02-14  1:37     ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  7:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Convert simple page_frag allocator to per-cpu page_frag to further stress test
> a combination of __arena global and static variables and alloc/free from arena.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

I know this organically grew from a toy implementation, but since
people will most likely be looking at selftests as usage examples, it
might be better to expose bpf_preempt_disable/enable and use it in the
case of per-CPU page_frag allocator? No need to block on this, can be
added on top later.

The kfunc is useful on its own for writing safe per-CPU data
structures or other memory allocators like bpf_ma on top of arenas.
It is also necessary as a building block for writing spin locks
natively in BPF on top of the arena map which we may add later.
I have a patch lying around for this, verifier plumbing is mostly the
same as rcu_read_lock.
I can send it out with tests, or otherwise if you want to add it to
this series, you go ahead.

>  [...]
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena Alexei Starovoitov
@ 2024-02-10  7:16   ` Kumar Kartikeya Dwivedi
  2024-02-12 19:11     ` Andrii Nakryiko
  2024-02-12 18:12   ` Eduard Zingerman
  2024-02-13 23:15   ` Andrii Nakryiko
  2 siblings, 1 reply; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  7:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> mmap() bpf_arena right after creation, since the kernel needs to
> remember the address returned from mmap. This is user_vm_start.
> LLVM will generate bpf_arena_cast_user() instructions where
> necessary and JIT will add upper 32-bit of user_vm_start
> to such pointers.
>
> Fix up bpf_map_mmap_sz() to compute mmap size as
> map->value_size * map->max_entries for arrays and
> PAGE_SIZE * map->max_entries for arena.
>
> Don't set BTF at arena creation time, since it doesn't support it.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/lib/bpf/libbpf.c        | 43 ++++++++++++++++++++++++++++++-----
>  tools/lib/bpf/libbpf_probes.c |  7 ++++++
>  2 files changed, 44 insertions(+), 6 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 01f407591a92..4880d623098d 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
>         [BPF_MAP_TYPE_BLOOM_FILTER]             = "bloom_filter",
>         [BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
>         [BPF_MAP_TYPE_CGRP_STORAGE]             = "cgrp_storage",
> +       [BPF_MAP_TYPE_ARENA]                    = "arena",
>  };
>
>  static const char * const prog_type_name[] = {
> @@ -1577,7 +1578,7 @@ static struct bpf_map *bpf_object__add_map(struct bpf_object *obj)
>         return map;
>  }
>
> -static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> +static size_t __bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
>  {
>         const long page_sz = sysconf(_SC_PAGE_SIZE);
>         size_t map_sz;
> @@ -1587,6 +1588,20 @@ static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
>         return map_sz;
>  }
>
> +static size_t bpf_map_mmap_sz(const struct bpf_map *map)
> +{
> +       const long page_sz = sysconf(_SC_PAGE_SIZE);
> +
> +       switch (map->def.type) {
> +       case BPF_MAP_TYPE_ARRAY:
> +               return __bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> +       case BPF_MAP_TYPE_ARENA:
> +               return page_sz * map->def.max_entries;
> +       default:
> +               return 0; /* not supported */
> +       }
> +}
> +
>  static int bpf_map_mmap_resize(struct bpf_map *map, size_t old_sz, size_t new_sz)
>  {
>         void *mmaped;
> @@ -1740,7 +1755,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>         pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
>                  map->name, map->sec_idx, map->sec_offset, def->map_flags);
>
> -       mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> +       mmap_sz = bpf_map_mmap_sz(map);
>         map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
>                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>         if (map->mmaped == MAP_FAILED) {
> @@ -4852,6 +4867,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>         case BPF_MAP_TYPE_SOCKHASH:
>         case BPF_MAP_TYPE_QUEUE:
>         case BPF_MAP_TYPE_STACK:
> +       case BPF_MAP_TYPE_ARENA:
>                 create_attr.btf_fd = 0;
>                 create_attr.btf_key_type_id = 0;
>                 create_attr.btf_value_type_id = 0;
> @@ -4908,6 +4924,21 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>         if (map->fd == map_fd)
>                 return 0;
>
> +       if (def->type == BPF_MAP_TYPE_ARENA) {
> +               map->mmaped = mmap((void *)map->map_extra, bpf_map_mmap_sz(map),
> +                                  PROT_READ | PROT_WRITE,
> +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> +                                  map_fd, 0);
> +               if (map->mmaped == MAP_FAILED) {
> +                       err = -errno;
> +                       map->mmaped = NULL;
> +                       close(map_fd);
> +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> +                               bpf_map__name(map), err);
> +                       return err;
> +               }
> +       }
> +

Would it be possible to introduce a public API accessor for getting
the value of map->mmaped?
Otherwise one would have to parse through /proc/self/maps in case
map_extra is 0.

The use case is to be able to use the arena as a backing store for
userspace malloc arenas, so that
we can pass through malloc/mallocx calls (or class specific operator
new) directly to malloc arena using the BPF arena.
In such a case a lot of the burden of converting existing data
structures or code can be avoided by making much of the process
transparent.
Userspace malloced objects can also be easily shared to BPF progs as a
pool through bpf_ma style per-CPU allocator.

> [...]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena Alexei Starovoitov
  2024-02-09 20:36   ` David Vernet
@ 2024-02-10  7:40   ` Kumar Kartikeya Dwivedi
  2024-02-12 18:21     ` Alexei Starovoitov
  2024-02-12 15:56   ` Barret Rhoden
  2024-02-13 23:14   ` Andrii Nakryiko
  3 siblings, 1 reply; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  7:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce bpf_arena, which is a sparse shared memory region between the bpf
> program and user space.
>
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>    region, like memcached or any key/value storage. The bpf program implements an
>    in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>    value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
>    rb-trees, sparse arrays), while user space consumes it.
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
>    The user space may mmap it, but bpf program will not convert pointers
>    to user base at run-time to improve bpf program speed.
>
> Initially, the kernel vm_area and user vma are not populated. User space can
> fault in pages within the range. While servicing a page fault, bpf_arena logic
> will insert a new page into the kernel and user vmas. The bpf program can
> allocate pages from that region via bpf_arena_alloc_pages(). This kernel
> function will insert pages into the kernel vm_area. The subsequent fault-in
> from user space will populate that page into the user vma. The
> BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
> from user space. In such a case, if a page is not allocated by the bpf program
> and not present in the kernel vm_area, the user process will segfault. This is
> useful for use cases 2 and 3 above.
>
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> either at a specific address within the arena or allocates a range with the
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
> and removes the range from the kernel vm_area and from user process vmas.
>
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf
> program is more important than ease of sharing with user space. This is use
> case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will
> tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a
> 32-bit move wX = wY, which will improve bpf prog performance. Otherwise,
> bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits
> of user vm_start (if the pointer is not NULL) to arena pointers before they are
> stored into memory. This way, user space sees them as valid 64-bit pointers.
>
> Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
> generate the bpf_cast_kern() instruction before dereference of the arena
> pointer and the bpf_cast_user() instruction when the arena pointer is formed.
> In a typical bpf program there will be very few bpf_cast_user().
>
> From LLVM's point of view, arena pointers are tagged as
> __attribute__((address_space(1))). Hence, clang provides helpful diagnostics
> when pointers cross address space. Libbpf and the kernel support only
> address_space == 1. All other address space identifiers are reserved.
>
> rX = bpf_cast_kern(rY, addr_space) tells the verifier that
> rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
> to be in the 32-bit domain. The verifier will mark load/store through
> PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
> kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
> copy_from_kernel_nofault() except that no address checks are necessary. The
> address is guaranteed to be in the 4GB range. If the page is not present, the
> destination register is zeroed on read, and the operation is ignored on write.
>
> rX = bpf_cast_user(rY, addr_space) tells the verifier that
> rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
> the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
> equivalent to:
> rX = (u32)rY;
> if (rY)
>   rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
>
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list built
> by a bpf program can be walked natively by user space.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

A few questions on the patch.

1. Is the expectation that user space would use syscall progs to
manipulate mappings in the arena?

2. I may have missed it, but which memcg are the allocations being
accounted against? Will it be the process that created the map?
When trying to explore bpf_map_alloc_pages, I could not figure out if
the obj_cgroup was being looked up anywhere.
I think it would be nice if it were accounted for against the caller
of bpf_map_alloc_pages, since potentially the arena map can be shared
across multiple processes.
Tying it to bpf_map on bpf_map_alloc may be too coarse for arena maps.

3. A bit tangential, but what would be the path to having huge page
mappings look like (mostly from an interface standpoint)? I gather we
could use the flags argument on the kernel side, and if 1 is true
above, it would mean userspace would do it from inside a syscall
program and then trigger a page fault? Because experience with use
case 1 in the commit log suggests it is desirable to have such memory
be backed by huge pages to reduce TLB misses.

> [...]
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 06/20] bpf: Disasm support for cast_kern/user instructions.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 06/20] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
@ 2024-02-10  7:41   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  7:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> LLVM generates rX = bpf_cast_kern/_user(rY, address_space) instructions
> when pointers in non-zero address space are used by the bpf program.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
  2024-02-10  1:15   ` Eduard Zingerman
@ 2024-02-10  8:40   ` Kumar Kartikeya Dwivedi
  2024-02-13 22:28     ` Alexei Starovoitov
  1 sibling, 1 reply; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  8:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> LLVM generates bpf_cast_kern and bpf_cast_user instructions while translating
> pointers with __attribute__((address_space(1))).
>
> rX = cast_kern(rY) is processed by the verifier and converted to
> normal 32-bit move: wX = wY
>
> bpf_cast_user has to be converted by JIT.
>
> rX = cast_user(rY) is
>
> aux_reg = upper_32_bits of arena->user_vm_start
> aux_reg <<= 32
> wX = wY // clear upper 32 bits of dst register
> if (wX) // if not zero add upper bits of user_vm_start
>   wX |= aux_reg
>

Would this be ok if the rY is somehow aligned at the 4GB boundary, and
the lower 32-bits end up being zero.
Then this transformation would confuse it with the NULL case, right?
Or do I miss something?

> JIT can do it more efficiently:
>
> mov dst_reg32, src_reg32  // 32-bit move
> shl dst_reg, 32
> or dst_reg, user_vm_start
> rol dst_reg, 32
> xor r11, r11
> test dst_reg32, dst_reg32 // check if lower 32-bit are zero
> cmove r11, dst_reg        // if so, set dst_reg to zero
>                           // Intel swapped src/dst register encoding in CMOVcc
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  arch/x86/net/bpf_jit_comp.c | 41 ++++++++++++++++++++++++++++++++++++-
>  include/linux/filter.h      |  1 +
>  kernel/bpf/core.c           |  5 +++++
>  3 files changed, 46 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 883b7f604b9a..a042ed57af7b 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1272,13 +1272,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>         bool tail_call_seen = false;
>         bool seen_exit = false;
>         u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
> -       u64 arena_vm_start;
> +       u64 arena_vm_start, user_vm_start;
>         int i, excnt = 0;
>         int ilen, proglen = 0;
>         u8 *prog = temp;
>         int err;
>
>         arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
> +       user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
>
>         detect_reg_usage(insn, insn_cnt, callee_regs_used,
>                          &tail_call_seen);
> @@ -1346,6 +1347,39 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>                         break;
>
>                 case BPF_ALU64 | BPF_MOV | BPF_X:
> +                       if (insn->off == BPF_ARENA_CAST_USER) {
> +                               if (dst_reg != src_reg)
> +                                       /* 32-bit mov */
> +                                       emit_mov_reg(&prog, false, dst_reg, src_reg);
> +                               /* shl dst_reg, 32 */
> +                               maybe_emit_1mod(&prog, dst_reg, true);
> +                               EMIT3(0xC1, add_1reg(0xE0, dst_reg), 32);
> +
> +                               /* or dst_reg, user_vm_start */
> +                               maybe_emit_1mod(&prog, dst_reg, true);
> +                               if (is_axreg(dst_reg))
> +                                       EMIT1_off32(0x0D,  user_vm_start >> 32);
> +                               else
> +                                       EMIT2_off32(0x81, add_1reg(0xC8, dst_reg),  user_vm_start >> 32);
> +
> +                               /* rol dst_reg, 32 */
> +                               maybe_emit_1mod(&prog, dst_reg, true);
> +                               EMIT3(0xC1, add_1reg(0xC0, dst_reg), 32);
> +
> +                               /* xor r11, r11 */
> +                               EMIT3(0x4D, 0x31, 0xDB);
> +
> +                               /* test dst_reg32, dst_reg32; check if lower 32-bit are zero */
> +                               maybe_emit_mod(&prog, dst_reg, dst_reg, false);
> +                               EMIT2(0x85, add_2reg(0xC0, dst_reg, dst_reg));
> +
> +                               /* cmove r11, dst_reg; if so, set dst_reg to zero */
> +                               /* WARNING: Intel swapped src/dst register encoding in CMOVcc !!! */
> +                               maybe_emit_mod(&prog, AUX_REG, dst_reg, true);
> +                               EMIT3(0x0F, 0x44, add_2reg(0xC0, AUX_REG, dst_reg));
> +                               break;
> +                       }
> +                       fallthrough;
>                 case BPF_ALU | BPF_MOV | BPF_X:
>                         if (insn->off == 0)
>                                 emit_mov_reg(&prog,
> @@ -3424,6 +3458,11 @@ void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
>         }
>  }
>
> +bool bpf_jit_supports_arena(void)
> +{
> +       return true;
> +}
> +
>  bool bpf_jit_supports_ptr_xchg(void)
>  {
>         return true;
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index cd76d43412d0..78ea63002531 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -959,6 +959,7 @@ bool bpf_jit_supports_kfunc_call(void);
>  bool bpf_jit_supports_far_kfunc_call(void);
>  bool bpf_jit_supports_exceptions(void);
>  bool bpf_jit_supports_ptr_xchg(void);
> +bool bpf_jit_supports_arena(void);
>  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
>  bool bpf_helper_changes_pkt_data(void *func);
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 2539d9bfe369..2829077f0461 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2926,6 +2926,11 @@ bool __weak bpf_jit_supports_far_kfunc_call(void)
>         return false;
>  }
>
> +bool __weak bpf_jit_supports_arena(void)
> +{
> +       return false;
> +}
> +
>  /* Return TRUE if the JIT backend satisfies the following two conditions:
>   * 1) JIT backend supports atomic_xchg() on pointer-sized words.
>   * 2) Under the specific arch, the implementation of xchg() is the same
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
@ 2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
  2024-02-13 23:14   ` Andrii Nakryiko
  1 sibling, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  8:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
>
> Note, when the verifier sees:
>
> __weak void foo(struct bar *p)
>
> it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars.
> Hence the only way to use arena pointers in global functions is to tag them with "arg:arena".
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h
  2024-02-09  4:05 ` [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
@ 2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
  2024-02-13 23:14   ` Andrii Nakryiko
  1 sibling, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  8:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Add __arg_arena to bpf_helpers.h
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 15/20] bpf: Tell bpf programs kernel's PAGE_SIZE
  2024-02-09  4:06 ` [PATCH v2 bpf-next 15/20] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
@ 2024-02-10  8:52   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> vmlinux BTF includes all kernel enums.
> Add __PAGE_SIZE = PAGE_SIZE enum, so that bpf programs
> that include vmlinux.h can easily access it.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast()
  2024-02-09  4:06 ` [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
@ 2024-02-10  8:54   ` Kumar Kartikeya Dwivedi
  2024-02-13 22:35     ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-10  8:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, eddyz87, tj, brho, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce helper macro bpf_arena_cast() that emits:
> rX = rX
> instruction with off = BPF_ARENA_CAST_KERN or off = BPF_ARENA_CAST_USER
> and encodes address_space into imm32.
>
> It's useful with older LLVM that doesn't emit this insn automatically.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

But could this simply be added to libbpf along with bpf_cast_user and
bpf_cast_kern? I believe since LLVM and the verifier support the new
cast instructions, they are unlikely to disappear any time soon. It
would probably also make it easier to use elsewhere (e.g. sched-ext)
without having to copy them.

I plan on doing the same eventually with assert macros too.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
  2024-02-09  4:05 ` [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops Alexei Starovoitov
                     ` (2 preceding siblings ...)
  2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
@ 2024-02-10  9:06   ` kernel test robot
  3 siblings, 0 replies; 112+ messages in thread
From: kernel test robot @ 2024-02-10  9:06 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: oe-kbuild-all

Hi Alexei,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Alexei-Starovoitov/bpf-Allow-kfuncs-return-void/20240209-120941
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20240209040608.98927-4-alexei.starovoitov%40gmail.com
patch subject: [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20240210/202402101653.QJBLJLnM-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240210/202402101653.QJBLJLnM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202402101653.QJBLJLnM-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/bpf/syscall.c: In function 'bpf_get_unmapped_area':
>> kernel/bpf/syscall.c:948:27: error: 'struct mm_struct' has no member named 'get_unmapped_area'
     948 |         return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
         |                           ^~
   kernel/bpf/syscall.c:949:1: warning: control reaches end of non-void function [-Wreturn-type]
     949 | }
         | ^


vim +948 kernel/bpf/syscall.c

   939	
   940	static unsigned long bpf_get_unmapped_area(struct file *filp, unsigned long addr,
   941						   unsigned long len, unsigned long pgoff,
   942						   unsigned long flags)
   943	{
   944		struct bpf_map *map = filp->private_data;
   945	
   946		if (map->ops->map_get_unmapped_area)
   947			return map->ops->map_get_unmapped_area(filp, addr, len, pgoff, flags);
 > 948		return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
   949	}
   950	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (19 preceding siblings ...)
  2024-02-09  4:06 ` [PATCH v2 bpf-next 20/20] selftests/bpf: Convert simple page_frag allocator to per-cpu Alexei Starovoitov
@ 2024-02-12 14:14 ` David Hildenbrand
  2024-02-12 18:14   ` Alexei Starovoitov
  2024-02-12 17:36 ` Barret Rhoden
  21 siblings, 1 reply; 112+ messages in thread
From: David Hildenbrand @ 2024-02-12 14:14 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

On 09.02.24 05:05, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> v1->v2:
> - Improved commit log with reasons for using vmap_pages_range() in bpf_arena.
>    Thanks to Johannes
> - Added support for __arena global variables in bpf programs
> - Fixed race conditions spotted by Barret
> - Fixed wrap32 issue spotted by Barret
> - Fixed bpf_map_mmap_sz() the way Andrii suggested
> 
> The work on bpf_arena was inspired by Barret's work:
> https://github.com/google/ghost-userspace/blob/main/lib/queue.bpf.h
> that implements queues, lists and AVL trees completely as bpf programs
> using giant bpf array map and integer indices instead of pointers.
> bpf_arena is a sparse array that allows to use normal C pointers to
> build such data structures. Last few patches implement page_frag
> allocator, link list and hash table as bpf programs.
> 
> v1:
> bpf programs have multiple options to communicate with user space:
> - Various ring buffers (perf, ftrace, bpf): The data is streamed
>    unidirectionally from bpf to user space.
> - Hash map: The bpf program populates elements, and user space consumes them
>    via bpf syscall.
> - mmap()-ed array map: Libbpf creates an array map that is directly accessed by
>    the bpf program and mmap-ed to user space. It's the fastest way. Its
>    disadvantage is that memory for the whole array is reserved at the start.
> 
> Introduce bpf_arena, which is a sparse shared memory region between the bpf
> program and user space.
> 
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>     region, like memcached or any key/value storage. The bpf program implements an
>     in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>     value without going to user space.

Just so I understand it correctly: this is all backed by unmovable and 
unswappable memory.

Is there any (existing?) way to restrict/cap the memory consumption via 
this interface? How easy is this to access+use by unprivileged userspace?

arena_vm_fault() seems to allocate new pages simply via 
alloc_page(GFP_KERNEL | __GFP_ZERO); No memory accounting, mlock limit 
checks etc.

We certainly don't want each and every application to be able to break 
page compaction, swapping etc, that's why I am asking.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena Alexei Starovoitov
  2024-02-09 20:36   ` David Vernet
  2024-02-10  7:40   ` Kumar Kartikeya Dwivedi
@ 2024-02-12 15:56   ` Barret Rhoden
  2024-02-12 18:23     ` Alexei Starovoitov
  2024-02-13 23:14   ` Andrii Nakryiko
  3 siblings, 1 reply; 112+ messages in thread
From: Barret Rhoden @ 2024-02-12 15:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On 2/8/24 23:05, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce bpf_arena, which is a sparse shared memory region between the bpf
> program and user space.


one last check - did you have a diff for the verifier to enforce 
user_vm_{start,end} somewhere?  didn't see it in the patchset, but also 
highly likely that i just skimmed past it.  =)

Reviewed-by: Barret Rhoden <brho@google.com>


thanks,

barret



> 
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>     region, like memcached or any key/value storage. The bpf program implements an
>     in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>     value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
>     rb-trees, sparse arrays), while user space consumes it.
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
>     The user space may mmap it, but bpf program will not convert pointers
>     to user base at run-time to improve bpf program speed.
> 
> Initially, the kernel vm_area and user vma are not populated. User space can
> fault in pages within the range. While servicing a page fault, bpf_arena logic
> will insert a new page into the kernel and user vmas. The bpf program can
> allocate pages from that region via bpf_arena_alloc_pages(). This kernel
> function will insert pages into the kernel vm_area. The subsequent fault-in
> from user space will populate that page into the user vma. The
> BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
> from user space. In such a case, if a page is not allocated by the bpf program
> and not present in the kernel vm_area, the user process will segfault. This is
> useful for use cases 2 and 3 above.
> 
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> either at a specific address within the arena or allocates a range with the
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
> and removes the range from the kernel vm_area and from user process vmas.
> 
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf
> program is more important than ease of sharing with user space. This is use
> case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will
> tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a
> 32-bit move wX = wY, which will improve bpf prog performance. Otherwise,
> bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits
> of user vm_start (if the pointer is not NULL) to arena pointers before they are
> stored into memory. This way, user space sees them as valid 64-bit pointers.
> 
> Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
> generate the bpf_cast_kern() instruction before dereference of the arena
> pointer and the bpf_cast_user() instruction when the arena pointer is formed.
> In a typical bpf program there will be very few bpf_cast_user().
> 
>  From LLVM's point of view, arena pointers are tagged as
> __attribute__((address_space(1))). Hence, clang provides helpful diagnostics
> when pointers cross address space. Libbpf and the kernel support only
> address_space == 1. All other address space identifiers are reserved.
> 
> rX = bpf_cast_kern(rY, addr_space) tells the verifier that
> rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
> to be in the 32-bit domain. The verifier will mark load/store through
> PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
> kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
> copy_from_kernel_nofault() except that no address checks are necessary. The
> address is guaranteed to be in the 4GB range. If the page is not present, the
> destination register is zeroed on read, and the operation is ignored on write.
> 
> rX = bpf_cast_user(rY, addr_space) tells the verifier that
> rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
> the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
> equivalent to:
> rX = (u32)rY;
> if (rY)
>    rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
> 
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list built
> by a bpf program can be walked natively by user space.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            |   5 +-
>   include/linux/bpf_types.h      |   1 +
>   include/uapi/linux/bpf.h       |   7 +
>   kernel/bpf/Makefile            |   3 +
>   kernel/bpf/arena.c             | 557 +++++++++++++++++++++++++++++++++
>   kernel/bpf/core.c              |  11 +
>   kernel/bpf/syscall.c           |   3 +
>   kernel/bpf/verifier.c          |   1 +
>   tools/include/uapi/linux/bpf.h |   7 +
>   9 files changed, 593 insertions(+), 2 deletions(-)
>   create mode 100644 kernel/bpf/arena.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8b0dcb66eb33..de557c6c42e0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -37,6 +37,7 @@ struct perf_event;
>   struct bpf_prog;
>   struct bpf_prog_aux;
>   struct bpf_map;
> +struct bpf_arena;
>   struct sock;
>   struct seq_file;
>   struct btf;
> @@ -534,8 +535,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>   			struct bpf_spin_lock *spin_lock);
>   void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
>   		      struct bpf_spin_lock *spin_lock);
> -
> -
> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena);
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena);
>   int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
>   
>   struct bpf_offload_dev;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 94baced5a1ad..9f2a6b83b49e 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -132,6 +132,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops)
>   
>   BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>   BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d96708380e52..f6648851eae6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -983,6 +983,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_BLOOM_FILTER,
>   	BPF_MAP_TYPE_USER_RINGBUF,
>   	BPF_MAP_TYPE_CGRP_STORAGE,
> +	BPF_MAP_TYPE_ARENA,
>   	__MAX_BPF_MAP_TYPE
>   };
>   
> @@ -1370,6 +1371,12 @@ enum {
>   
>   /* BPF token FD is passed in a corresponding command's token_fd field */
>   	BPF_F_TOKEN_FD          = (1U << 16),
> +
> +/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
> +	BPF_F_SEGV_ON_FAULT	= (1U << 17),
> +
> +/* Do not translate kernel bpf_arena pointers to user pointers */
> +	BPF_F_NO_USER_CONV	= (1U << 18),
>   };
>   
>   /* Flags for BPF_PROG_QUERY. */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 4ce95acfcaa7..368c5d86b5b7 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -15,6 +15,9 @@ obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>   obj-$(CONFIG_BPF_JIT) += trampoline.o
>   obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
> +ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
> +obj-$(CONFIG_BPF_SYSCALL) += arena.o
> +endif
>   obj-$(CONFIG_BPF_JIT) += dispatcher.o
>   ifeq ($(CONFIG_NET),y)
>   obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> new file mode 100644
> index 000000000000..5c1014471740
> --- /dev/null
> +++ b/kernel/bpf/arena.c
> @@ -0,0 +1,557 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/err.h>
> +#include <linux/btf_ids.h>
> +#include <linux/vmalloc.h>
> +#include <linux/pagemap.h>
> +
> +/*
> + * bpf_arena is a sparsely populated shared memory region between bpf program and
> + * user space process.
> + *
> + * For example on x86-64 the values could be:
> + * user_vm_start 7f7d26200000     // picked by mmap()
> + * kern_vm_start ffffc90001e69000 // picked by get_vm_area()
> + * For user space all pointers within the arena are normal 8-byte addresses.
> + * In this example 7f7d26200000 is the address of the first page (pgoff=0).
> + * The bpf program will access it as: kern_vm_start + lower_32bit_of_user_ptr
> + * (u32)7f7d26200000 -> 26200000
> + * hence
> + * ffffc90001e69000 + 26200000 == ffffc90028069000 is "pgoff=0" within 4Gb
> + * kernel memory region.
> + *
> + * BPF JITs generate the following code to access arena:
> + *   mov eax, eax  // eax has lower 32-bit of user pointer
> + *   mov word ptr [rax + r12 + off], bx
> + * where r12 == kern_vm_start and off is s16.
> + * Hence allocate 4Gb + GUARD_SZ/2 on each side.
> + *
> + * Initially kernel vm_area and user vma are not populated.
> + * User space can fault-in any address which will insert the page
> + * into kernel and user vma.
> + * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc
> + * which will insert it into kernel vm_area.
> + * The later fault-in from user space will populate that page into user vma.
> + */
> +
> +/* number of bytes addressable by LDX/STX insn with 16-bit 'off' field */
> +#define GUARD_SZ (1ull << sizeof(((struct bpf_insn *)0)->off) * 8)
> +#define KERN_VM_SZ ((1ull << 32) + GUARD_SZ)
> +
> +struct bpf_arena {
> +	struct bpf_map map;
> +	u64 user_vm_start;
> +	u64 user_vm_end;
> +	struct vm_struct *kern_vm;
> +	struct maple_tree mt;
> +	struct list_head vma_list;
> +	struct mutex lock;
> +};
> +
> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
> +{
> +	return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0;
> +}
> +
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> +{
> +	return arena ? arena->user_vm_start : 0;
> +}
> +
> +static long arena_map_peek_elem(struct bpf_map *map, void *value)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_pop_elem(struct bpf_map *map, void *value)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_delete_elem(struct bpf_map *map, void *value)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> +{
> +	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> +}
> +
> +static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> +{
> +	struct vm_struct *kern_vm;
> +	int numa_node = bpf_map_attr_numa_node(attr);
> +	struct bpf_arena *arena;
> +	u64 vm_range;
> +	int err = -ENOMEM;
> +
> +	if (attr->key_size || attr->value_size || attr->max_entries == 0 ||
> +	    /* BPF_F_MMAPABLE must be set */
> +	    !(attr->map_flags & BPF_F_MMAPABLE) ||
> +	    /* No unsupported flags present */
> +	    (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | BPF_F_NO_USER_CONV)))
> +		return ERR_PTR(-EINVAL);
> +
> +	if (attr->map_extra & ~PAGE_MASK)
> +		/* If non-zero the map_extra is an expected user VMA start address */
> +		return ERR_PTR(-EINVAL);
> +
> +	vm_range = (u64)attr->max_entries * PAGE_SIZE;
> +	if (vm_range > (1ull << 32))
> +		return ERR_PTR(-E2BIG);
> +
> +	if ((attr->map_extra >> 32) != ((attr->map_extra + vm_range - 1) >> 32))
> +		/* user vma must not cross 32-bit boundary */
> +		return ERR_PTR(-ERANGE);
> +
> +	kern_vm = get_vm_area(KERN_VM_SZ, VM_MAP | VM_USERMAP);
> +	if (!kern_vm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	arena = bpf_map_area_alloc(sizeof(*arena), numa_node);
> +	if (!arena)
> +		goto err;
> +
> +	arena->kern_vm = kern_vm;
> +	arena->user_vm_start = attr->map_extra;
> +	if (arena->user_vm_start)
> +		arena->user_vm_end = arena->user_vm_start + vm_range;
> +
> +	INIT_LIST_HEAD(&arena->vma_list);
> +	bpf_map_init_from_attr(&arena->map, attr);
> +	mt_init_flags(&arena->mt, MT_FLAGS_ALLOC_RANGE);
> +	mutex_init(&arena->lock);
> +
> +	return &arena->map;
> +err:
> +	free_vm_area(kern_vm);
> +	return ERR_PTR(err);
> +}
> +
> +static int for_each_pte(pte_t *ptep, unsigned long addr, void *data)
> +{
> +	struct page *page;
> +	pte_t pte;
> +
> +	pte = ptep_get(ptep);
> +	if (!pte_present(pte))
> +		return 0;
> +	page = pte_page(pte);
> +	/*
> +	 * We do not update pte here:
> +	 * 1. Nobody should be accessing bpf_arena's range outside of a kernel bug
> +	 * 2. TLB flushing is batched or deferred. Even if we clear pte,
> +	 * the TLB entries can stick around and continue to permit access to
> +	 * the freed page. So it all relies on 1.
> +	 */
> +	__free_page(page);
> +	return 0;
> +}
> +
> +static void arena_map_free(struct bpf_map *map)
> +{
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +	/*
> +	 * Check that user vma-s are not around when bpf map is freed.
> +	 * mmap() holds vm_file which holds bpf_map refcnt.
> +	 * munmap() must have happened on vma followed by arena_vm_close()
> +	 * which would clear arena->vma_list.
> +	 */
> +	if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
> +		return;
> +
> +	/*
> +	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
> +	 * It unmaps everything from vmalloc area and clears pgtables.
> +	 * Call apply_to_existing_page_range() first to find populated ptes and
> +	 * free those pages.
> +	 */
> +	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +				     KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);
> +	free_vm_area(arena->kern_vm);
> +	mtree_destroy(&arena->mt);
> +	bpf_map_area_free(arena);
> +}
> +
> +static void *arena_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static long arena_map_update_elem(struct bpf_map *map, void *key,
> +				  void *value, u64 flags)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int arena_map_check_btf(const struct bpf_map *map, const struct btf *btf,
> +			       const struct btf_type *key_type, const struct btf_type *value_type)
> +{
> +	return 0;
> +}
> +
> +static u64 arena_map_mem_usage(const struct bpf_map *map)
> +{
> +	return 0;
> +}
> +
> +struct vma_list {
> +	struct vm_area_struct *vma;
> +	struct list_head head;
> +};
> +
> +static int remember_vma(struct bpf_arena *arena, struct vm_area_struct *vma)
> +{
> +	struct vma_list *vml;
> +
> +	vml = kmalloc(sizeof(*vml), GFP_KERNEL);
> +	if (!vml)
> +		return -ENOMEM;
> +	vma->vm_private_data = vml;
> +	vml->vma = vma;
> +	list_add(&vml->head, &arena->vma_list);
> +	return 0;
> +}
> +
> +static void arena_vm_close(struct vm_area_struct *vma)
> +{
> +	struct bpf_map *map = vma->vm_file->private_data;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +	struct vma_list *vml;
> +
> +	guard(mutex)(&arena->lock);
> +	vml = vma->vm_private_data;
> +	list_del(&vml->head);
> +	vma->vm_private_data = NULL;
> +	kfree(vml);
> +}
> +
> +#define MT_ENTRY ((void *)&arena_map_ops) /* unused. has to be valid pointer */
> +
> +static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> +{
> +	struct bpf_map *map = vmf->vma->vm_file->private_data;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +	struct page *page;
> +	long kbase, kaddr;
> +	int ret;
> +
> +	kbase = bpf_arena_get_kern_vm_start(arena);
> +	kaddr = kbase + (u32)(vmf->address & PAGE_MASK);
> +
> +	guard(mutex)(&arena->lock);
> +	page = vmalloc_to_page((void *)kaddr);
> +	if (page)
> +		/* already have a page vmap-ed */
> +		goto out;
> +
> +	if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
> +		/* User space requested to segfault when page is not allocated by bpf prog */
> +		return VM_FAULT_SIGSEGV;
> +
> +	ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL);
> +	if (ret)
> +		return VM_FAULT_SIGSEGV;
> +
> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!page) {
> +		mtree_erase(&arena->mt, vmf->pgoff);
> +		return VM_FAULT_SIGSEGV;
> +	}
> +
> +	ret = vmap_pages_range(kaddr, kaddr + PAGE_SIZE, PAGE_KERNEL, &page, PAGE_SHIFT);
> +	if (ret) {
> +		mtree_erase(&arena->mt, vmf->pgoff);
> +		__free_page(page);
> +		return VM_FAULT_SIGSEGV;
> +	}
> +out:
> +	page_ref_add(page, 1);
> +	vmf->page = page;
> +	return 0;
> +}
> +
> +static const struct vm_operations_struct arena_vm_ops = {
> +	.close		= arena_vm_close,
> +	.fault          = arena_vm_fault,
> +};
> +
> +static unsigned long arena_get_unmapped_area(struct file *filp, unsigned long addr,
> +					     unsigned long len, unsigned long pgoff,
> +					     unsigned long flags)
> +{
> +	struct bpf_map *map = filp->private_data;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +	long ret;
> +
> +	if (pgoff)
> +		return -EINVAL;
> +	if (len > (1ull << 32))
> +		return -E2BIG;
> +
> +	/* if user_vm_start was specified at arena creation time */
> +	if (arena->user_vm_start) {
> +		if (len > arena->user_vm_end - arena->user_vm_start)
> +			return -E2BIG;
> +		if (len != arena->user_vm_end - arena->user_vm_start)
> +			return -EINVAL;
> +		if (addr != arena->user_vm_start)
> +			return -EINVAL;
> +	}
> +
> +	ret = current->mm->get_unmapped_area(filp, addr, len * 2, 0, flags);
> +	if (IS_ERR_VALUE(ret))
> +                return 0;
> +	if ((ret >> 32) == ((ret + len - 1) >> 32))
> +		return ret;
> +	if (WARN_ON_ONCE(arena->user_vm_start))
> +		/* checks at map creation time should prevent this */
> +		return -EFAULT;
> +	return round_up(ret, 1ull << 32);
> +}
> +
> +static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> +{
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +	guard(mutex)(&arena->lock);
> +	if (arena->user_vm_start && arena->user_vm_start != vma->vm_start)
> +		/*
> +		 * If map_extra was not specified at arena creation time then
> +		 * 1st user process can do mmap(NULL, ...) to pick user_vm_start
> +		 * 2nd user process must pass the same addr to mmap(addr, MAP_FIXED..);
> +		 *   or
> +		 * specify addr in map_extra and
> +		 * use the same addr later with mmap(addr, MAP_FIXED..);
> +		 */
> +		return -EBUSY;
> +
> +	if (arena->user_vm_end && arena->user_vm_end != vma->vm_end)
> +		/* all user processes must have the same size of mmap-ed region */
> +		return -EBUSY;
> +
> +	/* Earlier checks should prevent this */
> +	if (WARN_ON_ONCE(vma->vm_end - vma->vm_start > (1ull << 32) || vma->vm_pgoff))
> +		return -EFAULT;
> +
> +	if (remember_vma(arena, vma))
> +		return -ENOMEM;
> +
> +	arena->user_vm_start = vma->vm_start;
> +	arena->user_vm_end = vma->vm_end;
> +	/*
> +	 * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and
> +	 * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid
> +	 * potential change of user_vm_start.
> +	 */
> +	vm_flags_set(vma, VM_DONTEXPAND);
> +	vma->vm_ops = &arena_vm_ops;
> +	return 0;
> +}
> +
> +static int arena_map_direct_value_addr(const struct bpf_map *map, u64 *imm, u32 off)
> +{
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +	if ((u64)off > arena->user_vm_end - arena->user_vm_start)
> +		return -ERANGE;
> +	*imm = (unsigned long)arena->user_vm_start;
> +	return 0;
> +}
> +
> +BTF_ID_LIST_SINGLE(bpf_arena_map_btf_ids, struct, bpf_arena)
> +const struct bpf_map_ops arena_map_ops = {
> +	.map_meta_equal = bpf_map_meta_equal,
> +	.map_alloc = arena_map_alloc,
> +	.map_free = arena_map_free,
> +	.map_direct_value_addr = arena_map_direct_value_addr,
> +	.map_mmap = arena_map_mmap,
> +	.map_get_unmapped_area = arena_get_unmapped_area,
> +	.map_get_next_key = arena_map_get_next_key,
> +	.map_push_elem = arena_map_push_elem,
> +	.map_peek_elem = arena_map_peek_elem,
> +	.map_pop_elem = arena_map_pop_elem,
> +	.map_lookup_elem = arena_map_lookup_elem,
> +	.map_update_elem = arena_map_update_elem,
> +	.map_delete_elem = arena_map_delete_elem,
> +	.map_check_btf = arena_map_check_btf,
> +	.map_mem_usage = arena_map_mem_usage,
> +	.map_btf_id = &bpf_arena_map_btf_ids[0],
> +};
> +
> +static u64 clear_lo32(u64 val)
> +{
> +	return val & ~(u64)~0U;
> +}
> +
> +/*
> + * Allocate pages and vmap them into kernel vmalloc area.
> + * Later the pages will be mmaped into user space vma.
> + */
> +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
> +{
> +	/* user_vm_end/start are fixed before bpf prog runs */
> +	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
> +	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
> +	long pgoff = 0, nr_pages = 0;
> +	struct page **pages;
> +	u32 uaddr32;
> +	int ret, i;
> +
> +	if (page_cnt > page_cnt_max)
> +		return 0;
> +
> +	if (uaddr) {
> +		if (uaddr & ~PAGE_MASK)
> +			return 0;
> +		pgoff = compute_pgoff(arena, uaddr);
> +		if (pgoff + page_cnt > page_cnt_max)
> +			/* requested address will be outside of user VMA */
> +			return 0;
> +	}
> +
> +	/* zeroing is needed, since alloc_pages_bulk_array() only fills in non-zero entries */
> +	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return 0;
> +
> +	guard(mutex)(&arena->lock);
> +
> +	if (uaddr)
> +		ret = mtree_insert_range(&arena->mt, pgoff, pgoff + page_cnt - 1,
> +					 MT_ENTRY, GFP_KERNEL);
> +	else
> +		ret = mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY,
> +					page_cnt, 0, page_cnt_max - 1, GFP_KERNEL);
> +	if (ret)
> +		goto out_free_pages;
> +
> +	nr_pages = alloc_pages_bulk_array_node(GFP_KERNEL | __GFP_ZERO, node_id, page_cnt, pages);
> +	if (nr_pages != page_cnt)
> +		goto out;
> +
> +	uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
> +	/* Earlier checks make sure that uaddr32 + page_cnt * PAGE_SIZE will not overflow 32-bit */
> +	ret = vmap_pages_range(kern_vm_start + uaddr32,
> +			       kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE,
> +			       PAGE_KERNEL, pages, PAGE_SHIFT);
> +	if (ret)
> +		goto out;
> +	kvfree(pages);
> +	return clear_lo32(arena->user_vm_start) + uaddr32;
> +out:
> +	mtree_erase(&arena->mt, pgoff);
> +out_free_pages:
> +	if (pages)
> +		for (i = 0; i < nr_pages; i++)
> +			__free_page(pages[i]);
> +	kvfree(pages);
> +	return 0;
> +}
> +
> +/*
> + * If page is present in vmalloc area, unmap it from vmalloc area,
> + * unmap it from all user space vma-s,
> + * and free it.
> + */
> +static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> +{
> +	struct vma_list *vml;
> +
> +	list_for_each_entry(vml, &arena->vma_list, head)
> +		zap_page_range_single(vml->vma, uaddr,
> +				      PAGE_SIZE * page_cnt, NULL);
> +}
> +
> +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> +{
> +	u64 full_uaddr, uaddr_end;
> +	long kaddr, pgoff, i;
> +	struct page *page;
> +
> +	/* only aligned lower 32-bit are relevant */
> +	uaddr = (u32)uaddr;
> +	uaddr &= PAGE_MASK;
> +	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
> +	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
> +	if (full_uaddr >= uaddr_end)
> +		return;
> +
> +	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
> +
> +	guard(mutex)(&arena->lock);
> +
> +	pgoff = compute_pgoff(arena, uaddr);
> +	/* clear range */
> +	mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt - 1, NULL, GFP_KERNEL);
> +
> +	if (page_cnt > 1)
> +		/* bulk zap if multiple pages being freed */
> +		zap_pages(arena, full_uaddr, page_cnt);
> +
> +	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
> +	for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
> +		page = vmalloc_to_page((void *)kaddr);
> +		if (!page)
> +			continue;
> +		if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
> +			zap_pages(arena, full_uaddr, 1);
> +		vunmap_range(kaddr, kaddr + PAGE_SIZE);
> +		__free_page(page);
> +	}
> +}
> +
> +__bpf_kfunc_start_defs();
> +
> +__bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_cnt,
> +					int node_id, u64 flags)
> +{
> +	struct bpf_map *map = p__map;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +	if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt)
> +		return NULL;
> +
> +	return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id);
> +}
> +
> +__bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
> +{
> +	struct bpf_map *map = p__map;
> +	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
> +		return;
> +	arena_free_pages(arena, (long)ptr__ign, page_cnt);
> +}
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(arena_kfuncs)
> +BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE)
> +BTF_KFUNCS_END(arena_kfuncs)
> +
> +static const struct btf_kfunc_id_set common_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &arena_kfuncs,
> +};
> +
> +static int __init kfunc_init(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &common_kfunc_set);
> +}
> +late_initcall(kfunc_init);
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 71c459a51d9e..2539d9bfe369 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2970,6 +2970,17 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
>   {
>   }
>   
> +/* for configs without MMU or 32-bit */
> +__weak const struct bpf_map_ops arena_map_ops;
> +__weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> +{
> +	return 0;
> +}
> +__weak u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
> +{
> +	return 0;
> +}
> +
>   #ifdef CONFIG_BPF_SYSCALL
>   static int __init bpf_global_ma_init(void)
>   {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8dd9814a0e14..6b9efb3f79dd 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -164,6 +164,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>   	if (bpf_map_is_offloaded(map)) {
>   		return bpf_map_offload_update_elem(map, key, value, flags);
>   	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
> +		   map->map_type == BPF_MAP_TYPE_ARENA ||
>   		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
>   		return map->ops->map_update_elem(map, key, value, flags);
>   	} else if (map->map_type == BPF_MAP_TYPE_SOCKHASH ||
> @@ -1172,6 +1173,7 @@ static int map_create(union bpf_attr *attr)
>   	}
>   
>   	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
> +	    attr->map_type != BPF_MAP_TYPE_ARENA &&
>   	    attr->map_extra != 0)
>   		return -EINVAL;
>   
> @@ -1261,6 +1263,7 @@ static int map_create(union bpf_attr *attr)
>   	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
>   	case BPF_MAP_TYPE_STRUCT_OPS:
>   	case BPF_MAP_TYPE_CPUMAP:
> +	case BPF_MAP_TYPE_ARENA:
>   		if (!bpf_token_capable(token, CAP_BPF))
>   			goto put_token;
>   		break;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index db569ce89fb1..3c77a3ab1192 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18047,6 +18047,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>   		case BPF_MAP_TYPE_SK_STORAGE:
>   		case BPF_MAP_TYPE_TASK_STORAGE:
>   		case BPF_MAP_TYPE_CGRP_STORAGE:
> +		case BPF_MAP_TYPE_ARENA:
>   			break;
>   		default:
>   			verbose(env,
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index d96708380e52..f6648851eae6 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -983,6 +983,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_BLOOM_FILTER,
>   	BPF_MAP_TYPE_USER_RINGBUF,
>   	BPF_MAP_TYPE_CGRP_STORAGE,
> +	BPF_MAP_TYPE_ARENA,
>   	__MAX_BPF_MAP_TYPE
>   };
>   
> @@ -1370,6 +1371,12 @@ enum {
>   
>   /* BPF token FD is passed in a corresponding command's token_fd field */
>   	BPF_F_TOKEN_FD          = (1U << 16),
> +
> +/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
> +	BPF_F_SEGV_ON_FAULT	= (1U << 17),
> +
> +/* Do not translate kernel bpf_arena pointers to user pointers */
> +	BPF_F_NO_USER_CONV	= (1U << 18),
>   };
>   
>   /* Flags for BPF_PROG_QUERY. */


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  2024-02-10  4:35     ` Alexei Starovoitov
  2024-02-10  7:03       ` Kumar Kartikeya Dwivedi
@ 2024-02-12 16:48       ` David Vernet
  1 sibling, 0 replies; 112+ messages in thread
From: David Vernet @ 2024-02-12 16:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

[-- Attachment #1: Type: text/plain, Size: 3783 bytes --]

On Fri, Feb 09, 2024 at 08:35:01PM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 9, 2024 at 3:14 PM David Vernet <void@manifault.com> wrote:
> >
> > > +
> > > +#ifndef arena_container_of
> >
> > Why is this ifndef required if we have a pragma once above?
> 
> Just a habit to check for a macro before defining it.
> 
> > Obviously it's way better for us to actually have arenas in the interim
> > so this is fine for now, but UAF bugs could potentially be pretty
> > painful until we get proper exception unwinding support.
> 
> Detection that arena access faulted doesn't have to come after
> exception unwinding. Exceptions vs cancellable progs are also different.
> A record of the line in bpf prog that caused the first fault is probably
> good enough for prog debugging.
> 
> > Otherwise, in terms of usability, this looks really good. The only thing
> > to bear in mind is that I don't think we can fully get away from kptrs
> > that will have some duplicated logic compared to what we can enable in
> > an arena. For example, we will have to retain at least some of the
> > struct cpumask * kptrs for e.g. copying a struct task_struct's struct
> > cpumask *cpus_ptr field.
> 
> I think that's a bit orthogonal.
> task->cpus_ptr is a trusted_ptr_to_btf_id access that can be mixed
> within a program with arena access.

I see, so the idea is that we'd just use regular accesses to query the
bits in that cpumask rather than kfuncs? Similar to how we e.g. read a
task field such as p->comm with a regular load? Ok, that should work.

> > For now, we could iterate over the cpumask and manually set the bits, so
> > maybe even just supporting bpf_cpumask_test_cpu() would be enough
> > (though donig a bitmap_copy() would be better of course)? This is
> > probably fine for most use cases as we'd likely only be doing struct
> > cpumask * -> arena copies on slowpaths. But is there any kind of more
> > generalized integration we want to have between arenas and kptrs?  Not
> > sure, can't think of any off the top of my head.
> 
> Hopefully we'll be able to invent a way to store kptr-s inside the arena,
> but from a cpumask perspective bpf_cpumask_test_cpu() can be made
> polymorphic to work with arena ptrs and kptrs.
> Same with bpf_cpumask_and(). Mixed arguments can be allowed.
> Args can be either kptr or ptr_to_arena.

This would be ideal. And to make sure I understand, many of these
wouldn't even be kfuncs, right? We'd just be doing loads on two
safe/trusted objects that were both pointers to a bitmap of size
NR_CPUS?

> I still believe that we can deprecate 'struct bpf_cpumask'.
> The cpumask_t will stay, of course, but we won't need to
> bpf_obj_new(bpf_cpumask) and carefully track refcnt.
> The arena can do the same much faster.

Yes, I agree. Any use of struct bpf_cpumask * can just be stored in an
arena, and any kfuncs where we were previously passing a struct
bpf_cpumask * could instead just take an arena cpumask, or be done
entirely using BPF instructions per your point above.

> > > +             return 7;
> > > +     page3 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
> > > +     if (!page3)
> > > +             return 8;
> > > +     *page3 = 3;
> > > +     if (page2 != page3)
> > > +             return 9;
> > > +     if (*page1 != 1)
> > > +             return 10;
> >
> > Should we also test doing a store after an arena has been freed?
> 
> You mean the whole bpf arena map was freed ?
> I don't see how the verifier would allow that.
> If you meant a few pages were freed from the arena then such a test is
> already in the patches.

I meant a negative test that verifies we fail to load a prog that does a
write to a freed arena pointer.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena.
  2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
                   ` (20 preceding siblings ...)
  2024-02-12 14:14 ` [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena David Hildenbrand
@ 2024-02-12 17:36 ` Barret Rhoden
  21 siblings, 0 replies; 112+ messages in thread
From: Barret Rhoden @ 2024-02-12 17:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, hannes, lstoakes, akpm,
	urezki, hch, linux-mm, kernel-team

On 2/8/24 23:05, Alexei Starovoitov wrote:
> The work on bpf_arena was inspired by Barret's work:
> https://github.com/google/ghost-userspace/blob/main/lib/queue.bpf.h
> that implements queues, lists and AVL trees completely as bpf programs
> using giant bpf array map and integer indices instead of pointers.
> bpf_arena is a sparse array that allows to use normal C pointers to
> build such data structures. Last few patches implement page_frag
> allocator, link list and hash table as bpf programs.

thanks for the shout-out.  FWIW, i'm really looking forward to the BPF 
arena.  it'll be a little work to switch from array maps to the arena, 
but in the long run, it'll vastly simplify our scheduler code.

additionally, the ability to map in pages on demand, instead of 
preallocating a potentially large array map, will both save memory as 
well as allow me to remove some artificial limitations on what our 
scheduler can handle.  (e.g. don't limit ourselves to 64k threads).

thanks,

barret



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena Alexei Starovoitov
  2024-02-10  7:16   ` Kumar Kartikeya Dwivedi
@ 2024-02-12 18:12   ` Eduard Zingerman
  2024-02-12 20:14     ` Alexei Starovoitov
  2024-02-13 23:15   ` Andrii Nakryiko
  2 siblings, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-12 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, tj, brho, hannes, lstoakes, akpm, urezki,
	hch, linux-mm, kernel-team

On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
[...]

> @@ -9830,8 +9861,8 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
>  		int err;
>  		size_t mmap_old_sz, mmap_new_sz;
>  
> -		mmap_old_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> -		mmap_new_sz = bpf_map_mmap_sz(size, map->def.max_entries);
> +		mmap_old_sz = bpf_map_mmap_sz(map);
> +		mmap_new_sz = __bpf_map_mmap_sz(size, map->def.max_entries);
>  		err = bpf_map_mmap_resize(map, mmap_old_sz, mmap_new_sz);
>  		if (err) {
>  			pr_warn("map '%s': failed to resize memory-mapped region: %d\n",

I think that as is bpf_map__set_value_size() won't work for arenas.
The bpf_map_mmap_resize() does the following:

static int bpf_map_mmap_resize(struct bpf_map *map, size_t old_sz, size_t new_sz)
{
	...
	mmaped = mmap(NULL, new_sz, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	...
	memcpy(mmaped, map->mmaped, min(old_sz, new_sz));
	munmap(map->mmaped, old_sz);
	map->mmaped = mmaped;
	...
}

Which does not seem to tie the new mapping to arena, or am I missing something?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena.
  2024-02-12 14:14 ` [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena David Hildenbrand
@ 2024-02-12 18:14   ` Alexei Starovoitov
  2024-02-13 10:35     ` David Hildenbrand
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-12 18:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Mon, Feb 12, 2024 at 6:14 AM David Hildenbrand <david@redhat.com> wrote:
>
> How easy is this to access+use by unprivileged userspace?

not possible. bpf arena requires cap_bpf + cap_perfmon.

> arena_vm_fault() seems to allocate new pages simply via
> alloc_page(GFP_KERNEL | __GFP_ZERO); No memory accounting, mlock limit
> checks etc.

Right. That's a bug. As Kumar commented on the patch 5 that it needs to
move to memcg accounting the way we do for all other maps.
It will be very similar to bpf_map_kmalloc_node().

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-10  7:40   ` Kumar Kartikeya Dwivedi
@ 2024-02-12 18:21     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-12 18:21 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Fri, Feb 9, 2024 at 11:41 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> A few questions on the patch.
>
> 1. Is the expectation that user space would use syscall progs to
> manipulate mappings in the arena?

any sleepable prog can alloc/free.
all other progs can access arena.

> 2. I may have missed it, but which memcg are the allocations being
> accounted against? Will it be the process that created the map?
> When trying to explore bpf_map_alloc_pages, I could not figure out if
> the obj_cgroup was being looked up anywhere.
> I think it would be nice if it were accounted for against the caller
> of bpf_map_alloc_pages, since potentially the arena map can be shared
> across multiple processes.
> Tying it to bpf_map on bpf_map_alloc may be too coarse for arena maps.

yeah. will switch to memcg accounting the way it's done for all
other maps. It will be similar to bpf_map_kmalloc_node.
I don't think we should deviate from the standard.
bpf_map_save_memcg() is already done for bpf_arena.
I simply forgot to wrap alloc pages with set_active_memcg.

> 3. A bit tangential, but what would be the path to having huge page
> mappings look like (mostly from an interface standpoint)? I gather we
> could use the flags argument on the kernel side, and if 1 is true
> above, it would mean userspace would do it from inside a syscall
> program and then trigger a page fault? Because experience with use
> case 1 in the commit log suggests it is desirable to have such memory
> be backed by huge pages to reduce TLB misses.

Right. It will be a new flag.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-12 15:56   ` Barret Rhoden
@ 2024-02-12 18:23     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-12 18:23 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, Feb 12, 2024 at 7:56 AM Barret Rhoden <brho@google.com> wrote:
>
> On 2/8/24 23:05, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce bpf_arena, which is a sparse shared memory region between the bpf
> > program and user space.
>
>
> one last check - did you have a diff for the verifier to enforce
> user_vm_{start,end} somewhere?  didn't see it in the patchset, but also
> highly likely that i just skimmed past it.  =)

Yes. It's in the patch 9:
+   if (!bpf_arena_get_user_vm_start(env->prog->aux->arena)) {
+        verbose(env, "arena's user address must be set via map_extra
or mmap()\n");
+        return -EINVAL;
+   }


> Reviewed-by: Barret Rhoden <brho@google.com>

Thanks a lot for thorough code reviews!

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
@ 2024-02-12 18:58   ` Eduard Zingerman
  2024-02-13 23:15   ` Andrii Nakryiko
  1 sibling, 0 replies; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-12 18:58 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, tj, brho, hannes, lstoakes, akpm, urezki,
	hch, linux-mm, kernel-team

On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> __uint() macro that is used to specify map attributes like:
>   __uint(type, BPF_MAP_TYPE_ARRAY);
>   __uint(map_flags, BPF_F_MMAPABLE);
> is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.
> 
> Introduce __ulong() macro that allows specifying values bigger than 32-bit.
> In map definition "map_extra" is the only u64 field.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

Another option would be something like:

    #define __ulong(name, val) int (*name)[val >> 32][(val << 32) >> 32]

thus avoiding generation of __unique_value_123 literals,
but these literals probably should not be an issue.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-10  7:16   ` Kumar Kartikeya Dwivedi
@ 2024-02-12 19:11     ` Andrii Nakryiko
  2024-02-12 22:29       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-12 19:11 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Alexei Starovoitov, bpf, daniel, andrii, eddyz87, tj, brho,
	hannes, lstoakes, akpm, urezki, hch, linux-mm, kernel-team

On Fri, Feb 9, 2024 at 11:17 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > mmap() bpf_arena right after creation, since the kernel needs to
> > remember the address returned from mmap. This is user_vm_start.
> > LLVM will generate bpf_arena_cast_user() instructions where
> > necessary and JIT will add upper 32-bit of user_vm_start
> > to such pointers.
> >
> > Fix up bpf_map_mmap_sz() to compute mmap size as
> > map->value_size * map->max_entries for arrays and
> > PAGE_SIZE * map->max_entries for arena.
> >
> > Don't set BTF at arena creation time, since it doesn't support it.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  tools/lib/bpf/libbpf.c        | 43 ++++++++++++++++++++++++++++++-----
> >  tools/lib/bpf/libbpf_probes.c |  7 ++++++
> >  2 files changed, 44 insertions(+), 6 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 01f407591a92..4880d623098d 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
> >         [BPF_MAP_TYPE_BLOOM_FILTER]             = "bloom_filter",
> >         [BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
> >         [BPF_MAP_TYPE_CGRP_STORAGE]             = "cgrp_storage",
> > +       [BPF_MAP_TYPE_ARENA]                    = "arena",
> >  };
> >
> >  static const char * const prog_type_name[] = {
> > @@ -1577,7 +1578,7 @@ static struct bpf_map *bpf_object__add_map(struct bpf_object *obj)
> >         return map;
> >  }
> >
> > -static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> > +static size_t __bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> >  {
> >         const long page_sz = sysconf(_SC_PAGE_SIZE);
> >         size_t map_sz;
> > @@ -1587,6 +1588,20 @@ static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> >         return map_sz;
> >  }
> >
> > +static size_t bpf_map_mmap_sz(const struct bpf_map *map)
> > +{
> > +       const long page_sz = sysconf(_SC_PAGE_SIZE);
> > +
> > +       switch (map->def.type) {
> > +       case BPF_MAP_TYPE_ARRAY:
> > +               return __bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > +       case BPF_MAP_TYPE_ARENA:
> > +               return page_sz * map->def.max_entries;
> > +       default:
> > +               return 0; /* not supported */
> > +       }
> > +}
> > +
> >  static int bpf_map_mmap_resize(struct bpf_map *map, size_t old_sz, size_t new_sz)
> >  {
> >         void *mmaped;
> > @@ -1740,7 +1755,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
> >         pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
> >                  map->name, map->sec_idx, map->sec_offset, def->map_flags);
> >
> > -       mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > +       mmap_sz = bpf_map_mmap_sz(map);
> >         map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
> >                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> >         if (map->mmaped == MAP_FAILED) {
> > @@ -4852,6 +4867,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> >         case BPF_MAP_TYPE_SOCKHASH:
> >         case BPF_MAP_TYPE_QUEUE:
> >         case BPF_MAP_TYPE_STACK:
> > +       case BPF_MAP_TYPE_ARENA:
> >                 create_attr.btf_fd = 0;
> >                 create_attr.btf_key_type_id = 0;
> >                 create_attr.btf_value_type_id = 0;
> > @@ -4908,6 +4924,21 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> >         if (map->fd == map_fd)
> >                 return 0;
> >
> > +       if (def->type == BPF_MAP_TYPE_ARENA) {
> > +               map->mmaped = mmap((void *)map->map_extra, bpf_map_mmap_sz(map),
> > +                                  PROT_READ | PROT_WRITE,
> > +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> > +                                  map_fd, 0);
> > +               if (map->mmaped == MAP_FAILED) {
> > +                       err = -errno;
> > +                       map->mmaped = NULL;
> > +                       close(map_fd);
> > +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> > +                               bpf_map__name(map), err);
> > +                       return err;
> > +               }
> > +       }
> > +
>
> Would it be possible to introduce a public API accessor for getting
> the value of map->mmaped?

That would be bpf_map__initial_value(), no?

> Otherwise one would have to parse through /proc/self/maps in case
> map_extra is 0.
>
> The use case is to be able to use the arena as a backing store for
> userspace malloc arenas, so that
> we can pass through malloc/mallocx calls (or class specific operator
> new) directly to malloc arena using the BPF arena.
> In such a case a lot of the burden of converting existing data
> structures or code can be avoided by making much of the process
> transparent.
> Userspace malloced objects can also be easily shared to BPF progs as a
> pool through bpf_ma style per-CPU allocator.
>
> > [...]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-12 18:12   ` Eduard Zingerman
@ 2024-02-12 20:14     ` Alexei Starovoitov
  2024-02-12 20:21       ` Eduard Zingerman
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-12 20:14 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, Feb 12, 2024 at 10:12 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
> [...]
>
> > @@ -9830,8 +9861,8 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
> >               int err;
> >               size_t mmap_old_sz, mmap_new_sz;
> >
> > -             mmap_old_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > -             mmap_new_sz = bpf_map_mmap_sz(size, map->def.max_entries);
> > +             mmap_old_sz = bpf_map_mmap_sz(map);
> > +             mmap_new_sz = __bpf_map_mmap_sz(size, map->def.max_entries);
> >               err = bpf_map_mmap_resize(map, mmap_old_sz, mmap_new_sz);
> >               if (err) {
> >                       pr_warn("map '%s': failed to resize memory-mapped region: %d\n",
>
> I think that as is bpf_map__set_value_size() won't work for arenas.

It doesn't and doesn't work for ringbuf either.
I guess we can add a filter by map type, but I'm not sure
how big this can of worms (extra checks) will be.
There are probably many libbpf apis that can be misused.
Like bpf_map__set_type()

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-12 20:14     ` Alexei Starovoitov
@ 2024-02-12 20:21       ` Eduard Zingerman
  0 siblings, 0 replies; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-12 20:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, 2024-02-12 at 12:14 -0800, Alexei Starovoitov wrote:

[...]

> It doesn't and doesn't work for ringbuf either.
> I guess we can add a filter by map type, but I'm not sure
> how big this can of worms (extra checks) will be.
> There are probably many libbpf apis that can be misused.
> Like bpf_map__set_type()

Right, probably such extra checks should be a subject of a different
patch-set (if any).

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-12 19:11     ` Andrii Nakryiko
@ 2024-02-12 22:29       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 112+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-12 22:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, daniel, andrii, eddyz87, tj, brho,
	hannes, lstoakes, akpm, urezki, hch, linux-mm, kernel-team

On Mon, 12 Feb 2024 at 20:11, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Feb 9, 2024 at 11:17 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > mmap() bpf_arena right after creation, since the kernel needs to
> > > remember the address returned from mmap. This is user_vm_start.
> > > LLVM will generate bpf_arena_cast_user() instructions where
> > > necessary and JIT will add upper 32-bit of user_vm_start
> > > to such pointers.
> > >
> > > Fix up bpf_map_mmap_sz() to compute mmap size as
> > > map->value_size * map->max_entries for arrays and
> > > PAGE_SIZE * map->max_entries for arena.
> > >
> > > Don't set BTF at arena creation time, since it doesn't support it.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  tools/lib/bpf/libbpf.c        | 43 ++++++++++++++++++++++++++++++-----
> > >  tools/lib/bpf/libbpf_probes.c |  7 ++++++
> > >  2 files changed, 44 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > > index 01f407591a92..4880d623098d 100644
> > > --- a/tools/lib/bpf/libbpf.c
> > > +++ b/tools/lib/bpf/libbpf.c
> > > @@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
> > >         [BPF_MAP_TYPE_BLOOM_FILTER]             = "bloom_filter",
> > >         [BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
> > >         [BPF_MAP_TYPE_CGRP_STORAGE]             = "cgrp_storage",
> > > +       [BPF_MAP_TYPE_ARENA]                    = "arena",
> > >  };
> > >
> > >  static const char * const prog_type_name[] = {
> > > @@ -1577,7 +1578,7 @@ static struct bpf_map *bpf_object__add_map(struct bpf_object *obj)
> > >         return map;
> > >  }
> > >
> > > -static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> > > +static size_t __bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> > >  {
> > >         const long page_sz = sysconf(_SC_PAGE_SIZE);
> > >         size_t map_sz;
> > > @@ -1587,6 +1588,20 @@ static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> > >         return map_sz;
> > >  }
> > >
> > > +static size_t bpf_map_mmap_sz(const struct bpf_map *map)
> > > +{
> > > +       const long page_sz = sysconf(_SC_PAGE_SIZE);
> > > +
> > > +       switch (map->def.type) {
> > > +       case BPF_MAP_TYPE_ARRAY:
> > > +               return __bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > > +       case BPF_MAP_TYPE_ARENA:
> > > +               return page_sz * map->def.max_entries;
> > > +       default:
> > > +               return 0; /* not supported */
> > > +       }
> > > +}
> > > +
> > >  static int bpf_map_mmap_resize(struct bpf_map *map, size_t old_sz, size_t new_sz)
> > >  {
> > >         void *mmaped;
> > > @@ -1740,7 +1755,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
> > >         pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
> > >                  map->name, map->sec_idx, map->sec_offset, def->map_flags);
> > >
> > > -       mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > > +       mmap_sz = bpf_map_mmap_sz(map);
> > >         map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
> > >                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> > >         if (map->mmaped == MAP_FAILED) {
> > > @@ -4852,6 +4867,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> > >         case BPF_MAP_TYPE_SOCKHASH:
> > >         case BPF_MAP_TYPE_QUEUE:
> > >         case BPF_MAP_TYPE_STACK:
> > > +       case BPF_MAP_TYPE_ARENA:
> > >                 create_attr.btf_fd = 0;
> > >                 create_attr.btf_key_type_id = 0;
> > >                 create_attr.btf_value_type_id = 0;
> > > @@ -4908,6 +4924,21 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> > >         if (map->fd == map_fd)
> > >                 return 0;
> > >
> > > +       if (def->type == BPF_MAP_TYPE_ARENA) {
> > > +               map->mmaped = mmap((void *)map->map_extra, bpf_map_mmap_sz(map),
> > > +                                  PROT_READ | PROT_WRITE,
> > > +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> > > +                                  map_fd, 0);
> > > +               if (map->mmaped == MAP_FAILED) {
> > > +                       err = -errno;
> > > +                       map->mmaped = NULL;
> > > +                       close(map_fd);
> > > +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> > > +                               bpf_map__name(map), err);
> > > +                       return err;
> > > +               }
> > > +       }
> > > +
> >
> > Would it be possible to introduce a public API accessor for getting
> > the value of map->mmaped?
>
> That would be bpf_map__initial_value(), no?
>

Ah, indeed, that would do the trick. Thanks Andrii!

> [...]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles Alexei Starovoitov
@ 2024-02-13  0:34   ` Eduard Zingerman
  2024-02-13  0:44     ` Alexei Starovoitov
  2024-02-13 23:11   ` Eduard Zingerman
  2024-02-13 23:15   ` Andrii Nakryiko
  2 siblings, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-13  0:34 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, tj, brho, hannes, lstoakes, akpm, urezki,
	hch, linux-mm, kernel-team

On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> LLVM automatically places __arena variables into ".arena.1" ELF section.
> When libbpf sees such section it creates internal 'struct bpf_map' LIBBPF_MAP_ARENA
> that is connected to actual BPF_MAP_TYPE_ARENA 'struct bpf_map'.
> They share the same kernel's side bpf map and single map_fd.
> Both are emitted into skeleton. Real arena with the name given by bpf program
> in SEC(".maps") and another with "__arena_internal" name.
> All global variables from ".arena.1" section are accessible from user space
> via skel->arena->name_of_var.

[...]

I hit a strange bug when playing with patch. Consider a simple example [0].
When the following BPF global variable:

    int __arena * __arena bar;

- is commented -- the test passes;
- is uncommented -- in the test fails because global variable 'shared' is NULL.

Note: the second __arena is necessary to put 'bar' to .arena.1 section.

[0] https://github.com/kernel-patches/bpf/commit/6d95c8557c25d01ef3f13e6aef2bda9ac2516484

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13  0:34   ` Eduard Zingerman
@ 2024-02-13  0:44     ` Alexei Starovoitov
  2024-02-13  0:49       ` Eduard Zingerman
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13  0:44 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, Feb 12, 2024 at 4:34 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > LLVM automatically places __arena variables into ".arena.1" ELF section.
> > When libbpf sees such section it creates internal 'struct bpf_map' LIBBPF_MAP_ARENA
> > that is connected to actual BPF_MAP_TYPE_ARENA 'struct bpf_map'.
> > They share the same kernel's side bpf map and single map_fd.
> > Both are emitted into skeleton. Real arena with the name given by bpf program
> > in SEC(".maps") and another with "__arena_internal" name.
> > All global variables from ".arena.1" section are accessible from user space
> > via skel->arena->name_of_var.
>
> [...]
>
> I hit a strange bug when playing with patch. Consider a simple example [0].
> When the following BPF global variable:
>
>     int __arena * __arena bar;
>
> - is commented -- the test passes;
> - is uncommented -- in the test fails because global variable 'shared' is NULL.

Right. That's expected, because __uint(max_entries, 1);
The test creates an area on 1 page and it's consumed
by int __arena * __arena bar; variable.
Of course, one variable doesn't take the whole page.
There could have been many arena global vars.
But that page is not available anymore to bpf_arena_alloc_pages,
so it returns NULL.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13  0:44     ` Alexei Starovoitov
@ 2024-02-13  0:49       ` Eduard Zingerman
  2024-02-13  2:08         ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-13  0:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, 2024-02-12 at 16:44 -0800, Alexei Starovoitov wrote:
> > I hit a strange bug when playing with patch. Consider a simple example [0].
> > When the following BPF global variable:
> > 
> >     int __arena * __arena bar;
> > 
> > - is commented -- the test passes;
> > - is uncommented -- in the test fails because global variable 'shared' is NULL.
> 
> Right. That's expected, because __uint(max_entries, 1);
> The test creates an area on 1 page and it's consumed
> by int __arena * __arena bar; variable.
> Of course, one variable doesn't take the whole page.
> There could have been many arena global vars.
> But that page is not available anymore to bpf_arena_alloc_pages,
> so it returns NULL.

My bad, thank you for explaining.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13  0:49       ` Eduard Zingerman
@ 2024-02-13  2:08         ` Alexei Starovoitov
  2024-02-13 12:48           ` Eduard Zingerman
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13  2:08 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, Feb 12, 2024 at 4:49 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2024-02-12 at 16:44 -0800, Alexei Starovoitov wrote:
> > > I hit a strange bug when playing with patch. Consider a simple example [0].
> > > When the following BPF global variable:
> > >
> > >     int __arena * __arena bar;
> > >
> > > - is commented -- the test passes;
> > > - is uncommented -- in the test fails because global variable 'shared' is NULL.
> >
> > Right. That's expected, because __uint(max_entries, 1);
> > The test creates an area on 1 page and it's consumed
> > by int __arena * __arena bar; variable.
> > Of course, one variable doesn't take the whole page.
> > There could have been many arena global vars.
> > But that page is not available anymore to bpf_arena_alloc_pages,
> > so it returns NULL.
>
> My bad, thank you for explaining.

Since it was a surprising behavior we can make libbpf
to auto-extend max_entries with the number of pages necessary
for arena global vars, but it will be surprising too.

struct {
  __uint(type, BPF_MAP_TYPE_ARENA);
  __uint(map_flags, BPF_F_MMAPABLE);
  __ulong(map_extra, 2ull << 44);  // this is start of user VMA
  __uint(max_entries, 1000);       // this is length of user VMA in pages
} arena SEC(".maps");

if libbpf adds extra pages to max_entries the user_vm_end shifts too
and libbpf would need to mmap() it with that size.
When all is hidden in libbpf it's fine, but still can be a surprise
to see a different max_entries in map_info and bpftool map list.
Not sure which way is user friendlier.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier.
  2024-02-10  1:13   ` Eduard Zingerman
@ 2024-02-13  2:58     ` Alexei Starovoitov
  2024-02-13 12:01       ` Eduard Zingerman
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13  2:58 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Fri, Feb 9, 2024 at 5:13 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-02-08 at 20:05 -0800, Alexei Starovoitov wrote:
> [...]
>
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 3c77a3ab1192..5eeb9bf7e324 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
>
> [...]
>
> > @@ -13837,6 +13844,21 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
> >
> >       dst_reg = &regs[insn->dst_reg];
> >       src_reg = NULL;
> > +
> > +     if (dst_reg->type == PTR_TO_ARENA) {
> > +             struct bpf_insn_aux_data *aux = cur_aux(env);
> > +
> > +             if (BPF_CLASS(insn->code) == BPF_ALU64)
> > +                     /*
> > +                      * 32-bit operations zero upper bits automatically.
> > +                      * 64-bit operations need to be converted to 32.
> > +                      */
> > +                     aux->needs_zext = true;
>
> It should be possible to write an example, when the same insn is
> visited with both PTR_TO_ARENA and some other PTR type.
> Such examples should be rejected as is currently done in do_check()
> for BPF_{ST,STX} using save_aux_ptr_type().

Good catch. Fixed reg_type_mismatch_ok().
Didn't craft a unit test. That will be in a follow up.

> [...]
>
> > @@ -13954,16 +13976,17 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
> >       } else if (opcode == BPF_MOV) {
> >
> >               if (BPF_SRC(insn->code) == BPF_X) {
> > -                     if (insn->imm != 0) {
> > -                             verbose(env, "BPF_MOV uses reserved fields\n");
> > -                             return -EINVAL;
> > -                     }
> > -
> >                       if (BPF_CLASS(insn->code) == BPF_ALU) {
> > -                             if (insn->off != 0 && insn->off != 8 && insn->off != 16) {
> > +                             if ((insn->off != 0 && insn->off != 8 && insn->off != 16) ||
> > +                                 insn->imm) {
> >                                       verbose(env, "BPF_MOV uses reserved fields\n");
> >                                       return -EINVAL;
> >                               }
> > +                     } else if (insn->off == BPF_ARENA_CAST_KERN || insn->off == BPF_ARENA_CAST_USER) {
> > +                             if (!insn->imm) {
> > +                                     verbose(env, "cast_kern/user insn must have non zero imm32\n");
> > +                                     return -EINVAL;
> > +                             }
> >                       } else {
> >                               if (insn->off != 0 && insn->off != 8 && insn->off != 16 &&
> >                                   insn->off != 32) {
>
> I think it is now necessary to check insn->imm here,
> as is it allows ALU64 move with non-zero imm.

great catch too. Fixed.

> > @@ -13993,7 +14016,12 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
> >                       struct bpf_reg_state *dst_reg = regs + insn->dst_reg;
> >
> >                       if (BPF_CLASS(insn->code) == BPF_ALU64) {
> > -                             if (insn->off == 0) {
> > +                             if (insn->imm) {
> > +                                     /* off == BPF_ARENA_CAST_KERN || off == BPF_ARENA_CAST_USER */
> > +                                     mark_reg_unknown(env, regs, insn->dst_reg);
> > +                                     if (insn->off == BPF_ARENA_CAST_KERN)
> > +                                             dst_reg->type = PTR_TO_ARENA;
>
> This effectively allows casting anything to PTR_TO_ARENA.
> Do we want to check that src_reg somehow originates from arena?
> Might be tricky, a new type modifier bit or something like that.

Yes. Casting anything is fine.
I don't think we need to enforce anything.
Those insns will be llvm generated. If src_reg is somehow ptr_to_ctx
or something it's likely llvm bug or crazy manual type cast
by the user, but if they do so let them experience debug pains.
The kernel won't crash.

> > +                             } else if (insn->off == 0) {
> >                                       /* case: R1 = R2
> >                                        * copy register state to dest reg
> >                                        */
> > @@ -14059,6 +14087,9 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
> >                                               dst_reg->subreg_def = env->insn_idx + 1;
> >                                               coerce_subreg_to_size_sx(dst_reg, insn->off >> 3);
> >                                       }
> > +                             } else if (src_reg->type == PTR_TO_ARENA) {
> > +                                     mark_reg_unknown(env, regs, insn->dst_reg);
> > +                                     dst_reg->type = PTR_TO_ARENA;
>
> This describes a case wX = wY, where rY is PTR_TO_ARENA,
> should rX be marked as SCALAR instead of PTR_TO_ARENA?

That was a leftover from earlier experiments when alu64->alu32 was
done early.
Removed this hunk now.

> [...]
>
> > @@ -18235,6 +18272,31 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
> >                               fdput(f);
> >                               return -EBUSY;
> >                       }
> > +                     if (map->map_type == BPF_MAP_TYPE_ARENA) {
> > +                             if (env->prog->aux->arena) {
>
> Does this have to be (env->prog->aux->arena && env->prog->aux->arena != map) ?

No. all maps in used_maps[] are unique.
Adding "env->prog->aux->arena != map" won't make any difference.
It will only be confusing.

> > +                                     verbose(env, "Only one arena per program\n");
> > +                                     fdput(f);
> > +                                     return -EBUSY;
> > +                             }
>
> [...]
>
> > @@ -18799,6 +18861,18 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> >                          insn->code == (BPF_ST | BPF_MEM | BPF_W) ||
> >                          insn->code == (BPF_ST | BPF_MEM | BPF_DW)) {
> >                       type = BPF_WRITE;
> > +             } else if (insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) && insn->imm) {
> > +                     if (insn->off == BPF_ARENA_CAST_KERN ||
> > +                         (((struct bpf_map *)env->prog->aux->arena)->map_flags & BPF_F_NO_USER_CONV)) {
> > +                             /* convert to 32-bit mov that clears upper 32-bit */
> > +                             insn->code = BPF_ALU | BPF_MOV | BPF_X;
> > +                             /* clear off, so it's a normal 'wX = wY' from JIT pov */
> > +                             insn->off = 0;
> > +                     } /* else insn->off == BPF_ARENA_CAST_USER should be handled by JIT */
> > +                     continue;
> > +             } else if (env->insn_aux_data[i + delta].needs_zext) {
> > +                     /* Convert BPF_CLASS(insn->code) == BPF_ALU64 to 32-bit ALU */
> > +                     insn->code = BPF_ALU | BPF_OP(insn->code) | BPF_SRC(insn->code);
>
> Tbh, I think this should be done in do_misc_fixups(),
> mixing it with context handling in convert_ctx_accesses()
> seems a bit confusing.

Good point. Moved.

Thanks a lot for the review!

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena.
  2024-02-12 18:14   ` Alexei Starovoitov
@ 2024-02-13 10:35     ` David Hildenbrand
  0 siblings, 0 replies; 112+ messages in thread
From: David Hildenbrand @ 2024-02-13 10:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On 12.02.24 19:14, Alexei Starovoitov wrote:
> On Mon, Feb 12, 2024 at 6:14 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> How easy is this to access+use by unprivileged userspace?
> 
> not possible. bpf arena requires cap_bpf + cap_perfmon.
> 
>> arena_vm_fault() seems to allocate new pages simply via
>> alloc_page(GFP_KERNEL | __GFP_ZERO); No memory accounting, mlock limit
>> checks etc.
> 
> Right. That's a bug. As Kumar commented on the patch 5 that it needs to
> move to memcg accounting the way we do for all other maps.
> It will be very similar to bpf_map_kmalloc_node().
> 

Great, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier.
  2024-02-13  2:58     ` Alexei Starovoitov
@ 2024-02-13 12:01       ` Eduard Zingerman
  0 siblings, 0 replies; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-13 12:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, 2024-02-12 at 18:58 -0800, Alexei Starovoitov wrote:

[...]

> Yes. Casting anything is fine.
> I don't think we need to enforce anything.
> Those insns will be llvm generated. If src_reg is somehow ptr_to_ctx
> or something it's likely llvm bug or crazy manual type cast
> by the user, but if they do so let them experience debug pains.
> The kernel won't crash.

Ok, makes sense.

[...]

> > > @@ -18235,6 +18272,31 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
> > >                               fdput(f);
> > >                               return -EBUSY;
> > >                       }
> > > +                     if (map->map_type == BPF_MAP_TYPE_ARENA) {
> > > +                             if (env->prog->aux->arena) {
> > 
> > Does this have to be (env->prog->aux->arena && env->prog->aux->arena != map) ?
> 
> No. all maps in used_maps[] are unique.
> Adding "env->prog->aux->arena != map" won't make any difference.
> It will only be confusing.

Right, sorry, I missed the loop above that checks if map had been
already seen.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13  2:08         ` Alexei Starovoitov
@ 2024-02-13 12:48           ` Eduard Zingerman
  0 siblings, 0 replies; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-13 12:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Mon, 2024-02-12 at 18:08 -0800, Alexei Starovoitov wrote:
[...]

> Since it was a surprising behavior we can make libbpf
> to auto-extend max_entries with the number of pages necessary
> for arena global vars, but it will be surprising too.
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_ARENA);
>   __uint(map_flags, BPF_F_MMAPABLE);
>   __ulong(map_extra, 2ull << 44);  // this is start of user VMA
>   __uint(max_entries, 1000);       // this is length of user VMA in pages
> } arena SEC(".maps");
> 
> if libbpf adds extra pages to max_entries the user_vm_end shifts too
> and libbpf would need to mmap() it with that size.
> When all is hidden in libbpf it's fine, but still can be a surprise
> to see a different max_entries in map_info and bpftool map list.
> Not sure which way is user friendlier.

Adjusting max_entries would be surprising indeed.
On the other hand, it would remove the error condition about
"Declared arena map size %zd is too small ...".
Probably either way is fine, as long as it is documented.
Don't have a strong opinion.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  2024-02-10  6:48   ` Kumar Kartikeya Dwivedi
@ 2024-02-13 22:00     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13 22:00 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Fri, Feb 9, 2024 at 10:49 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
> > +               if (arena_vm_start)
> > +                       push_r12(&prog);
>
> I believe since this is done on entry for arena_vm_start, we need to
> do matching pop_r12 in
> emit_bpf_tail_call_indirect and emit_bpf_tail_call_direct before tail
> call, unless I'm missing something.
> Otherwise r12 may be bad after prog (push + set to arena_vm_start) ->
> tail call -> exit (no pop of r12 back from stack).

Good catch! Fixed.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
  2024-02-09 17:20   ` Eduard Zingerman
@ 2024-02-13 22:20     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13 22:20 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Fri, Feb 9, 2024 at 9:20 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-02-08 at 20:05 -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions.
> > They are similar to PROBE_MEM instructions with the following differences:
> > - PROBE_MEM has to check that the address is in the kernel range with
> >   src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check
> > - PROBE_MEM doesn't support store
> > - PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register
> > - PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue)
> >   Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed
> >   to be within arena virtual range, so no address check at run-time.
> > - PROBE_MEM32 allows STX and ST. If they fault the store is a nop.
> >   When LDX faults the destination register is zeroed.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
>
> It would be great to add support for these new probe instructions in disasm,
> otherwise commands like "bpftool prog dump xlated" can't print them.
>
> I sort-of brute-force verified jit code generated for new instructions
> and disassembly seem to be as expected.

yeah. added a fix to the verifier patch.

> [...]
>
> > @@ -1564,6 +1697,52 @@ st:                    if (is_imm8(insn->off))
> >                       emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
> >                       break;
> >
> > +             case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
> > +             case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
> > +             case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
> > +             case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
> > +                     start_of_ldx = prog;
> > +                     emit_st_r12(&prog, BPF_SIZE(insn->code), dst_reg, insn->off, insn->imm);
> > +                     goto populate_extable;
> > +
> > +                     /* LDX: dst_reg = *(u8*)(src_reg + r12 + off) */
> > +             case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
> > +             case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
> > +             case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
> > +             case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
> > +             case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
> > +             case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
> > +             case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
> > +             case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
> > +                     start_of_ldx = prog;
> > +                     if (BPF_CLASS(insn->code) == BPF_LDX)
> > +                             emit_ldx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
> > +                     else
> > +                             emit_stx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
> > +populate_extable:
> > +                     {
> > +                             struct exception_table_entry *ex;
> > +                             u8 *_insn = image + proglen + (start_of_ldx - temp);
> > +                             s64 delta;
> > +
> > +                             if (!bpf_prog->aux->extable)
> > +                                     break;
> > +
> > +                             ex = &bpf_prog->aux->extable[excnt++];
>
> Nit: this seem to mostly repeat exception logic for
>      "BPF_LDX | BPF_MEM | BPF_B" & co,
>      is there a way to abstract it a bit?

I don't see a good way. A macro is meh.
A helper with 5+ args is also meh.

>      Also note that there excnt is checked for overflow.

indeed. added overflow check.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction.
  2024-02-10  8:40   ` Kumar Kartikeya Dwivedi
@ 2024-02-13 22:28     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13 22:28 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Sat, Feb 10, 2024 at 12:40 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > LLVM generates bpf_cast_kern and bpf_cast_user instructions while translating
> > pointers with __attribute__((address_space(1))).
> >
> > rX = cast_kern(rY) is processed by the verifier and converted to
> > normal 32-bit move: wX = wY
> >
> > bpf_cast_user has to be converted by JIT.
> >
> > rX = cast_user(rY) is
> >
> > aux_reg = upper_32_bits of arena->user_vm_start
> > aux_reg <<= 32
> > wX = wY // clear upper 32 bits of dst register
> > if (wX) // if not zero add upper bits of user_vm_start
> >   wX |= aux_reg
> >
>
> Would this be ok if the rY is somehow aligned at the 4GB boundary, and
> the lower 32-bits end up being zero.
> Then this transformation would confuse it with the NULL case, right?

yes. it will. I tried to fix it by reserving a zero page,
but the end result was bad. See discussion with Barret.
So we decided to drop this idea.
Might come back to it eventually.
Also, I was thinking of doing
if (rX) instead of if (wX) to mitigate a bit,
but that is probably wrong too.
The best is to mitigate this inside bpf program by never returning lo32 zero
from bpf_alloc() function.
In general with the latest llvm we see close to zero cast_user
when bpf prog is not mixing (void *) with (void __arena *) casts,
so it shouldn't be an issue in practice with patches as-is.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast()
  2024-02-10  8:54   ` Kumar Kartikeya Dwivedi
@ 2024-02-13 22:35     ` Alexei Starovoitov
  2024-02-14 16:47       ` Eduard Zingerman
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13 22:35 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Sat, Feb 10, 2024 at 12:54 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce helper macro bpf_arena_cast() that emits:
> > rX = rX
> > instruction with off = BPF_ARENA_CAST_KERN or off = BPF_ARENA_CAST_USER
> > and encodes address_space into imm32.
> >
> > It's useful with older LLVM that doesn't emit this insn automatically.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
>
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>
> But could this simply be added to libbpf along with bpf_cast_user and
> bpf_cast_kern? I believe since LLVM and the verifier support the new
> cast instructions, they are unlikely to disappear any time soon. It
> would probably also make it easier to use elsewhere (e.g. sched-ext)
> without having to copy them.

This arena bpf_arena_cast() macro probably will be removed
once llvm 19 is released and we upgrade bpf CI to it.
It's here for selftests only.
It's quite tricky and fragile to use in practice.
Notice it does:
"r"(__var)
which is not quite correct,
since llvm won't recognize it as output that changes __var and
will use a copy of __var in a different register later.
But if the macro changes to "=r" or "+r" then llvm allocates
a register and that screws up codegen even more.

The __var;}) also doesn't always work.
So this macro is not suited for all to use.

> I plan on doing the same eventually with assert macros too.

I think it's too early to move them.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles Alexei Starovoitov
  2024-02-13  0:34   ` Eduard Zingerman
@ 2024-02-13 23:11   ` Eduard Zingerman
  2024-02-13 23:17     ` Andrii Nakryiko
  2024-02-14  1:02     ` Alexei Starovoitov
  2024-02-13 23:15   ` Andrii Nakryiko
  2 siblings, 2 replies; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-13 23:11 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf
  Cc: daniel, andrii, memxor, tj, brho, hannes, lstoakes, akpm, urezki,
	hch, linux-mm, kernel-team

On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> LLVM automatically places __arena variables into ".arena.1" ELF section.
> When libbpf sees such section it creates internal 'struct bpf_map' LIBBPF_MAP_ARENA
> that is connected to actual BPF_MAP_TYPE_ARENA 'struct bpf_map'.
> They share the same kernel's side bpf map and single map_fd.
> Both are emitted into skeleton. Real arena with the name given by bpf program
> in SEC(".maps") and another with "__arena_internal" name.
> All global variables from ".arena.1" section are accessible from user space
> via skel->arena->name_of_var.
> 
> For bss/data/rodata the skeleton/libbpf perform the following sequence:
> 1. addr = mmap(MAP_ANONYMOUS)
> 2. user space optionally modifies global vars
> 3. map_fd = bpf_create_map()
> 4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel
> 5. mmap(addr, MAP_FIXED, map_fd)
> after step 5 user spaces see the values it wrote at step 2 at the same addresses
> 
> arena doesn't support update_map_elem. Hence skeleton/libbpf do:
> 1. addr = mmap(MAP_ANONYMOUS)
> 2. user space optionally modifies global vars
> 3. map_fd = bpf_create_map(MAP_TYPE_ARENA)
> 4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd)
> 5. memcpy(real_addr, addr) // this will fault-in and allocate pages
> 6. munmap(addr)
> 
> At the end look and feel of global data vs __arena global data is the same from bpf prog pov.

[...]

So, at first I thought that having two maps is a bit of a hack.
However, after trying to make it work with only one map I don't really
like that either :)

The patch looks good to me, have not spotted any logical issues.

I have two questions if you don't mind:

First is regarding initialization data.
In bpf_object__init_internal_map() the amount of bpf_map_mmap_sz(map)
bytes is mmaped and only data_sz bytes are copied,
then bpf_map_mmap_sz(map) bytes are copied in bpf_object__create_maps().
Is Linux/libc smart enough to skip action on pages which were mmaped but
never touched?

Second is regarding naming.
Currently only one arena is supported, and generated skel has a single '->arena' field.
Is there a plan to support multiple arenas at some point?
If so, should '->arena' field use the same name as arena map declared in program?


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena Alexei Starovoitov
                     ` (2 preceding siblings ...)
  2024-02-12 15:56   ` Barret Rhoden
@ 2024-02-13 23:14   ` Andrii Nakryiko
  2024-02-13 23:29     ` Alexei Starovoitov
  3 siblings, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-13 23:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

On Thu, Feb 8, 2024 at 8:06 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce bpf_arena, which is a sparse shared memory region between the bpf
> program and user space.
>
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>    region, like memcached or any key/value storage. The bpf program implements an
>    in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>    value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
>    rb-trees, sparse arrays), while user space consumes it.
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
>    The user space may mmap it, but bpf program will not convert pointers
>    to user base at run-time to improve bpf program speed.
>
> Initially, the kernel vm_area and user vma are not populated. User space can
> fault in pages within the range. While servicing a page fault, bpf_arena logic
> will insert a new page into the kernel and user vmas. The bpf program can
> allocate pages from that region via bpf_arena_alloc_pages(). This kernel
> function will insert pages into the kernel vm_area. The subsequent fault-in
> from user space will populate that page into the user vma. The
> BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
> from user space. In such a case, if a page is not allocated by the bpf program
> and not present in the kernel vm_area, the user process will segfault. This is
> useful for use cases 2 and 3 above.
>
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> either at a specific address within the arena or allocates a range with the
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
> and removes the range from the kernel vm_area and from user process vmas.
>
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf
> program is more important than ease of sharing with user space. This is use
> case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will
> tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a
> 32-bit move wX = wY, which will improve bpf prog performance. Otherwise,
> bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits
> of user vm_start (if the pointer is not NULL) to arena pointers before they are
> stored into memory. This way, user space sees them as valid 64-bit pointers.
>
> Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
> generate the bpf_cast_kern() instruction before dereference of the arena
> pointer and the bpf_cast_user() instruction when the arena pointer is formed.
> In a typical bpf program there will be very few bpf_cast_user().
>
> From LLVM's point of view, arena pointers are tagged as
> __attribute__((address_space(1))). Hence, clang provides helpful diagnostics
> when pointers cross address space. Libbpf and the kernel support only
> address_space == 1. All other address space identifiers are reserved.
>
> rX = bpf_cast_kern(rY, addr_space) tells the verifier that
> rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
> to be in the 32-bit domain. The verifier will mark load/store through
> PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
> kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
> copy_from_kernel_nofault() except that no address checks are necessary. The
> address is guaranteed to be in the 4GB range. If the page is not present, the
> destination register is zeroed on read, and the operation is ignored on write.
>
> rX = bpf_cast_user(rY, addr_space) tells the verifier that
> rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
> the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
> equivalent to:
> rX = (u32)rY;
> if (rY)
>   rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
>
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list built
> by a bpf program can be walked natively by user space.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf.h            |   5 +-
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |   7 +
>  kernel/bpf/Makefile            |   3 +
>  kernel/bpf/arena.c             | 557 +++++++++++++++++++++++++++++++++
>  kernel/bpf/core.c              |  11 +
>  kernel/bpf/syscall.c           |   3 +
>  kernel/bpf/verifier.c          |   1 +
>  tools/include/uapi/linux/bpf.h |   7 +
>  9 files changed, 593 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/arena.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8b0dcb66eb33..de557c6c42e0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -37,6 +37,7 @@ struct perf_event;
>  struct bpf_prog;
>  struct bpf_prog_aux;
>  struct bpf_map;
> +struct bpf_arena;
>  struct sock;
>  struct seq_file;
>  struct btf;
> @@ -534,8 +535,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>                         struct bpf_spin_lock *spin_lock);
>  void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
>                       struct bpf_spin_lock *spin_lock);
> -
> -
> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena);
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena);
>  int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
>
>  struct bpf_offload_dev;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 94baced5a1ad..9f2a6b83b49e 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -132,6 +132,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops)
>
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d96708380e52..f6648851eae6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -983,6 +983,7 @@ enum bpf_map_type {
>         BPF_MAP_TYPE_BLOOM_FILTER,
>         BPF_MAP_TYPE_USER_RINGBUF,
>         BPF_MAP_TYPE_CGRP_STORAGE,
> +       BPF_MAP_TYPE_ARENA,
>         __MAX_BPF_MAP_TYPE
>  };
>
> @@ -1370,6 +1371,12 @@ enum {
>
>  /* BPF token FD is passed in a corresponding command's token_fd field */
>         BPF_F_TOKEN_FD          = (1U << 16),
> +
> +/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
> +       BPF_F_SEGV_ON_FAULT     = (1U << 17),
> +
> +/* Do not translate kernel bpf_arena pointers to user pointers */
> +       BPF_F_NO_USER_CONV      = (1U << 18),
>  };
>
>  /* Flags for BPF_PROG_QUERY. */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 4ce95acfcaa7..368c5d86b5b7 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -15,6 +15,9 @@ obj-${CONFIG_BPF_LSM}   += bpf_inode_storage.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
> +ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
> +obj-$(CONFIG_BPF_SYSCALL) += arena.o
> +endif
>  obj-$(CONFIG_BPF_JIT) += dispatcher.o
>  ifeq ($(CONFIG_NET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> new file mode 100644
> index 000000000000..5c1014471740
> --- /dev/null
> +++ b/kernel/bpf/arena.c
> @@ -0,0 +1,557 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/err.h>
> +#include <linux/btf_ids.h>
> +#include <linux/vmalloc.h>
> +#include <linux/pagemap.h>
> +
> +/*
> + * bpf_arena is a sparsely populated shared memory region between bpf program and
> + * user space process.
> + *
> + * For example on x86-64 the values could be:
> + * user_vm_start 7f7d26200000     // picked by mmap()
> + * kern_vm_start ffffc90001e69000 // picked by get_vm_area()
> + * For user space all pointers within the arena are normal 8-byte addresses.
> + * In this example 7f7d26200000 is the address of the first page (pgoff=0).
> + * The bpf program will access it as: kern_vm_start + lower_32bit_of_user_ptr
> + * (u32)7f7d26200000 -> 26200000
> + * hence
> + * ffffc90001e69000 + 26200000 == ffffc90028069000 is "pgoff=0" within 4Gb
> + * kernel memory region.
> + *
> + * BPF JITs generate the following code to access arena:
> + *   mov eax, eax  // eax has lower 32-bit of user pointer
> + *   mov word ptr [rax + r12 + off], bx
> + * where r12 == kern_vm_start and off is s16.
> + * Hence allocate 4Gb + GUARD_SZ/2 on each side.
> + *
> + * Initially kernel vm_area and user vma are not populated.
> + * User space can fault-in any address which will insert the page
> + * into kernel and user vma.
> + * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc
> + * which will insert it into kernel vm_area.
> + * The later fault-in from user space will populate that page into user vma.
> + */
> +
> +/* number of bytes addressable by LDX/STX insn with 16-bit 'off' field */
> +#define GUARD_SZ (1ull << sizeof(((struct bpf_insn *)0)->off) * 8)
> +#define KERN_VM_SZ ((1ull << 32) + GUARD_SZ)

I feel like we need another named constant for those 4GB limits here,
something like:

#define MAX_ARENA_SZ (1ull << 32)
#define KERN_VM_SZ (MAX_ARENA_SZ + GUARD_SZ)

see below why

> +
> +struct bpf_arena {
> +       struct bpf_map map;
> +       u64 user_vm_start;
> +       u64 user_vm_end;
> +       struct vm_struct *kern_vm;
> +       struct maple_tree mt;
> +       struct list_head vma_list;
> +       struct mutex lock;
> +};
> +

[...]

> +static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> +{
> +       struct vm_struct *kern_vm;
> +       int numa_node = bpf_map_attr_numa_node(attr);
> +       struct bpf_arena *arena;
> +       u64 vm_range;
> +       int err = -ENOMEM;
> +
> +       if (attr->key_size || attr->value_size || attr->max_entries == 0 ||
> +           /* BPF_F_MMAPABLE must be set */
> +           !(attr->map_flags & BPF_F_MMAPABLE) ||
> +           /* No unsupported flags present */
> +           (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | BPF_F_NO_USER_CONV)))
> +               return ERR_PTR(-EINVAL);
> +
> +       if (attr->map_extra & ~PAGE_MASK)
> +               /* If non-zero the map_extra is an expected user VMA start address */
> +               return ERR_PTR(-EINVAL);
> +
> +       vm_range = (u64)attr->max_entries * PAGE_SIZE;
> +       if (vm_range > (1ull << 32))

here we can then use MAX_ARENA_SZ

> +               return ERR_PTR(-E2BIG);
> +
> +       if ((attr->map_extra >> 32) != ((attr->map_extra + vm_range - 1) >> 32))
> +               /* user vma must not cross 32-bit boundary */
> +               return ERR_PTR(-ERANGE);
> +
> +       kern_vm = get_vm_area(KERN_VM_SZ, VM_MAP | VM_USERMAP);
> +       if (!kern_vm)
> +               return ERR_PTR(-ENOMEM);
> +
> +       arena = bpf_map_area_alloc(sizeof(*arena), numa_node);
> +       if (!arena)
> +               goto err;
> +
> +       arena->kern_vm = kern_vm;
> +       arena->user_vm_start = attr->map_extra;
> +       if (arena->user_vm_start)
> +               arena->user_vm_end = arena->user_vm_start + vm_range;
> +
> +       INIT_LIST_HEAD(&arena->vma_list);
> +       bpf_map_init_from_attr(&arena->map, attr);
> +       mt_init_flags(&arena->mt, MT_FLAGS_ALLOC_RANGE);
> +       mutex_init(&arena->lock);
> +
> +       return &arena->map;
> +err:
> +       free_vm_area(kern_vm);
> +       return ERR_PTR(err);
> +}
> +
> +static int for_each_pte(pte_t *ptep, unsigned long addr, void *data)
> +{
> +       struct page *page;
> +       pte_t pte;
> +
> +       pte = ptep_get(ptep);
> +       if (!pte_present(pte))
> +               return 0;
> +       page = pte_page(pte);
> +       /*
> +        * We do not update pte here:
> +        * 1. Nobody should be accessing bpf_arena's range outside of a kernel bug
> +        * 2. TLB flushing is batched or deferred. Even if we clear pte,
> +        * the TLB entries can stick around and continue to permit access to
> +        * the freed page. So it all relies on 1.
> +        */
> +       __free_page(page);
> +       return 0;
> +}
> +
> +static void arena_map_free(struct bpf_map *map)
> +{
> +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +       /*
> +        * Check that user vma-s are not around when bpf map is freed.
> +        * mmap() holds vm_file which holds bpf_map refcnt.
> +        * munmap() must have happened on vma followed by arena_vm_close()
> +        * which would clear arena->vma_list.
> +        */
> +       if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
> +               return;
> +
> +       /*
> +        * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
> +        * It unmaps everything from vmalloc area and clears pgtables.
> +        * Call apply_to_existing_page_range() first to find populated ptes and
> +        * free those pages.
> +        */
> +       apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +                                    KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);

I'm still reading the rest (so it might become obvious), but this
KERN_VM_SZ - GUARD_SZ / 2 is a bit surprising. I understand that
kern_vm_start is shifted by GUARD_SZ/2, but is the intent here is to
actually go beyond maximum 4GB by GUARD_SZ/2, or the intent was to
unmap 4GB (MAX_ARENA_SZ)?

> +       free_vm_area(arena->kern_vm);
> +       mtree_destroy(&arena->mt);
> +       bpf_map_area_free(arena);
> +}
> +

[...]

> +static unsigned long arena_get_unmapped_area(struct file *filp, unsigned long addr,
> +                                            unsigned long len, unsigned long pgoff,
> +                                            unsigned long flags)
> +{
> +       struct bpf_map *map = filp->private_data;
> +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +       long ret;
> +
> +       if (pgoff)
> +               return -EINVAL;
> +       if (len > (1ull << 32))

MAX_ARENA_SZ ?

> +               return -E2BIG;
> +
> +       /* if user_vm_start was specified at arena creation time */
> +       if (arena->user_vm_start) {
> +               if (len > arena->user_vm_end - arena->user_vm_start)
> +                       return -E2BIG;
> +               if (len != arena->user_vm_end - arena->user_vm_start)
> +                       return -EINVAL;
> +               if (addr != arena->user_vm_start)
> +                       return -EINVAL;
> +       }
> +
> +       ret = current->mm->get_unmapped_area(filp, addr, len * 2, 0, flags);
> +       if (IS_ERR_VALUE(ret))
> +                return 0;

Can you leave a comment why we are swallowing errors, if this is intentional?

> +       if ((ret >> 32) == ((ret + len - 1) >> 32))
> +               return ret;
> +       if (WARN_ON_ONCE(arena->user_vm_start))
> +               /* checks at map creation time should prevent this */
> +               return -EFAULT;
> +       return round_up(ret, 1ull << 32);

this is still probably MAX_ARENA_SZ, no?

> +}
> +
> +static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> +{
> +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +       guard(mutex)(&arena->lock);
> +       if (arena->user_vm_start && arena->user_vm_start != vma->vm_start)
> +               /*
> +                * If map_extra was not specified at arena creation time then
> +                * 1st user process can do mmap(NULL, ...) to pick user_vm_start
> +                * 2nd user process must pass the same addr to mmap(addr, MAP_FIXED..);
> +                *   or
> +                * specify addr in map_extra and
> +                * use the same addr later with mmap(addr, MAP_FIXED..);
> +                */
> +               return -EBUSY;
> +
> +       if (arena->user_vm_end && arena->user_vm_end != vma->vm_end)
> +               /* all user processes must have the same size of mmap-ed region */
> +               return -EBUSY;
> +
> +       /* Earlier checks should prevent this */
> +       if (WARN_ON_ONCE(vma->vm_end - vma->vm_start > (1ull << 32) || vma->vm_pgoff))

MAX_ARENA_SZ ?

> +               return -EFAULT;
> +
> +       if (remember_vma(arena, vma))
> +               return -ENOMEM;
> +
> +       arena->user_vm_start = vma->vm_start;
> +       arena->user_vm_end = vma->vm_end;
> +       /*
> +        * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and
> +        * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid
> +        * potential change of user_vm_start.
> +        */
> +       vm_flags_set(vma, VM_DONTEXPAND);
> +       vma->vm_ops = &arena_vm_ops;
> +       return 0;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
  2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
@ 2024-02-13 23:14   ` Andrii Nakryiko
  2024-02-14  0:26     ` Alexei Starovoitov
  1 sibling, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-13 23:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

On Thu, Feb 8, 2024 at 8:06 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
>
> Note, when the verifier sees:
>
> __weak void foo(struct bar *p)
>
> it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars.
> Hence the only way to use arena pointers in global functions is to tag them with "arg:arena".
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf.h   |  1 +
>  kernel/bpf/btf.c      | 19 +++++++++++++++----
>  kernel/bpf/verifier.c | 15 +++++++++++++++
>  3 files changed, 31 insertions(+), 4 deletions(-)
>

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5eeb9bf7e324..fa49602194d5 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -9348,6 +9348,18 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
>                                 bpf_log(log, "arg#%d is expected to be non-NULL\n", i);
>                                 return -EINVAL;
>                         }
> +               } else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
> +                       /*
> +                        * Can pass any value and the kernel won't crash, but
> +                        * only PTR_TO_ARENA or SCALAR make sense. Everything
> +                        * else is a bug in the bpf program. Point it out to
> +                        * the user at the verification time instead of
> +                        * run-time debug nightmare.
> +                        */
> +                       if (reg->type != PTR_TO_ARENA && reg->type != SCALAR_VALUE) {

the comment above doesn't explain why it's ok to pass SCALAR_VALUE. Is
it because PTR_TO_ARENA will become SCALAR_VALUE after some arithmetic
operations and we don't want to regress user experience? If that's the
case, what's the way for user to convert SCALAR_VALUE back to
PTR_TO_ARENA without going through global subprog? bpf_cast_xxx
instruction through assembly?

> +                               bpf_log(log, "R%d is not a pointer to arena or scalar.\n", regno);
> +                               return -EINVAL;
> +                       }
>                 } else if (arg->arg_type == (ARG_PTR_TO_DYNPTR | MEM_RDONLY)) {
>                         ret = process_dynptr_func(env, regno, -1, arg->arg_type, 0);
>                         if (ret)
> @@ -20329,6 +20341,9 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
>                                 reg->btf = bpf_get_btf_vmlinux(); /* can't fail at this point */
>                                 reg->btf_id = arg->btf_id;
>                                 reg->id = ++env->id_gen;
> +                       } else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
> +                               /* caller can pass either PTR_TO_ARENA or SCALAR */
> +                               mark_reg_unknown(env, regs, i);

shouldn't we set the register type to PTR_TO_ARENA here?


>                         } else {
>                                 WARN_ONCE(1, "BUG: unhandled arg#%d type %d\n",
>                                           i - BPF_REG_1, arg->arg_type);
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h
  2024-02-09  4:05 ` [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
  2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
@ 2024-02-13 23:14   ` Andrii Nakryiko
  1 sibling, 0 replies; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-13 23:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

On Thu, Feb 8, 2024 at 8:07 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Add __arg_arena to bpf_helpers.h
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/lib/bpf/bpf_helpers.h | 1 +
>  1 file changed, 1 insertion(+)
>

Acked-by: Andrii Nakryiko <andrii@kernel.org>


> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index 79eaa581be98..9c777c21da28 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -192,6 +192,7 @@ enum libbpf_tristate {
>  #define __arg_nonnull __attribute((btf_decl_tag("arg:nonnull")))
>  #define __arg_nullable __attribute((btf_decl_tag("arg:nullable")))
>  #define __arg_trusted __attribute((btf_decl_tag("arg:trusted")))
> +#define __arg_arena __attribute((btf_decl_tag("arg:arena")))
>
>  #ifndef ___bpf_concat
>  #define ___bpf_concat(a, b) a ## b
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena Alexei Starovoitov
  2024-02-10  7:16   ` Kumar Kartikeya Dwivedi
  2024-02-12 18:12   ` Eduard Zingerman
@ 2024-02-13 23:15   ` Andrii Nakryiko
  2024-02-14  0:32     ` Alexei Starovoitov
  2 siblings, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-13 23:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

On Thu, Feb 8, 2024 at 8:07 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> mmap() bpf_arena right after creation, since the kernel needs to
> remember the address returned from mmap. This is user_vm_start.
> LLVM will generate bpf_arena_cast_user() instructions where
> necessary and JIT will add upper 32-bit of user_vm_start
> to such pointers.
>
> Fix up bpf_map_mmap_sz() to compute mmap size as
> map->value_size * map->max_entries for arrays and
> PAGE_SIZE * map->max_entries for arena.
>
> Don't set BTF at arena creation time, since it doesn't support it.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/lib/bpf/libbpf.c        | 43 ++++++++++++++++++++++++++++++-----
>  tools/lib/bpf/libbpf_probes.c |  7 ++++++
>  2 files changed, 44 insertions(+), 6 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 01f407591a92..4880d623098d 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
>         [BPF_MAP_TYPE_BLOOM_FILTER]             = "bloom_filter",
>         [BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
>         [BPF_MAP_TYPE_CGRP_STORAGE]             = "cgrp_storage",
> +       [BPF_MAP_TYPE_ARENA]                    = "arena",
>  };
>
>  static const char * const prog_type_name[] = {
> @@ -1577,7 +1578,7 @@ static struct bpf_map *bpf_object__add_map(struct bpf_object *obj)
>         return map;
>  }
>
> -static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> +static size_t __bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)

please rename this to array_map_mmap_sz, underscores are not very meaningful

>  {
>         const long page_sz = sysconf(_SC_PAGE_SIZE);
>         size_t map_sz;
> @@ -1587,6 +1588,20 @@ static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
>         return map_sz;
>  }
>
> +static size_t bpf_map_mmap_sz(const struct bpf_map *map)
> +{
> +       const long page_sz = sysconf(_SC_PAGE_SIZE);
> +
> +       switch (map->def.type) {
> +       case BPF_MAP_TYPE_ARRAY:
> +               return __bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> +       case BPF_MAP_TYPE_ARENA:
> +               return page_sz * map->def.max_entries;
> +       default:
> +               return 0; /* not supported */
> +       }
> +}
> +
>  static int bpf_map_mmap_resize(struct bpf_map *map, size_t old_sz, size_t new_sz)
>  {
>         void *mmaped;
> @@ -1740,7 +1755,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>         pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
>                  map->name, map->sec_idx, map->sec_offset, def->map_flags);
>
> -       mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> +       mmap_sz = bpf_map_mmap_sz(map);
>         map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
>                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>         if (map->mmaped == MAP_FAILED) {
> @@ -4852,6 +4867,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>         case BPF_MAP_TYPE_SOCKHASH:
>         case BPF_MAP_TYPE_QUEUE:
>         case BPF_MAP_TYPE_STACK:
> +       case BPF_MAP_TYPE_ARENA:
>                 create_attr.btf_fd = 0;
>                 create_attr.btf_key_type_id = 0;
>                 create_attr.btf_value_type_id = 0;
> @@ -4908,6 +4924,21 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
>         if (map->fd == map_fd)
>                 return 0;
>
> +       if (def->type == BPF_MAP_TYPE_ARENA) {
> +               map->mmaped = mmap((void *)map->map_extra, bpf_map_mmap_sz(map),
> +                                  PROT_READ | PROT_WRITE,
> +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> +                                  map_fd, 0);
> +               if (map->mmaped == MAP_FAILED) {
> +                       err = -errno;
> +                       map->mmaped = NULL;
> +                       close(map_fd);
> +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> +                               bpf_map__name(map), err);

seems like we just use `map->name` directly elsewhere in this
function, let's keep it consistent

> +                       return err;
> +               }
> +       }
> +
>         /* Keep placeholder FD value but now point it to the BPF map object.
>          * This way everything that relied on this map's FD (e.g., relocated
>          * ldimm64 instructions) will stay valid and won't need adjustments.
> @@ -8582,7 +8613,7 @@ static void bpf_map__destroy(struct bpf_map *map)
>         if (map->mmaped) {
>                 size_t mmap_sz;
>
> -               mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> +               mmap_sz = bpf_map_mmap_sz(map);
>                 munmap(map->mmaped, mmap_sz);
>                 map->mmaped = NULL;
>         }
> @@ -9830,8 +9861,8 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
>                 int err;
>                 size_t mmap_old_sz, mmap_new_sz;
>

this logic assumes ARRAY (which are the only ones so far that could
have `map->mapped != NULL`, so I think we should error out for ARENA
maps here, instead of silently doing the wrong thing?

if (map->type != BPF_MAP_TYPE_ARRAY)
    return -EOPNOTSUPP;

should do



> -               mmap_old_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> -               mmap_new_sz = bpf_map_mmap_sz(size, map->def.max_entries);
> +               mmap_old_sz = bpf_map_mmap_sz(map);
> +               mmap_new_sz = __bpf_map_mmap_sz(size, map->def.max_entries);
>                 err = bpf_map_mmap_resize(map, mmap_old_sz, mmap_new_sz);
>                 if (err) {
>                         pr_warn("map '%s': failed to resize memory-mapped region: %d\n",
> @@ -13356,7 +13387,7 @@ int bpf_object__load_skeleton(struct bpf_object_skeleton *s)
>
>         for (i = 0; i < s->map_cnt; i++) {
>                 struct bpf_map *map = *s->maps[i].map;
> -               size_t mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> +               size_t mmap_sz = bpf_map_mmap_sz(map);
>                 int prot, map_fd = map->fd;
>                 void **mmaped = s->maps[i].mmaped;
>
> diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
> index ee9b1dbea9eb..302188122439 100644
> --- a/tools/lib/bpf/libbpf_probes.c
> +++ b/tools/lib/bpf/libbpf_probes.c
> @@ -338,6 +338,13 @@ static int probe_map_create(enum bpf_map_type map_type)
>                 key_size = 0;
>                 max_entries = 1;
>                 break;
> +       case BPF_MAP_TYPE_ARENA:
> +               key_size        = 0;
> +               value_size      = 0;
> +               max_entries     = 1; /* one page */
> +               opts.map_extra  = 0; /* can mmap() at any address */
> +               opts.map_flags  = BPF_F_MMAPABLE;
> +               break;
>         case BPF_MAP_TYPE_HASH:
>         case BPF_MAP_TYPE_ARRAY:
>         case BPF_MAP_TYPE_PROG_ARRAY:
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
  2024-02-12 18:58   ` Eduard Zingerman
@ 2024-02-13 23:15   ` Andrii Nakryiko
  2024-02-14  0:47     ` Alexei Starovoitov
  1 sibling, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-13 23:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

On Thu, Feb 8, 2024 at 8:07 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> __uint() macro that is used to specify map attributes like:
>   __uint(type, BPF_MAP_TYPE_ARRAY);
>   __uint(map_flags, BPF_F_MMAPABLE);
> is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.
>
> Introduce __ulong() macro that allows specifying values bigger than 32-bit.
> In map definition "map_extra" is the only u64 field.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/lib/bpf/bpf_helpers.h |  5 +++++
>  tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index 9c777c21da28..0aeac8ea7af2 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -13,6 +13,11 @@
>  #define __uint(name, val) int (*name)[val]
>  #define __type(name, val) typeof(val) *name
>  #define __array(name, val) typeof(val) *name[]
> +#ifndef __PASTE
> +#define ___PASTE(a,b) a##b
> +#define __PASTE(a,b) ___PASTE(a,b)
> +#endif

we already have ___bpf_concat defined further in this file (it's macro
so ordering shouldn't matter), let's just use that instead of adding
another variant

> +#define __ulong(name, val) enum { __PASTE(__unique_value, __COUNTER__) = val } name
>
>  /*
>   * Helper macro to place programs, maps, license in
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 4880d623098d..f8158e250327 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -2243,6 +2243,39 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf,
>         return true;
>  }
>
> +static bool get_map_field_long(const char *map_name, const struct btf *btf,
> +                              const struct btf_member *m, __u64 *res)
> +{
> +       const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> +       const char *name = btf__name_by_offset(btf, m->name_off);
> +
> +       if (btf_is_ptr(t))
> +               return false;

It's not great that anyone that uses __uint(map_extra, ...) would get
warnings now.
Let's just teach get_map_field_long to recognize __uint vs __ulong?

Let's call into get_map_field_int() here if we have a pointer, and
then upcast u32 into u64?

> +
> +       if (!btf_is_enum(t) && !btf_is_enum64(t)) {
> +               pr_warn("map '%s': attr '%s': expected enum or enum64, got %s.\n",

seems like get_map_field_int() is using "PTR" and "ARRAY" all caps
spelling in warnings, let's use ENUM and ENUM64 for consistency?

> +                       map_name, name, btf_kind_str(t));
> +               return false;
> +       }
> +
> +       if (btf_vlen(t) != 1) {
> +               pr_warn("map '%s': attr '%s': invalid __ulong\n",
> +                       map_name, name);
> +               return false;
> +       }
> +
> +       if (btf_is_enum(t)) {
> +               const struct btf_enum *e = btf_enum(t);
> +
> +               *res = e->val;
> +       } else {
> +               const struct btf_enum64 *e = btf_enum64(t);
> +
> +               *res = btf_enum64_value(e);
> +       }
> +       return true;
> +}
> +
>  static int pathname_concat(char *buf, size_t buf_sz, const char *path, const char *name)
>  {
>         int len;
> @@ -2476,10 +2509,15 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
>                         map_def->pinning = val;
>                         map_def->parts |= MAP_DEF_PINNING;
>                 } else if (strcmp(name, "map_extra") == 0) {
> -                       __u32 map_extra;
> +                       __u64 map_extra;
>
> -                       if (!get_map_field_int(map_name, btf, m, &map_extra))
> -                               return -EINVAL;
> +                       if (!get_map_field_long(map_name, btf, m, &map_extra)) {
> +                               __u32 map_extra_u32;
> +
> +                               if (!get_map_field_int(map_name, btf, m, &map_extra_u32))
> +                                       return -EINVAL;
> +                               map_extra = map_extra_u32;
> +                       }

with the above change it would be a simple
s/get_map_field_int/get_map_field_long/ (and __u32 -> __u64, of
course)


>                         map_def->map_extra = map_extra;
>                         map_def->parts |= MAP_DEF_MAP_EXTRA;
>                 } else {
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-09  4:06 ` [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles Alexei Starovoitov
  2024-02-13  0:34   ` Eduard Zingerman
  2024-02-13 23:11   ` Eduard Zingerman
@ 2024-02-13 23:15   ` Andrii Nakryiko
  2 siblings, 0 replies; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-13 23:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

On Thu, Feb 8, 2024 at 8:07 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> LLVM automatically places __arena variables into ".arena.1" ELF section.
> When libbpf sees such section it creates internal 'struct bpf_map' LIBBPF_MAP_ARENA
> that is connected to actual BPF_MAP_TYPE_ARENA 'struct bpf_map'.
> They share the same kernel's side bpf map and single map_fd.
> Both are emitted into skeleton. Real arena with the name given by bpf program
> in SEC(".maps") and another with "__arena_internal" name.
> All global variables from ".arena.1" section are accessible from user space
> via skel->arena->name_of_var.

This "real arena" vs "__arena_internal" seems like an unnecessary
complication. When processing .arena.1 ELF section, we search for
explicit BPF_MAP_TYPE_ARENA map, right? Great, at that point, we can
use that map and it's map->mmapable pointer (we mmap() anonymous
memory to hold initial values of global variables). We copy init
values into map->mmapable on actual arena struct bpf_map. Then during
map creation (during load) we do a new mmap(map_fd), taking into
account map_extra. Then memcpy() from the original anonymous mmap into
this arena-linked mmap. Then discard the original mmap.

This way we don't have fake maps anymore, we initialize actual map (we
might need to just remember smaller init mmap_sz, which doesn't seem
like a big complication). WDYT?

BTW, I think bpf_object__load_skeleton() can re-assign skeleton's
arena data pointer, so user accessing skel->arena->var before and
after skeleton load will be getting correct pointer.

>
> For bss/data/rodata the skeleton/libbpf perform the following sequence:
> 1. addr = mmap(MAP_ANONYMOUS)
> 2. user space optionally modifies global vars
> 3. map_fd = bpf_create_map()
> 4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel
> 5. mmap(addr, MAP_FIXED, map_fd)
> after step 5 user spaces see the values it wrote at step 2 at the same addresses
>
> arena doesn't support update_map_elem. Hence skeleton/libbpf do:
> 1. addr = mmap(MAP_ANONYMOUS)
> 2. user space optionally modifies global vars
> 3. map_fd = bpf_create_map(MAP_TYPE_ARENA)
> 4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd)
> 5. memcpy(real_addr, addr) // this will fault-in and allocate pages
> 6. munmap(addr)
>
> At the end look and feel of global data vs __arena global data is the same from bpf prog pov.
>
> Another complication is:
> struct {
>   __uint(type, BPF_MAP_TYPE_ARENA);
> } arena SEC(".maps");
>
> int __arena foo;
> int bar;
>
>   ptr1 = &foo;   // relocation against ".arena.1" section
>   ptr2 = &arena; // relocation against ".maps" section
>   ptr3 = &bar;   // relocation against ".bss" section
>
> Fo the kernel ptr1 and ptr2 has point to the same arena's map_fd
> while ptr3 points to a different global array's map_fd.
> For the verifier:
> ptr1->type == unknown_scalar
> ptr2->type == const_ptr_to_map
> ptr3->type == ptr_to_map_value
>
> after the verifier and for JIT all 3 ptr-s are normal ld_imm64 insns.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  tools/bpf/bpftool/gen.c |  13 ++++-
>  tools/lib/bpf/libbpf.c  | 102 +++++++++++++++++++++++++++++++++++-----
>  2 files changed, 101 insertions(+), 14 deletions(-)
>

[...]

> @@ -1718,10 +1722,34 @@ static int
>  bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>                               const char *real_name, int sec_idx, void *data, size_t data_sz)
>  {
> +       const long page_sz = sysconf(_SC_PAGE_SIZE);
> +       struct bpf_map *map, *arena = NULL;
>         struct bpf_map_def *def;
> -       struct bpf_map *map;
>         size_t mmap_sz;
> -       int err;
> +       int err, i;
> +
> +       if (type == LIBBPF_MAP_ARENA) {
> +               for (i = 0; i < obj->nr_maps; i++) {
> +                       map = &obj->maps[i];
> +                       if (map->def.type != BPF_MAP_TYPE_ARENA)
> +                               continue;
> +                       arena = map;
> +                       real_name = "__arena_internal";
> +                       mmap_sz = bpf_map_mmap_sz(map);
> +                       if (roundup(data_sz, page_sz) > mmap_sz) {
> +                               pr_warn("Declared arena map size %zd is too small to hold"
> +                                       "global __arena variables of size %zd\n",
> +                                       mmap_sz, data_sz);
> +                               return -E2BIG;
> +                       }
> +                       break;
> +               }
> +               if (!arena) {
> +                       pr_warn("To use global __arena variables the arena map should"
> +                               "be declared explicitly in SEC(\".maps\")\n");
> +                       return -ENOENT;
> +               }

so basically here we found arena, let's arena->mmapable =
mmap(MAP_ANONYMOUS) here, memcpy(arena->mmapable, data, data_sz) and
exit early, not doing bpf_object__add_map()?


> +       }
>
>         map = bpf_object__add_map(obj);
>         if (IS_ERR(map))
> @@ -1732,6 +1760,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>         map->sec_offset = 0;
>         map->real_name = strdup(real_name);
>         map->name = internal_map_name(obj, real_name);
> +       map->arena = arena;
>         if (!map->real_name || !map->name) {
>                 zfree(&map->real_name);
>                 zfree(&map->name);

[...]

> @@ -13437,6 +13508,11 @@ int bpf_object__load_skeleton(struct bpf_object_skeleton *s)
>                         continue;
>                 }
>
> +               if (map->arena) {
> +                       *mmaped = map->arena->mmaped;
> +                       continue;
> +               }
> +

yep, I was going to suggest that we can fix up this pointer in
load_skeleton, nice


>                 if (map->def.map_flags & BPF_F_RDONLY_PROG)
>                         prot = PROT_READ;
>                 else
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13 23:11   ` Eduard Zingerman
@ 2024-02-13 23:17     ` Andrii Nakryiko
  2024-02-13 23:36       ` Eduard Zingerman
  2024-02-14  1:02     ` Alexei Starovoitov
  1 sibling, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-13 23:17 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Alexei Starovoitov, bpf, daniel, andrii, memxor, tj, brho,
	hannes, lstoakes, akpm, urezki, hch, linux-mm, kernel-team

On Tue, Feb 13, 2024 at 3:11 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > LLVM automatically places __arena variables into ".arena.1" ELF section.
> > When libbpf sees such section it creates internal 'struct bpf_map' LIBBPF_MAP_ARENA
> > that is connected to actual BPF_MAP_TYPE_ARENA 'struct bpf_map'.
> > They share the same kernel's side bpf map and single map_fd.
> > Both are emitted into skeleton. Real arena with the name given by bpf program
> > in SEC(".maps") and another with "__arena_internal" name.
> > All global variables from ".arena.1" section are accessible from user space
> > via skel->arena->name_of_var.
> >
> > For bss/data/rodata the skeleton/libbpf perform the following sequence:
> > 1. addr = mmap(MAP_ANONYMOUS)
> > 2. user space optionally modifies global vars
> > 3. map_fd = bpf_create_map()
> > 4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel
> > 5. mmap(addr, MAP_FIXED, map_fd)
> > after step 5 user spaces see the values it wrote at step 2 at the same addresses
> >
> > arena doesn't support update_map_elem. Hence skeleton/libbpf do:
> > 1. addr = mmap(MAP_ANONYMOUS)
> > 2. user space optionally modifies global vars
> > 3. map_fd = bpf_create_map(MAP_TYPE_ARENA)
> > 4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd)
> > 5. memcpy(real_addr, addr) // this will fault-in and allocate pages
> > 6. munmap(addr)
> >
> > At the end look and feel of global data vs __arena global data is the same from bpf prog pov.
>
> [...]
>
> So, at first I thought that having two maps is a bit of a hack.

yep, that was my instinct as well

> However, after trying to make it work with only one map I don't really
> like that either :)

Can you elaborate? see my reply to Alexei, I wonder how did you think
about doing this?

>
> The patch looks good to me, have not spotted any logical issues.
>
> I have two questions if you don't mind:
>
> First is regarding initialization data.
> In bpf_object__init_internal_map() the amount of bpf_map_mmap_sz(map)
> bytes is mmaped and only data_sz bytes are copied,
> then bpf_map_mmap_sz(map) bytes are copied in bpf_object__create_maps().
> Is Linux/libc smart enough to skip action on pages which were mmaped but
> never touched?
>
> Second is regarding naming.
> Currently only one arena is supported, and generated skel has a single '->arena' field.
> Is there a plan to support multiple arenas at some point?
> If so, should '->arena' field use the same name as arena map declared in program?
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  2024-02-10  7:03       ` Kumar Kartikeya Dwivedi
@ 2024-02-13 23:19         ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13 23:19 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: David Vernet, bpf, Daniel Borkmann, Andrii Nakryiko, Eddy Z,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Fri, Feb 9, 2024 at 11:03 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 10 Feb 2024 at 05:35, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Feb 9, 2024 at 3:14 PM David Vernet <void@manifault.com> wrote:
> > >
> > > > +
> > > > +#ifndef arena_container_of
> > >
> > > Why is this ifndef required if we have a pragma once above?
> >
> > Just a habit to check for a macro before defining it.
> >
> > > Obviously it's way better for us to actually have arenas in the interim
> > > so this is fine for now, but UAF bugs could potentially be pretty
> > > painful until we get proper exception unwinding support.
> >
> > Detection that arena access faulted doesn't have to come after
> > exception unwinding. Exceptions vs cancellable progs are also different.
>
> What do you mean exactly by 'cancellable progs'? That they can be
> interrupted at any (or well-known) points and stopped? I believe
> whatever plumbing was done to enable exceptions will be useful there
> as well. The verifier would just need to know e.g. that a load into
> PTR_TO_ARENA may fault, and thus generate descriptors for all frames
> for that pc. Then, at runtime, you could technically release all
> resources by looking up the frame descriptor and unwind the stack and
> return back to the caller of the prog.

I don't think it's a scalable approach.
I'm still trying to understand your exceptions part 2 series,
but from what I understand so far the scalability is a real concern.

>
> > A record of the line in bpf prog that caused the first fault is probably
> > good enough for prog debugging.
> >
>
> I think it would make more sense to abort the program by default,
> because use-after-free in the arena most certainly means a bug in the
> program.

yes, but aborting vs safe continue and remember the first wrong access
from debuggability pov is the same thing.
aborting by itself also doesn't mean that the prog is auto-detached.
It may run again a split second later and won't hit abort condition.

Recording of first wrong access (either abort or pf in arena) is
must have regardless.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-13 23:14   ` Andrii Nakryiko
@ 2024-02-13 23:29     ` Alexei Starovoitov
  2024-02-14  0:03       ` Andrii Nakryiko
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-13 23:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 3:14 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
>
> here we can then use MAX_ARENA_SZ

I thought about it, but decided against it, since it will be
too tempting to change it without understanding the consequences.
Like...

> > +       if ((attr->map_extra >> 32) != ((attr->map_extra + vm_range - 1) >> 32))
> > +               /* user vma must not cross 32-bit boundary */
> > +               return ERR_PTR(-ERANGE);

here >> 32 is relevant to size and pretty much every such shift.
Hence 1ull << 32 matches all those shifts.
And they have to be analyzed together.

> > +       apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> > +                                    KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);
>
> I'm still reading the rest (so it might become obvious), but this
> KERN_VM_SZ - GUARD_SZ / 2 is a bit surprising. I understand that
> kern_vm_start is shifted by GUARD_SZ/2, but is the intent here is to
> actually go beyond maximum 4GB by GUARD_SZ/2, or the intent was to
> unmap 4GB (MAX_ARENA_SZ)?

here it's just the range for apply_to_existing_page_range() to walk.
There are no pages mapped into the lower GUARD_SZ / 2 and upper GUARD_SZ / 2.
So no reason to ask apply_to_existing_page_range() to walk
the whole KERN_VM_SZ.

> > +       ret = current->mm->get_unmapped_area(filp, addr, len * 2, 0, flags);
> > +       if (IS_ERR_VALUE(ret))
> > +                return 0;
>
> Can you leave a comment why we are swallowing errors, if this is intentional?

argh. good catch. it's a bug.

> > +       if ((ret >> 32) == ((ret + len - 1) >> 32))
> > +               return ret;
> > +       if (WARN_ON_ONCE(arena->user_vm_start))
> > +               /* checks at map creation time should prevent this */
> > +               return -EFAULT;
> > +       return round_up(ret, 1ull << 32);
>
> this is still probably MAX_ARENA_SZ, no?

and here it would be wrong to do that.
This line has to match the logic with 'if' few lines above.
Hiding behind macro is a dangerous obfuscation.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13 23:17     ` Andrii Nakryiko
@ 2024-02-13 23:36       ` Eduard Zingerman
  2024-02-14  0:09         ` Andrii Nakryiko
  0 siblings, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-13 23:36 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, daniel, andrii, memxor, tj, brho,
	hannes, lstoakes, akpm, urezki, hch, linux-mm, kernel-team

On Tue, 2024-02-13 at 15:17 -0800, Andrii Nakryiko wrote:

[...]

> > So, at first I thought that having two maps is a bit of a hack.
> 
> yep, that was my instinct as well
> 
> > However, after trying to make it work with only one map I don't really
> > like that either :)
> 
> Can you elaborate? see my reply to Alexei, I wonder how did you think
> about doing this?

Relocations in the ELF file are against a new section: ".arena.1".
This works nicely with logic in bpf_program__record_reloc().
If single map is used, we effectively need to track two indexes for
the map section:
- one used for relocations against map variables themselves
  (named "generic map reference relocation" in the function code);
- one used for relocations against ".arena.1"
  (named "global data map relocation" in the function code).

This spooked me off:
- either bpf_object__init_internal_map() would have a specialized
  branch for arenas, as with current approach;
- or bpf_program__record_reloc() would have a specialized branch for arenas,
  as with one map approach.

Additionally, skel generation logic currently assumes that mmapable
bindings would be generated only for internal maps,
but that is probably not a big deal.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-13 23:29     ` Alexei Starovoitov
@ 2024-02-14  0:03       ` Andrii Nakryiko
  2024-02-14  0:14         ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-14  0:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 3:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 13, 2024 at 3:14 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> >
> > here we can then use MAX_ARENA_SZ
>
> I thought about it, but decided against it, since it will be
> too tempting to change it without understanding the consequences.
> Like...
>
> > > +       if ((attr->map_extra >> 32) != ((attr->map_extra + vm_range - 1) >> 32))
> > > +               /* user vma must not cross 32-bit boundary */
> > > +               return ERR_PTR(-ERANGE);
>
> here >> 32 is relevant to size and pretty much every such shift.
> Hence 1ull << 32 matches all those shifts.
> And they have to be analyzed together.
>
> > > +       apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> > > +                                    KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);
> >
> > I'm still reading the rest (so it might become obvious), but this
> > KERN_VM_SZ - GUARD_SZ / 2 is a bit surprising. I understand that
> > kern_vm_start is shifted by GUARD_SZ/2, but is the intent here is to
> > actually go beyond maximum 4GB by GUARD_SZ/2, or the intent was to
> > unmap 4GB (MAX_ARENA_SZ)?
>
> here it's just the range for apply_to_existing_page_range() to walk.
> There are no pages mapped into the lower GUARD_SZ / 2 and upper GUARD_SZ / 2.
> So no reason to ask apply_to_existing_page_range() to walk
> the whole KERN_VM_SZ.

right, so I expected to see KERN_VM_SZ - GUARD_SZ, but instead we have
KERN_VM_SZ - GUARD_SZ/2 (so we'll iterate GUARD_SZ/2 too much, into
upper guard pages which are supposed to be not allocated), which is
why I'm asking. It's minor, I was probing if I'm missing something
subtle.

>
> > > +       ret = current->mm->get_unmapped_area(filp, addr, len * 2, 0, flags);
> > > +       if (IS_ERR_VALUE(ret))
> > > +                return 0;
> >
> > Can you leave a comment why we are swallowing errors, if this is intentional?
>
> argh. good catch. it's a bug.
>
> > > +       if ((ret >> 32) == ((ret + len - 1) >> 32))
> > > +               return ret;
> > > +       if (WARN_ON_ONCE(arena->user_vm_start))
> > > +               /* checks at map creation time should prevent this */
> > > +               return -EFAULT;
> > > +       return round_up(ret, 1ull << 32);
> >
> > this is still probably MAX_ARENA_SZ, no?
>
> and here it would be wrong to do that.
> This line has to match the logic with 'if' few lines above.
> Hiding behind macro is a dangerous obfuscation.

ok, no big deal

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13 23:36       ` Eduard Zingerman
@ 2024-02-14  0:09         ` Andrii Nakryiko
  2024-02-14  0:16           ` Eduard Zingerman
  2024-02-14  1:24           ` Alexei Starovoitov
  0 siblings, 2 replies; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-14  0:09 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Alexei Starovoitov, bpf, daniel, andrii, memxor, tj, brho,
	hannes, lstoakes, akpm, urezki, hch, linux-mm, kernel-team

On Tue, Feb 13, 2024 at 3:37 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-02-13 at 15:17 -0800, Andrii Nakryiko wrote:
>
> [...]
>
> > > So, at first I thought that having two maps is a bit of a hack.
> >
> > yep, that was my instinct as well
> >
> > > However, after trying to make it work with only one map I don't really
> > > like that either :)
> >
> > Can you elaborate? see my reply to Alexei, I wonder how did you think
> > about doing this?
>
> Relocations in the ELF file are against a new section: ".arena.1".
> This works nicely with logic in bpf_program__record_reloc().
> If single map is used, we effectively need to track two indexes for
> the map section:
> - one used for relocations against map variables themselves
>   (named "generic map reference relocation" in the function code);
> - one used for relocations against ".arena.1"
>   (named "global data map relocation" in the function code).
>
> This spooked me off:
> - either bpf_object__init_internal_map() would have a specialized
>   branch for arenas, as with current approach;
> - or bpf_program__record_reloc() would have a specialized branch for arenas,
>   as with one map approach.

Yes, relocations would know about .arena.1, but it's a pretty simple
check in a few places. We basically have arena *definition* sec_idx
(corresponding to SEC(".maps")) and arena *data* sec_idx. The latter
is what is recorded for global variables in .arena.1. We can remember
this arena data sec_idx in struct bpf_object once during ELF
processing, and then just special case it internally in a few places.

The "fake" bpf_map for __arena_internal is user-visible and requires
autocreate=false tricks, etc. I feel like it's a worse tradeoff from a
user API perspective than a few extra ARENA-specific internal checks
(which we already have a few anyways, ARENA is not completely
transparent internally anyways).


>
> Additionally, skel generation logic currently assumes that mmapable
> bindings would be generated only for internal maps,
> but that is probably not a big deal.

yeah, it's not, we will have STRUCT_OPS maps handled special anyways
(Kui-Feng posted an RFC already), so ARENA won't be the only one
special case

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.
  2024-02-14  0:03       ` Andrii Nakryiko
@ 2024-02-14  0:14         ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14  0:14 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 4:03 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
 > > > +       apply_to_existing_page_range(&init_mm,
bpf_arena_get_kern_vm_start(arena),
> > > > +                                    KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);
> > >
> > > I'm still reading the rest (so it might become obvious), but this
> > > KERN_VM_SZ - GUARD_SZ / 2 is a bit surprising. I understand that
> > > kern_vm_start is shifted by GUARD_SZ/2, but is the intent here is to
> > > actually go beyond maximum 4GB by GUARD_SZ/2, or the intent was to
> > > unmap 4GB (MAX_ARENA_SZ)?
> >
> > here it's just the range for apply_to_existing_page_range() to walk.
> > There are no pages mapped into the lower GUARD_SZ / 2 and upper GUARD_SZ / 2.
> > So no reason to ask apply_to_existing_page_range() to walk
> > the whole KERN_VM_SZ.
>
> right, so I expected to see KERN_VM_SZ - GUARD_SZ, but instead we have
> KERN_VM_SZ - GUARD_SZ/2 (so we'll iterate GUARD_SZ/2 too much, into
> upper guard pages which are supposed to be not allocated), which is
> why I'm asking. It's minor, I was probing if I'm missing something
> subtle.

ahh. you're correct, of course. braino.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-14  0:09         ` Andrii Nakryiko
@ 2024-02-14  0:16           ` Eduard Zingerman
  2024-02-14  0:29             ` Andrii Nakryiko
  2024-02-14  1:24           ` Alexei Starovoitov
  1 sibling, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-14  0:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, daniel, andrii, memxor, tj, brho,
	hannes, lstoakes, akpm, urezki, hch, linux-mm, kernel-team

On Tue, 2024-02-13 at 16:09 -0800, Andrii Nakryiko wrote:
[...]

> The "fake" bpf_map for __arena_internal is user-visible and requires
> autocreate=false tricks, etc. I feel like it's a worse tradeoff from a
> user API perspective than a few extra ARENA-specific internal checks
> (which we already have a few anyways, ARENA is not completely
> transparent internally anyways).

By user-visible you mean when doing "bpf_object__for_each_map()", right?
Shouldn't users ignore bpf_map__is_internal() maps?
But I agree that having one map might be a bit cleaner.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
  2024-02-13 23:14   ` Andrii Nakryiko
@ 2024-02-14  0:26     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14  0:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 3:15 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 8:06 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.
> >
> > Note, when the verifier sees:
> >
> > __weak void foo(struct bar *p)
> >
> > it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars.
> > Hence the only way to use arena pointers in global functions is to tag them with "arg:arena".
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/bpf.h   |  1 +
> >  kernel/bpf/btf.c      | 19 +++++++++++++++----
> >  kernel/bpf/verifier.c | 15 +++++++++++++++
> >  3 files changed, 31 insertions(+), 4 deletions(-)
> >
>
> [...]
>
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 5eeb9bf7e324..fa49602194d5 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -9348,6 +9348,18 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
> >                                 bpf_log(log, "arg#%d is expected to be non-NULL\n", i);
> >                                 return -EINVAL;
> >                         }
> > +               } else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
> > +                       /*
> > +                        * Can pass any value and the kernel won't crash, but
> > +                        * only PTR_TO_ARENA or SCALAR make sense. Everything
> > +                        * else is a bug in the bpf program. Point it out to
> > +                        * the user at the verification time instead of
> > +                        * run-time debug nightmare.
> > +                        */
> > +                       if (reg->type != PTR_TO_ARENA && reg->type != SCALAR_VALUE) {
>
> the comment above doesn't explain why it's ok to pass SCALAR_VALUE. Is
> it because PTR_TO_ARENA will become SCALAR_VALUE after some arithmetic
> operations and we don't want to regress user experience? If that's the
> case, what's the way for user to convert SCALAR_VALUE back to
> PTR_TO_ARENA without going through global subprog? bpf_cast_xxx
> instruction through assembly?

bpf_cast_xx inline asm should never be used.
It's for selftests only until llvm 19 is released and in distros.
The scalar_value can come in lots of cases.
Any pointer dereference returns scalar.
Most of the time all arena math is on scalars.
Scalars are passed into global and static funcs and
become ptr_to_arena right before LDX/STX through that pointer.
Sometime llvm still does a bit of math after scalar became ptr_to_arena,
hence needs_zext flag to downgrade alu64 to alu32.
In these selftests that produce non trivial bpf progs
there are 4 such insns that needs_zext in arena_htab.bpf.o.
Also 23 cast_kern, zero cast_user,
and 57 ldx/stx from arena.

>
> > +                               bpf_log(log, "R%d is not a pointer to arena or scalar.\n", regno);
> > +                               return -EINVAL;
> > +                       }
> >                 } else if (arg->arg_type == (ARG_PTR_TO_DYNPTR | MEM_RDONLY)) {
> >                         ret = process_dynptr_func(env, regno, -1, arg->arg_type, 0);
> >                         if (ret)
> > @@ -20329,6 +20341,9 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> >                                 reg->btf = bpf_get_btf_vmlinux(); /* can't fail at this point */
> >                                 reg->btf_id = arg->btf_id;
> >                                 reg->id = ++env->id_gen;
> > +                       } else if (base_type(arg->arg_type) == ARG_PTR_TO_ARENA) {
> > +                               /* caller can pass either PTR_TO_ARENA or SCALAR */
> > +                               mark_reg_unknown(env, regs, i);
>
> shouldn't we set the register type to PTR_TO_ARENA here?

No. It has to be scalar.
It's not ok to deref it with kern_vm_base yet.
It's a full 64-bit value here and upper 32-bit are likely correct user_vm_start.

Hence my struggle with this __arg_arena feature, since it's really for
the verifier not to complain at the call site only.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-14  0:16           ` Eduard Zingerman
@ 2024-02-14  0:29             ` Andrii Nakryiko
  0 siblings, 0 replies; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-14  0:29 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Alexei Starovoitov, bpf, daniel, andrii, memxor, tj, brho,
	hannes, lstoakes, akpm, urezki, hch, linux-mm, kernel-team

On Tue, Feb 13, 2024 at 4:16 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-02-13 at 16:09 -0800, Andrii Nakryiko wrote:
> [...]
>
> > The "fake" bpf_map for __arena_internal is user-visible and requires
> > autocreate=false tricks, etc. I feel like it's a worse tradeoff from a
> > user API perspective than a few extra ARENA-specific internal checks
> > (which we already have a few anyways, ARENA is not completely
> > transparent internally anyways).
>
> By user-visible you mean when doing "bpf_object__for_each_map()", right?
> Shouldn't users ignore bpf_map__is_internal() maps?

no, not really, they are valid maps, and you can
bpf_map__set_value_size() on them (for example). Similarly for
bpf_map__initial_value(). And here it will be interesting that before
load you should use bpf_map__initial_value(__arena_internal) if you
want to tune something, and after load it will be
bpf_map__initial_value(real_arena).

While if we combine them, it actually will work more naturally.

> But I agree that having one map might be a bit cleaner.

+1

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena.
  2024-02-13 23:15   ` Andrii Nakryiko
@ 2024-02-14  0:32     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14  0:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 3:15 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 8:07 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > mmap() bpf_arena right after creation, since the kernel needs to
> > remember the address returned from mmap. This is user_vm_start.
> > LLVM will generate bpf_arena_cast_user() instructions where
> > necessary and JIT will add upper 32-bit of user_vm_start
> > to such pointers.
> >
> > Fix up bpf_map_mmap_sz() to compute mmap size as
> > map->value_size * map->max_entries for arrays and
> > PAGE_SIZE * map->max_entries for arena.
> >
> > Don't set BTF at arena creation time, since it doesn't support it.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  tools/lib/bpf/libbpf.c        | 43 ++++++++++++++++++++++++++++++-----
> >  tools/lib/bpf/libbpf_probes.c |  7 ++++++
> >  2 files changed, 44 insertions(+), 6 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 01f407591a92..4880d623098d 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -185,6 +185,7 @@ static const char * const map_type_name[] = {
> >         [BPF_MAP_TYPE_BLOOM_FILTER]             = "bloom_filter",
> >         [BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
> >         [BPF_MAP_TYPE_CGRP_STORAGE]             = "cgrp_storage",
> > +       [BPF_MAP_TYPE_ARENA]                    = "arena",
> >  };
> >
> >  static const char * const prog_type_name[] = {
> > @@ -1577,7 +1578,7 @@ static struct bpf_map *bpf_object__add_map(struct bpf_object *obj)
> >         return map;
> >  }
> >
> > -static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> > +static size_t __bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
>
> please rename this to array_map_mmap_sz, underscores are not very meaningful

makes sense.

> >  {
> >         const long page_sz = sysconf(_SC_PAGE_SIZE);
> >         size_t map_sz;
> > @@ -1587,6 +1588,20 @@ static size_t bpf_map_mmap_sz(unsigned int value_sz, unsigned int max_entries)
> >         return map_sz;
> >  }
> >
> > +static size_t bpf_map_mmap_sz(const struct bpf_map *map)
> > +{
> > +       const long page_sz = sysconf(_SC_PAGE_SIZE);
> > +
> > +       switch (map->def.type) {
> > +       case BPF_MAP_TYPE_ARRAY:
> > +               return __bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > +       case BPF_MAP_TYPE_ARENA:
> > +               return page_sz * map->def.max_entries;
> > +       default:
> > +               return 0; /* not supported */
> > +       }
> > +}
> > +
> >  static int bpf_map_mmap_resize(struct bpf_map *map, size_t old_sz, size_t new_sz)
> >  {
> >         void *mmaped;
> > @@ -1740,7 +1755,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
> >         pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
> >                  map->name, map->sec_idx, map->sec_offset, def->map_flags);
> >
> > -       mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > +       mmap_sz = bpf_map_mmap_sz(map);
> >         map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
> >                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> >         if (map->mmaped == MAP_FAILED) {
> > @@ -4852,6 +4867,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> >         case BPF_MAP_TYPE_SOCKHASH:
> >         case BPF_MAP_TYPE_QUEUE:
> >         case BPF_MAP_TYPE_STACK:
> > +       case BPF_MAP_TYPE_ARENA:
> >                 create_attr.btf_fd = 0;
> >                 create_attr.btf_key_type_id = 0;
> >                 create_attr.btf_value_type_id = 0;
> > @@ -4908,6 +4924,21 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
> >         if (map->fd == map_fd)
> >                 return 0;
> >
> > +       if (def->type == BPF_MAP_TYPE_ARENA) {
> > +               map->mmaped = mmap((void *)map->map_extra, bpf_map_mmap_sz(map),
> > +                                  PROT_READ | PROT_WRITE,
> > +                                  map->map_extra ? MAP_SHARED | MAP_FIXED : MAP_SHARED,
> > +                                  map_fd, 0);
> > +               if (map->mmaped == MAP_FAILED) {
> > +                       err = -errno;
> > +                       map->mmaped = NULL;
> > +                       close(map_fd);
> > +                       pr_warn("map '%s': failed to mmap bpf_arena: %d\n",
> > +                               bpf_map__name(map), err);
>
> seems like we just use `map->name` directly elsewhere in this
> function, let's keep it consistent

that was to match the next patch, since arena is using real_name.
map->name is also correct and will have the same name here.
The next patch will have two arena maps, but one will never be
passed into this function to create a real kernel map.
So I can use map->name here, but bpf_map__name() is a bit more correct.

> > +                       return err;
> > +               }
> > +       }
> > +
> >         /* Keep placeholder FD value but now point it to the BPF map object.
> >          * This way everything that relied on this map's FD (e.g., relocated
> >          * ldimm64 instructions) will stay valid and won't need adjustments.
> > @@ -8582,7 +8613,7 @@ static void bpf_map__destroy(struct bpf_map *map)
> >         if (map->mmaped) {
> >                 size_t mmap_sz;
> >
> > -               mmap_sz = bpf_map_mmap_sz(map->def.value_size, map->def.max_entries);
> > +               mmap_sz = bpf_map_mmap_sz(map);
> >                 munmap(map->mmaped, mmap_sz);
> >                 map->mmaped = NULL;
> >         }
> > @@ -9830,8 +9861,8 @@ int bpf_map__set_value_size(struct bpf_map *map, __u32 size)
> >                 int err;
> >                 size_t mmap_old_sz, mmap_new_sz;
> >
>
> this logic assumes ARRAY (which are the only ones so far that could
> have `map->mapped != NULL`, so I think we should error out for ARENA
> maps here, instead of silently doing the wrong thing?
>
> if (map->type != BPF_MAP_TYPE_ARRAY)
>     return -EOPNOTSUPP;
>
> should do

Good point. Will do.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-13 23:15   ` Andrii Nakryiko
@ 2024-02-14  0:47     ` Alexei Starovoitov
  2024-02-14  0:51       ` Andrii Nakryiko
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14  0:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 3:15 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Feb 8, 2024 at 8:07 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > __uint() macro that is used to specify map attributes like:
> >   __uint(type, BPF_MAP_TYPE_ARRAY);
> >   __uint(map_flags, BPF_F_MMAPABLE);
> > is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.
> >
> > Introduce __ulong() macro that allows specifying values bigger than 32-bit.
> > In map definition "map_extra" is the only u64 field.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  tools/lib/bpf/bpf_helpers.h |  5 +++++
> >  tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
> >  2 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> > index 9c777c21da28..0aeac8ea7af2 100644
> > --- a/tools/lib/bpf/bpf_helpers.h
> > +++ b/tools/lib/bpf/bpf_helpers.h
> > @@ -13,6 +13,11 @@
> >  #define __uint(name, val) int (*name)[val]
> >  #define __type(name, val) typeof(val) *name
> >  #define __array(name, val) typeof(val) *name[]
> > +#ifndef __PASTE
> > +#define ___PASTE(a,b) a##b
> > +#define __PASTE(a,b) ___PASTE(a,b)
> > +#endif
>
> we already have ___bpf_concat defined further in this file (it's macro
> so ordering shouldn't matter), let's just use that instead of adding
> another variant

Ohh. forgot about this one. will do.

> > +#define __ulong(name, val) enum { __PASTE(__unique_value, __COUNTER__) = val } name
> >
> >  /*
> >   * Helper macro to place programs, maps, license in
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 4880d623098d..f8158e250327 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -2243,6 +2243,39 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf,
> >         return true;
> >  }
> >
> > +static bool get_map_field_long(const char *map_name, const struct btf *btf,
> > +                              const struct btf_member *m, __u64 *res)
> > +{
> > +       const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> > +       const char *name = btf__name_by_offset(btf, m->name_off);
> > +
> > +       if (btf_is_ptr(t))
> > +               return false;
>
> It's not great that anyone that uses __uint(map_extra, ...) would get
> warnings now.

What warning ?
This specific check makes it fallback to ptr without warning.
We have a bloom filter test that uses map_extra.
No warnings there.

> Let's just teach get_map_field_long to recognize __uint vs __ulong?
>
> Let's call into get_map_field_int() here if we have a pointer, and
> then upcast u32 into u64?

makes sense.

> > +
> > +       if (!btf_is_enum(t) && !btf_is_enum64(t)) {
> > +               pr_warn("map '%s': attr '%s': expected enum or enum64, got %s.\n",
>
> seems like get_map_field_int() is using "PTR" and "ARRAY" all caps
> spelling in warnings, let's use ENUM and ENUM64 for consistency?

done.

> > +                       map_name, name, btf_kind_str(t));
> > +               return false;
> > +       }
> > +
> > +       if (btf_vlen(t) != 1) {
> > +               pr_warn("map '%s': attr '%s': invalid __ulong\n",
> > +                       map_name, name);
> > +               return false;
> > +       }
> > +
> > +       if (btf_is_enum(t)) {
> > +               const struct btf_enum *e = btf_enum(t);
> > +
> > +               *res = e->val;
> > +       } else {
> > +               const struct btf_enum64 *e = btf_enum64(t);
> > +
> > +               *res = btf_enum64_value(e);
> > +       }
> > +       return true;
> > +}
> > +
> >  static int pathname_concat(char *buf, size_t buf_sz, const char *path, const char *name)
> >  {
> >         int len;
> > @@ -2476,10 +2509,15 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
> >                         map_def->pinning = val;
> >                         map_def->parts |= MAP_DEF_PINNING;
> >                 } else if (strcmp(name, "map_extra") == 0) {
> > -                       __u32 map_extra;
> > +                       __u64 map_extra;
> >
> > -                       if (!get_map_field_int(map_name, btf, m, &map_extra))
> > -                               return -EINVAL;
> > +                       if (!get_map_field_long(map_name, btf, m, &map_extra)) {
> > +                               __u32 map_extra_u32;
> > +
> > +                               if (!get_map_field_int(map_name, btf, m, &map_extra_u32))
> > +                                       return -EINVAL;
> > +                               map_extra = map_extra_u32;
> > +                       }
>
> with the above change it would be a simple
> s/get_map_field_int/get_map_field_long/ (and __u32 -> __u64, of
> course)

so this logic will move into get_map_field_long.
makes sense.

I thought about making get_map_field_int() to handle enum,
but way too many places need refactoring, since it's called like:
get_map_field_int(map_name, btf, m, &map_def->map_type)
get_map_field_int(map_name, btf, m, &map_def->max_entries)

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF.
  2024-02-14  0:47     ` Alexei Starovoitov
@ 2024-02-14  0:51       ` Andrii Nakryiko
  0 siblings, 0 replies; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-14  0:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 4:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 13, 2024 at 3:15 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Feb 8, 2024 at 8:07 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > __uint() macro that is used to specify map attributes like:
> > >   __uint(type, BPF_MAP_TYPE_ARRAY);
> > >   __uint(map_flags, BPF_F_MMAPABLE);
> > > is limited to 32-bit, since BTF_KIND_ARRAY has u32 "number of elements" field.
> > >
> > > Introduce __ulong() macro that allows specifying values bigger than 32-bit.
> > > In map definition "map_extra" is the only u64 field.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  tools/lib/bpf/bpf_helpers.h |  5 +++++
> > >  tools/lib/bpf/libbpf.c      | 44 ++++++++++++++++++++++++++++++++++---
> > >  2 files changed, 46 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> > > index 9c777c21da28..0aeac8ea7af2 100644
> > > --- a/tools/lib/bpf/bpf_helpers.h
> > > +++ b/tools/lib/bpf/bpf_helpers.h
> > > @@ -13,6 +13,11 @@
> > >  #define __uint(name, val) int (*name)[val]
> > >  #define __type(name, val) typeof(val) *name
> > >  #define __array(name, val) typeof(val) *name[]
> > > +#ifndef __PASTE
> > > +#define ___PASTE(a,b) a##b
> > > +#define __PASTE(a,b) ___PASTE(a,b)
> > > +#endif
> >
> > we already have ___bpf_concat defined further in this file (it's macro
> > so ordering shouldn't matter), let's just use that instead of adding
> > another variant
>
> Ohh. forgot about this one. will do.
>
> > > +#define __ulong(name, val) enum { __PASTE(__unique_value, __COUNTER__) = val } name
> > >
> > >  /*
> > >   * Helper macro to place programs, maps, license in
> > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > > index 4880d623098d..f8158e250327 100644
> > > --- a/tools/lib/bpf/libbpf.c
> > > +++ b/tools/lib/bpf/libbpf.c
> > > @@ -2243,6 +2243,39 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf,
> > >         return true;
> > >  }
> > >
> > > +static bool get_map_field_long(const char *map_name, const struct btf *btf,
> > > +                              const struct btf_member *m, __u64 *res)
> > > +{
> > > +       const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> > > +       const char *name = btf__name_by_offset(btf, m->name_off);
> > > +
> > > +       if (btf_is_ptr(t))
> > > +               return false;
> >
> > It's not great that anyone that uses __uint(map_extra, ...) would get
> > warnings now.
>
> What warning ?
> This specific check makes it fallback to ptr without warning.
> We have a bloom filter test that uses map_extra.
> No warnings there.

ah, right, forget about the warning, you exit early. But still makes
sense to handle ulong vs uint transparently


>
> > Let's just teach get_map_field_long to recognize __uint vs __ulong?
> >
> > Let's call into get_map_field_int() here if we have a pointer, and
> > then upcast u32 into u64?
>
> makes sense.
>
> > > +
> > > +       if (!btf_is_enum(t) && !btf_is_enum64(t)) {
> > > +               pr_warn("map '%s': attr '%s': expected enum or enum64, got %s.\n",
> >
> > seems like get_map_field_int() is using "PTR" and "ARRAY" all caps
> > spelling in warnings, let's use ENUM and ENUM64 for consistency?
>
> done.
>
> > > +                       map_name, name, btf_kind_str(t));
> > > +               return false;
> > > +       }
> > > +
> > > +       if (btf_vlen(t) != 1) {
> > > +               pr_warn("map '%s': attr '%s': invalid __ulong\n",
> > > +                       map_name, name);
> > > +               return false;
> > > +       }
> > > +
> > > +       if (btf_is_enum(t)) {
> > > +               const struct btf_enum *e = btf_enum(t);
> > > +
> > > +               *res = e->val;
> > > +       } else {
> > > +               const struct btf_enum64 *e = btf_enum64(t);
> > > +
> > > +               *res = btf_enum64_value(e);
> > > +       }
> > > +       return true;
> > > +}
> > > +
> > >  static int pathname_concat(char *buf, size_t buf_sz, const char *path, const char *name)
> > >  {
> > >         int len;
> > > @@ -2476,10 +2509,15 @@ int parse_btf_map_def(const char *map_name, struct btf *btf,
> > >                         map_def->pinning = val;
> > >                         map_def->parts |= MAP_DEF_PINNING;
> > >                 } else if (strcmp(name, "map_extra") == 0) {
> > > -                       __u32 map_extra;
> > > +                       __u64 map_extra;
> > >
> > > -                       if (!get_map_field_int(map_name, btf, m, &map_extra))
> > > -                               return -EINVAL;
> > > +                       if (!get_map_field_long(map_name, btf, m, &map_extra)) {
> > > +                               __u32 map_extra_u32;
> > > +
> > > +                               if (!get_map_field_int(map_name, btf, m, &map_extra_u32))
> > > +                                       return -EINVAL;
> > > +                               map_extra = map_extra_u32;
> > > +                       }
> >
> > with the above change it would be a simple
> > s/get_map_field_int/get_map_field_long/ (and __u32 -> __u64, of
> > course)
>
> so this logic will move into get_map_field_long.
> makes sense.

yep, seems good to not care about int vs long here

>
> I thought about making get_map_field_int() to handle enum,
> but way too many places need refactoring, since it's called like:
> get_map_field_int(map_name, btf, m, &map_def->map_type)
> get_map_field_int(map_name, btf, m, &map_def->max_entries)

yeah, not worth it

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-13 23:11   ` Eduard Zingerman
  2024-02-13 23:17     ` Andrii Nakryiko
@ 2024-02-14  1:02     ` Alexei Starovoitov
  2024-02-14 15:10       ` Eduard Zingerman
  1 sibling, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14  1:02 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Tue, Feb 13, 2024 at 3:11 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-02-08 at 20:06 -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > LLVM automatically places __arena variables into ".arena.1" ELF section.
> > When libbpf sees such section it creates internal 'struct bpf_map' LIBBPF_MAP_ARENA
> > that is connected to actual BPF_MAP_TYPE_ARENA 'struct bpf_map'.
> > They share the same kernel's side bpf map and single map_fd.
> > Both are emitted into skeleton. Real arena with the name given by bpf program
> > in SEC(".maps") and another with "__arena_internal" name.
> > All global variables from ".arena.1" section are accessible from user space
> > via skel->arena->name_of_var.
> >
> > For bss/data/rodata the skeleton/libbpf perform the following sequence:
> > 1. addr = mmap(MAP_ANONYMOUS)
> > 2. user space optionally modifies global vars
> > 3. map_fd = bpf_create_map()
> > 4. bpf_update_map_elem(map_fd, addr) // to store values into the kernel
> > 5. mmap(addr, MAP_FIXED, map_fd)
> > after step 5 user spaces see the values it wrote at step 2 at the same addresses
> >
> > arena doesn't support update_map_elem. Hence skeleton/libbpf do:
> > 1. addr = mmap(MAP_ANONYMOUS)
> > 2. user space optionally modifies global vars
> > 3. map_fd = bpf_create_map(MAP_TYPE_ARENA)
> > 4. real_addr = mmap(map->map_extra, MAP_SHARED | MAP_FIXED, map_fd)
> > 5. memcpy(real_addr, addr) // this will fault-in and allocate pages
> > 6. munmap(addr)
> >
> > At the end look and feel of global data vs __arena global data is the same from bpf prog pov.
>
> [...]
>
> So, at first I thought that having two maps is a bit of a hack.
> However, after trying to make it work with only one map I don't really
> like that either :)

My first attempt was with single arena map, but it ended up with
hacks all over libbpf and bpftool to treat one map differently depending
on conditions.
Two maps simplified the code a lot.

> The patch looks good to me, have not spotted any logical issues.
>
> I have two questions if you don't mind:
>
> First is regarding initialization data.
> In bpf_object__init_internal_map() the amount of bpf_map_mmap_sz(map)
> bytes is mmaped and only data_sz bytes are copied,
> then bpf_map_mmap_sz(map) bytes are copied in bpf_object__create_maps().
> Is Linux/libc smart enough to skip action on pages which were mmaped but
> never touched?

kernel gives zeroed out pages to user space.
So it's ok to mmap a page, copy data_sz bytes into it
and later copy the full page from one addr to another.
No garbage copy.
In this case there will be data by llvm construction of ".arena.1"
It looks to me that even .bss-like __arena vars have zero-s in data
and non-zero data_sz.

>
> Second is regarding naming.
> Currently only one arena is supported, and generated skel has a single '->arena' field.
> Is there a plan to support multiple arenas at some point?
> If so, should '->arena' field use the same name as arena map declared in program?

I wanted to place all global arena vars into a default name "arena"
and let skeleton to use that name without thinking what name
bpf prog gave to the actual map.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-14  0:09         ` Andrii Nakryiko
  2024-02-14  0:16           ` Eduard Zingerman
@ 2024-02-14  1:24           ` Alexei Starovoitov
  2024-02-14 17:24             ` Andrii Nakryiko
  1 sibling, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14  1:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Eduard Zingerman, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 4:09 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Feb 13, 2024 at 3:37 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >
> > On Tue, 2024-02-13 at 15:17 -0800, Andrii Nakryiko wrote:
> >
> > [...]
> >
> > > > So, at first I thought that having two maps is a bit of a hack.
> > >
> > > yep, that was my instinct as well
> > >
> > > > However, after trying to make it work with only one map I don't really
> > > > like that either :)
> > >
> > > Can you elaborate? see my reply to Alexei, I wonder how did you think
> > > about doing this?
> >
> > Relocations in the ELF file are against a new section: ".arena.1".
> > This works nicely with logic in bpf_program__record_reloc().
> > If single map is used, we effectively need to track two indexes for
> > the map section:
> > - one used for relocations against map variables themselves
> >   (named "generic map reference relocation" in the function code);
> > - one used for relocations against ".arena.1"
> >   (named "global data map relocation" in the function code).
> >
> > This spooked me off:
> > - either bpf_object__init_internal_map() would have a specialized
> >   branch for arenas, as with current approach;
> > - or bpf_program__record_reloc() would have a specialized branch for arenas,
> >   as with one map approach.
>
> Yes, relocations would know about .arena.1, but it's a pretty simple
> check in a few places. We basically have arena *definition* sec_idx
> (corresponding to SEC(".maps")) and arena *data* sec_idx. The latter
> is what is recorded for global variables in .arena.1. We can remember
> this arena data sec_idx in struct bpf_object once during ELF
> processing, and then just special case it internally in a few places.

That was my first attempt and bpf_program__record_reloc()
became a mess.
Currently it does relo search either in internal maps
or in obj->efile.btf_maps_shndx.
Doing double search wasn't nice.
And further, such dual meaning of 'struct bpf_map' object messes
assumptions of bpf_object__shndx_is_maps, bpf_object__shndx_is_data
and the way libbpf treats map->libbpf_type everywhere.

bpf_map__is_internal() cannot really say true or false
for such dual use map.
Then skeleton gen gets ugly.
Needs more public libbpf APIs to use in bpftool gen.
Just a mess.

> The "fake" bpf_map for __arena_internal is user-visible and requires
> autocreate=false tricks, etc. I feel like it's a worse tradeoff from a
> user API perspective than a few extra ARENA-specific internal checks
> (which we already have a few anyways, ARENA is not completely
> transparent internally anyways).

what do you mean 'user visible'?
I can add a filter to avoid generating a pointer for it in a skeleton.
Then it won't be any more visible than other bss/data fake maps.
The 2nd fake arena returns true out of bpf_map__is_internal.

The key comment in the patch:
                /* bpf_object will contain two arena maps:
                 * LIBBPF_MAP_ARENA & BPF_MAP_TYPE_ARENA
                 * and
                 * LIBBPF_MAP_UNSPEC & BPF_MAP_TYPE_ARENA.
                 * The former map->arena will point to latter.
                 */

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 20/20] selftests/bpf: Convert simple page_frag allocator to per-cpu.
  2024-02-10  7:05   ` Kumar Kartikeya Dwivedi
@ 2024-02-14  1:37     ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14  1:37 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Fri, Feb 9, 2024 at 11:06 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 9 Feb 2024 at 05:07, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Convert simple page_frag allocator to per-cpu page_frag to further stress test
> > a combination of __arena global and static variables and alloc/free from arena.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
>
> I know this organically grew from a toy implementation, but since
> people will most likely be looking at selftests as usage examples, it
> might be better to expose bpf_preempt_disable/enable and use it in the
> case of per-CPU page_frag allocator? No need to block on this, can be
> added on top later.
>
> The kfunc is useful on its own for writing safe per-CPU data
> structures or other memory allocators like bpf_ma on top of arenas.
> It is also necessary as a building block for writing spin locks
> natively in BPF on top of the arena map which we may add later.
> I have a patch lying around for this, verifier plumbing is mostly the
> same as rcu_read_lock.
> I can send it out with tests, or otherwise if you want to add it to
> this series, you go ahead.

Please send it.
I think the verifier checks need to be more tight than rcu_read_lock.
preempt_enable/disable should be as strict as bpf_spin_lock.

The plan is to add bpf_arena_spin_lock() in the follow up and use
it in this bpf page_frag allocator to make it work properly out of
tracing context.
I'm not sure yet whether bpf_preemp_disable() will be sufficient.

And in the long run the idea is to convert all these bpf_arena*
facilities into libc equivalent.
Probably not part of libbpf, but some new package. name tbd.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-09  4:05 ` [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
  2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
@ 2024-02-14  8:36   ` Christoph Hellwig
  2024-02-14 20:53     ` Alexei Starovoitov
  1 sibling, 1 reply; 112+ messages in thread
From: Christoph Hellwig @ 2024-02-14  8:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, daniel, andrii, memxor, eddyz87, tj, brho, hannes, lstoakes,
	akpm, urezki, hch, linux-mm, kernel-team

NAK.  Please

On Thu, Feb 08, 2024 at 08:05:52PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> BPF would like to use the vmap API to implement a lazily-populated
> memory space which can be shared by multiple userspace threads.
> The vmap API is generally public and has functions to request and

What is "the vmap API"?

> For example, there is the public ioremap_page_range(), which is used
> to map device memory into addressable kernel space.

It's not really public.  It's a helper for the ioremap implementation
which really should not be arch specific to start with and are in
the process of beeing consolidatd into common code.

> The new BPF code needs the functionality of vmap_pages_range() in
> order to incrementally map privately managed arrays of pages into its
> vmap area. Indeed this function used to be public, but became private
> when usecases other than vmalloc happened to disappear.

Yes, for a freaking good reason.  The vmap area is not for general abuse
by random callers.  We have a few of those left, but we need to get rid
of that and not add more.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-14  1:02     ` Alexei Starovoitov
@ 2024-02-14 15:10       ` Eduard Zingerman
  0 siblings, 0 replies; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-14 15:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Tue, 2024-02-13 at 17:02 -0800, Alexei Starovoitov wrote:
[...]
> > First is regarding initialization data.
> > In bpf_object__init_internal_map() the amount of bpf_map_mmap_sz(map)
> > bytes is mmaped and only data_sz bytes are copied,
> > then bpf_map_mmap_sz(map) bytes are copied in bpf_object__create_maps().
> > Is Linux/libc smart enough to skip action on pages which were mmaped but
> > never touched?
> 
> kernel gives zeroed out pages to user space.
> So it's ok to mmap a page, copy data_sz bytes into it
> and later copy the full page from one addr to another.
> No garbage copy.
> In this case there will be data by llvm construction of ".arena.1"
> It looks to me that even .bss-like __arena vars have zero-s in data
> and non-zero data_sz.

I was actually worried about second memcpy increasing RSS unnecessarily,
but I missed that internal map does:

  def->max_entries = roundup(data_sz, page_sz) / page_sz;

So that is not an issue as bpf_map_mmap_sz() for internal map is
proportional to data_sz, not full arena.
Sorry for the noise.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast()
  2024-02-13 22:35     ` Alexei Starovoitov
@ 2024-02-14 16:47       ` Eduard Zingerman
  2024-02-14 17:45         ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Eduard Zingerman @ 2024-02-14 16:47 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Tue, 2024-02-13 at 14:35 -0800, Alexei Starovoitov wrote:
[...]

> This arena bpf_arena_cast() macro probably will be removed
> once llvm 19 is released and we upgrade bpf CI to it.
> It's here for selftests only.
> It's quite tricky and fragile to use in practice.
> Notice it does:
> "r"(__var)
> which is not quite correct,
> since llvm won't recognize it as output that changes __var and
> will use a copy of __var in a different register later.
> But if the macro changes to "=r" or "+r" then llvm allocates
> a register and that screws up codegen even more.
> 
> The __var;}) also doesn't always work.
> So this macro is not suited for all to use.

Could you please elaborate a bit on why is this macro fragile?
I toyed a bit with a version patched as below and it seems to work fine.
Don't see how  ": [reg]"+r"(var) : ..." could be broken by the compiler
(when "+r" is in the "output constraint" position):
from clang pov the variable 'var' would be in register and updated
after the asm volatile part.

---

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index e73b7d48439f..488001236506 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -334,8 +334,6 @@ l_true:                                                                                             \
 /* emit instruction: rX=rX .off = mode .imm32 = address_space */
 #ifndef bpf_arena_cast
 #define bpf_arena_cast(var, mode, addr_space)  \
-       ({                                      \
-       typeof(var) __var = var;                \
        asm volatile(".byte 0xBF;               \
                     .ifc %[reg], r0;           \
                     .byte 0x00;                \
@@ -368,8 +366,7 @@ l_true:                                                                                             \
                     .byte 0x99;                \
                     .endif;                    \
                     .short %[off]; .long %[as]"        \
-                    :: [reg]"r"(__var), [off]"i"(mode), [as]"i"(addr_space)); __var; \
-       })
+                    : [reg]"+r"(var) : [off]"i"(mode), [as]"i"(addr_space))
 #endif

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-14  1:24           ` Alexei Starovoitov
@ 2024-02-14 17:24             ` Andrii Nakryiko
  2024-02-15 23:22               ` Andrii Nakryiko
  0 siblings, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-14 17:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Tue, Feb 13, 2024 at 5:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 13, 2024 at 4:09 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Feb 13, 2024 at 3:37 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > >
> > > On Tue, 2024-02-13 at 15:17 -0800, Andrii Nakryiko wrote:
> > >
> > > [...]
> > >
> > > > > So, at first I thought that having two maps is a bit of a hack.
> > > >
> > > > yep, that was my instinct as well
> > > >
> > > > > However, after trying to make it work with only one map I don't really
> > > > > like that either :)
> > > >
> > > > Can you elaborate? see my reply to Alexei, I wonder how did you think
> > > > about doing this?
> > >
> > > Relocations in the ELF file are against a new section: ".arena.1".
> > > This works nicely with logic in bpf_program__record_reloc().
> > > If single map is used, we effectively need to track two indexes for
> > > the map section:
> > > - one used for relocations against map variables themselves
> > >   (named "generic map reference relocation" in the function code);
> > > - one used for relocations against ".arena.1"
> > >   (named "global data map relocation" in the function code).
> > >
> > > This spooked me off:
> > > - either bpf_object__init_internal_map() would have a specialized
> > >   branch for arenas, as with current approach;
> > > - or bpf_program__record_reloc() would have a specialized branch for arenas,
> > >   as with one map approach.
> >
> > Yes, relocations would know about .arena.1, but it's a pretty simple
> > check in a few places. We basically have arena *definition* sec_idx
> > (corresponding to SEC(".maps")) and arena *data* sec_idx. The latter
> > is what is recorded for global variables in .arena.1. We can remember
> > this arena data sec_idx in struct bpf_object once during ELF
> > processing, and then just special case it internally in a few places.
>
> That was my first attempt and bpf_program__record_reloc()
> became a mess.
> Currently it does relo search either in internal maps
> or in obj->efile.btf_maps_shndx.
> Doing double search wasn't nice.
> And further, such dual meaning of 'struct bpf_map' object messes
> assumptions of bpf_object__shndx_is_maps, bpf_object__shndx_is_data
> and the way libbpf treats map->libbpf_type everywhere.
>
> bpf_map__is_internal() cannot really say true or false
> for such dual use map.
> Then skeleton gen gets ugly.
> Needs more public libbpf APIs to use in bpftool gen.
> Just a mess.

It might be easier for me to try implement it the way I see it than
discuss it over emails. I'll give it a try today-tomorrow and get back
to you.

>
> > The "fake" bpf_map for __arena_internal is user-visible and requires
> > autocreate=false tricks, etc. I feel like it's a worse tradeoff from a
> > user API perspective than a few extra ARENA-specific internal checks
> > (which we already have a few anyways, ARENA is not completely
> > transparent internally anyways).
>
> what do you mean 'user visible'?

That __arena_internal (representing .area.1 data section) actually is
separate from actual ARENA map (represented by variable in .maps
section). And both have separate `struct bpf_map`, which you can look
up by name or through iterating all maps of bpf_object. And that you
can call getters/setters on __arena_internal, even though the only
thing that actually makes sense there is bpf_map__initial_value(),
which would just as much make sense on ARENA map itself.

> I can add a filter to avoid generating a pointer for it in a skeleton.
> Then it won't be any more visible than other bss/data fake maps.

bss/data are not fake maps, they have corresponding BPF map (ARRAY) in
the kernel. Which is different from __arena_internal. And even if we
hide it from skeleton, it's still there in bpf_object, as I mentioned
above.

Let me try implementing what I have in mind and see how bad it is.

> The 2nd fake arena returns true out of bpf_map__is_internal.
>
> The key comment in the patch:
>                 /* bpf_object will contain two arena maps:
>                  * LIBBPF_MAP_ARENA & BPF_MAP_TYPE_ARENA
>                  * and
>                  * LIBBPF_MAP_UNSPEC & BPF_MAP_TYPE_ARENA.
>                  * The former map->arena will point to latter.
>                  */

Yes, and I'd like to not have two arena maps because they are logically one.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast()
  2024-02-14 16:47       ` Eduard Zingerman
@ 2024-02-14 17:45         ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14 17:45 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Kumar Kartikeya Dwivedi, bpf, Daniel Borkmann, Andrii Nakryiko,
	Tejun Heo, Barret Rhoden, Johannes Weiner, Lorenzo Stoakes,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm,
	Kernel Team

On Wed, Feb 14, 2024 at 8:47 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-02-13 at 14:35 -0800, Alexei Starovoitov wrote:
> [...]
>
> > This arena bpf_arena_cast() macro probably will be removed
> > once llvm 19 is released and we upgrade bpf CI to it.
> > It's here for selftests only.
> > It's quite tricky and fragile to use in practice.
> > Notice it does:
> > "r"(__var)
> > which is not quite correct,
> > since llvm won't recognize it as output that changes __var and
> > will use a copy of __var in a different register later.
> > But if the macro changes to "=r" or "+r" then llvm allocates
> > a register and that screws up codegen even more.
> >
> > The __var;}) also doesn't always work.
> > So this macro is not suited for all to use.
>
> Could you please elaborate a bit on why is this macro fragile?
> I toyed a bit with a version patched as below and it seems to work fine.
> Don't see how  ": [reg]"+r"(var) : ..." could be broken by the compiler
> (when "+r" is in the "output constraint" position):
> from clang pov the variable 'var' would be in register and updated
> after the asm volatile part.
>
> ---
>
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> index e73b7d48439f..488001236506 100644
> --- a/tools/testing/selftests/bpf/bpf_experimental.h
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -334,8 +334,6 @@ l_true:                                                                                             \
>  /* emit instruction: rX=rX .off = mode .imm32 = address_space */
>  #ifndef bpf_arena_cast
>  #define bpf_arena_cast(var, mode, addr_space)  \
> -       ({                                      \
> -       typeof(var) __var = var;                \
>         asm volatile(".byte 0xBF;               \
>                      .ifc %[reg], r0;           \
>                      .byte 0x00;                \
> @@ -368,8 +366,7 @@ l_true:                                                                                             \
>                      .byte 0x99;                \
>                      .endif;                    \
>                      .short %[off]; .long %[as]"        \
> -                    :: [reg]"r"(__var), [off]"i"(mode), [as]"i"(addr_space)); __var; \
> -       })
> +                    : [reg]"+r"(var) : [off]"i"(mode), [as]"i"(addr_space))

Earlier I tried "+r" while keeping __var.
Directly using var seems to work indeed.
I'll apply this change.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-14  8:36   ` Christoph Hellwig
@ 2024-02-14 20:53     ` Alexei Starovoitov
  2024-02-15  6:58       ` Christoph Hellwig
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-14 20:53 UTC (permalink / raw)
  To: Christoph Hellwig, Linus Torvalds
  Cc: bpf, Daniel Borkmann, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Eddy Z, Tejun Heo, Barret Rhoden, Johannes Weiner,
	Lorenzo Stoakes, Andrew Morton, Uladzislau Rezki, linux-mm,
	Kernel Team

On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> NAK.  Please

What is the alternative?
Remember, maintainers cannot tell developers "go away".
They must suggest a different path.

> On Thu, Feb 08, 2024 at 08:05:52PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > BPF would like to use the vmap API to implement a lazily-populated
> > memory space which can be shared by multiple userspace threads.
> > The vmap API is generally public and has functions to request and
>
> What is "the vmap API"?

I mean an API that manages kernel virtual address space:

. get_vm_area - external
. free_vm_area - EXPORT_SYMBOL_GPL
. vunmap_range - external
. vmalloc_to_page - EXPORT_SYMBOL
. apply_to_page_range - EXPORT_SYMBOL_GPL

and the last one is pretty much equivalent to vmap_pages_range,
hence I'm surprised by push back to make vmap_pages_range available to bpf.

> > For example, there is the public ioremap_page_range(), which is used
> > to map device memory into addressable kernel space.
>
> It's not really public.  It's a helper for the ioremap implementation
> which really should not be arch specific to start with and are in
> the process of beeing consolidatd into common code.

Any link to such consolidation of ioremap ? I couldn't find one.
I surely don't want bpf_arena to cause headaches to mm folks.

Anyway, ioremap_page_range() was just an example.
I could have used vmap() as an equivalent example.
vmap is EXPORT_SYMBOL, btw.

What bpf_arena needs is pretty much vmap(), but instead of
allocating all pages in advance, allocate them and insert on demand.

As you saw in the next patch bpf_arena does:
get_vm_area(4Gbyte, VM_MAP | VM_USERMAP);
and then alloc_page + vmap_pages_range into this region on demand.
Nothing fancy.

> > The new BPF code needs the functionality of vmap_pages_range() in
> > order to incrementally map privately managed arrays of pages into its
> > vmap area. Indeed this function used to be public, but became private
> > when usecases other than vmalloc happened to disappear.
>
> Yes, for a freaking good reason.  The vmap area is not for general abuse
> by random callers.  We have a few of those left, but we need to get rid
> of that and not add more.

What do you mean by "vmap area" ? The vmalloc virtual region ?
Are you suggesting that bpf_arena should reserve its own virtual region of
kernel memory instead of vmalloc region ?
That's doable, but I don't quite see the point.
Instead of VMALLOC_START/END we can carve a bpf specific region and
do __get_vm_area_node() from there, but why?
vmalloc region fits the best.
bpf_arena's mm manipulations don't interfere with kasan either.

Or you meant vm_map_ram() ? Don't care about those. bpf_arena doesn't
touch that.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-14 20:53     ` Alexei Starovoitov
@ 2024-02-15  6:58       ` Christoph Hellwig
  2024-02-15 20:50         ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Christoph Hellwig @ 2024-02-15  6:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Linus Torvalds, bpf, Daniel Borkmann,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Wed, Feb 14, 2024 at 12:53:42PM -0800, Alexei Starovoitov wrote:
> On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > NAK.  Please
> 
> What is the alternative?
> Remember, maintainers cannot tell developers "go away".
> They must suggest a different path.

That criteria is something you've made up.   Telling that something
is not ok is the most important job of not just maintainers but all
developers.  Maybe start with a description of the problem you're
solving and why you think it matters and needs different APIs.

> . get_vm_area - external
> . free_vm_area - EXPORT_SYMBOL_GPL
> . vunmap_range - external
> . vmalloc_to_page - EXPORT_SYMBOL
> . apply_to_page_range - EXPORT_SYMBOL_GPL
> 
> and the last one is pretty much equivalent to vmap_pages_range,
> hence I'm surprised by push back to make vmap_pages_range available to bpf.

And the last we've been trying to get rid of by ages because we don't
want random modules to 

> > > For example, there is the public ioremap_page_range(), which is used
> > > to map device memory into addressable kernel space.
> >
> > It's not really public.  It's a helper for the ioremap implementation
> > which really should not be arch specific to start with and are in
> > the process of beeing consolidatd into common code.
> 
> Any link to such consolidation of ioremap ? I couldn't find one.

Second hit on google:

https://lore.kernel.org/lkml/20230609075528.9390-1-bhe@redhat.com/T/

> I surely don't want bpf_arena to cause headaches to mm folks.
> 
> Anyway, ioremap_page_range() was just an example.
> I could have used vmap() as an equivalent example.
> vmap is EXPORT_SYMBOL, btw.

vmap is a good well defined API.  vmap_pages_range is not.

> What bpf_arena needs is pretty much vmap(), but instead of
> allocating all pages in advance, allocate them and insert on demand.

So propose an API that does that instead of exposing random low-level
details.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-15  6:58       ` Christoph Hellwig
@ 2024-02-15 20:50         ` Alexei Starovoitov
  2024-02-15 21:26           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-15 20:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Wed, Feb 14, 2024 at 10:58 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Feb 14, 2024 at 12:53:42PM -0800, Alexei Starovoitov wrote:
> > On Wed, Feb 14, 2024 at 12:36 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > NAK.  Please
> >
> > What is the alternative?
> > Remember, maintainers cannot tell developers "go away".
> > They must suggest a different path.
>
> That criteria is something you've made up.

I didn't invent it. I internalized it based on the feedback received.

> Telling that something
> is not ok is the most important job of not just maintainers but all
> developers.

I'm not saying that maintainers should not say "no",
I'm saying that maintainers should say "no", understand the problem
being solved, and suggest an alternative.

> Maybe start with a description of the problem you're
> solving and why you think it matters and needs different APIs.

bpf_arena doesn't need a different api. These 5 api-s below are enough.
I'm saying that vmap_pages_range() is equivalent to apply_to_page_range()
for all practical purposes.
So, since apply_to_page_range() is available to the kernel
(xen, gpu, kasan, etc) then I see no reason why
vmap_pages_range() shouldn't be available as well, since:

struct vmap_ctx {
     struct page **pages;
     int idx;
};

static int __for_each_pte(pte_t *ptep, unsigned long addr, void *data)
{
     struct vmap_ctx *ctx = data;
     struct page *page = ctx->pages[ctx->idx++];

     /* TODO: sanity checks here */
     set_pte_at(&init_mm, addr, ptep, mk_pte(page, PAGE_KERNEL));
     return 0;
}

static int vmap_pages_range_hack(unsigned long addr, unsigned long end,
                                 struct page **pages)
{
    struct vmap_ctx ctx = { .pages = pages };

    return apply_to_page_range(&init_mm, addr, end - addr,
__for_each_pte, &ctx);
}

Anything I miss?

> > . get_vm_area - external
> > . free_vm_area - EXPORT_SYMBOL_GPL
> > . vunmap_range - external
> > . vmalloc_to_page - EXPORT_SYMBOL
> > . apply_to_page_range - EXPORT_SYMBOL_GPL
> >
> > and the last one is pretty much equivalent to vmap_pages_range,
> > hence I'm surprised by push back to make vmap_pages_range available to bpf.
>
> And the last we've been trying to get rid of by ages because we don't
> want random modules to

Get rid of EXPORT_SYMBOL from it? Fine by me.
Or you're saying that you have a plan to replace apply_to_page_range()
with something else ? With what ?

> > > > For example, there is the public ioremap_page_range(), which is used
> > > > to map device memory into addressable kernel space.
> > >
> > > It's not really public.  It's a helper for the ioremap implementation
> > > which really should not be arch specific to start with and are in
> > > the process of beeing consolidatd into common code.
> >
> > Any link to such consolidation of ioremap ? I couldn't find one.
>
> Second hit on google:
>
> https://lore.kernel.org/lkml/20230609075528.9390-1-bhe@redhat.com/T/

Thanks.
It sounded like you were referring to some future work.
The series that landed was a good cleanup.
No questions about it.

> > I surely don't want bpf_arena to cause headaches to mm folks.
> >
> > Anyway, ioremap_page_range() was just an example.
> > I could have used vmap() as an equivalent example.
> > vmap is EXPORT_SYMBOL, btw.
>
> vmap is a good well defined API.  vmap_pages_range is not.

since vmap() is nothing but get_vm_area() + vmap_pages_range()
and few checks... I'm missing the point.
Pls elaborate.

> > What bpf_arena needs is pretty much vmap(), but instead of
> > allocating all pages in advance, allocate them and insert on demand.
>
> So propose an API that does that instead of exposing random low-level
> details.

The generic_ioremap_prot() and vmap() APIs make sense for the cases
when phys memory exists with known size. It needs to vmap-ed and
not touched after.
bpf_arena use case is similar to kasan which
reserves a giant virtual memory region, and then
does apply_to_page_range() to populate certain pte-s with pages in that region,
and later apply_to_existing_page_range() to free pages in kasan's region.

bpf_arena is very similar, except it currently calls get_vm_area()
to get a 4Gb+guard_pages region, and then vmap_pages_range() to
populate a page in it, and vunmap_range() to remove a page.

These existing api-s work, so not sure what you're requesting.
I can guess many different things, but pls clarify to reduce
this back and forth.
Are you worried about range checking? That vmap_pages_range()
can accidently hit an unintended range?

btw the cover letter and patch 5 explain the higher level motivation
from bpf pov in detail.
There was a bunch of feedback on that patch, which was addressed,
and the latest version is here:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=a752b4122071adb5307d7ab3ae6736a9a0e45317

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-15 20:50         ` Alexei Starovoitov
@ 2024-02-15 21:26           ` Linus Torvalds
  2024-02-16  1:34           ` Alexei Starovoitov
  2024-02-16  9:31           ` Christoph Hellwig
  2 siblings, 0 replies; 112+ messages in thread
From: Linus Torvalds @ 2024-02-15 21:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Thu, 15 Feb 2024 at 12:51, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> I didn't invent it. I internalized it based on the feedback received.

No. It's not up to maintainers to suggest alternatives. Sometimes it's
simply enough to explain *why* something isn't acceptable.

A plain "no" without explanation isn't sufficient. NAKs need a good
reason. But they don't need more than that.

The onus of coming up with an acceptable solution is on the person who
needs something new.

          Linus

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-14 17:24             ` Andrii Nakryiko
@ 2024-02-15 23:22               ` Andrii Nakryiko
  2024-02-16  2:45                 ` Alexei Starovoitov
  0 siblings, 1 reply; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-15 23:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Wed, Feb 14, 2024 at 9:24 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Feb 13, 2024 at 5:24 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Feb 13, 2024 at 4:09 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Feb 13, 2024 at 3:37 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > >
> > > > On Tue, 2024-02-13 at 15:17 -0800, Andrii Nakryiko wrote:
> > > >
> > > > [...]
> > > >
> > > > > > So, at first I thought that having two maps is a bit of a hack.
> > > > >
> > > > > yep, that was my instinct as well
> > > > >
> > > > > > However, after trying to make it work with only one map I don't really
> > > > > > like that either :)
> > > > >
> > > > > Can you elaborate? see my reply to Alexei, I wonder how did you think
> > > > > about doing this?
> > > >
> > > > Relocations in the ELF file are against a new section: ".arena.1".
> > > > This works nicely with logic in bpf_program__record_reloc().
> > > > If single map is used, we effectively need to track two indexes for
> > > > the map section:
> > > > - one used for relocations against map variables themselves
> > > >   (named "generic map reference relocation" in the function code);
> > > > - one used for relocations against ".arena.1"
> > > >   (named "global data map relocation" in the function code).
> > > >
> > > > This spooked me off:
> > > > - either bpf_object__init_internal_map() would have a specialized
> > > >   branch for arenas, as with current approach;
> > > > - or bpf_program__record_reloc() would have a specialized branch for arenas,
> > > >   as with one map approach.
> > >
> > > Yes, relocations would know about .arena.1, but it's a pretty simple
> > > check in a few places. We basically have arena *definition* sec_idx
> > > (corresponding to SEC(".maps")) and arena *data* sec_idx. The latter
> > > is what is recorded for global variables in .arena.1. We can remember
> > > this arena data sec_idx in struct bpf_object once during ELF
> > > processing, and then just special case it internally in a few places.
> >
> > That was my first attempt and bpf_program__record_reloc()
> > became a mess.
> > Currently it does relo search either in internal maps
> > or in obj->efile.btf_maps_shndx.
> > Doing double search wasn't nice.
> > And further, such dual meaning of 'struct bpf_map' object messes
> > assumptions of bpf_object__shndx_is_maps, bpf_object__shndx_is_data
> > and the way libbpf treats map->libbpf_type everywhere.
> >
> > bpf_map__is_internal() cannot really say true or false
> > for such dual use map.
> > Then skeleton gen gets ugly.
> > Needs more public libbpf APIs to use in bpftool gen.
> > Just a mess.
>
> It might be easier for me to try implement it the way I see it than
> discuss it over emails. I'll give it a try today-tomorrow and get back
> to you.
>
> >
> > > The "fake" bpf_map for __arena_internal is user-visible and requires
> > > autocreate=false tricks, etc. I feel like it's a worse tradeoff from a
> > > user API perspective than a few extra ARENA-specific internal checks
> > > (which we already have a few anyways, ARENA is not completely
> > > transparent internally anyways).
> >
> > what do you mean 'user visible'?
>
> That __arena_internal (representing .area.1 data section) actually is
> separate from actual ARENA map (represented by variable in .maps
> section). And both have separate `struct bpf_map`, which you can look
> up by name or through iterating all maps of bpf_object. And that you
> can call getters/setters on __arena_internal, even though the only
> thing that actually makes sense there is bpf_map__initial_value(),
> which would just as much make sense on ARENA map itself.
>
> > I can add a filter to avoid generating a pointer for it in a skeleton.
> > Then it won't be any more visible than other bss/data fake maps.
>
> bss/data are not fake maps, they have corresponding BPF map (ARRAY) in
> the kernel. Which is different from __arena_internal. And even if we
> hide it from skeleton, it's still there in bpf_object, as I mentioned
> above.
>
> Let me try implementing what I have in mind and see how bad it is.
>
> > The 2nd fake arena returns true out of bpf_map__is_internal.
> >
> > The key comment in the patch:
> >                 /* bpf_object will contain two arena maps:
> >                  * LIBBPF_MAP_ARENA & BPF_MAP_TYPE_ARENA
> >                  * and
> >                  * LIBBPF_MAP_UNSPEC & BPF_MAP_TYPE_ARENA.
> >                  * The former map->arena will point to latter.
> >                  */
>
> Yes, and I'd like to not have two arena maps because they are logically one.

Alright, I'm back. I pushed 3 patches on top of your patches into [0].
Available also at [1], if that's more convenient. I'll paste the main
diff below, but gmail will inevitably butcher the formatting, but it's
easier to discuss the code this way.

  [0] https://github.com/anakryiko/linux/commits/arena/
  [1] https://git.kernel.org/pub/scm/linux/kernel/git/andrii/bpf-next.git/log/?h=arena

First, as I was working on the code, I realized that the place where
we do mmap() after creating ARENA map is different from where we
normally do post-creation steps, so I moved the code to keep all those
extra steps in one place. No changes in logic, but now we also don't
need to close map_fd and so on, I think it's better this way.

And so the main changes are below. There are definitely a few
ARENA-specific checks here and there, but I don't think it's that bad.
A bunch of code is just undoing code changes from previous patch, so
once you incorporate these changes into your patches it will be even
smaller.

The main outcome is that we don't have a fake map as an independent
struct bpf_map and bpf_map__initial_value() logic works transparently.

We'll probably need similar special-casing for STRUCT_OPS maps that
Kui-Feng is adding, so ARENA won't be the only one.

The slightly annoying part is that special casing is necessary because
of map->mmapable assumption that it has to be munmap() and its size is
calculated by bpf_map_mmap_sz() (I could have hacked
map->def.value_size for this, but that felt wrong). We could
generalize/fix that, but I chose not to do that just yet.


commit 2a7a90e06d02a4edb60cf92c19aee2b3f05d3cca
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Feb 15 14:55:00 2024 -0800

    libbpf: remove fake __arena_internal map

    Unify actual ARENA map and fake __arena_internal map. .arena.1 ELF
    section isn't a stand-alone BPF map, so it shouldn't be represented as
    `struct bpf_map *` instance in libbpf APIs. Instead, use ELF section
    data as initial data image, which is exposed through skeleton and
    bpf_map__initial_value() to the user, if they need to tune it before the
    load phase. During load phase, this initial image is copied over into
    mmap()'ed region corresponding to ARENA, and discarded.

    Few small checks here and there had to be added to make sure this
    approach works with bpf_map__initial_value(), mostly due to hard-coded
    assumption that map->mmaped is set up with mmap() syscall and should be
    munmap()'ed. For ARENA, .arena.1 can be (much) smaller than maximum
    ARENA size, so this smaller data size has to be tracked separately.
    Given it is enforced that there is only one ARENA for entire bpf_object
    instance, we just keep it in a separate field. This can be generalized
    if necessary later.

    bpftool is adjusted a bit as well, because ARENA map is not reported as
    "internal" (it's not a great fit in this case), plus we need to take
    into account that ARENA map can exist without corresponding .arena.1 ELF
    section, so user-facing data section in skeleton is optional.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c
index 273da2098231..6e17b95436de 100644
--- a/tools/bpf/bpftool/gen.c
+++ b/tools/bpf/bpftool/gen.c
@@ -82,7 +82,7 @@ static bool get_map_ident(const struct bpf_map *map,
char *buf, size_t buf_sz)
     const char *name = bpf_map__name(map);
     int i, n;

-    if (!bpf_map__is_internal(map) || bpf_map__type(map) ==
BPF_MAP_TYPE_ARENA) {
+    if (!bpf_map__is_internal(map)) {
         snprintf(buf, buf_sz, "%s", name);
         return true;
     }
@@ -109,7 +109,7 @@ static bool get_datasec_ident(const char
*sec_name, char *buf, size_t buf_sz)
     /* recognize hard coded LLVM section name */
     if (strcmp(sec_name, ".arena.1") == 0) {
         /* this is the name to use in skeleton */
-        strncpy(buf, "arena", buf_sz);
+        snprintf(buf, buf_sz, "arena");
         return true;
     }
     for  (i = 0, n = ARRAY_SIZE(pfxs); i < n; i++) {
@@ -242,14 +242,16 @@ static const struct btf_type
*find_type_for_map(struct btf *btf, const char *map

 static bool is_mmapable_map(const struct bpf_map *map, char *buf, size_t sz)
 {
-    if (!bpf_map__is_internal(map) || !(bpf_map__map_flags(map) &
BPF_F_MMAPABLE))
-        return false;
+    size_t tmp_sz;

-    if (bpf_map__type(map) == BPF_MAP_TYPE_ARENA) {
-        strncpy(buf, "arena", sz);
+    if (bpf_map__type(map) == BPF_MAP_TYPE_ARENA &&
bpf_map__initial_value(map, &tmp_sz)) {
+        snprintf(buf, sz, "arena");
         return true;
     }

+    if (!bpf_map__is_internal(map) || !(bpf_map__map_flags(map) &
BPF_F_MMAPABLE))
+        return false;
+
     if (!get_map_ident(map, buf, sz))
         return false;

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5a53f1ed87f2..c72577bef439 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -506,7 +506,6 @@ enum libbpf_map_type {
     LIBBPF_MAP_BSS,
     LIBBPF_MAP_RODATA,
     LIBBPF_MAP_KCONFIG,
-    LIBBPF_MAP_ARENA,
 };

 struct bpf_map_def {
@@ -549,7 +548,6 @@ struct bpf_map {
     bool reused;
     bool autocreate;
     __u64 map_extra;
-    struct bpf_map *arena;
 };

 enum extern_type {
@@ -616,7 +614,6 @@ enum sec_type {
     SEC_BSS,
     SEC_DATA,
     SEC_RODATA,
-    SEC_ARENA,
 };

 struct elf_sec_desc {
@@ -634,6 +631,7 @@ struct elf_state {
     Elf_Data *symbols;
     Elf_Data *st_ops_data;
     Elf_Data *st_ops_link_data;
+    Elf_Data *arena_data;
     size_t shstrndx; /* section index for section name strings */
     size_t strtabidx;
     struct elf_sec_desc *secs;
@@ -644,6 +642,7 @@ struct elf_state {
     int symbols_shndx;
     int st_ops_shndx;
     int st_ops_link_shndx;
+    int arena_data_shndx;
 };

 struct usdt_manager;
@@ -703,6 +702,10 @@ struct bpf_object {

     struct usdt_manager *usdt_man;

+    struct bpf_map *arena_map;
+    void *arena_data;
+    size_t arena_data_sz;
+
     struct kern_feature_cache *feat_cache;
     char *token_path;
     int token_fd;
@@ -1340,6 +1343,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj)
     obj->efile.symbols = NULL;
     obj->efile.st_ops_data = NULL;
     obj->efile.st_ops_link_data = NULL;
+    obj->efile.arena_data = NULL;

     zfree(&obj->efile.secs);
     obj->efile.sec_cnt = 0;
@@ -1722,34 +1726,10 @@ static int
 bpf_object__init_internal_map(struct bpf_object *obj, enum
libbpf_map_type type,
                   const char *real_name, int sec_idx, void *data,
size_t data_sz)
 {
-    const long page_sz = sysconf(_SC_PAGE_SIZE);
-    struct bpf_map *map, *arena = NULL;
     struct bpf_map_def *def;
+    struct bpf_map *map;
     size_t mmap_sz;
-    int err, i;
-
-    if (type == LIBBPF_MAP_ARENA) {
-        for (i = 0; i < obj->nr_maps; i++) {
-            map = &obj->maps[i];
-            if (map->def.type != BPF_MAP_TYPE_ARENA)
-                continue;
-            arena = map;
-            real_name = "__arena_internal";
-                mmap_sz = bpf_map_mmap_sz(map);
-            if (roundup(data_sz, page_sz) > mmap_sz) {
-                pr_warn("Declared arena map size %zd is too small to hold"
-                    "global __arena variables of size %zd\n",
-                    mmap_sz, data_sz);
-                return -E2BIG;
-            }
-            break;
-        }
-        if (!arena) {
-            pr_warn("To use global __arena variables the arena map should"
-                "be declared explicitly in SEC(\".maps\")\n");
-            return -ENOENT;
-        }
-    }
+    int err;

     map = bpf_object__add_map(obj);
     if (IS_ERR(map))
@@ -1760,7 +1740,6 @@ bpf_object__init_internal_map(struct bpf_object
*obj, enum libbpf_map_type type,
     map->sec_offset = 0;
     map->real_name = strdup(real_name);
     map->name = internal_map_name(obj, real_name);
-    map->arena = arena;
     if (!map->real_name || !map->name) {
         zfree(&map->real_name);
         zfree(&map->name);
@@ -1768,32 +1747,18 @@ bpf_object__init_internal_map(struct
bpf_object *obj, enum libbpf_map_type type,
     }

     def = &map->def;
-    if (type == LIBBPF_MAP_ARENA) {
-        /* bpf_object will contain two arena maps:
-         * LIBBPF_MAP_ARENA & BPF_MAP_TYPE_ARENA
-         * and
-         * LIBBPF_MAP_UNSPEC & BPF_MAP_TYPE_ARENA.
-         * The former map->arena will point to latter.
-         */
-        def->type = BPF_MAP_TYPE_ARENA;
-        def->key_size = 0;
-        def->value_size = 0;
-        def->max_entries = roundup(data_sz, page_sz) / page_sz;
-        def->map_flags = BPF_F_MMAPABLE;
-    } else {
-        def->type = BPF_MAP_TYPE_ARRAY;
-        def->key_size = sizeof(int);
-        def->value_size = data_sz;
-        def->max_entries = 1;
-        def->map_flags = type == LIBBPF_MAP_RODATA || type ==
LIBBPF_MAP_KCONFIG
-            ? BPF_F_RDONLY_PROG : 0;
+    def->type = BPF_MAP_TYPE_ARRAY;
+    def->key_size = sizeof(int);
+    def->value_size = data_sz;
+    def->max_entries = 1;
+    def->map_flags = type == LIBBPF_MAP_RODATA || type == LIBBPF_MAP_KCONFIG
+        ? BPF_F_RDONLY_PROG : 0;

-        /* failures are fine because of maps like .rodata.str1.1 */
-        (void) map_fill_btf_type_info(obj, map);
+    /* failures are fine because of maps like .rodata.str1.1 */
+    (void) map_fill_btf_type_info(obj, map);

-        if (map_is_mmapable(obj, map))
-            def->map_flags |= BPF_F_MMAPABLE;
-    }
+    if (map_is_mmapable(obj, map))
+        def->map_flags |= BPF_F_MMAPABLE;

     pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
          map->name, map->sec_idx, map->sec_offset, def->map_flags);
@@ -1857,13 +1822,6 @@ static int
bpf_object__init_global_data_maps(struct bpf_object *obj)
                                 NULL,
                                 sec_desc->data->d_size);
             break;
-        case SEC_ARENA:
-            sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, sec_idx));
-            err = bpf_object__init_internal_map(obj, LIBBPF_MAP_ARENA,
-                                sec_name, sec_idx,
-                                sec_desc->data->d_buf,
-                                sec_desc->data->d_size);
-            break;
         default:
             /* skip */
             break;
@@ -2786,6 +2744,32 @@ static int bpf_object__init_user_btf_map(struct
bpf_object *obj,
     return 0;
 }

+static int init_arena_map_data(struct bpf_object *obj, struct bpf_map *map,
+                   const char *sec_name, int sec_idx,
+                   void *data, size_t data_sz)
+{
+    const long page_sz = sysconf(_SC_PAGE_SIZE);
+    size_t mmap_sz;
+
+    mmap_sz = bpf_map_mmap_sz(obj->arena_map);
+    if (roundup(data_sz, page_sz) > mmap_sz) {
+        pr_warn("elf: sec '%s': declared ARENA map size (%zu) is too
small to hold global __arena variables of size %zu\n",
+            sec_name, mmap_sz, data_sz);
+        return -E2BIG;
+    }
+
+    obj->arena_data = malloc(data_sz);
+    if (!obj->arena_data)
+        return -ENOMEM;
+    memcpy(obj->arena_data, data, data_sz);
+    obj->arena_data_sz = data_sz;
+
+    /* make bpf_map__init_value() work for ARENA maps */
+    map->mmaped = obj->arena_data;
+
+    return 0;
+}
+
 static int bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict,
                       const char *pin_root_path)
 {
@@ -2835,6 +2819,33 @@ static int
bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict,
             return err;
     }

+    for (i = 0; i < obj->nr_maps; i++) {
+        struct bpf_map *map = &obj->maps[i];
+
+        if (map->def.type != BPF_MAP_TYPE_ARENA)
+            continue;
+
+        if (obj->arena_map) {
+            pr_warn("map '%s': only single ARENA map is supported
(map '%s' is also ARENA)\n",
+                map->name, obj->arena_map->name);
+            return -EINVAL;
+        }
+        obj->arena_map = map;
+
+        if (obj->efile.arena_data) {
+            err = init_arena_map_data(obj, map, ARENA_SEC,
obj->efile.arena_data_shndx,
+                          obj->efile.arena_data->d_buf,
+                          obj->efile.arena_data->d_size);
+            if (err)
+                return err;
+        }
+    }
+    if (obj->efile.arena_data && !obj->arena_map) {
+        pr_warn("elf: sec '%s': to use global __arena variables the
ARENA map should be explicitly declared in SEC(\".maps\")\n",
+            ARENA_SEC);
+        return -ENOENT;
+    }
+
     return 0;
 }

@@ -3699,9 +3710,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
                 obj->efile.st_ops_link_data = data;
                 obj->efile.st_ops_link_shndx = idx;
             } else if (strcmp(name, ARENA_SEC) == 0) {
-                sec_desc->sec_type = SEC_ARENA;
-                sec_desc->shdr = sh;
-                sec_desc->data = data;
+                obj->efile.arena_data = data;
+                obj->efile.arena_data_shndx = idx;
             } else {
                 pr_info("elf: skipping unrecognized data section(%d) %s\n",
                     idx, name);
@@ -4204,7 +4214,6 @@ static bool bpf_object__shndx_is_data(const
struct bpf_object *obj,
     case SEC_BSS:
     case SEC_DATA:
     case SEC_RODATA:
-    case SEC_ARENA:
         return true;
     default:
         return false;
@@ -4230,8 +4239,6 @@ bpf_object__section_to_libbpf_map_type(const
struct bpf_object *obj, int shndx)
         return LIBBPF_MAP_DATA;
     case SEC_RODATA:
         return LIBBPF_MAP_RODATA;
-    case SEC_ARENA:
-        return LIBBPF_MAP_ARENA;
     default:
         return LIBBPF_MAP_UNSPEC;
     }
@@ -4332,6 +4339,15 @@ static int bpf_program__record_reloc(struct
bpf_program *prog,
     type = bpf_object__section_to_libbpf_map_type(obj, shdr_idx);
     sym_sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, shdr_idx));

+    /* arena data relocation */
+    if (shdr_idx == obj->efile.arena_data_shndx) {
+        reloc_desc->type = RELO_DATA;
+        reloc_desc->insn_idx = insn_idx;
+        reloc_desc->map_idx = obj->arena_map - obj->maps;
+        reloc_desc->sym_off = sym->st_value;
+        return 0;
+    }
+
     /* generic map reference relocation */
     if (type == LIBBPF_MAP_UNSPEC) {
         if (!bpf_object__shndx_is_maps(obj, shdr_idx)) {
@@ -4385,7 +4401,7 @@ static int bpf_program__record_reloc(struct
bpf_program *prog,

     reloc_desc->type = RELO_DATA;
     reloc_desc->insn_idx = insn_idx;
-    reloc_desc->map_idx = map->arena ? map->arena - obj->maps : map_idx;
+    reloc_desc->map_idx = map_idx;
     reloc_desc->sym_off = sym->st_value;
     return 0;
 }
@@ -4872,8 +4888,6 @@ bpf_object__populate_internal_map(struct
bpf_object *obj, struct bpf_map *map)
             bpf_gen__map_freeze(obj->gen_loader, map - obj->maps);
         return 0;
     }
-    if (map_type == LIBBPF_MAP_ARENA)
-        return 0;

     err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);
     if (err) {
@@ -5166,15 +5180,6 @@ bpf_object__create_maps(struct bpf_object *obj)
         if (bpf_map__is_internal(map) && !kernel_supports(obj,
FEAT_GLOBAL_DATA))
             map->autocreate = false;

-        if (map->libbpf_type == LIBBPF_MAP_ARENA) {
-            size_t len = bpf_map_mmap_sz(map);
-
-            memcpy(map->arena->mmaped, map->mmaped, len);
-            map->autocreate = false;
-            munmap(map->mmaped, len);
-            map->mmaped = NULL;
-        }
-
         if (!map->autocreate) {
             pr_debug("map '%s': skipped auto-creating...\n", map->name);
             continue;
@@ -5229,6 +5234,10 @@ bpf_object__create_maps(struct bpf_object *obj)
                         map->name, err);
                     return err;
                 }
+                if (obj->arena_data) {
+                    memcpy(map->mmaped, obj->arena_data, obj->arena_data_sz);
+                    zfree(&obj->arena_data);
+                }
             }
             if (map->init_slots_sz && map->def.type !=
BPF_MAP_TYPE_PROG_ARRAY) {
                 err = init_map_in_map_slots(obj, map);
@@ -8716,13 +8725,9 @@ static void bpf_map__destroy(struct bpf_map *map)
     zfree(&map->init_slots);
     map->init_slots_sz = 0;

-    if (map->mmaped) {
-        size_t mmap_sz;
-
-        mmap_sz = bpf_map_mmap_sz(map);
-        munmap(map->mmaped, mmap_sz);
-        map->mmaped = NULL;
-    }
+    if (map->mmaped && map->mmaped != map->obj->arena_data)
+        munmap(map->mmaped, bpf_map_mmap_sz(map));
+    map->mmaped = NULL;

     if (map->st_ops) {
         zfree(&map->st_ops->data);
@@ -8782,6 +8787,8 @@ void bpf_object__close(struct bpf_object *obj)
     if (obj->token_fd > 0)
         close(obj->token_fd);

+    zfree(&obj->arena_data);
+
     free(obj);
 }

@@ -9803,8 +9810,6 @@ static bool map_uses_real_name(const struct bpf_map *map)
         return true;
     if (map->libbpf_type == LIBBPF_MAP_RODATA &&
strcmp(map->real_name, RODATA_SEC) != 0)
         return true;
-    if (map->libbpf_type == LIBBPF_MAP_ARENA)
-        return true;
     return false;
 }

@@ -10006,22 +10011,35 @@ __u32 bpf_map__btf_value_type_id(const
struct bpf_map *map)
 int bpf_map__set_initial_value(struct bpf_map *map,
                    const void *data, size_t size)
 {
+    size_t actual_sz;
+
     if (map->obj->loaded || map->reused)
         return libbpf_err(-EBUSY);

-    if (!map->mmaped || map->libbpf_type == LIBBPF_MAP_KCONFIG ||
-        size != map->def.value_size)
+    if (!map->mmaped || map->libbpf_type == LIBBPF_MAP_KCONFIG)
+        return libbpf_err(-EINVAL);
+
+    if (map->def.type == BPF_MAP_TYPE_ARENA)
+        actual_sz = map->obj->arena_data_sz;
+    else
+        actual_sz = map->def.value_size;
+    if (size != actual_sz)
         return libbpf_err(-EINVAL);

     memcpy(map->mmaped, data, size);
     return 0;
 }

-void *bpf_map__initial_value(struct bpf_map *map, size_t *psize)
+void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
 {
     if (!map->mmaped)
         return NULL;
-    *psize = map->def.value_size;
+
+    if (map->def.type == BPF_MAP_TYPE_ARENA)
+        *psize = map->obj->arena_data_sz;
+    else
+        *psize = map->def.value_size;
+
     return map->mmaped;
 }

@@ -13510,8 +13528,8 @@ int bpf_object__load_skeleton(struct
bpf_object_skeleton *s)
             continue;
         }

-        if (map->arena) {
-            *mmaped = map->arena->mmaped;
+        if (map->def.type == BPF_MAP_TYPE_ARENA) {
+            *mmaped = map->mmaped;
             continue;
         }

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 5723cbbfcc41..7b510761f545 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1014,7 +1014,7 @@ LIBBPF_API int bpf_map__set_map_extra(struct
bpf_map *map, __u64 map_extra);

 LIBBPF_API int bpf_map__set_initial_value(struct bpf_map *map,
                       const void *data, size_t size);
-LIBBPF_API void *bpf_map__initial_value(struct bpf_map *map, size_t *psize);
+LIBBPF_API void *bpf_map__initial_value(const struct bpf_map *map,
size_t *psize);

 /**
  * @brief **bpf_map__is_internal()** tells the caller whether or not the

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-15 20:50         ` Alexei Starovoitov
  2024-02-15 21:26           ` Linus Torvalds
@ 2024-02-16  1:34           ` Alexei Starovoitov
  2024-02-16  9:31           ` Christoph Hellwig
  2 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-16  1:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Thu, Feb 15, 2024 at 12:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> >
> > So propose an API that does that instead of exposing random low-level
> > details.
>
> The generic_ioremap_prot() and vmap() APIs make sense for the cases
> when phys memory exists with known size. It needs to vmap-ed and
> not touched after.
> bpf_arena use case is similar to kasan which
> reserves a giant virtual memory region, and then
> does apply_to_page_range() to populate certain pte-s with pages in that region,
> and later apply_to_existing_page_range() to free pages in kasan's region.
>
> bpf_arena is very similar, except it currently calls get_vm_area()
> to get a 4Gb+guard_pages region, and then vmap_pages_range() to
> populate a page in it, and vunmap_range() to remove a page.
>
> These existing api-s work, so not sure what you're requesting.
> I can guess many different things, but pls clarify to reduce
> this back and forth.
> Are you worried about range checking? That vmap_pages_range()
> can accidently hit an unintended range?

guessing... like this ?

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..3bc67b526272 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -635,6 +635,18 @@ static int vmap_pages_range(unsigned long addr,
unsigned long end,
        return err;
 }

+
+int vm_area_map_pages(struct vm_struct *area, unsigned long addr,
unsigned int count,
+                     struct page **pages)
+{
+       unsigned long size = ((unsigned long)count) * PAGE_SIZE;
+       unsigned long end = addr + size;
+
+       if (addr < (unsigned long)area->addr || (void *)end >
area->addr + area->size)
+               return -EINVAL;
+       return vmap_pages_range(addr, end, PAGE_KERNEL, pages, PAGE_SHIFT);
+}

in addition.. can conditionally silence WARN_ON-s in vmap_pages_pte_range(),
but imo overkill.
What did you have in mind?

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-15 23:22               ` Andrii Nakryiko
@ 2024-02-16  2:45                 ` Alexei Starovoitov
  2024-02-16  4:51                   ` Andrii Nakryiko
  0 siblings, 1 reply; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-16  2:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Eduard Zingerman, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Thu, Feb 15, 2024 at 3:22 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
>  {
> @@ -2835,6 +2819,33 @@ static int
> bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict,
>              return err;
>      }
>
> +    for (i = 0; i < obj->nr_maps; i++) {
> +        struct bpf_map *map = &obj->maps[i];
> +
> +        if (map->def.type != BPF_MAP_TYPE_ARENA)
> +            continue;
> +
> +        if (obj->arena_map) {
> +            pr_warn("map '%s': only single ARENA map is supported
> (map '%s' is also ARENA)\n",
> +                map->name, obj->arena_map->name);
> +            return -EINVAL;
> +        }
> +        obj->arena_map = map;
> +
> +        if (obj->efile.arena_data) {
> +            err = init_arena_map_data(obj, map, ARENA_SEC,
> obj->efile.arena_data_shndx,
> +                          obj->efile.arena_data->d_buf,
> +                          obj->efile.arena_data->d_size);
> +            if (err)
> +                return err;
> +        }
> +    }
> +    if (obj->efile.arena_data && !obj->arena_map) {
> +        pr_warn("elf: sec '%s': to use global __arena variables the
> ARENA map should be explicitly declared in SEC(\".maps\")\n",
> +            ARENA_SEC);
> +        return -ENOENT;
> +    }
> +
>      return 0;
>  }
>
> @@ -3699,9 +3710,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
>                  obj->efile.st_ops_link_data = data;
>                  obj->efile.st_ops_link_shndx = idx;
>              } else if (strcmp(name, ARENA_SEC) == 0) {
> -                sec_desc->sec_type = SEC_ARENA;
> -                sec_desc->shdr = sh;
> -                sec_desc->data = data;
> +                obj->efile.arena_data = data;
> +                obj->efile.arena_data_shndx = idx;

I see. So these two are sort-of main tricks.
Special case ARENA_SEC like ".maps" and then look for this
obj level map in the right spots.
The special case around bpf_map__[set_]initial_value kind break
the layering with:
if (map->def.type == BPF_MAP_TYPE_ARENA)
  actual_sz = map->obj->arena_data_sz;
but no big deal.

How do you want me to squash the patches?
Keep "rename is_internal_mmapable_map into is_mmapable_map" patch as-is
and then squash mine and your 2nd and 3rd?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles.
  2024-02-16  2:45                 ` Alexei Starovoitov
@ 2024-02-16  4:51                   ` Andrii Nakryiko
  0 siblings, 0 replies; 112+ messages in thread
From: Andrii Nakryiko @ 2024-02-16  4:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm, Kernel Team

On Thu, Feb 15, 2024 at 6:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 15, 2024 at 3:22 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> >  {
> > @@ -2835,6 +2819,33 @@ static int
> > bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict,
> >              return err;
> >      }
> >
> > +    for (i = 0; i < obj->nr_maps; i++) {
> > +        struct bpf_map *map = &obj->maps[i];
> > +
> > +        if (map->def.type != BPF_MAP_TYPE_ARENA)
> > +            continue;
> > +
> > +        if (obj->arena_map) {
> > +            pr_warn("map '%s': only single ARENA map is supported
> > (map '%s' is also ARENA)\n",
> > +                map->name, obj->arena_map->name);
> > +            return -EINVAL;
> > +        }
> > +        obj->arena_map = map;
> > +
> > +        if (obj->efile.arena_data) {
> > +            err = init_arena_map_data(obj, map, ARENA_SEC,
> > obj->efile.arena_data_shndx,
> > +                          obj->efile.arena_data->d_buf,
> > +                          obj->efile.arena_data->d_size);
> > +            if (err)
> > +                return err;
> > +        }
> > +    }
> > +    if (obj->efile.arena_data && !obj->arena_map) {
> > +        pr_warn("elf: sec '%s': to use global __arena variables the
> > ARENA map should be explicitly declared in SEC(\".maps\")\n",
> > +            ARENA_SEC);
> > +        return -ENOENT;
> > +    }
> > +
> >      return 0;
> >  }
> >
> > @@ -3699,9 +3710,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
> >                  obj->efile.st_ops_link_data = data;
> >                  obj->efile.st_ops_link_shndx = idx;
> >              } else if (strcmp(name, ARENA_SEC) == 0) {
> > -                sec_desc->sec_type = SEC_ARENA;
> > -                sec_desc->shdr = sh;
> > -                sec_desc->data = data;
> > +                obj->efile.arena_data = data;
> > +                obj->efile.arena_data_shndx = idx;
>
> I see. So these two are sort-of main tricks.
> Special case ARENA_SEC like ".maps" and then look for this
> obj level map in the right spots.

yep

> The special case around bpf_map__[set_]initial_value kind break
> the layering with:
> if (map->def.type == BPF_MAP_TYPE_ARENA)
>   actual_sz = map->obj->arena_data_sz;
> but no big deal.
>

true, and struct_ops will be another special case, so we might want to
think about generalizing that a bit, but that's a separate thing we
can handle later on

> How do you want me to squash the patches?
> Keep "rename is_internal_mmapable_map into is_mmapable_map" patch as-is

yep

> and then squash mine and your 2nd and 3rd?

I think `libbpf: move post-creation steps for ARENA map` should be
squashed into your `libbpf: Add support for bpf_arena.` which
introduces ARENA map by itself. And then `libbpf: Recognize __arena
global varaibles.` and `libbpf: remove fake __arena_internal map` can
be squashed together as well.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-15 20:50         ` Alexei Starovoitov
  2024-02-15 21:26           ` Linus Torvalds
  2024-02-16  1:34           ` Alexei Starovoitov
@ 2024-02-16  9:31           ` Christoph Hellwig
  2024-02-16 16:54             ` Alexei Starovoitov
  2 siblings, 1 reply; 112+ messages in thread
From: Christoph Hellwig @ 2024-02-16  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Linus Torvalds, bpf, Daniel Borkmann,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> So, since apply_to_page_range() is available to the kernel
> (xen, gpu, kasan, etc) then I see no reason why
> vmap_pages_range() shouldn't be available as well, since:

In case it wasn't clear before:  apply_to_page_range is a bad API to
be exported.  We've been working on removing it but it stalled.
Exposing something that allows a module to change arbitrary page table
bits is not a good idea.

Please take a step back and think of how to expose a vmalloc like
allocation that grows only when used as a proper abstraction.  I could
actually think of various other uses for it.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-16  9:31           ` Christoph Hellwig
@ 2024-02-16 16:54             ` Alexei Starovoitov
  2024-02-16 17:18               ` Uladzislau Rezki
  2024-02-20  6:57               ` Christoph Hellwig
  0 siblings, 2 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-16 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, bpf, Daniel Borkmann, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo, Barret Rhoden,
	Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> > So, since apply_to_page_range() is available to the kernel
> > (xen, gpu, kasan, etc) then I see no reason why
> > vmap_pages_range() shouldn't be available as well, since:
>
> In case it wasn't clear before:  apply_to_page_range is a bad API to
> be exported.  We've been working on removing it but it stalled.
> Exposing something that allows a module to change arbitrary page table
> bits is not a good idea.

I never said that that module should do that.

> Please take a step back and think of how to expose a vmalloc like
> allocation that grows only when used as a proper abstraction.  I could
> actually think of various other uses for it.

"vmalloc like allocation that grows" is not what I'm after.
I need 4G+guard region at the start.
Please read my earlier email and reply to my questions and api proposals.
Replying to half of the sentence, and out of context, is not a
productive discussion.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-16 16:54             ` Alexei Starovoitov
@ 2024-02-16 17:18               ` Uladzislau Rezki
  2024-02-18  2:06                 ` Alexei Starovoitov
  2024-02-20  6:57               ` Christoph Hellwig
  1 sibling, 1 reply; 112+ messages in thread
From: Uladzislau Rezki @ 2024-02-16 17:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Linus Torvalds, bpf, Daniel Borkmann,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> > > So, since apply_to_page_range() is available to the kernel
> > > (xen, gpu, kasan, etc) then I see no reason why
> > > vmap_pages_range() shouldn't be available as well, since:
> >
> > In case it wasn't clear before:  apply_to_page_range is a bad API to
> > be exported.  We've been working on removing it but it stalled.
> > Exposing something that allows a module to change arbitrary page table
> > bits is not a good idea.
> 
> I never said that that module should do that.
> 
> > Please take a step back and think of how to expose a vmalloc like
> > allocation that grows only when used as a proper abstraction.  I could
> > actually think of various other uses for it.
> 
> "vmalloc like allocation that grows" is not what I'm after.
> I need 4G+guard region at the start.
> Please read my earlier email and reply to my questions and api proposals.
> Replying to half of the sentence, and out of context, is not a
> productive discussion.
>
1. The concern here is that this interface, which you would like to add,
exposes the "addr", "end" to upper layer, so fake values can easily be
passed to vmap internals.

2. Other users can start using this API/function which is hidden now
and is not supposed to be used outside of vmap code. Because it is a
static helper.

3. It opens new dependencies which we would like to avoid. As a second
step someone wants to dump "such region(4G+guard region)" over vmallocifo
to see what is mapped what requires a certain tracking.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-16 17:18               ` Uladzislau Rezki
@ 2024-02-18  2:06                 ` Alexei Starovoitov
  0 siblings, 0 replies; 112+ messages in thread
From: Alexei Starovoitov @ 2024-02-18  2:06 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Christoph Hellwig, Linus Torvalds, bpf, Daniel Borkmann,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	linux-mm, Kernel Team

On Fri, Feb 16, 2024 at 9:18 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote:
> > On Fri, Feb 16, 2024 at 1:31 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Thu, Feb 15, 2024 at 12:50:55PM -0800, Alexei Starovoitov wrote:
> > > > So, since apply_to_page_range() is available to the kernel
> > > > (xen, gpu, kasan, etc) then I see no reason why
> > > > vmap_pages_range() shouldn't be available as well, since:
> > >
> > > In case it wasn't clear before:  apply_to_page_range is a bad API to
> > > be exported.  We've been working on removing it but it stalled.
> > > Exposing something that allows a module to change arbitrary page table
> > > bits is not a good idea.
> >
> > I never said that that module should do that.
> >
> > > Please take a step back and think of how to expose a vmalloc like
> > > allocation that grows only when used as a proper abstraction.  I could
> > > actually think of various other uses for it.
> >
> > "vmalloc like allocation that grows" is not what I'm after.
> > I need 4G+guard region at the start.
> > Please read my earlier email and reply to my questions and api proposals.
> > Replying to half of the sentence, and out of context, is not a
> > productive discussion.
> >
> 1. The concern here is that this interface, which you would like to add,
> exposes the "addr", "end" to upper layer, so fake values can easily be
> passed to vmap internals.
>
> 2. Other users can start using this API/function which is hidden now
> and is not supposed to be used outside of vmap code. Because it is a
> static helper.

I suspect you're replying to the original patch that just
makes vmap_pages_range() external.
It was discarded already.
The request for feedback is for vm_area_map_pages proposal upthread:

+int vm_area_map_pages(struct vm_struct *area, unsigned long addr,
unsigned int count,
+                     struct page **pages)

There is no "end" and "addr" is range checked.

> 3. It opens new dependencies which we would like to avoid. As a second
> step someone wants to dump "such region(4G+guard region)" over vmallocifo
> to see what is mapped what requires a certain tracking.

What do you mean by "dump over /proc/vmallocinfo" ?
Privileged user space can see the start/end of every region.
And if some regions have all pages mapped and others don't, so?
vmallocinfo is racy. By the time user space sees the range
it can be unmapped already.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel.
  2024-02-16 16:54             ` Alexei Starovoitov
  2024-02-16 17:18               ` Uladzislau Rezki
@ 2024-02-20  6:57               ` Christoph Hellwig
  1 sibling, 0 replies; 112+ messages in thread
From: Christoph Hellwig @ 2024-02-20  6:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christoph Hellwig, Linus Torvalds, bpf, Daniel Borkmann,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Eddy Z, Tejun Heo,
	Barret Rhoden, Johannes Weiner, Lorenzo Stoakes, Andrew Morton,
	Uladzislau Rezki, linux-mm, Kernel Team

On Fri, Feb 16, 2024 at 08:54:08AM -0800, Alexei Starovoitov wrote:
> "vmalloc like allocation that grows" is not what I'm after.
> I need 4G+guard region at the start.
> Please read my earlier email and reply to my questions and api proposals.
> Replying to half of the sentence, and out of context, is not a
> productive discussion.

If you can explain what you are trying to do to the reviewers you're
doing something wrong..

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2024-02-20  6:57 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-09  4:05 [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena Alexei Starovoitov
2024-02-09  4:05 ` [PATCH v2 bpf-next 01/20] bpf: Allow kfuncs return 'void *' Alexei Starovoitov
2024-02-10  6:49   ` Kumar Kartikeya Dwivedi
2024-02-09  4:05 ` [PATCH v2 bpf-next 02/20] bpf: Recognize '__map' suffix in kfunc arguments Alexei Starovoitov
2024-02-10  6:52   ` Kumar Kartikeya Dwivedi
2024-02-09  4:05 ` [PATCH v2 bpf-next 03/20] bpf: Plumb get_unmapped_area() callback into bpf_map_ops Alexei Starovoitov
2024-02-10  0:06   ` kernel test robot
2024-02-10  0:17   ` kernel test robot
2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
2024-02-10  9:06   ` kernel test robot
2024-02-09  4:05 ` [PATCH v2 bpf-next 04/20] mm: Expose vmap_pages_range() to the rest of the kernel Alexei Starovoitov
2024-02-10  7:04   ` Kumar Kartikeya Dwivedi
2024-02-14  8:36   ` Christoph Hellwig
2024-02-14 20:53     ` Alexei Starovoitov
2024-02-15  6:58       ` Christoph Hellwig
2024-02-15 20:50         ` Alexei Starovoitov
2024-02-15 21:26           ` Linus Torvalds
2024-02-16  1:34           ` Alexei Starovoitov
2024-02-16  9:31           ` Christoph Hellwig
2024-02-16 16:54             ` Alexei Starovoitov
2024-02-16 17:18               ` Uladzislau Rezki
2024-02-18  2:06                 ` Alexei Starovoitov
2024-02-20  6:57               ` Christoph Hellwig
2024-02-09  4:05 ` [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena Alexei Starovoitov
2024-02-09 20:36   ` David Vernet
2024-02-10  4:38     ` Alexei Starovoitov
2024-02-10  7:40   ` Kumar Kartikeya Dwivedi
2024-02-12 18:21     ` Alexei Starovoitov
2024-02-12 15:56   ` Barret Rhoden
2024-02-12 18:23     ` Alexei Starovoitov
2024-02-13 23:14   ` Andrii Nakryiko
2024-02-13 23:29     ` Alexei Starovoitov
2024-02-14  0:03       ` Andrii Nakryiko
2024-02-14  0:14         ` Alexei Starovoitov
2024-02-09  4:05 ` [PATCH v2 bpf-next 06/20] bpf: Disasm support for cast_kern/user instructions Alexei Starovoitov
2024-02-10  7:41   ` Kumar Kartikeya Dwivedi
2024-02-09  4:05 ` [PATCH v2 bpf-next 07/20] bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions Alexei Starovoitov
2024-02-09 17:20   ` Eduard Zingerman
2024-02-13 22:20     ` Alexei Starovoitov
2024-02-10  6:48   ` Kumar Kartikeya Dwivedi
2024-02-13 22:00     ` Alexei Starovoitov
2024-02-09  4:05 ` [PATCH v2 bpf-next 08/20] bpf: Add x86-64 JIT support for bpf_cast_user instruction Alexei Starovoitov
2024-02-10  1:15   ` Eduard Zingerman
2024-02-10  8:40   ` Kumar Kartikeya Dwivedi
2024-02-13 22:28     ` Alexei Starovoitov
2024-02-09  4:05 ` [PATCH v2 bpf-next 09/20] bpf: Recognize cast_kern/user instructions in the verifier Alexei Starovoitov
2024-02-10  1:13   ` Eduard Zingerman
2024-02-13  2:58     ` Alexei Starovoitov
2024-02-13 12:01       ` Eduard Zingerman
2024-02-09  4:05 ` [PATCH v2 bpf-next 10/20] bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA Alexei Starovoitov
2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
2024-02-13 23:14   ` Andrii Nakryiko
2024-02-14  0:26     ` Alexei Starovoitov
2024-02-09  4:05 ` [PATCH v2 bpf-next 11/20] libbpf: Add __arg_arena to bpf_helpers.h Alexei Starovoitov
2024-02-10  8:51   ` Kumar Kartikeya Dwivedi
2024-02-13 23:14   ` Andrii Nakryiko
2024-02-09  4:06 ` [PATCH v2 bpf-next 12/20] libbpf: Add support for bpf_arena Alexei Starovoitov
2024-02-10  7:16   ` Kumar Kartikeya Dwivedi
2024-02-12 19:11     ` Andrii Nakryiko
2024-02-12 22:29       ` Kumar Kartikeya Dwivedi
2024-02-12 18:12   ` Eduard Zingerman
2024-02-12 20:14     ` Alexei Starovoitov
2024-02-12 20:21       ` Eduard Zingerman
2024-02-13 23:15   ` Andrii Nakryiko
2024-02-14  0:32     ` Alexei Starovoitov
2024-02-09  4:06 ` [PATCH v2 bpf-next 13/20] libbpf: Allow specifying 64-bit integers in map BTF Alexei Starovoitov
2024-02-12 18:58   ` Eduard Zingerman
2024-02-13 23:15   ` Andrii Nakryiko
2024-02-14  0:47     ` Alexei Starovoitov
2024-02-14  0:51       ` Andrii Nakryiko
2024-02-09  4:06 ` [PATCH v2 bpf-next 14/20] libbpf: Recognize __arena global varaibles Alexei Starovoitov
2024-02-13  0:34   ` Eduard Zingerman
2024-02-13  0:44     ` Alexei Starovoitov
2024-02-13  0:49       ` Eduard Zingerman
2024-02-13  2:08         ` Alexei Starovoitov
2024-02-13 12:48           ` Eduard Zingerman
2024-02-13 23:11   ` Eduard Zingerman
2024-02-13 23:17     ` Andrii Nakryiko
2024-02-13 23:36       ` Eduard Zingerman
2024-02-14  0:09         ` Andrii Nakryiko
2024-02-14  0:16           ` Eduard Zingerman
2024-02-14  0:29             ` Andrii Nakryiko
2024-02-14  1:24           ` Alexei Starovoitov
2024-02-14 17:24             ` Andrii Nakryiko
2024-02-15 23:22               ` Andrii Nakryiko
2024-02-16  2:45                 ` Alexei Starovoitov
2024-02-16  4:51                   ` Andrii Nakryiko
2024-02-14  1:02     ` Alexei Starovoitov
2024-02-14 15:10       ` Eduard Zingerman
2024-02-13 23:15   ` Andrii Nakryiko
2024-02-09  4:06 ` [PATCH v2 bpf-next 15/20] bpf: Tell bpf programs kernel's PAGE_SIZE Alexei Starovoitov
2024-02-10  8:52   ` Kumar Kartikeya Dwivedi
2024-02-09  4:06 ` [PATCH v2 bpf-next 16/20] bpf: Add helper macro bpf_arena_cast() Alexei Starovoitov
2024-02-10  8:54   ` Kumar Kartikeya Dwivedi
2024-02-13 22:35     ` Alexei Starovoitov
2024-02-14 16:47       ` Eduard Zingerman
2024-02-14 17:45         ` Alexei Starovoitov
2024-02-09  4:06 ` [PATCH v2 bpf-next 17/20] selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages Alexei Starovoitov
2024-02-09 23:14   ` David Vernet
2024-02-10  4:35     ` Alexei Starovoitov
2024-02-10  7:03       ` Kumar Kartikeya Dwivedi
2024-02-13 23:19         ` Alexei Starovoitov
2024-02-12 16:48       ` David Vernet
2024-02-09  4:06 ` [PATCH v2 bpf-next 18/20] selftests/bpf: Add bpf_arena_list test Alexei Starovoitov
2024-02-09  4:06 ` [PATCH v2 bpf-next 19/20] selftests/bpf: Add bpf_arena_htab test Alexei Starovoitov
2024-02-09  4:06 ` [PATCH v2 bpf-next 20/20] selftests/bpf: Convert simple page_frag allocator to per-cpu Alexei Starovoitov
2024-02-10  7:05   ` Kumar Kartikeya Dwivedi
2024-02-14  1:37     ` Alexei Starovoitov
2024-02-12 14:14 ` [PATCH v2 bpf-next 00/20] bpf: Introduce BPF arena David Hildenbrand
2024-02-12 18:14   ` Alexei Starovoitov
2024-02-13 10:35     ` David Hildenbrand
2024-02-12 17:36 ` Barret Rhoden

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.