All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
@ 2022-11-17 20:23 Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Song Liu @ 2022-11-17 20:23 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof, Song Liu

This patchset tries to address the following issues:

1. Direct map fragmentation

On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
RO+X. These set_memory_* calls cause 1GB page table entries to be split
into 2MB and 4kB ones. This fragmentation in direct map results in bigger
and slower page table, and pressure for both instruction and data TLB.

Our previous work in bpf_prog_pack tries to address this issue from BPF
program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
greatly reduced direct map fragmentation from BPF programs.

2. iTLB pressure from BPF program

Dynamic kernel text such as modules and BPF programs (even with current
bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
BPF program is big, we can see visible performance drop caused by high
iTLB miss rate.

3. TLB shootdown for short-living BPF programs

Before bpf_prog_pack loading and unloading BPF programs requires global
TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
TLB flush.

4. Reduce memory usage by BPF programs (in some cases)

Most BPF programs and various trampolines are small, and they often
occupies a whole page. From a random server in our fleet, 50% of the
loaded BPF programs are less than 500 byte in size, and 75% of them are
less than 2kB in size. Allowing these BPF programs to share 2MB pages
would yield some memory saving for systems with many BPF programs. For
systems with only small number of BPF programs, this patch may waste a
little memory by allocating one 2MB page, but using only part of it.

5. Introduce a unified API to allocate memory with special permissions.

This will help get rid of set_vm_flush_reset_perms calls from users of
vmalloc, module_alloc, etc.


Based on our experiments [5], we measured ~0.6% performance improvement
from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
The difference is because bpf_prog_pack uses 512x 4kB pages instead of
1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.

This patchset replaces bpf_prog_pack with a better API and makes it
available for other dynamic kernel text, such as modules, ftrace, kprobe.


This set enables bpf programs and bpf dispatchers to share huge pages with
new API:
  execmem_alloc()
  execmem_alloc()
  execmem_fill()

The idea is similar to Peter's suggestion in [1].

execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
memory to its users. execmem_alloc() is used to free memory allocated by
execmem_alloc(). execmem_fill() is used to update memory allocated by
execmem_alloc().

Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
The caller has to update the content with text_poke like mechanism.
Specifically, execmem_fill() is provided to update memory allocated by
execmem_alloc(). execmem_fill() also makes sure the update stays in the
boundary of one chunk allocated by execmem_alloc(). Please refer to patch
1/6 for more details of

Patch 4/6 uses these new APIs in bpf program and bpf dispatcher.

Patch 5/6and 6/6 allows static kernel text (_stext to _etext) to share
PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
_etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
[2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
[3] v1: https://lore.kernel.org/bpf/20221031222541.1773452-1-song@kernel.org/
[4] https://lore.kernel.org/bpf/Y2ioTodn+mBXdIqp@ziqianlu-desk2/
[5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

Changes v3 => v4:
1. Fixed a bug found with test_vmalloc.

Changes v2 => v3:
1. Add selftests in 4/6. (Luis Chamberlain)
2. Add more motivation and test results. (Luis Chamberlain)
3. Fix error handling in execmem_alloc().

Changes PATCH v1 => v2:
1. Rename the APIs as execmem_* (Christoph Hellwig)
2. Add more information about the motivation of this work (and follow up
   works in for kernel modules, various trampolines, etc).
   (Luis Chamberlain, Rick Edgecombe, Mike Rapoport, Aaron Lu)
3. Include expermential results from previous bpf_prog_pack and the
   community. (Aaron Lu, Luis Chamberlain, Rick Edgecombe)

Changes RFC v2 => PATCH v1:
1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
   also ensures vcopy_exec() is only used to update memory from one single
   vmalloc_exec() call. (Christoph Hellwig)
2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
   text_poke() like logic.
3. Drop changes for kernel modules and focus on BPF side changes.

Changes RFC v1 => RFC v2:
1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
   still need some work.

Song Liu (6):
  vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  x86/alternative: support execmem_alloc() and execmem_free()
  selftests/vm: extend test_vmalloc to test execmem_* APIs
  bpf: use execmem_alloc for bpf program and bpf dispatcher
  vmalloc: introduce register_text_tail_vm()
  x86: use register_text_tail_vm

 Documentation/x86/x86_64/mm.rst         |   4 +-
 arch/x86/include/asm/pgtable_64_types.h |   1 +
 arch/x86/kernel/alternative.c           |  12 +
 arch/x86/mm/init_64.c                   |   4 +-
 arch/x86/net/bpf_jit_comp.c             |  23 +-
 include/linux/bpf.h                     |   3 -
 include/linux/filter.h                  |   5 -
 include/linux/vmalloc.h                 |   9 +
 kernel/bpf/core.c                       | 180 +-----------
 kernel/bpf/dispatcher.c                 |  11 +-
 lib/test_vmalloc.c                      |  30 ++
 mm/nommu.c                              |  12 +
 mm/vmalloc.c                            | 362 ++++++++++++++++++++++++
 13 files changed, 452 insertions(+), 204 deletions(-)

--
2.30.2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
@ 2022-11-17 20:23 ` Song Liu
  2022-11-21 15:52   ` Daniel Borkmann
  2022-11-17 20:23 ` [PATCH bpf-next v4 2/6] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-11-17 20:23 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof, Song Liu

execmem_alloc is used to allocate memory to host dynamic kernel text
(modules, BPF programs, etc.) with huge pages. This is similar to the
proposal by Peter in [1].

A new tree of vmap_area, free_text_area_* tree, is introduced in addition
to free_vmap_area_* and vmap_area_*. execmem_alloc allocates pages from
free_text_area_*. When there isn't enough space left in free_text_area_*,
new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
free_text_area_*. To be more accurate, the vmap_area is first added to
vmap_area_* tree and then moved to free_text_area_*. This extra move
simplifies the logic of execmem_alloc.

vmap_area in free_text_area_* tree are backed with memory, but we need
subtree_max_size for tree operations. Therefore, vm_struct for these
vmap_area are stored in a separate list, all_text_vm.

The new tree allows separate handling of < PAGE_SIZE allocations, as
current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
version of execmem_alloc can handle bpf programs, which uses 64 byte
aligned allocations), and modules, which uses PAGE_SIZE aligned
allocations.

Memory allocated by execmem_alloc() is set to RO+X before returning to the
caller. Therefore, the caller cannot write directly write to the memory.
Instead, the caller is required to use execmem_fill() to update the memory.
For the safety and security of X memory, execmem_fill() checks the data
being updated always in the memory allocated by one execmem_alloc() call.
execmem_fill() uses text_poke like mechanism and requires arch support.
Specifically, the arch need to implement arch_execmem_fill().

In execmem_free(), the memory is first erased with arch_invalidate_exec().
Then, the memory is added to free_text_area_*. If this free creates big
enough continuous free space (> PMD_SIZE), execmem_free() will try to free
the backing vm_struct.

Hopefully, this will be the first step towards a unified memory allocator
for memory with special permissions.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h |   5 +
 mm/nommu.c              |  12 ++
 mm/vmalloc.c            | 328 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 345 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..30aa8c187d40 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 		int node, const void *caller) __alloc_size(1);
 void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *execmem_alloc(unsigned long size, unsigned long align) __alloc_size(1);
+void *execmem_fill(void *dst, void *src, size_t len);
+void execmem_free(void *addr);
+void *arch_fill_execmem(void *dst, void *src, size_t len);
+int arch_invalidate_execmem(void *ptr, size_t len);
 
 extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..e3039fd4f65b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 }
 EXPORT_SYMBOL(vm_map_pages_zero);
 
+void *execmem_alloc(unsigned long size, unsigned long align)
+{
+	return NULL;
+}
+
+void *execmem_fill(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+void execmem_free(const void *addr) { }
+
 /*
  *  sys_brk() for the most part doesn't need the global kernel
  *  lock, except when an application is doing something nasty
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ccaa461998f3..75ec8ab24929 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
+#ifndef PMD_ALIGN
+#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
+#endif
+#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -769,6 +774,38 @@ static LIST_HEAD(free_vmap_area_list);
  */
 static struct rb_root free_vmap_area_root = RB_ROOT;
 
+/*
+ * free_text_area for execmem_alloc()
+ */
+static DEFINE_SPINLOCK(free_text_area_lock);
+/*
+ * This linked list is used in pair with free_text_area_root.
+ * It gives O(1) access to prev/next to perform fast coalescing.
+ */
+static LIST_HEAD(free_text_area_list);
+
+/*
+ * This augment red-black tree represents the free text space.
+ * All vmap_area objects in this tree are sorted by va->va_start
+ * address. It is used for allocation and merging when a vmap
+ * object is released.
+ *
+ * Each vmap_area node contains a maximum available free block
+ * of its sub-tree, right or left. Therefore it is possible to
+ * find a lowest match of free area.
+ *
+ * vmap_area in this tree are backed by RO+X memory, but they do
+ * not have valid vm pointer (because we need subtree_max_size).
+ * The vm for these vmap_area are stored in all_text_vm.
+ */
+static struct rb_root free_text_area_root = RB_ROOT;
+
+/*
+ * List of vm_struct for free_text_area_root. This list is rarely
+ * accessed, so the O(N) complexity is not likely a real issue.
+ */
+struct vm_struct *all_text_vm;
+
 /*
  * Preload a CPU with one object for "no edge" split case. The
  * aim is to get rid of allocations from the atomic context, thus
@@ -3313,6 +3350,297 @@ void *vmalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc);
 
+#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
+#define EXEC_MEM_START MODULES_VADDR
+#define EXEC_MEM_END MODULES_END
+#else
+#define EXEC_MEM_START VMALLOC_START
+#define EXEC_MEM_END VMALLOC_END
+#endif
+
+static void move_vmap_to_free_text_tree(void *addr)
+{
+	struct vmap_area *va;
+
+	/* remove from vmap_area_root */
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* make the memory RO+X */
+	memset(addr, 0, va->va_end - va->va_start);
+	set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+	set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+
+	spin_lock(&free_text_area_lock);
+	/* add to all_text_vm */
+	va->vm->next = all_text_vm;
+	all_text_vm = va->vm;
+
+	/* add to free_text_area_root */
+	merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
+	spin_unlock(&free_text_area_lock);
+}
+
+/**
+ * execmem_alloc - allocate virtually contiguous RO+X memory
+ * @size:    allocation size
+ *
+ * This is used to allocate dynamic kernel text, such as module text, BPF
+ * programs, etc. User need to use text_poke to update the memory allocated
+ * by execmem_alloc.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *execmem_alloc(unsigned long size, unsigned long align)
+{
+	struct vmap_area *va, *tmp;
+	unsigned long addr;
+	enum fit_type type;
+	int ret;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!va))
+		return NULL;
+
+again:
+	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
+	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
+
+	if (!tmp) {
+		unsigned long alloc_size;
+		void *ptr;
+
+		spin_unlock(&free_text_area_lock);
+
+		/*
+		 * Not enough continuous space in free_text_area_root, try
+		 * allocate more memory. The memory is first added to
+		 * vmap_area_root, and then moved to free_text_area_root.
+		 */
+		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
+		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, EXEC_MEM_START,
+					   EXEC_MEM_END, GFP_KERNEL, PAGE_KERNEL,
+					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
+					   NUMA_NO_NODE, __builtin_return_address(0));
+		if (unlikely(!ptr))
+			goto err_out;
+
+		move_vmap_to_free_text_tree(ptr);
+		goto again;
+	}
+
+	addr = roundup(tmp->va_start, align);
+	type = classify_va_fit_type(tmp, addr, size);
+	if (WARN_ON_ONCE(type == NOTHING_FIT))
+		goto err_out_unlock;
+
+	ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
+				    tmp, addr, size);
+	if (ret)
+		goto err_out_unlock;
+
+	spin_unlock(&free_text_area_lock);
+
+	va->va_start = addr;
+	va->va_end = addr + size;
+	va->vm = NULL;
+
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	return (void *)addr;
+
+err_out_unlock:
+	spin_unlock(&free_text_area_lock);
+err_out:
+	kmem_cache_free(vmap_area_cachep, va);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(execmem_alloc);
+
+void __weak *arch_fill_execmem(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+int __weak arch_invalidate_execmem(void *ptr, size_t len)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * execmem_fill - Copy text to RO+X memory allocated by execmem_alloc()
+ * @dst:  pointer to memory allocated by execmem_alloc()
+ * @src:  pointer to data being copied from
+ * @len:  number of bytes to be copied
+ *
+ * execmem_fill() will only update memory allocated by a single execmem_fill()
+ * call. If dst + len goes beyond the boundary of one allocation,
+ * execmem_fill() is aborted.
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void *execmem_fill(void *dst, void *src, size_t len)
+{
+	struct vmap_area *va;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)dst, &vmap_area_root);
+
+	/*
+	 * If no va, or va has a vm attached, this memory is not allocated
+	 * by execmem_alloc().
+	 */
+	if (WARN_ON_ONCE(!va))
+		goto err_out;
+	if (WARN_ON_ONCE(va->vm))
+		goto err_out;
+
+	/* Disallow filling across execmem_alloc boundaries */
+	if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end))
+		goto err_out;
+
+	spin_unlock(&vmap_area_lock);
+
+	return arch_fill_execmem(dst, src, len);
+
+err_out:
+	spin_unlock(&vmap_area_lock);
+	return ERR_PTR(-EINVAL);
+}
+EXPORT_SYMBOL_GPL(execmem_fill);
+
+static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end)
+{
+	struct vm_struct *vm, *prev_vm;
+
+	lockdep_assert_held(&free_text_area_lock);
+
+	vm = all_text_vm;
+	while (vm) {
+		unsigned long vm_addr = (unsigned long)vm->addr;
+
+		/* vm is within this free space, we can free it */
+		if ((vm_addr >= start) && ((vm_addr + vm->size) <= end))
+			goto unlink_vm;
+		vm = vm->next;
+	}
+	return NULL;
+
+unlink_vm:
+	if (all_text_vm == vm) {
+		all_text_vm = vm->next;
+	} else {
+		prev_vm = all_text_vm;
+		while (prev_vm->next != vm)
+			prev_vm = prev_vm->next;
+		prev_vm->next = vm->next;
+	}
+	return vm;
+}
+
+/**
+ * execmem_free - Release memory allocated by execmem_alloc()
+ * @addr:  Memory base address
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void execmem_free(void *addr)
+{
+	unsigned long free_start, free_end, free_addr;
+	struct vm_struct *vm;
+	struct vmap_area *va;
+
+	might_sleep();
+
+	if (!addr)
+		return;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	WARN_ON_ONCE(va->vm);
+
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* Invalidate text in the region */
+	arch_invalidate_execmem(addr, va->va_end - va->va_start);
+
+	spin_lock(&free_text_area_lock);
+	va = merge_or_add_vmap_area_augment(va,
+		&free_text_area_root, &free_text_area_list);
+
+	if (WARN_ON_ONCE(!va))
+		goto out;
+
+	free_start = PMD_ALIGN(va->va_start);
+	free_end = PMD_ALIGN_DOWN(va->va_end);
+
+	/*
+	 * Only try to free vm when there is at least one PMD_SIZE free
+	 * continuous memory.
+	 */
+	if (free_start >= free_end)
+		goto out;
+
+	/*
+	 * TODO: It is possible that multiple vm are ready to be freed
+	 * after one execmem_free(). But we free at most one vm for now.
+	 */
+	vm = find_and_unlink_text_vm(free_start, free_end);
+	if (!vm)
+		goto out;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE);
+	if (unlikely(!va))
+		goto out_save_vm;
+
+	free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list,
+				      vm->size, 1, (unsigned long)vm->addr,
+				      (unsigned long)vm->addr + vm->size);
+
+	if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr))
+		goto out_save_vm;
+
+	va->va_start = (unsigned long)vm->addr;
+	va->va_end = va->va_start + vm->size;
+	va->vm = vm;
+	spin_unlock(&free_text_area_lock);
+
+	set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT);
+	set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT);
+
+	/* put the va to vmap_area_root, and then free it with vfree */
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	vfree(vm->addr);
+	return;
+
+out_save_vm:
+	/*
+	 * vm is removed from all_text_vm, but not freed. Add it back,
+	 * so that we can use or free it later.
+	 */
+	vm->next = all_text_vm;
+	all_text_vm = vm;
+out:
+	spin_unlock(&free_text_area_lock);
+}
+EXPORT_SYMBOL_GPL(execmem_free);
+
 /**
  * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
  * @size:      allocation size
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 2/6] x86/alternative: support execmem_alloc() and execmem_free()
  2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
@ 2022-11-17 20:23 ` Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 3/6] selftests/vm: extend test_vmalloc to test execmem_* APIs Song Liu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-17 20:23 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof, Song Liu

Implement arch_fill_execmem() and arch_invalidate_execmem() to support
execmem_alloc.

arch_fill_execmem() copies dynamic kernel text (such as BPF programs) to
RO+X memory region allocated by execmem_alloc().

arch_invalidate_execmem() fills memory with 0xcc after it is released by
execmem_free().

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/kernel/alternative.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cadcea035e0..e38829f19a81 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1270,6 +1270,18 @@ void *text_poke_copy(void *addr, const void *opcode, size_t len)
 	return addr;
 }
 
+void *arch_fill_execmem(void *dst, void *src, size_t len)
+{
+	if (text_poke_copy(dst, src, len) == NULL)
+		return ERR_PTR(-EINVAL);
+	return dst;
+}
+
+int arch_invalidate_execmem(void *ptr, size_t len)
+{
+	return IS_ERR_OR_NULL(text_poke_set(ptr, 0xcc, len));
+}
+
 /**
  * text_poke_set - memset into (an unused part of) RX memory
  * @addr: address to modify
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 3/6] selftests/vm: extend test_vmalloc to test execmem_* APIs
  2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 2/6] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
@ 2022-11-17 20:23 ` Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 4/6] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-17 20:23 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof, Song Liu

Add logic to test execmem_[alloc|fill|free] in test_vmalloc.c.
No need to change tools/testing/selftests/vm/test_vmalloc.sh.

Signed-off-by: Song Liu <song@kernel.org>
---
 lib/test_vmalloc.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index cf7780572f5b..6591c4932c3c 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -50,6 +50,7 @@ __param(int, run_test_mask, INT_MAX,
 		"\t\tid: 128,  name: pcpu_alloc_test\n"
 		"\t\tid: 256,  name: kvfree_rcu_1_arg_vmalloc_test\n"
 		"\t\tid: 512,  name: kvfree_rcu_2_arg_vmalloc_test\n"
+		"\t\tid: 1024, name: execmem_alloc_test\n"
 		/* Add a new test case description here. */
 );
 
@@ -352,6 +353,34 @@ kvfree_rcu_2_arg_vmalloc_test(void)
 	return 0;
 }
 
+static int
+execmem_alloc_test(void)
+{
+	void *p, *tmp;
+	int i;
+
+	for (i = 0; i < test_loop_count; i++) {
+		/* allocate variable size, up to 64kB */
+		size_t size = (i % 1024 + 1) * 64;
+
+		p = execmem_alloc(size, 64);
+		if (!p)
+			return -1;
+
+		tmp = execmem_fill(p, "a", 1);
+		if (tmp != p)
+			return -1;
+
+		tmp = execmem_fill(p + size - 1, "b", 1);
+		if (tmp != p + size - 1)
+			return -1;
+
+		execmem_free(p);
+	}
+
+	return 0;
+}
+
 struct test_case_desc {
 	const char *test_name;
 	int (*test_func)(void);
@@ -368,6 +397,7 @@ static struct test_case_desc test_case_array[] = {
 	{ "pcpu_alloc_test", pcpu_alloc_test },
 	{ "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test },
 	{ "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test },
+	{ "execmem_alloc_test", execmem_alloc_test },
 	/* Add a new test case here. */
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 4/6] bpf: use execmem_alloc for bpf program and bpf dispatcher
  2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
                   ` (2 preceding siblings ...)
  2022-11-17 20:23 ` [PATCH bpf-next v4 3/6] selftests/vm: extend test_vmalloc to test execmem_* APIs Song Liu
@ 2022-11-17 20:23 ` Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 5/6] vmalloc: introduce register_text_tail_vm() Song Liu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-17 20:23 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof, Song Liu

Use execmem_alloc, execmem_free, and execmem_fill instead of
bpf_prog_pack_alloc, bpf_prog_pack_free, and bpf_arch_text_copy.

execmem_free doesn't require extra size information. Therefore, the free
and error handling path can be simplified.

There are some tests that show the benefit of execmem_alloc.

Run 100 instances of the following benchmark from bpf selftests:
  tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe
which loads 7 BPF programs, and triggers one of them.

Then use perf to monitor TLB related counters:
   perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \
           itlb_misses.walk_completed_2m_4m -a

The following results are from a qemu VM with 32 cores.

Before bpf_prog_pack:
  iTLB-load-misses: 350k/s
  itlb_misses.walk_completed_4k: 90k/s
  itlb_misses.walk_completed_2m_4m: 0.1/s

With bpf_prog_pack (current upstream):
  iTLB-load-misses: 220k/s
  itlb_misses.walk_completed_4k: 68k/s
  itlb_misses.walk_completed_2m_4m: 0.2/s

With execmem_alloc (with this set):
  iTLB-load-misses: 185k/s
  itlb_misses.walk_completed_4k: 58k/s
  itlb_misses.walk_completed_2m_4m: 1/s

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c |  23 +----
 include/linux/bpf.h         |   3 -
 include/linux/filter.h      |   5 -
 kernel/bpf/core.c           | 180 +++---------------------------------
 kernel/bpf/dispatcher.c     |  11 +--
 5 files changed, 21 insertions(+), 201 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index cec5195602bc..43b93570d8f9 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -229,11 +229,6 @@ static void jit_fill_hole(void *area, unsigned int size)
 	memset(area, 0xcc, size);
 }
 
-int bpf_arch_text_invalidate(void *dst, size_t len)
-{
-	return IS_ERR_OR_NULL(text_poke_set(dst, 0xcc, len));
-}
-
 struct jit_context {
 	int cleanup_addr; /* Epilogue code offset */
 
@@ -2509,11 +2504,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		if (proglen <= 0) {
 out_image:
 			image = NULL;
-			if (header) {
-				bpf_arch_text_copy(&header->size, &rw_header->size,
-						   sizeof(rw_header->size));
+			if (header)
 				bpf_jit_binary_pack_free(header, rw_header);
-			}
+
 			/* Fall back to interpreter mode */
 			prog = orig_prog;
 			if (extra_pass) {
@@ -2563,8 +2556,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		if (!prog->is_func || extra_pass) {
 			/*
 			 * bpf_jit_binary_pack_finalize fails in two scenarios:
-			 *   1) header is not pointing to proper module memory;
-			 *   2) the arch doesn't support bpf_arch_text_copy().
+			 *   1) header is not pointing to memory allocated by
+			 *      execmem_alloc;
+			 *   2) the arch doesn't support execmem_free().
 			 *
 			 * Both cases are serious bugs and justify WARN_ON.
 			 */
@@ -2610,13 +2604,6 @@ bool bpf_jit_supports_kfunc_call(void)
 	return true;
 }
 
-void *bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
-	if (text_poke_copy(dst, src, len) == NULL)
-		return ERR_PTR(-EINVAL);
-	return dst;
-}
-
 /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 54462dd28824..9018f4149310 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2733,9 +2733,6 @@ enum bpf_text_poke_type {
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
 
-void *bpf_arch_text_copy(void *dst, void *src, size_t len);
-int bpf_arch_text_invalidate(void *dst, size_t len);
-
 struct btf_id_set;
 bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index bf701976056e..4373ff388759 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1023,8 +1023,6 @@ extern long bpf_jit_limit_max;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size);
-
 struct bpf_binary_header *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
@@ -1037,9 +1035,6 @@ void bpf_jit_free(struct bpf_prog *fp);
 struct bpf_binary_header *
 bpf_jit_binary_pack_hdr(const struct bpf_prog *fp);
 
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns);
-void bpf_prog_pack_free(struct bpf_binary_header *hdr);
-
 static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
 {
 	return list_empty(&fp->aux->ksym.lnode) ||
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9c16338bcbe8..86b4f640e3c0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -806,149 +806,6 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 	return slot;
 }
 
-/*
- * BPF program pack allocator.
- *
- * Most BPF programs are pretty small. Allocating a hole page for each
- * program is sometime a waste. Many small bpf program also adds pressure
- * to instruction TLB. To solve this issue, we introduce a BPF program pack
- * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
- * to host BPF programs.
- */
-#define BPF_PROG_CHUNK_SHIFT	6
-#define BPF_PROG_CHUNK_SIZE	(1 << BPF_PROG_CHUNK_SHIFT)
-#define BPF_PROG_CHUNK_MASK	(~(BPF_PROG_CHUNK_SIZE - 1))
-
-struct bpf_prog_pack {
-	struct list_head list;
-	void *ptr;
-	unsigned long bitmap[];
-};
-
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size)
-{
-	memset(area, 0, size);
-}
-
-#define BPF_PROG_SIZE_TO_NBITS(size)	(round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
-
-static DEFINE_MUTEX(pack_mutex);
-static LIST_HEAD(pack_list);
-
-/* PMD_SIZE is not available in some special config, e.g. ARCH=arm with
- * CONFIG_MMU=n. Use PAGE_SIZE in these cases.
- */
-#ifdef PMD_SIZE
-#define BPF_PROG_PACK_SIZE (PMD_SIZE * num_possible_nodes())
-#else
-#define BPF_PROG_PACK_SIZE PAGE_SIZE
-#endif
-
-#define BPF_PROG_CHUNK_COUNT (BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
-
-static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	struct bpf_prog_pack *pack;
-
-	pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)),
-		       GFP_KERNEL);
-	if (!pack)
-		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
-	if (!pack->ptr) {
-		kfree(pack);
-		return NULL;
-	}
-	bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE);
-	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
-	list_add_tail(&pack->list, &pack_list);
-
-	set_vm_flush_reset_perms(pack->ptr);
-	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	return pack;
-}
-
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
-	struct bpf_prog_pack *pack;
-	unsigned long pos;
-	void *ptr = NULL;
-
-	mutex_lock(&pack_mutex);
-	if (size > BPF_PROG_PACK_SIZE) {
-		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
-		if (ptr) {
-			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
-			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
-			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
-		}
-		goto out;
-	}
-	list_for_each_entry(pack, &pack_list, list) {
-		pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-						 nbits, 0);
-		if (pos < BPF_PROG_CHUNK_COUNT)
-			goto found_free_area;
-	}
-
-	pack = alloc_new_pack(bpf_fill_ill_insns);
-	if (!pack)
-		goto out;
-
-	pos = 0;
-
-found_free_area:
-	bitmap_set(pack->bitmap, pos, nbits);
-	ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
-
-out:
-	mutex_unlock(&pack_mutex);
-	return ptr;
-}
-
-void bpf_prog_pack_free(struct bpf_binary_header *hdr)
-{
-	struct bpf_prog_pack *pack = NULL, *tmp;
-	unsigned int nbits;
-	unsigned long pos;
-
-	mutex_lock(&pack_mutex);
-	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
-		goto out;
-	}
-
-	list_for_each_entry(tmp, &pack_list, list) {
-		if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) {
-			pack = tmp;
-			break;
-		}
-	}
-
-	if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
-		goto out;
-
-	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
-	pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT;
-
-	WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size),
-		  "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n");
-
-	bitmap_clear(pack->bitmap, pos, nbits);
-	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
-		list_del(&pack->list);
-		module_memfree(pack->ptr);
-		kfree(pack);
-	}
-out:
-	mutex_unlock(&pack_mutex);
-}
-
 static atomic_long_t bpf_jit_current;
 
 /* Can be overridden by an arch's JIT compiler if it has a custom,
@@ -1048,6 +905,9 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 	bpf_jit_uncharge_modmem(size);
 }
 
+#define BPF_PROG_EXEC_ALIGN	64
+#define BPF_PROG_EXEC_MASK	(~(BPF_PROG_EXEC_ALIGN - 1))
+
 /* Allocate jit binary from bpf_prog_pack allocator.
  * Since the allocated memory is RO+X, the JIT engine cannot write directly
  * to the memory. To solve this problem, a RW buffer is also allocated at
@@ -1070,11 +930,11 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 		     alignment > BPF_IMAGE_ALIGNMENT);
 
 	/* add 16 bytes for a random section of illegal instructions */
-	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_CHUNK_SIZE);
+	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_EXEC_ALIGN);
 
 	if (bpf_jit_charge_modmem(size))
 		return NULL;
-	ro_header = bpf_prog_pack_alloc(size, bpf_fill_ill_insns);
+	ro_header = execmem_alloc(size, BPF_PROG_EXEC_ALIGN);
 	if (!ro_header) {
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
@@ -1082,8 +942,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 
 	*rw_header = kvmalloc(size, GFP_KERNEL);
 	if (!*rw_header) {
-		bpf_arch_text_copy(&ro_header->size, &size, sizeof(size));
-		bpf_prog_pack_free(ro_header);
+		execmem_free(ro_header);
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
 	}
@@ -1093,7 +952,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 	(*rw_header)->size = size;
 
 	hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),
-		     BPF_PROG_CHUNK_SIZE - sizeof(*ro_header));
+		     BPF_PROG_EXEC_ALIGN - sizeof(*ro_header));
 	start = prandom_u32_max(hole) & ~(alignment - 1);
 
 	*image_ptr = &ro_header->image[start];
@@ -1109,12 +968,12 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
 {
 	void *ptr;
 
-	ptr = bpf_arch_text_copy(ro_header, rw_header, rw_header->size);
+	ptr = execmem_fill(ro_header, rw_header, rw_header->size);
 
 	kvfree(rw_header);
 
 	if (IS_ERR(ptr)) {
-		bpf_prog_pack_free(ro_header);
+		execmem_free(ro_header);
 		return PTR_ERR(ptr);
 	}
 	return 0;
@@ -1124,18 +983,13 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
  *   1) when the program is freed after;
  *   2) when the JIT engine fails (before bpf_jit_binary_pack_finalize).
  * For case 2), we need to free both the RO memory and the RW buffer.
- *
- * bpf_jit_binary_pack_free requires proper ro_header->size. However,
- * bpf_jit_binary_pack_alloc does not set it. Therefore, ro_header->size
- * must be set with either bpf_jit_binary_pack_finalize (normal path) or
- * bpf_arch_text_copy (when jit fails).
  */
 void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header,
 			      struct bpf_binary_header *rw_header)
 {
-	u32 size = ro_header->size;
+	u32 size = rw_header ? rw_header->size : ro_header->size;
 
-	bpf_prog_pack_free(ro_header);
+	execmem_free(ro_header);
 	kvfree(rw_header);
 	bpf_jit_uncharge_modmem(size);
 }
@@ -1146,7 +1000,7 @@ bpf_jit_binary_pack_hdr(const struct bpf_prog *fp)
 	unsigned long real_start = (unsigned long)fp->bpf_func;
 	unsigned long addr;
 
-	addr = real_start & BPF_PROG_CHUNK_MASK;
+	addr = real_start & BPF_PROG_EXEC_MASK;
 	return (void *)addr;
 }
 
@@ -2736,16 +2590,6 @@ int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 	return -ENOTSUPP;
 }
 
-void * __weak bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
-	return ERR_PTR(-ENOTSUPP);
-}
-
-int __weak bpf_arch_text_invalidate(void *dst, size_t len)
-{
-	return -ENOTSUPP;
-}
-
 DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 EXPORT_SYMBOL(bpf_stats_enabled_key);
 
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
index 04f0a045dcaa..c41dff3379db 100644
--- a/kernel/bpf/dispatcher.c
+++ b/kernel/bpf/dispatcher.c
@@ -126,11 +126,11 @@ static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs)
 	tmp = d->num_progs ? d->rw_image + noff : NULL;
 	if (new) {
 		/* Prepare the dispatcher in d->rw_image. Then use
-		 * bpf_arch_text_copy to update d->image, which is RO+X.
+		 * execmem_fill to update d->image, which is RO+X.
 		 */
 		if (bpf_dispatcher_prepare(d, new, tmp))
 			return;
-		if (IS_ERR(bpf_arch_text_copy(new, tmp, PAGE_SIZE / 2)))
+		if (IS_ERR(execmem_fill(new, tmp, PAGE_SIZE / 2)))
 			return;
 	}
 
@@ -152,15 +152,12 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
 
 	mutex_lock(&d->mutex);
 	if (!d->image) {
-		d->image = bpf_prog_pack_alloc(PAGE_SIZE, bpf_jit_fill_hole_with_zero);
+		d->image = execmem_alloc(PAGE_SIZE, PAGE_SIZE /* align */);
 		if (!d->image)
 			goto out;
 		d->rw_image = bpf_jit_alloc_exec(PAGE_SIZE);
 		if (!d->rw_image) {
-			u32 size = PAGE_SIZE;
-
-			bpf_arch_text_copy(d->image, &size, sizeof(size));
-			bpf_prog_pack_free((struct bpf_binary_header *)d->image);
+			execmem_free((struct bpf_binary_header *)d->image);
 			d->image = NULL;
 			goto out;
 		}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 5/6] vmalloc: introduce register_text_tail_vm()
  2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
                   ` (3 preceding siblings ...)
  2022-11-17 20:23 ` [PATCH bpf-next v4 4/6] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
@ 2022-11-17 20:23 ` Song Liu
  2022-11-17 20:23 ` [PATCH bpf-next v4 6/6] x86: use register_text_tail_vm Song Liu
  2022-11-21 20:12 ` [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Luis Chamberlain
  6 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-17 20:23 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof, Song Liu

Allow arch code to register some memory to be used by execmem_alloc().
One possible use case is to allocate PMD pages for kernl text up to
PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for
execmem_alloc. Currently, only one such region is supported.

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h |  4 ++++
 mm/vmalloc.c            | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 30aa8c187d40..2babbe3031a5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -132,11 +132,15 @@ extern void vm_unmap_aliases(void);
 #ifdef CONFIG_MMU
 extern void __init vmalloc_init(void);
 extern unsigned long vmalloc_nr_pages(void);
+void register_text_tail_vm(unsigned long start, unsigned long end);
 #else
 static inline void vmalloc_init(void)
 {
 }
 static inline unsigned long vmalloc_nr_pages(void) { return 0; }
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+}
 #endif
 
 extern void *vmalloc(unsigned long size) __alloc_size(1);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 75ec8ab24929..31d04ae81cd9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -77,6 +77,9 @@ static const bool vmap_allow_huge = false;
 #endif
 #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
 
+static struct vm_struct text_tail_vm;
+static struct vmap_area text_tail_va;
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -655,6 +658,8 @@ int is_vmalloc_or_module_addr(const void *x)
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
 	if (addr >= MODULES_VADDR && addr < MODULES_END)
 		return 1;
+	if (addr >= text_tail_va.va_start && addr < text_tail_va.va_end)
+		return 1;
 #endif
 	return is_vmalloc_addr(x);
 }
@@ -2439,6 +2444,35 @@ static void vmap_init_free_space(void)
 	}
 }
 
+/*
+ * register_text_tail_vm() allows arch code to register memory regions
+ * for execmem_alloc. Unlike regular memory regions used by execmem_alloc,
+ * this region is never freed by vfree_exec.
+ *
+ * One possible use case is to allocate PMD pages for kernl text up to
+ * PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for
+ * execmem_alloc.
+ */
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+	struct vmap_area *va;
+
+	/* only support one region */
+	if (WARN_ON_ONCE(text_tail_vm.addr))
+		return;
+
+	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+	if (WARN_ON_ONCE(!va))
+		return;
+	text_tail_vm.addr = (void *)start;
+	text_tail_vm.size = end - start;
+	text_tail_va.va_start = start;
+	text_tail_va.va_end = end;
+	text_tail_va.vm = &text_tail_vm;
+	memcpy(va, &text_tail_va, sizeof(*va));
+	insert_vmap_area_augment(va, NULL, &free_text_area_root, &free_text_area_list);
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 6/6] x86: use register_text_tail_vm
  2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
                   ` (4 preceding siblings ...)
  2022-11-17 20:23 ` [PATCH bpf-next v4 5/6] vmalloc: introduce register_text_tail_vm() Song Liu
@ 2022-11-17 20:23 ` Song Liu
  2022-11-21 20:12 ` [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Luis Chamberlain
  6 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-17 20:23 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof, Song Liu

Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
[round_up(_etext, 4kb), round_up(_etext, 2MB)] with register_text_tail_vm
so that we can use this part of memory for dynamic kernel text (BPF
programs, etc.).

Here is an example:

[root@eth50-1 ~]# grep _etext /proc/kallsyms
ffffffff82202a08 T _etext

[root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]

[root@eth50-1 ~]#  grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel
0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x  pmd

ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, and
bpf programs.

Also update Documentation/x86/x86_64/mm.rst to show execmem can be mapped
to kernel text addresses.

Signed-off-by: Song Liu <song@kernel.org>
---
 Documentation/x86/x86_64/mm.rst         | 4 ++--
 arch/x86/include/asm/pgtable_64_types.h | 1 +
 arch/x86/mm/init_64.c                   | 4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst
index 9798676bb0bf..6ee95e5fa7e9 100644
--- a/Documentation/x86/x86_64/mm.rst
+++ b/Documentation/x86/x86_64/mm.rst
@@ -62,7 +62,7 @@ Complete virtual memory map with 4-level page tables
    ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
    ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
    ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping and execmem, mapped to physical address 0
    ffffffff80000000 |-2048    MB |                  |         |
    ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
    ffffffffff000000 |  -16    MB |                  |         |
@@ -121,7 +121,7 @@ Complete virtual memory map with 5-level page tables
    ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
    ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
    ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping and execmem, mapped to physical address 0
    ffffffff80000000 |-2048    MB |                  |         |
    ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
    ffffffffff000000 |  -16    MB |                  |         |
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04f36063ad54..c0f9cceb109a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -101,6 +101,7 @@ extern unsigned int ptrs_per_p4d;
 #define PUD_MASK	(~(PUD_SIZE - 1))
 #define PGDIR_SIZE	(_AC(1, UL) << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE - 1))
+#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
 
 /*
  * See Documentation/x86/x86_64/mm.rst for a description of the memory map.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3f040c6e5d13..5b42fc0c6099 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1373,7 +1373,7 @@ void mark_rodata_ro(void)
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
-	unsigned long text_end = PFN_ALIGN(_etext);
+	unsigned long text_end = PMD_ALIGN(_etext);
 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
 	unsigned long all_end;
 
@@ -1414,6 +1414,8 @@ void mark_rodata_ro(void)
 				(void *)rodata_end, (void *)_sdata);
 
 	debug_checkwx();
+	register_text_tail_vm(PFN_ALIGN((unsigned long)_etext),
+			      PMD_ALIGN((unsigned long)_etext));
 }
 
 int kern_addr_valid(unsigned long addr)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-17 20:23 ` [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
@ 2022-11-21 15:52   ` Daniel Borkmann
  2022-11-21 15:55     ` Christoph Hellwig
  2022-11-28 17:53     ` Song Liu
  0 siblings, 2 replies; 25+ messages in thread
From: Daniel Borkmann @ 2022-11-21 15:52 UTC (permalink / raw)
  To: Song Liu, bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof

On 11/17/22 9:23 PM, Song Liu wrote:
> execmem_alloc is used to allocate memory to host dynamic kernel text
> (modules, BPF programs, etc.) with huge pages. This is similar to the
> proposal by Peter in [1].
[...]

Looks like the series needs a rebase, meanwhile just few nits:

> diff --git a/mm/nommu.c b/mm/nommu.c
> index 214c70e1d059..e3039fd4f65b 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
>   }
>   EXPORT_SYMBOL(vm_map_pages_zero);
>   
> +void *execmem_alloc(unsigned long size, unsigned long align)
> +{
> +	return NULL;
> +}
> +
> +void *execmem_fill(void *dst, void *src, size_t len)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}

Don't they need EXPORT_SYMBOL_GPL, too?

> +
> +void execmem_free(const void *addr) { }
> +
>   /*
>    *  sys_brk() for the most part doesn't need the global kernel
>    *  lock, except when an application is doing something nasty
[...]
> +
> +/**
> + * execmem_alloc - allocate virtually contiguous RO+X memory
> + * @size:    allocation size

nit: @align missing in kdoc

> + *
> + * This is used to allocate dynamic kernel text, such as module text, BPF
> + * programs, etc. User need to use text_poke to update the memory allocated
> + * by execmem_alloc.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *execmem_alloc(unsigned long size, unsigned long align)
> +{
> +	struct vmap_area *va, *tmp;
> +	unsigned long addr;
> +	enum fit_type type;
> +	int ret;
> +
> +	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
> +	if (unlikely(!va))
> +		return NULL;
> +
> +again:
> +	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
> +	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
> +
> +	if (!tmp) {
> +		unsigned long alloc_size;
> +		void *ptr;
> +
> +		spin_unlock(&free_text_area_lock);
> +
> +		/*
> +		 * Not enough continuous space in free_text_area_root, try
> +		 * allocate more memory. The memory is first added to
> +		 * vmap_area_root, and then moved to free_text_area_root.
> +		 */
> +		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
> +		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, EXEC_MEM_START,
> +					   EXEC_MEM_END, GFP_KERNEL, PAGE_KERNEL,
> +					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
> +					   NUMA_NO_NODE, __builtin_return_address(0));
> +		if (unlikely(!ptr))
> +			goto err_out;
> +
> +		move_vmap_to_free_text_tree(ptr);
> +		goto again;
> +	}
> +
> +	addr = roundup(tmp->va_start, align);
> +	type = classify_va_fit_type(tmp, addr, size);
> +	if (WARN_ON_ONCE(type == NOTHING_FIT))
> +		goto err_out_unlock;

Isn't this already covered in adjust_va_to_fit_type()?

> +	ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
> +				    tmp, addr, size);

nit:

	spin_unlock(&free_text_area_lock);
	if (ret)
		goto err_out;

> +	if (ret)
> +		goto err_out_unlock;
> +
> +	spin_unlock(&free_text_area_lock);
> +
> +	va->va_start = addr;
> +	va->va_end = addr + size;
> +	va->vm = NULL;
> +
> +	spin_lock(&vmap_area_lock);
> +	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> +	spin_unlock(&vmap_area_lock);
> +
> +	return (void *)addr;
> +
> +err_out_unlock:
> +	spin_unlock(&free_text_area_lock);
> +err_out:
> +	kmem_cache_free(vmap_area_cachep, va);
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(execmem_alloc);
> +
> +void __weak *arch_fill_execmem(void *dst, void *src, size_t len)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +int __weak arch_invalidate_execmem(void *ptr, size_t len)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +/**
> + * execmem_fill - Copy text to RO+X memory allocated by execmem_alloc()
> + * @dst:  pointer to memory allocated by execmem_alloc()
> + * @src:  pointer to data being copied from
> + * @len:  number of bytes to be copied
> + *
> + * execmem_fill() will only update memory allocated by a single execmem_fill()
> + * call. If dst + len goes beyond the boundary of one allocation,
> + * execmem_fill() is aborted.
> + *
> + * If @addr is NULL, no operation is performed.
> + */
> +void *execmem_fill(void *dst, void *src, size_t len)
> +{
> +	struct vmap_area *va;
> +
> +	spin_lock(&vmap_area_lock);
> +	va = __find_vmap_area((unsigned long)dst, &vmap_area_root);
> +
> +	/*
> +	 * If no va, or va has a vm attached, this memory is not allocated
> +	 * by execmem_alloc().
> +	 */
> +	if (WARN_ON_ONCE(!va))
> +		goto err_out;
> +	if (WARN_ON_ONCE(va->vm))
> +		goto err_out;
> +
> +	/* Disallow filling across execmem_alloc boundaries */
> +	if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end))
> +		goto err_out;
> +
> +	spin_unlock(&vmap_area_lock);
> +
> +	return arch_fill_execmem(dst, src, len);
> +
> +err_out:
> +	spin_unlock(&vmap_area_lock);
> +	return ERR_PTR(-EINVAL);
> +}
> +EXPORT_SYMBOL_GPL(execmem_fill);
> +
> +static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end)
> +{
> +	struct vm_struct *vm, *prev_vm;
> +
> +	lockdep_assert_held(&free_text_area_lock);
> +
> +	vm = all_text_vm;
> +	while (vm) {
> +		unsigned long vm_addr = (unsigned long)vm->addr;
> +
> +		/* vm is within this free space, we can free it */
> +		if ((vm_addr >= start) && ((vm_addr + vm->size) <= end))
> +			goto unlink_vm;
> +		vm = vm->next;
> +	}
> +	return NULL;
> +
> +unlink_vm:
> +	if (all_text_vm == vm) {
> +		all_text_vm = vm->next;
> +	} else {
> +		prev_vm = all_text_vm;
> +		while (prev_vm->next != vm)
> +			prev_vm = prev_vm->next;
> +		prev_vm->next = vm->next;
> +	}
> +	return vm;
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-21 15:52   ` Daniel Borkmann
@ 2022-11-21 15:55     ` Christoph Hellwig
  2022-11-21 16:29       ` Song Liu
  2022-11-28 17:53     ` Song Liu
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-11-21 15:55 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, rppt, mcgrof

On Mon, Nov 21, 2022 at 04:52:24PM +0100, Daniel Borkmann wrote:
>> +void *execmem_fill(void *dst, void *src, size_t len)
>> +{
>> +	return ERR_PTR(-EOPNOTSUPP);
>> +}
>
> Don't they need EXPORT_SYMBOL_GPL, too?

None of these should be exported.  Modular code has absolutely no
business creating executable mappings.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-21 15:55     ` Christoph Hellwig
@ 2022-11-21 16:29       ` Song Liu
  2022-11-21 19:55         ` Luis Chamberlain
  0 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-11-21 16:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Borkmann, bpf, linux-mm, akpm, x86, peterz,
	rick.p.edgecombe, rppt, mcgrof

On Mon, Nov 21, 2022 at 8:55 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Nov 21, 2022 at 04:52:24PM +0100, Daniel Borkmann wrote:
> >> +void *execmem_fill(void *dst, void *src, size_t len)
> >> +{
> >> +    return ERR_PTR(-EOPNOTSUPP);
> >> +}
> >
> > Don't they need EXPORT_SYMBOL_GPL, too?
>
> None of these should be exported.  Modular code has absolutely no
> business creating executable mappings.

I added these exports for test_vmalloc.ko. Is there a way to only export
them to test_vmalloc.ko but nothing else?

Thanks,
Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-21 16:29       ` Song Liu
@ 2022-11-21 19:55         ` Luis Chamberlain
  2022-11-22  2:55           ` Song Liu
  2022-11-22  6:13           ` Christoph Hellwig
  0 siblings, 2 replies; 25+ messages in thread
From: Luis Chamberlain @ 2022-11-21 19:55 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Daniel Borkmann, bpf, linux-mm, akpm, x86,
	peterz, rick.p.edgecombe, rppt

On Mon, Nov 21, 2022 at 09:29:20AM -0700, Song Liu wrote:
> On Mon, Nov 21, 2022 at 8:55 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Mon, Nov 21, 2022 at 04:52:24PM +0100, Daniel Borkmann wrote:
> > >> +void *execmem_fill(void *dst, void *src, size_t len)
> > >> +{
> > >> +    return ERR_PTR(-EOPNOTSUPP);
> > >> +}
> > >
> > > Don't they need EXPORT_SYMBOL_GPL, too?
> >
> > None of these should be exported.  Modular code has absolutely no
> > business creating executable mappings.
> 
> I added these exports for test_vmalloc.ko. Is there a way to only export
> them to test_vmalloc.ko but nothing else?

See EXPORT_SYMBOL_NS_GPL()

  Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
                   ` (5 preceding siblings ...)
  2022-11-17 20:23 ` [PATCH bpf-next v4 6/6] x86: use register_text_tail_vm Song Liu
@ 2022-11-21 20:12 ` Luis Chamberlain
  2022-11-21 20:20   ` Luis Chamberlain
                     ` (2 more replies)
  6 siblings, 3 replies; 25+ messages in thread
From: Luis Chamberlain @ 2022-11-21 20:12 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe, rppt,
	willy, dave, a.manzanares

On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> This patchset tries to address the following issues:
> 
> 1. Direct map fragmentation
> 
> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> RO+X. These set_memory_* calls cause 1GB page table entries to be split
> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> and slower page table, and pressure for both instruction and data TLB.
> 
> Our previous work in bpf_prog_pack tries to address this issue from BPF
> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> greatly reduced direct map fragmentation from BPF programs.

This value is clear, but I'd like to see at least another new user and
the respective commit log show the gains as Aaron Lu showed.

> 2. iTLB pressure from BPF program
> 
> Dynamic kernel text such as modules and BPF programs (even with current
> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> BPF program is big, we can see visible performance drop caused by high
> iTLB miss rate.

As suggested by Mike Rapoport, "benchmarking iTLB performance on an idle
system is not very representative. TLB is a scarce resource, so it'd be
interesting to see this benchmark on a loaded system."

This would also help pave the way to measure this for more possible
future callers like modules. There in lies true value to this
consideration.

Also, you mention your perf stats are run on a VM, I am curious what
things you need to get TLB to be properly measured on the VM and if
this is really reliable data Vs bare metal. I haven't yet been sucessful
on getting perf stat for TBL to work on a VM and based on what I've read
have been catious about the results.

So curious if you'd see something different on bare metal.

[0] https://lkml.kernel.org/r/Y3YA2mRZDJkB4lmP@kernel.org

> 3. TLB shootdown for short-living BPF programs
> 
> Before bpf_prog_pack loading and unloading BPF programs requires global
> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> TLB flush.

If this is all done on the bpf code replacement then the commit log
should clarify that in the commit log, as then it allows future users
to not be surprised if they don't see these gains as this is specific
to the way bpf code used bpf_prog_pag. Also, you can measure the
shootdowns and show the differences with perf stat tlb:tlb_flush.

> 4. Reduce memory usage by BPF programs (in some cases)
> 
> Most BPF programs and various trampolines are small, and they often
> occupies a whole page. From a random server in our fleet, 50% of the
> loaded BPF programs are less than 500 byte in size, and 75% of them are
> less than 2kB in size. Allowing these BPF programs to share 2MB pages
> would yield some memory saving for systems with many BPF programs. For
> systems with only small number of BPF programs, this patch may waste a
> little memory by allocating one 2MB page, but using only part of it.
> 
> 5. Introduce a unified API to allocate memory with special permissions.
> 
> This will help get rid of set_vm_flush_reset_perms calls from users of
> vmalloc, module_alloc, etc.

And *this* is one of the reasons I'm so eager to see a proper solution
drawn up. This would be a huge win for modules, however since some of
the complexities in special permissions with modules lies in all the
cross architecture hanky panky, I'd prefer to see this through merged
*iff* we have modules converted as well as it would give us a clearer
picture if the solution covers the bases. And we'd get proper testing
on this. Rather than it being a special thing for BPF.

> Based on our experiments [5], we measured ~0.6% performance improvement
> from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.

I'd prefer we leave out arbitrary performance data, as it does not help much.

> The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> 
> This patchset replaces bpf_prog_pack with a better API and makes it
> available for other dynamic kernel text, such as modules, ftrace, kprobe.

Let's see that through, then I think the series builds confidence in
implementation.

  Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-21 20:12 ` [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Luis Chamberlain
@ 2022-11-21 20:20   ` Luis Chamberlain
  2022-11-22  2:36     ` Song Liu
  2022-11-22  2:28   ` Song Liu
  2022-11-22  2:55   ` Song Liu
  2 siblings, 1 reply; 25+ messages in thread
From: Luis Chamberlain @ 2022-11-21 20:20 UTC (permalink / raw)
  To: Song Liu, Linus Torvalds
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe, rppt,
	willy, dave, a.manzanares

On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote:
> On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > Based on our experiments [5], we measured ~0.6% performance improvement
> > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> 
> I'd prefer we leave out arbitrary performance data, as it does not help much.

I'd like to clarify what I mean by this, as Linus has suggested before, you are
missing the opportunity to present "actual numbers on real loads. (not
some artificial benchmark)"

[0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com

  Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-21 20:12 ` [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Luis Chamberlain
  2022-11-21 20:20   ` Luis Chamberlain
@ 2022-11-22  2:28   ` Song Liu
  2022-11-23  0:21     ` Luis Chamberlain
  2022-11-30  9:41     ` Mike Rapoport
  2022-11-22  2:55   ` Song Liu
  2 siblings, 2 replies; 25+ messages in thread
From: Song Liu @ 2022-11-22  2:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe, rppt,
	willy, dave, a.manzanares

On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > This patchset tries to address the following issues:
> >
> > 1. Direct map fragmentation
> >
> > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> > RO+X. These set_memory_* calls cause 1GB page table entries to be split
> > into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> > and slower page table, and pressure for both instruction and data TLB.
> >
> > Our previous work in bpf_prog_pack tries to address this issue from BPF
> > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> > greatly reduced direct map fragmentation from BPF programs.
>
> This value is clear, but I'd like to see at least another new user and
> the respective commit log show the gains as Aaron Lu showed.
>
> > 2. iTLB pressure from BPF program
> >
> > Dynamic kernel text such as modules and BPF programs (even with current
> > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> > BPF program is big, we can see visible performance drop caused by high
> > iTLB miss rate.
>
> As suggested by Mike Rapoport, "benchmarking iTLB performance on an idle
> system is not very representative. TLB is a scarce resource, so it'd be
> interesting to see this benchmark on a loaded system."
>
> This would also help pave the way to measure this for more possible
> future callers like modules. There in lies true value to this
> consideration.
>
> Also, you mention your perf stats are run on a VM, I am curious what
> things you need to get TLB to be properly measured on the VM and if
> this is really reliable data Vs bare metal. I haven't yet been sucessful
> on getting perf stat for TBL to work on a VM and based on what I've read
> have been catious about the results.

To make these perf counters work on VM, we need a newer host kernel
(my system is running 5.6 based kernel, but I am not sure what is the
minimum required version). Then we need to run qemu with option
"-cpu host" (both host and guest are x86_64).

>
> So curious if you'd see something different on bare metal.

Once the above all worked out, VM runs the same as bare metal from
perf counter's point of view.

>
> [0] https://lkml.kernel.org/r/Y3YA2mRZDJkB4lmP@kernel.org
>
> > 3. TLB shootdown for short-living BPF programs
> >
> > Before bpf_prog_pack loading and unloading BPF programs requires global
> > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> > TLB flush.
>
> If this is all done on the bpf code replacement then the commit log
> should clarify that in the commit log, as then it allows future users
> to not be surprised if they don't see these gains as this is specific
> to the way bpf code used bpf_prog_pag. Also, you can measure the
> shootdowns and show the differences with perf stat tlb:tlb_flush.
>
> > 4. Reduce memory usage by BPF programs (in some cases)
> >
> > Most BPF programs and various trampolines are small, and they often
> > occupies a whole page. From a random server in our fleet, 50% of the
> > loaded BPF programs are less than 500 byte in size, and 75% of them are
> > less than 2kB in size. Allowing these BPF programs to share 2MB pages
> > would yield some memory saving for systems with many BPF programs. For
> > systems with only small number of BPF programs, this patch may waste a
> > little memory by allocating one 2MB page, but using only part of it.
> >
> > 5. Introduce a unified API to allocate memory with special permissions.
> >
> > This will help get rid of set_vm_flush_reset_perms calls from users of
> > vmalloc, module_alloc, etc.
>
> And *this* is one of the reasons I'm so eager to see a proper solution
> drawn up. This would be a huge win for modules, however since some of
> the complexities in special permissions with modules lies in all the
> cross architecture hanky panky, I'd prefer to see this through merged
> *iff* we have modules converted as well as it would give us a clearer
> picture if the solution covers the bases. And we'd get proper testing
> on this. Rather than it being a special thing for BPF.
>
> > Based on our experiments [5], we measured ~0.6% performance improvement
> > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
>
> I'd prefer we leave out arbitrary performance data, as it does not help much.

This really bothers me. With real workload, we are talking about performance
difference of ~1%. I don't think there is any open source benchmark that can
show this level of performance difference. In our case, we used A/B test with
80 hosts (40 vs. 40) and runs for many hours to confidently show 1%
performance difference.

This exact benchmark has a very good record of reporting smallish
performance regression. For example, this commit

  commit 7af0145067bc ("x86/mm/cpa: Avoid the 4k pages check completely")

fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
text. The bug stayed in the kernel for almost a year. None of all the available
open source benchmark had caught it before this specific benchmark.

We have used this benchmark to demonstrate performance benefits of many
optimizations. I don't understand why it suddenly becomes "arbitrary
performance data".

Song

>
> > The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> >
> > This patchset replaces bpf_prog_pack with a better API and makes it
> > available for other dynamic kernel text, such as modules, ftrace, kprobe.
>
> Let's see that through, then I think the series builds confidence in
> implementation.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-21 20:20   ` Luis Chamberlain
@ 2022-11-22  2:36     ` Song Liu
  2022-12-08  2:48       ` Luis Chamberlain
  0 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-11-22  2:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linus Torvalds, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, rppt, willy, dave, a.manzanares

On Mon, Nov 21, 2022 at 1:20 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote:
> > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > > Based on our experiments [5], we measured ~0.6% performance improvement
> > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> >
> > I'd prefer we leave out arbitrary performance data, as it does not help much.
>
> I'd like to clarify what I mean by this, as Linus has suggested before, you are
> missing the opportunity to present "actual numbers on real loads. (not
> some artificial benchmark)"
>
> [0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com

Unless I made some serious mistakes, Linus' concern with performance was
addressed by the exact performance results above. [5]

Thanks,
Song

[5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-21 20:12 ` [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Luis Chamberlain
  2022-11-21 20:20   ` Luis Chamberlain
  2022-11-22  2:28   ` Song Liu
@ 2022-11-22  2:55   ` Song Liu
  2 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-22  2:55 UTC (permalink / raw)
  To: Luis Chamberlain, akpm, Linus Torvalds
  Cc: bpf, linux-mm, x86, peterz, hch, rick.p.edgecombe, rppt, willy,
	dave, a.manzanares

On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:

[...]

> > 5. Introduce a unified API to allocate memory with special permissions.
> >
> > This will help get rid of set_vm_flush_reset_perms calls from users of
> > vmalloc, module_alloc, etc.
>
> And *this* is one of the reasons I'm so eager to see a proper solution
> drawn up. This would be a huge win for modules, however since some of
> the complexities in special permissions with modules lies in all the
> cross architecture hanky panky, I'd prefer to see this through merged
> *iff* we have modules converted as well as it would give us a clearer
> picture if the solution covers the bases. And we'd get proper testing
> on this. Rather than it being a special thing for BPF.

Module code is clearly the most difficult to migrate. (It has to work on
almost all archs, and it contains 3 allocations: core, data, init.) If we
want actionable path towards fixing all these, I don't think we should
use module code as the bar for the very first set. (Of course, if
Andrew or Linus insists that way, I will rethink about this).

PS: I don't quite understand why there is a strong concern in adding
this to core mm API, especially that there is also an argument that
this is only for BPF.

IIUC, the real concern comes for a core API that is
   1. easy to use, and have many users;
   2. has a horrible internal implementation (maybe bpf_prog_pack
      falls in here, but it is not easy to use).

Such API will cause a lot of problems, and it is also so hard to
remove. execmem_* APIs are quite the opposite. It is hard to use,
and it has a decent internal implementation (at least better than
bpf_prog_pack).

In 4/5 of the set, we easily reverted all the code bpf_prog_pack
and used execmem_* instead. If execmem_* turn out to be
horrible, and only useful for BPF, we can easily migrate it to the
next good API, right?

Thanks,
Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-21 19:55         ` Luis Chamberlain
@ 2022-11-22  2:55           ` Song Liu
  2022-11-22  6:13           ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-22  2:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Daniel Borkmann, bpf, linux-mm, akpm, x86,
	peterz, rick.p.edgecombe, rppt

On Mon, Nov 21, 2022 at 12:55 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 21, 2022 at 09:29:20AM -0700, Song Liu wrote:
> > On Mon, Nov 21, 2022 at 8:55 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Mon, Nov 21, 2022 at 04:52:24PM +0100, Daniel Borkmann wrote:
> > > >> +void *execmem_fill(void *dst, void *src, size_t len)
> > > >> +{
> > > >> +    return ERR_PTR(-EOPNOTSUPP);
> > > >> +}
> > > >
> > > > Don't they need EXPORT_SYMBOL_GPL, too?
> > >
> > > None of these should be exported.  Modular code has absolutely no
> > > business creating executable mappings.
> >
> > I added these exports for test_vmalloc.ko. Is there a way to only export
> > them to test_vmalloc.ko but nothing else?
>
> See EXPORT_SYMBOL_NS_GPL()

Thanks! I will take a look at this later (vacation this week).

Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-21 19:55         ` Luis Chamberlain
  2022-11-22  2:55           ` Song Liu
@ 2022-11-22  6:13           ` Christoph Hellwig
  2022-11-22 17:25             ` Song Liu
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-11-22  6:13 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, Christoph Hellwig, Daniel Borkmann, bpf, linux-mm,
	akpm, x86, peterz, rick.p.edgecombe, rppt

On Mon, Nov 21, 2022 at 11:55:14AM -0800, Luis Chamberlain wrote:
> > I added these exports for test_vmalloc.ko. Is there a way to only export
> > them to test_vmalloc.ko but nothing else?
> 
> See EXPORT_SYMBOL_NS_GPL()

No, that is in no way limiting who uses it, it just makes them go
through extra hoops.

The funtionality to allocate exectuable memory is highly dangerous
and absolutely must be limited to built-in code.

So the tests should just be forced to be built-in here as well.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-22  6:13           ` Christoph Hellwig
@ 2022-11-22 17:25             ` Song Liu
  0 siblings, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-22 17:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, Daniel Borkmann, bpf, linux-mm, akpm, x86,
	peterz, rick.p.edgecombe, rppt

On Mon, Nov 21, 2022 at 11:13 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Nov 21, 2022 at 11:55:14AM -0800, Luis Chamberlain wrote:
> > > I added these exports for test_vmalloc.ko. Is there a way to only export
> > > them to test_vmalloc.ko but nothing else?
> >
> > See EXPORT_SYMBOL_NS_GPL()
>
> No, that is in no way limiting who uses it, it just makes them go
> through extra hoops.
>
> The funtionality to allocate exectuable memory is highly dangerous
> and absolutely must be limited to built-in code.
>
> So the tests should just be forced to be built-in here as well.

I guess we can use some debug macro similar to
DEBUG_AUGMENT_LOWEST_MATCH_CHECK to gate
test_vmalloc.ko?

Otherwise, we can just drop the changes to test_vmalloc.c.

Thanks,
Song

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-22  2:28   ` Song Liu
@ 2022-11-23  0:21     ` Luis Chamberlain
  2022-11-23  5:06       ` Song Liu
  2022-11-30  9:41     ` Mike Rapoport
  1 sibling, 1 reply; 25+ messages in thread
From: Luis Chamberlain @ 2022-11-23  0:21 UTC (permalink / raw)
  To: Song Liu, Mel Gorman, Michal Hocko, Aaron Lu, Linus Torvalds, tglx
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe, rppt,
	willy, dave, a.manzanares, Javier González, anton,
	colin.i.king

On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > > This patchset tries to address the following issues:
> > >
> > > Based on our experiments [5], we measured ~0.6% performance improvement
> > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> >
> > I'd prefer we leave out arbitrary performance data, as it does not help much.
> 
> This really bothers me. With real workload, we are talking about performance
> difference of ~1%. I don't think there is any open source benchmark that can
> show this level of performance difference.

I *highly* doubt that.

> In our case, we used A/B test with 80 hosts (40 vs. 40) and runs for
> many hours to confidently show 1% performance difference. This exact
> benchmark has a very good record of reporting smallish performance
> regression.

As per wikipedia, "A/B tests are useful for understanding user
engagement and satisfaction of online features like a new feature or
product". Let us disregards what is going on with user experience and
consider evaluating the performance instead of what goes on behind the
scenes.

> For example, this commit
> 
>   commit 7af0145067bc ("x86/mm/cpa: Avoid the 4k pages check completely")
> 
> fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
> text. The bug stayed in the kernel for almost a year. None of all the available
> open source benchmark had caught it before this specific benchmark.

That doesn't mean enterpise level testing would not have caught it, and
enteprise kernels run on ancient kernels so they would not catch up that
fast. RHEL uses even more ancient kernels than SUSE so let's consider
where SUSE was during this regression. The commit you mentioned the fix
7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019.
The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid
the 4k pages check completely") and that was on v4.20-rc1~159^2~41
around September 2018. Around September 2018, the time the regression was
committed, the most bleeding edge Enterprise Linux kernel in the industry was
that on SLE15 and so v4.12 and so there is no way in hell the performance
team at SUSE for instance would have even come close to evaluating code with
that regression. In fact, they wouldn't come accross it in testing until
SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed.

Yes, 0-day does *some* performance testing, but it does not do any
justice the monumental effort that goes into performance testing at
Enterprise Linux distributions. The gap that leaves perhaps should be
solved in the community long term however that that's a separate problem.

But to suggest that there is *nothing* like what you have, is probably
pretty innacurate.

> We have used this benchmark to demonstrate performance benefits of many
> optimizations. I don't understand why it suddenly becomes "arbitrary
> performance data".

It's because typically you'd want a benchmark you can reproduce something with,
and some "A/B testing" reference really doesn't help future developers who are
evaluating performance regressions, or who would want to provide critical
feedback to you on things you may have overlooked when selling a generic
performance improvement into the kernel.

  Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-23  0:21     ` Luis Chamberlain
@ 2022-11-23  5:06       ` Song Liu
  2022-11-30  9:53         ` Mike Rapoport
  0 siblings, 1 reply; 25+ messages in thread
From: Song Liu @ 2022-11-23  5:06 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Mel Gorman, Michal Hocko, Aaron Lu, Linus Torvalds, tglx, bpf,
	linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe, rppt, willy,
	dave, a.manzanares, Javier González, anton, colin.i.king

On Tue, Nov 22, 2022 at 5:21 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > >
[...]
> > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
> > text. The bug stayed in the kernel for almost a year. None of all the available
> > open source benchmark had caught it before this specific benchmark.
>
> That doesn't mean enterpise level testing would not have caught it, and
> enteprise kernels run on ancient kernels so they would not catch up that
> fast. RHEL uses even more ancient kernels than SUSE so let's consider
> where SUSE was during this regression. The commit you mentioned the fix
> 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019.
> The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid
> the 4k pages check completely") and that was on v4.20-rc1~159^2~41
> around September 2018. Around September 2018, the time the regression was
> committed, the most bleeding edge Enterprise Linux kernel in the industry was
> that on SLE15 and so v4.12 and so there is no way in hell the performance
> team at SUSE for instance would have even come close to evaluating code with
> that regression. In fact, they wouldn't come accross it in testing until
> SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed.

Can you refer me to one enterprise performance report with open source
benchmark that shows ~1% performance regression? If it is available, I am
more than happy to try it out. Note that, we need some BPF programs to show
the benefit of this set. In most production hosts, network related BPF programs
are the busiest. Therefore, single host benchmarks will not show the benefit.

Thanks,
Song

PS: Data in [1] if full of noise:

"""
2. For each benchmark/system combination, the 1G mapping had the highest
performance for 45% of the tests, 2M for ~30%, and 4k for~20%.

3. From the average delta, among 1G/2M/4K, 4K gets the lowest
performance in all the 4 test machines, while 1G gets the best
performance on 2 test machines and 2M gets the best performance on the
other 2 machines.
"""

There is no way we can get consistent result of 1% performance improvement
from experiments like those.


[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-21 15:52   ` Daniel Borkmann
  2022-11-21 15:55     ` Christoph Hellwig
@ 2022-11-28 17:53     ` Song Liu
  1 sibling, 0 replies; 25+ messages in thread
From: Song Liu @ 2022-11-28 17:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe, rppt, mcgrof

On Mon, Nov 21, 2022 at 7:52 AM Daniel Borkmann <daniel@iogearbox.net> wrote:

[...]

> > + */
> > +void *execmem_alloc(unsigned long size, unsigned long align)
> > +{
> > +     struct vmap_area *va, *tmp;
> > +     unsigned long addr;
> > +     enum fit_type type;
> > +     int ret;
> > +
> > +     va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
> > +     if (unlikely(!va))
> > +             return NULL;
> > +
> > +again:
> > +     preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
> > +     tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
> > +
> > +     if (!tmp) {
> > +             unsigned long alloc_size;
> > +             void *ptr;
> > +
> > +             spin_unlock(&free_text_area_lock);
> > +
> > +             /*
> > +              * Not enough continuous space in free_text_area_root, try
> > +              * allocate more memory. The memory is first added to
> > +              * vmap_area_root, and then moved to free_text_area_root.
> > +              */
> > +             alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
> > +             ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, EXEC_MEM_START,
> > +                                        EXEC_MEM_END, GFP_KERNEL, PAGE_KERNEL,
> > +                                        VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
> > +                                        NUMA_NO_NODE, __builtin_return_address(0));
> > +             if (unlikely(!ptr))
> > +                     goto err_out;
> > +
> > +             move_vmap_to_free_text_tree(ptr);
> > +             goto again;
> > +     }
> > +
> > +     addr = roundup(tmp->va_start, align);
> > +     type = classify_va_fit_type(tmp, addr, size);
> > +     if (WARN_ON_ONCE(type == NOTHING_FIT))
> > +             goto err_out_unlock;
>
> Isn't this already covered in adjust_va_to_fit_type()?

That's right! Now we can get rid of err_out_unlock. Thanks!

Also fixed other nits.

Song

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-22  2:28   ` Song Liu
  2022-11-23  0:21     ` Luis Chamberlain
@ 2022-11-30  9:41     ` Mike Rapoport
  1 sibling, 0 replies; 25+ messages in thread
From: Mike Rapoport @ 2022-11-30  9:41 UTC (permalink / raw)
  To: Song Liu
  Cc: Luis Chamberlain, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, willy, dave, a.manzanares

On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > Also, you mention your perf stats are run on a VM, I am curious what
> > things you need to get TLB to be properly measured on the VM and if
> > this is really reliable data Vs bare metal. I haven't yet been sucessful
> > on getting perf stat for TBL to work on a VM and based on what I've read
> > have been catious about the results.
> 
> To make these perf counters work on VM, we need a newer host kernel
> (my system is running 5.6 based kernel, but I am not sure what is the
> minimum required version). Then we need to run qemu with option
> "-cpu host" (both host and guest are x86_64).
> 
> >
> > So curious if you'd see something different on bare metal.
> 
> Once the above all worked out, VM runs the same as bare metal from
> perf counter's point of view.

TLBs behave differently because of EPT.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-23  5:06       ` Song Liu
@ 2022-11-30  9:53         ` Mike Rapoport
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Rapoport @ 2022-11-30  9:53 UTC (permalink / raw)
  To: Song Liu
  Cc: Luis Chamberlain, Mel Gorman, Michal Hocko, Aaron Lu,
	Linus Torvalds, tglx, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, willy, dave, a.manzanares,
	Javier González, anton, colin.i.king

On Tue, Nov 22, 2022 at 10:06:06PM -0700, Song Liu wrote:
> On Tue, Nov 22, 2022 at 5:21 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Nov 21, 2022 at 07:28:36PM -0700, Song Liu wrote:
> > > On Mon, Nov 21, 2022 at 1:12 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > >
> [...]
> > > fixes a bug that splits the page table (from 2MB to 4kB) for the WHOLE kernel
> > > text. The bug stayed in the kernel for almost a year. None of all the available
> > > open source benchmark had caught it before this specific benchmark.
> >
> > That doesn't mean enterpise level testing would not have caught it, and
> > enteprise kernels run on ancient kernels so they would not catch up that
> > fast. RHEL uses even more ancient kernels than SUSE so let's consider
> > where SUSE was during this regression. The commit you mentioned the fix
> > 7af0145067bc went upstream on v5.3-rc7~4^2, and that was in August 2019.
> > The bug was introduced through commit 585948f4f695 ("x86/mm/cpa: Avoid
> > the 4k pages check completely") and that was on v4.20-rc1~159^2~41
> > around September 2018. Around September 2018, the time the regression was
> > committed, the most bleeding edge Enterprise Linux kernel in the industry was
> > that on SLE15 and so v4.12 and so there is no way in hell the performance
> > team at SUSE for instance would have even come close to evaluating code with
> > that regression. In fact, they wouldn't come accross it in testing until
> > SLE15-SP2 on the v5.3 kernel but by then the regression would have been fixed.
> 
> Can you refer me to one enterprise performance report with open source
> benchmark that shows ~1% performance regression? If it is available, I am
> more than happy to try it out. Note that, we need some BPF programs to show
> the benefit of this set. In most production hosts, network related BPF programs
> are the busiest. Therefore, single host benchmarks will not show the benefit.
> 
> Thanks,
> Song
> 
> PS: Data in [1] if full of noise:
> 
> """
> 2. For each benchmark/system combination, the 1G mapping had the highest
> performance for 45% of the tests, 2M for ~30%, and 4k for~20%.
> 
> 3. From the average delta, among 1G/2M/4K, 4K gets the lowest
> performance in all the 4 test machines, while 1G gets the best
> performance on 2 test machines and 2M gets the best performance on the
> other 2 machines.
> """

I don't think it's noise. IMO, this means that performance degradation
caused by the fragmentation of the direct map highly depends on workload
and microarchitecture.
 
> There is no way we can get consistent result of 1% performance improvement
> from experiments like those.

Experiments like those show how a change in the kernel behaviour affects
different workloads and not a single benchmark. Having a performance
improvement in a single benchmark does necessarily not mean other
benchmarks won't regress.
 
> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs
  2022-11-22  2:36     ` Song Liu
@ 2022-12-08  2:48       ` Luis Chamberlain
  0 siblings, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2022-12-08  2:48 UTC (permalink / raw)
  To: Song Liu
  Cc: Linus Torvalds, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, rppt, willy, dave, a.manzanares

On Mon, Nov 21, 2022 at 07:36:21PM -0700, Song Liu wrote:
> On Mon, Nov 21, 2022 at 1:20 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Nov 21, 2022 at 12:12:49PM -0800, Luis Chamberlain wrote:
> > > On Thu, Nov 17, 2022 at 12:23:16PM -0800, Song Liu wrote:
> > > > Based on our experiments [5], we measured ~0.6% performance improvement
> > > > from bpf_prog_pack. This patchset further boosts the improvement to ~0.8%.
> > >
> > > I'd prefer we leave out arbitrary performance data, as it does not help much.
> >
> > I'd like to clarify what I mean by this, as Linus has suggested before, you are
> > missing the opportunity to present "actual numbers on real loads. (not
> > some artificial benchmark)"
> >
> > [0] https://lkml.kernel.org/r/CAHk-=wiF1KnM1_paB3jCONR9Mh1D_RCsnXKBau1K7XLG-mwwTQ@mail.gmail.com
> 
> Unless I made some serious mistakes, Linus' concern with performance was
> addressed by the exact performance results above. [5]
> 
> [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

AFAICT no, that still is all still artificial.

  Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-12-08  2:48 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-17 20:23 [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Song Liu
2022-11-17 20:23 ` [PATCH bpf-next v4 1/6] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
2022-11-21 15:52   ` Daniel Borkmann
2022-11-21 15:55     ` Christoph Hellwig
2022-11-21 16:29       ` Song Liu
2022-11-21 19:55         ` Luis Chamberlain
2022-11-22  2:55           ` Song Liu
2022-11-22  6:13           ` Christoph Hellwig
2022-11-22 17:25             ` Song Liu
2022-11-28 17:53     ` Song Liu
2022-11-17 20:23 ` [PATCH bpf-next v4 2/6] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
2022-11-17 20:23 ` [PATCH bpf-next v4 3/6] selftests/vm: extend test_vmalloc to test execmem_* APIs Song Liu
2022-11-17 20:23 ` [PATCH bpf-next v4 4/6] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
2022-11-17 20:23 ` [PATCH bpf-next v4 5/6] vmalloc: introduce register_text_tail_vm() Song Liu
2022-11-17 20:23 ` [PATCH bpf-next v4 6/6] x86: use register_text_tail_vm Song Liu
2022-11-21 20:12 ` [PATCH bpf-next v4 0/6] execmem_alloc for BPF programs Luis Chamberlain
2022-11-21 20:20   ` Luis Chamberlain
2022-11-22  2:36     ` Song Liu
2022-12-08  2:48       ` Luis Chamberlain
2022-11-22  2:28   ` Song Liu
2022-11-23  0:21     ` Luis Chamberlain
2022-11-23  5:06       ` Song Liu
2022-11-30  9:53         ` Mike Rapoport
2022-11-30  9:41     ` Mike Rapoport
2022-11-22  2:55   ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.