[PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs
@ 2022-10-31 22:25 Song Liu
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Song Liu @ 2022-10-31 22:25 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, mcgrof, Song Liu

Original email didn't make to lore, resend with a shorter CC list.

This set enables bpf programs and bpf dispatchers to share huge pages with
new API:
  vmalloc_exec()
  vfree_exec()
  vcopy_exec()

The idea is similar to Peter's suggestion in [1].

vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates these
memory to its users. vfree_exec() is used to free memory allocated by
vmalloc_exec(). vcopy_exec() is used to update memory allocated by
vmalloc_exec().

Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate W^X.
The caller has to update the content with text_poke like mechanism.
Specifically, vcopy_exec() is provided to update memory allocated by
vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
boundary of one chunk allocated by vmalloc_exec(). Please refer to patch
1/5 for more details of

Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.

Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
_etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
[2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/

Changes RFC v2 => PATCH v1:
1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
   also ensures vcopy_exec() is only used to update memory from one single
   vmalloc_exec() call. (Christoph Hellwig)
2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
   text_poke() like logic.
3. Drop changes for kernel modules and focus on BPF side changes.

Changes RFC v1 => RFC v2:
1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
   still need some work.

Song Liu (5):
  vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  x86/alternative: support vmalloc_exec() and vfree_exec()
  bpf: use vmalloc_exec for bpf program and bpf dispatcher
  vmalloc: introduce register_text_tail_vm()
  x86: use register_text_tail_vm

 arch/x86/include/asm/pgtable_64_types.h |   1 +
 arch/x86/kernel/alternative.c           |  12 +
 arch/x86/mm/init_64.c                   |   4 +-
 arch/x86/net/bpf_jit_comp.c             |  23 +-
 include/linux/bpf.h                     |   3 -
 include/linux/filter.h                  |   5 -
 include/linux/vmalloc.h                 |   9 +
 kernel/bpf/core.c                       | 180 +-----------
 kernel/bpf/dispatcher.c                 |  11 +-
 mm/nommu.c                              |   7 +
 mm/vmalloc.c                            | 351 ++++++++++++++++++++++++
 11 files changed, 404 insertions(+), 202 deletions(-)

--
2.30.2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
@ 2022-10-31 22:25 ` Song Liu
  2022-11-02 23:41   ` Luis Chamberlain
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-31 22:25 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, mcgrof, Song Liu

vmalloc_exec is used to allocate memory to host dynamic kernel text
(modules, BPF programs, etc.) with huge pages. This is similar to the
proposal by Peter in [1].

A new tree of vmap_area, free_text_area_* tree, is introduced in addition
to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from
free_text_area_*. When there isn't enough space left in free_text_area_*,
new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
free_text_area_*. To be more accurate, the vmap_area is first added to
vmap_area_* tree and then moved to free_text_area_*. This extra move
simplifies the logic of vmalloc_exec.

vmap_area in free_text_area_* tree are backed with memory, but we need
subtree_max_size for tree operations. Therefore, vm_struct for these
vmap_area are stored in a separate list, all_text_vm.

The new tree allows separate handling of < PAGE_SIZE allocations, as
current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
version of vmalloc_exec can handle bpf programs, which uses 64 byte
aligned allocations), and modules, which uses PAGE_SIZE aligned
allocations.

Memory allocated by vmalloc_exec() is set to RO+X before returning to the
caller. Therefore, the caller cannot write directly write to the memory.
Instead, the caller is required to use vcopy_exec() to update the memory.
For the safety and security of X memory, vcopy_exec() checks the data
being updated always in the memory allocated by one vmalloc_exec() call.
vcopy_exec() uses text_poke like mechanism and requires arch support.
Specifically, the arch need to implement arch_vcopy_exec().

In vfree_exec(), the memory is first erased with arch_invalidate_exec().
Then, the memory is added to free_text_area_*. If this free creates big
enough continuous free space (> PMD_SIZE), vfree_exec() will try to free
the backing vm_struct.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h |   5 +
 mm/nommu.c              |  12 ++
 mm/vmalloc.c            | 318 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 335 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..9b2042313c12 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 		int node, const void *caller) __alloc_size(1);
 void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1);
+void *vcopy_exec(void *dst, void *src, size_t len);
+void vfree_exec(void *addr);
+void *arch_vcopy_exec(void *dst, void *src, size_t len);
+int arch_invalidate_exec(void *ptr, size_t len);
 
 extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..8a1317247ef0 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 }
 EXPORT_SYMBOL(vm_map_pages_zero);
 
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+	return NULL;
+}
+
+void *vcopy_exec(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+void vfree_exec(const void *addr) { }
+
 /*
  *  sys_brk() for the most part doesn't need the global kernel
  *  lock, except when an application is doing something nasty
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ccaa461998f3..6f4c73e67191 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
+#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
+#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -769,6 +772,38 @@ static LIST_HEAD(free_vmap_area_list);
  */
 static struct rb_root free_vmap_area_root = RB_ROOT;
 
+/*
+ * free_text_area for vmalloc_exec()
+ */
+static DEFINE_SPINLOCK(free_text_area_lock);
+/*
+ * This linked list is used in pair with free_text_area_root.
+ * It gives O(1) access to prev/next to perform fast coalescing.
+ */
+static LIST_HEAD(free_text_area_list);
+
+/*
+ * This augment red-black tree represents the free text space.
+ * All vmap_area objects in this tree are sorted by va->va_start
+ * address. It is used for allocation and merging when a vmap
+ * object is released.
+ *
+ * Each vmap_area node contains a maximum available free block
+ * of its sub-tree, right or left. Therefore it is possible to
+ * find a lowest match of free area.
+ *
+ * vmap_area in this tree are backed by RO+X memory, but they do
+ * not have valid vm pointer (because we need subtree_max_size).
+ * The vm for these vmap_area are stored in all_text_vm.
+ */
+static struct rb_root free_text_area_root = RB_ROOT;
+
+/*
+ * List of vm_struct for free_text_area_root. This list is rarely
+ * accessed, so the O(N) complexity is not likely a real issue.
+ */
+struct vm_struct *all_text_vm;
+
 /*
  * Preload a CPU with one object for "no edge" split case. The
  * aim is to get rid of allocations from the atomic context, thus
@@ -3313,6 +3348,289 @@ void *vmalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc);
 
+#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
+#define VMALLOC_EXEC_START MODULES_VADDR
+#define VMALLOC_EXEC_END MODULES_END
+#else
+#define VMALLOC_EXEC_START VMALLOC_START
+#define VMALLOC_EXEC_END VMALLOC_END
+#endif
+
+static void move_vmap_to_free_text_tree(void *addr)
+{
+	struct vmap_area *va;
+
+	/* remove from vmap_area_root */
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* make the memory RO+X */
+	memset(addr, 0, va->va_end - va->va_start);
+	set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+	set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+
+	/* add to all_text_vm */
+	va->vm->next = all_text_vm;
+	all_text_vm = va->vm;
+
+	/* add to free_text_area_root */
+	spin_lock(&free_text_area_lock);
+	merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
+	spin_unlock(&free_text_area_lock);
+}
+
+/**
+ * vmalloc_exec - allocate virtually contiguous RO+X memory
+ * @size:    allocation size
+ *
+ * This is used to allocate dynamic kernel text, such as module text, BPF
+ * programs, etc. User need to use text_poke to update the memory allocated
+ * by vmalloc_exec.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+	struct vmap_area *va, *tmp;
+	unsigned long addr;
+	enum fit_type type;
+	int ret;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!va))
+		return NULL;
+
+again:
+	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
+	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
+
+	if (!tmp) {
+		unsigned long alloc_size;
+		void *ptr;
+
+		spin_unlock(&free_text_area_lock);
+
+		/*
+		 * Not enough continuous space in free_text_area_root, try
+		 * allocate more memory. The memory is first added to
+		 * vmap_area_root, and then moved to free_text_area_root.
+		 */
+		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
+		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
+					   VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
+					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
+					   NUMA_NO_NODE, __builtin_return_address(0));
+		if (unlikely(!ptr))
+			goto err_out;
+
+		move_vmap_to_free_text_tree(ptr);
+		goto again;
+	}
+
+	addr = roundup(tmp->va_start, align);
+	type = classify_va_fit_type(tmp, addr, size);
+	if (WARN_ON_ONCE(type == NOTHING_FIT))
+		goto err_out;
+
+	ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
+				    tmp, addr, size);
+	if (ret)
+		goto err_out;
+
+	spin_unlock(&free_text_area_lock);
+
+	va->va_start = addr;
+	va->va_end = addr + size;
+	va->vm = NULL;
+
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	return (void *)addr;
+
+err_out:
+	spin_unlock(&free_text_area_lock);
+	kmem_cache_free(vmap_area_cachep, va);
+	return NULL;
+}
+
+void __weak *arch_vcopy_exec(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+int __weak arch_invalidate_exec(void *ptr, size_t len)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * vcopy_exec - Copy text to RO+X memory allocated by vmalloc_exec()
+ * @dst:  pointer to memory allocated by vmalloc_exec()
+ * @src:  pointer to data being copied from
+ * @len:  number of bytes to be copied
+ *
+ * vcopy_exec() will only update memory allocated by a single vcopy_exec()
+ * call. If dst + len goes beyond the boundary of one allocation,
+ * vcopy_exec() is aborted.
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void *vcopy_exec(void *dst, void *src, size_t len)
+{
+	struct vmap_area *va;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)dst, &vmap_area_root);
+
+	/*
+	 * If no va, or va has a vm attached, this memory is not allocated
+	 * by vmalloc_exec().
+	 */
+	if (WARN_ON_ONCE(!va) || WARN_ON_ONCE(va->vm))
+		goto err_out;
+	if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end))
+		goto err_out;
+
+	spin_unlock(&vmap_area_lock);
+
+	return arch_vcopy_exec(dst, src, len);
+
+err_out:
+	spin_unlock(&vmap_area_lock);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end)
+{
+	struct vm_struct *vm, *prev_vm;
+
+	lockdep_assert_held(&free_text_area_lock);
+
+	vm = all_text_vm;
+	while (vm) {
+		unsigned long vm_addr = (unsigned long)vm->addr;
+
+		/* vm is within this free space, we can free it */
+		if ((vm_addr >= start) && ((vm_addr + vm->size) <= end))
+			goto unlink_vm;
+		vm = vm->next;
+	}
+	return NULL;
+
+unlink_vm:
+	if (all_text_vm == vm) {
+		all_text_vm = vm->next;
+	} else {
+		prev_vm = all_text_vm;
+		while (prev_vm->next != vm)
+			prev_vm = prev_vm->next;
+		prev_vm = vm->next;
+	}
+	return vm;
+}
+
+/**
+ * vfree_exec - Release memory allocated by vmalloc_exec()
+ * @addr:  Memory base address
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void vfree_exec(void *addr)
+{
+	unsigned long free_start, free_end, free_addr;
+	struct vm_struct *vm;
+	struct vmap_area *va;
+
+	might_sleep();
+
+	if (!addr)
+		return;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	WARN_ON_ONCE(va->vm);
+
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* Invalidate text in the region */
+	arch_invalidate_exec(addr, va->va_end - va->va_start);
+
+	spin_lock(&free_text_area_lock);
+	va = merge_or_add_vmap_area_augment(va,
+		&free_text_area_root, &free_text_area_list);
+
+	if (WARN_ON_ONCE(!va))
+		goto out;
+
+	free_start = PMD_ALIGN(va->va_start);
+	free_end = PMD_ALIGN_DOWN(va->va_end);
+
+	/*
+	 * Only try to free vm when there is at least one PMD_SIZE free
+	 * continuous memory.
+	 */
+	if (free_start >= free_end)
+		goto out;
+
+	/*
+	 * TODO: It is possible that multiple vm are ready to be freed
+	 * after one vfree_exec(). But we free at most one vm for now.
+	 */
+	vm = find_and_unlink_text_vm(free_start, free_end);
+	if (!vm)
+		goto out;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE);
+	if (unlikely(!va))
+		goto out_save_vm;
+
+	free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list,
+				      vm->size, 1, (unsigned long)vm->addr,
+				      (unsigned long)vm->addr + vm->size);
+
+	if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr))
+		goto out_save_vm;
+
+	va->va_start = (unsigned long)vm->addr;
+	va->va_end = va->va_start + vm->size;
+	va->vm = vm;
+	spin_unlock(&free_text_area_lock);
+
+	set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT);
+	set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT);
+
+	/* put the va to vmap_area_root, and then free it with vfree */
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	vfree(vm->addr);
+	return;
+
+out_save_vm:
+	/*
+	 * vm is removed from all_text_vm, but not freed. Add it back,
+	 * so that we can use or free it later.
+	 */
+	vm->next = all_text_vm;
+	all_text_vm = vm;
+out:
+	spin_unlock(&free_text_area_lock);
+}
+
 /**
  * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
  * @size:      allocation size
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec()
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
@ 2022-10-31 22:25 ` Song Liu
  2022-11-02 22:21   ` Edgecombe, Rick P
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-31 22:25 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, mcgrof, Song Liu

Implement arch_vcopy_exec() and arch_invalidate_exec() to support
vmalloc_exec.

arch_vcopy_exec() copies dynamic kernel text (such as BPF programs) to
RO+X memory region allocated by vmalloc_exec().

arch_invalidate_exec() fills memory with 0xcc after it is released by
vfree_exec().

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/kernel/alternative.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cadcea035e0..73d89774ace3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1270,6 +1270,18 @@ void *text_poke_copy(void *addr, const void *opcode, size_t len)
 	return addr;
 }
 
+void *arch_vcopy_exec(void *dst, void *src, size_t len)
+{
+	if (text_poke_copy(dst, src, len) == NULL)
+		return ERR_PTR(-EINVAL);
+	return dst;
+}
+
+int arch_invalidate_exec(void *ptr, size_t len)
+{
+	return IS_ERR_OR_NULL(text_poke_set(ptr, 0xcc, len));
+}
+
 /**
  * text_poke_set - memset into (an unused part of) RX memory
  * @addr: address to modify
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v1 RESEND 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
@ 2022-10-31 22:25 ` Song Liu
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-10-31 22:25 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, mcgrof, Song Liu

Use vmalloc_exec, vfree_exec, and vcopy_exec instead of
bpf_prog_pack_alloc, bpf_prog_pack_free, and bpf_arch_text_copy.

vfree_exec doesn't require extra size information. Therefore, the free
and error handling path can be simplified.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c |  23 +----
 include/linux/bpf.h         |   3 -
 include/linux/filter.h      |   5 -
 kernel/bpf/core.c           | 180 +++---------------------------------
 kernel/bpf/dispatcher.c     |  11 +--
 5 files changed, 21 insertions(+), 201 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 36ffe67ad6e5..3db316b2f3a7 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -228,11 +228,6 @@ static void jit_fill_hole(void *area, unsigned int size)
 	memset(area, 0xcc, size);
 }
 
-int bpf_arch_text_invalidate(void *dst, size_t len)
-{
-	return IS_ERR_OR_NULL(text_poke_set(dst, 0xcc, len));
-}
-
 struct jit_context {
 	int cleanup_addr; /* Epilogue code offset */
 
@@ -2496,11 +2491,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		if (proglen <= 0) {
 out_image:
 			image = NULL;
-			if (header) {
-				bpf_arch_text_copy(&header->size, &rw_header->size,
-						   sizeof(rw_header->size));
+			if (header)
 				bpf_jit_binary_pack_free(header, rw_header);
-			}
+
 			/* Fall back to interpreter mode */
 			prog = orig_prog;
 			if (extra_pass) {
@@ -2550,8 +2543,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		if (!prog->is_func || extra_pass) {
 			/*
 			 * bpf_jit_binary_pack_finalize fails in two scenarios:
-			 *   1) header is not pointing to proper module memory;
-			 *   2) the arch doesn't support bpf_arch_text_copy().
+			 *   1) header is not pointing to memory allocated by
+			 *      vmalloc_exec;
+			 *   2) the arch doesn't support vcopy_exec().
 			 *
 			 * Both cases are serious bugs and justify WARN_ON.
 			 */
@@ -2597,13 +2591,6 @@ bool bpf_jit_supports_kfunc_call(void)
 	return true;
 }
 
-void *bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
-	if (text_poke_copy(dst, src, len) == NULL)
-		return ERR_PTR(-EINVAL);
-	return dst;
-}
-
 /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9fd68b0b3e9c..1d42f58334d0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2654,9 +2654,6 @@ enum bpf_text_poke_type {
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
 
-void *bpf_arch_text_copy(void *dst, void *src, size_t len);
-int bpf_arch_text_invalidate(void *dst, size_t len);
-
 struct btf_id_set;
 bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index efc42a6e3aed..98e28126c24b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1023,8 +1023,6 @@ extern long bpf_jit_limit_max;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size);
-
 struct bpf_binary_header *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
@@ -1037,9 +1035,6 @@ void bpf_jit_free(struct bpf_prog *fp);
 struct bpf_binary_header *
 bpf_jit_binary_pack_hdr(const struct bpf_prog *fp);
 
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns);
-void bpf_prog_pack_free(struct bpf_binary_header *hdr);
-
 static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
 {
 	return list_empty(&fp->aux->ksym.lnode) ||
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a0e762a2bf97..ca722078697b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -806,149 +806,6 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 	return slot;
 }
 
-/*
- * BPF program pack allocator.
- *
- * Most BPF programs are pretty small. Allocating a hole page for each
- * program is sometime a waste. Many small bpf program also adds pressure
- * to instruction TLB. To solve this issue, we introduce a BPF program pack
- * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
- * to host BPF programs.
- */
-#define BPF_PROG_CHUNK_SHIFT	6
-#define BPF_PROG_CHUNK_SIZE	(1 << BPF_PROG_CHUNK_SHIFT)
-#define BPF_PROG_CHUNK_MASK	(~(BPF_PROG_CHUNK_SIZE - 1))
-
-struct bpf_prog_pack {
-	struct list_head list;
-	void *ptr;
-	unsigned long bitmap[];
-};
-
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size)
-{
-	memset(area, 0, size);
-}
-
-#define BPF_PROG_SIZE_TO_NBITS(size)	(round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
-
-static DEFINE_MUTEX(pack_mutex);
-static LIST_HEAD(pack_list);
-
-/* PMD_SIZE is not available in some special config, e.g. ARCH=arm with
- * CONFIG_MMU=n. Use PAGE_SIZE in these cases.
- */
-#ifdef PMD_SIZE
-#define BPF_PROG_PACK_SIZE (PMD_SIZE * num_possible_nodes())
-#else
-#define BPF_PROG_PACK_SIZE PAGE_SIZE
-#endif
-
-#define BPF_PROG_CHUNK_COUNT (BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
-
-static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	struct bpf_prog_pack *pack;
-
-	pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)),
-		       GFP_KERNEL);
-	if (!pack)
-		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
-	if (!pack->ptr) {
-		kfree(pack);
-		return NULL;
-	}
-	bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE);
-	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
-	list_add_tail(&pack->list, &pack_list);
-
-	set_vm_flush_reset_perms(pack->ptr);
-	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	return pack;
-}
-
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
-	struct bpf_prog_pack *pack;
-	unsigned long pos;
-	void *ptr = NULL;
-
-	mutex_lock(&pack_mutex);
-	if (size > BPF_PROG_PACK_SIZE) {
-		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
-		if (ptr) {
-			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
-			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
-			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
-		}
-		goto out;
-	}
-	list_for_each_entry(pack, &pack_list, list) {
-		pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-						 nbits, 0);
-		if (pos < BPF_PROG_CHUNK_COUNT)
-			goto found_free_area;
-	}
-
-	pack = alloc_new_pack(bpf_fill_ill_insns);
-	if (!pack)
-		goto out;
-
-	pos = 0;
-
-found_free_area:
-	bitmap_set(pack->bitmap, pos, nbits);
-	ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
-
-out:
-	mutex_unlock(&pack_mutex);
-	return ptr;
-}
-
-void bpf_prog_pack_free(struct bpf_binary_header *hdr)
-{
-	struct bpf_prog_pack *pack = NULL, *tmp;
-	unsigned int nbits;
-	unsigned long pos;
-
-	mutex_lock(&pack_mutex);
-	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
-		goto out;
-	}
-
-	list_for_each_entry(tmp, &pack_list, list) {
-		if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) {
-			pack = tmp;
-			break;
-		}
-	}
-
-	if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
-		goto out;
-
-	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
-	pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT;
-
-	WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size),
-		  "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n");
-
-	bitmap_clear(pack->bitmap, pos, nbits);
-	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
-		list_del(&pack->list);
-		module_memfree(pack->ptr);
-		kfree(pack);
-	}
-out:
-	mutex_unlock(&pack_mutex);
-}
-
 static atomic_long_t bpf_jit_current;
 
 /* Can be overridden by an arch's JIT compiler if it has a custom,
@@ -1048,6 +905,9 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 	bpf_jit_uncharge_modmem(size);
 }
 
+#define BPF_PROG_EXEC_ALIGN	64
+#define BPF_PROG_EXEC_MASK	(~(BPF_PROG_EXEC_ALIGN - 1))
+
 /* Allocate jit binary from bpf_prog_pack allocator.
  * Since the allocated memory is RO+X, the JIT engine cannot write directly
  * to the memory. To solve this problem, a RW buffer is also allocated at
@@ -1070,11 +930,11 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 		     alignment > BPF_IMAGE_ALIGNMENT);
 
 	/* add 16 bytes for a random section of illegal instructions */
-	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_CHUNK_SIZE);
+	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_EXEC_ALIGN);
 
 	if (bpf_jit_charge_modmem(size))
 		return NULL;
-	ro_header = bpf_prog_pack_alloc(size, bpf_fill_ill_insns);
+	ro_header = vmalloc_exec(size, BPF_PROG_EXEC_ALIGN);
 	if (!ro_header) {
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
@@ -1082,8 +942,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 
 	*rw_header = kvmalloc(size, GFP_KERNEL);
 	if (!*rw_header) {
-		bpf_arch_text_copy(&ro_header->size, &size, sizeof(size));
-		bpf_prog_pack_free(ro_header);
+		vfree_exec(ro_header);
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
 	}
@@ -1093,7 +952,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 	(*rw_header)->size = size;
 
 	hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),
-		     BPF_PROG_CHUNK_SIZE - sizeof(*ro_header));
+		     BPF_PROG_EXEC_ALIGN - sizeof(*ro_header));
 	start = (get_random_int() % hole) & ~(alignment - 1);
 
 	*image_ptr = &ro_header->image[start];
@@ -1109,12 +968,12 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
 {
 	void *ptr;
 
-	ptr = bpf_arch_text_copy(ro_header, rw_header, rw_header->size);
+	ptr = vcopy_exec(ro_header, rw_header, rw_header->size);
 
 	kvfree(rw_header);
 
 	if (IS_ERR(ptr)) {
-		bpf_prog_pack_free(ro_header);
+		vfree_exec(ro_header);
 		return PTR_ERR(ptr);
 	}
 	return 0;
@@ -1124,18 +983,13 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
  *   1) when the program is freed after;
  *   2) when the JIT engine fails (before bpf_jit_binary_pack_finalize).
  * For case 2), we need to free both the RO memory and the RW buffer.
- *
- * bpf_jit_binary_pack_free requires proper ro_header->size. However,
- * bpf_jit_binary_pack_alloc does not set it. Therefore, ro_header->size
- * must be set with either bpf_jit_binary_pack_finalize (normal path) or
- * bpf_arch_text_copy (when jit fails).
  */
 void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header,
 			      struct bpf_binary_header *rw_header)
 {
-	u32 size = ro_header->size;
+	u32 size = rw_header ? rw_header->size : ro_header->size;
 
-	bpf_prog_pack_free(ro_header);
+	vfree_exec(ro_header);
 	kvfree(rw_header);
 	bpf_jit_uncharge_modmem(size);
 }
@@ -1146,7 +1000,7 @@ bpf_jit_binary_pack_hdr(const struct bpf_prog *fp)
 	unsigned long real_start = (unsigned long)fp->bpf_func;
 	unsigned long addr;
 
-	addr = real_start & BPF_PROG_CHUNK_MASK;
+	addr = real_start & BPF_PROG_EXEC_MASK;
 	return (void *)addr;
 }
 
@@ -2736,16 +2590,6 @@ int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 	return -ENOTSUPP;
 }
 
-void * __weak bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
-	return ERR_PTR(-ENOTSUPP);
-}
-
-int __weak bpf_arch_text_invalidate(void *dst, size_t len)
-{
-	return -ENOTSUPP;
-}
-
 DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 EXPORT_SYMBOL(bpf_stats_enabled_key);
 
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
index fa64b80b8bca..7ca8138fb8fa 100644
--- a/kernel/bpf/dispatcher.c
+++ b/kernel/bpf/dispatcher.c
@@ -120,11 +120,11 @@ static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs)
 	tmp = d->num_progs ? d->rw_image + noff : NULL;
 	if (new) {
 		/* Prepare the dispatcher in d->rw_image. Then use
-		 * bpf_arch_text_copy to update d->image, which is RO+X.
+		 * vcopy_exec to update d->image, which is RO+X.
 		 */
 		if (bpf_dispatcher_prepare(d, new, tmp))
 			return;
-		if (IS_ERR(bpf_arch_text_copy(new, tmp, PAGE_SIZE / 2)))
+		if (IS_ERR(vcopy_exec(new, tmp, PAGE_SIZE / 2)))
 			return;
 	}
 
@@ -146,15 +146,12 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
 
 	mutex_lock(&d->mutex);
 	if (!d->image) {
-		d->image = bpf_prog_pack_alloc(PAGE_SIZE, bpf_jit_fill_hole_with_zero);
+		d->image = vmalloc_exec(PAGE_SIZE, PAGE_SIZE /* align */);
 		if (!d->image)
 			goto out;
 		d->rw_image = bpf_jit_alloc_exec(PAGE_SIZE);
 		if (!d->rw_image) {
-			u32 size = PAGE_SIZE;
-
-			bpf_arch_text_copy(d->image, &size, sizeof(size));
-			bpf_prog_pack_free((struct bpf_binary_header *)d->image);
+			vfree_exec((struct bpf_binary_header *)d->image);
 			d->image = NULL;
 			goto out;
 		}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v1 RESEND 4/5] vmalloc: introduce register_text_tail_vm()
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
                   ` (2 preceding siblings ...)
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
@ 2022-10-31 22:25 ` Song Liu
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm Song Liu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-10-31 22:25 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, mcgrof, Song Liu

Allow arch code to register some memory to be used by vmalloc_exec().
One possible use case is to allocate PMD pages for kernl text up to
PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for
vmalloc_exec. Currently, only one such region is supported.

Signed-off-by: Song Liu <song@kernel.org>
---
 mm/vmalloc.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6f4c73e67191..46f2b7e56670 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -75,6 +75,9 @@ static const bool vmap_allow_huge = false;
 #define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
 #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
 
+static struct vm_struct text_tail_vm;
+static struct vmap_area text_tail_va;
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -653,6 +656,8 @@ int is_vmalloc_or_module_addr(const void *x)
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
 	if (addr >= MODULES_VADDR && addr < MODULES_END)
 		return 1;
+	if (addr >= text_tail_va.va_start && addr < text_tail_va.va_end)
+		return 1;
 #endif
 	return is_vmalloc_addr(x);
 }
@@ -2437,6 +2442,34 @@ static void vmap_init_free_space(void)
 	}
 }
 
+/*
+ * register_text_tail_vm() allows arch code to register memory regions
+ * for vmalloc_exec. Unlike regular memory regions used by vmalloc_exec,
+ * this region is never freed by vfree_exec.
+ *
+ * One possible use case is to allocate PMD pages for kernl text up to
+ * PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for vmalloc_exec.
+ */
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+	struct vmap_area *va;
+
+	/* only support one region */
+	if (WARN_ON_ONCE(text_tail_vm.addr))
+		return;
+
+	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+	if (WARN_ON_ONCE(!va))
+		return;
+	text_tail_vm.addr = (void *)start;
+	text_tail_vm.size = end - start;
+	text_tail_va.va_start = start;
+	text_tail_va.va_end = end;
+	text_tail_va.vm = &text_tail_vm;
+	memcpy(va, &text_tail_va, sizeof(*va));
+	insert_vmap_area_augment(va, NULL, &free_text_area_root, &free_text_area_list);
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
                   ` (3 preceding siblings ...)
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
@ 2022-10-31 22:25 ` Song Liu
  2022-11-02 22:24   ` Edgecombe, Rick P
  2022-11-01 11:26 ` [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Christoph Hellwig
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-31 22:25 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, mcgrof, Song Liu

Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
[round_up(_etext, 4kb), round_up(_etext, 2MB)] with register_text_tail_vm
so that we can use this part of memory for dynamic kernel text (BPF
programs, etc.).

Here is an example:

[root@eth50-1 ~]# grep _etext /proc/kallsyms
ffffffff82202a08 T _etext

[root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]

[root@eth50-1 ~]#  grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel
0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x  pmd

ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, and
bpf programs.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/include/asm/pgtable_64_types.h | 1 +
 arch/x86/mm/init_64.c                   | 4 +++-
 include/linux/vmalloc.h                 | 4 ++++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04f36063ad54..c0f9cceb109a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -101,6 +101,7 @@ extern unsigned int ptrs_per_p4d;
 #define PUD_MASK	(~(PUD_SIZE - 1))
 #define PGDIR_SIZE	(_AC(1, UL) << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE - 1))
+#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
 
 /*
  * See Documentation/x86/x86_64/mm.rst for a description of the memory map.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3f040c6e5d13..5b42fc0c6099 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1373,7 +1373,7 @@ void mark_rodata_ro(void)
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
-	unsigned long text_end = PFN_ALIGN(_etext);
+	unsigned long text_end = PMD_ALIGN(_etext);
 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
 	unsigned long all_end;
 
@@ -1414,6 +1414,8 @@ void mark_rodata_ro(void)
 				(void *)rodata_end, (void *)_sdata);
 
 	debug_checkwx();
+	register_text_tail_vm(PFN_ALIGN((unsigned long)_etext),
+			      PMD_ALIGN((unsigned long)_etext));
 }
 
 int kern_addr_valid(unsigned long addr)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 9b2042313c12..7365cf9c4e7f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -132,11 +132,15 @@ extern void vm_unmap_aliases(void);
 #ifdef CONFIG_MMU
 extern void __init vmalloc_init(void);
 extern unsigned long vmalloc_nr_pages(void);
+void register_text_tail_vm(unsigned long start, unsigned long end);
 #else
 static inline void vmalloc_init(void)
 {
 }
 static inline unsigned long vmalloc_nr_pages(void) { return 0; }
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+}
 #endif
 
 extern void *vmalloc(unsigned long size) __alloc_size(1);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
                   ` (4 preceding siblings ...)
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm Song Liu
@ 2022-11-01 11:26 ` Christoph Hellwig
  2022-11-01 15:10   ` Song Liu
  2022-11-02 20:45 ` Luis Chamberlain
  2022-11-02 22:29 ` Edgecombe, Rick P
  7 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2022-11-01 11:26 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	dave.hansen, mcgrof

On Mon, Oct 31, 2022 at 03:25:36PM -0700, Song Liu wrote:
> This set enables bpf programs and bpf dispatchers to share huge pages with
> new API:
>   vmalloc_exec()
>   vfree_exec()
>   vcopy_exec()

Maybe it's just me, but I don't like the names very much.  They imply
a slight extension to the vmalloc API, but while they use the vmalloc
mechanisms internally, the API is actually quite different.

So why not something like:

   execmem_alloc
   execmem_free
   execmem_fill or execmem_set or copy_to_execmem

?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs
  2022-11-01 11:26 ` [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Christoph Hellwig
@ 2022-11-01 15:10   ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-11-01 15:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: bpf, linux-mm, akpm, x86, peterz, rick.p.edgecombe, dave.hansen, mcgrof

On Tue, Nov 1, 2022 at 4:26 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Oct 31, 2022 at 03:25:36PM -0700, Song Liu wrote:
> > This set enables bpf programs and bpf dispatchers to share huge pages with
> > new API:
> >   vmalloc_exec()
> >   vfree_exec()
> >   vcopy_exec()
>
> Maybe it's just me, but I don't like the names very much.  They imply
> a slight extension to the vmalloc API, but while they use the vmalloc
> mechanisms internally, the API is actually quite different.
>
> So why not something like:
>
>    execmem_alloc
>    execmem_free
>    execmem_fill or execmem_set or copy_to_execmem
>
> ?

I don't have a strong preference on names. We can change the name
to whatever we agree on.

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
                   ` (5 preceding siblings ...)
  2022-11-01 11:26 ` [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Christoph Hellwig
@ 2022-11-02 20:45 ` Luis Chamberlain
  2022-11-02 22:29 ` Edgecombe, Rick P
  7 siblings, 0 replies; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-02 20:45 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	dave.hansen, rppt, zhengjun.xing, kbusch, p.raghav, dave

On Mon, Oct 31, 2022 at 03:25:36PM -0700, Song Liu wrote:
> Changes RFC v2 => PATCH v1:

<-- snip -->

> 3. Drop changes for kernel modules and focus on BPF side changes.

Yeah because in my testing of RFCv2 things blow up pretty hard,
I haven't been able to successfully even boot *any* system with those
patches, virutal or bare metal.

I'm very happy to help test new iterations if you have anything.

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec()
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
@ 2022-11-02 22:21   ` Edgecombe, Rick P
  2022-11-03 21:03     ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-11-02 22:21 UTC (permalink / raw)
  To: linux-mm, song, bpf; +Cc: hch, mcgrof, peterz, akpm, x86, Hansen, Dave

On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> diff --git a/arch/x86/kernel/alternative.c
> b/arch/x86/kernel/alternative.c
> index 5cadcea035e0..73d89774ace3 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1270,6 +1270,18 @@ void *text_poke_copy(void *addr, const void
> *opcode, size_t len)
>         return addr;
>  }
>  
> +void *arch_vcopy_exec(void *dst, void *src, size_t len)
> +{
> +       if (text_poke_copy(dst, src, len) == NULL)
> +               return ERR_PTR(-EINVAL);
> +       return dst;
> +}

Except for this, there are no more users of text_poke_copy() right?
Should it just be replaced with arch_vcopy_exec()?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm Song Liu
@ 2022-11-02 22:24   ` Edgecombe, Rick P
  2022-11-03 21:04     ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-11-02 22:24 UTC (permalink / raw)
  To: linux-mm, song, bpf; +Cc: hch, mcgrof, peterz, akpm, x86, Hansen, Dave

On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
> [round_up(_etext, 4kb), round_up(_etext, 2MB)] with
> register_text_tail_vm
> so that we can use this part of memory for dynamic kernel text (BPF
> programs, etc.).
> 
> Here is an example:
> 
> [root@eth50-1 ~]# grep _etext /proc/kallsyms
> ffffffff82202a08 T _etext
> 
> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> ffffffff8220f920 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> ffffffff8220fa28 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> ffffffff8220fad4 t
> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> 
> [root@eth50-1 ~]#  grep 0xffffffff82200000
> /sys/kernel/debug/page_tables/kernel
> 0xffffffff82200000-
> 0xffffffff82400000     2M     ro   PSE         x  pmd
> 
> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> and
> bpf programs.
> 
> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  arch/x86/include/asm/pgtable_64_types.h | 1 +
>  arch/x86/mm/init_64.c                   | 4 +++-
>  include/linux/vmalloc.h                 | 4 ++++
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_64_types.h
> b/arch/x86/include/asm/pgtable_64_types.h
> index 04f36063ad54..c0f9cceb109a 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -101,6 +101,7 @@ extern unsigned int ptrs_per_p4d;
>  #define PUD_MASK	(~(PUD_SIZE - 1))
>  #define PGDIR_SIZE	(_AC(1, UL) << PGDIR_SHIFT)
>  #define PGDIR_MASK	(~(PGDIR_SIZE - 1))
> +#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) &
> PMD_MASK)
>  
>  /*
>   * See Documentation/x86/x86_64/mm.rst for a description of the
> memory map.
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3f040c6e5d13..5b42fc0c6099 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1373,7 +1373,7 @@ void mark_rodata_ro(void)
>  	unsigned long start = PFN_ALIGN(_text);
>  	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>  	unsigned long end = (unsigned long)__end_rodata_hpage_align;
> -	unsigned long text_end = PFN_ALIGN(_etext);
> +	unsigned long text_end = PMD_ALIGN(_etext);
>  	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
>  	unsigned long all_end;

Check out is_errata93(). Right now it assumes all text is between text-
etext and MODULES_VADDR-MODULES_END. It's a quite old errata, but it
would be nice if we had a is_text_addr() helper or something. To help
keep track of the places where text might pop up.

Speaking of which, it might be nice to update
Documentation/x86/x86_64/mm.rst with some hints that this area exists.

>  
> @@ -1414,6 +1414,8 @@ void mark_rodata_ro(void)
>  				(void *)rodata_end, (void *)_sdata);
>  
>  	debug_checkwx();
> +	register_text_tail_vm(PFN_ALIGN((unsigned long)_etext),
> +			      PMD_ALIGN((unsigned long)_etext));
>  }
>  
>  int kern_addr_valid(unsigned long addr)
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 9b2042313c12..7365cf9c4e7f 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -132,11 +132,15 @@ extern void vm_unmap_aliases(void);
>  #ifdef CONFIG_MMU
>  extern void __init vmalloc_init(void);
>  extern unsigned long vmalloc_nr_pages(void);
> +void register_text_tail_vm(unsigned long start, unsigned long end);
>  #else
>  static inline void vmalloc_init(void)
>  {
>  }
>  static inline unsigned long vmalloc_nr_pages(void) { return 0; }
> +void register_text_tail_vm(unsigned long start, unsigned long end)
> +{
> +}
>  #endif

This looks like it should be in the previous patch.

>  
>  extern void *vmalloc(unsigned long size) __alloc_size(1);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs
  2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
                   ` (6 preceding siblings ...)
  2022-11-02 20:45 ` Luis Chamberlain
@ 2022-11-02 22:29 ` Edgecombe, Rick P
  2022-11-03 21:13   ` Song Liu
  7 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-11-02 22:29 UTC (permalink / raw)
  To: linux-mm, song, bpf; +Cc: hch, mcgrof, peterz, akpm, x86, Hansen, Dave

On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> This set enables bpf programs and bpf dispatchers to share huge pages
> with
> new API:
>   vmalloc_exec()
>   vfree_exec()
>   vcopy_exec()
> 
> The idea is similar to Peter's suggestion in [1].
> 
> vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates
> these
> memory to its users. vfree_exec() is used to free memory allocated by
> vmalloc_exec(). vcopy_exec() is used to update memory allocated by
> vmalloc_exec().
> 
> Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate
> W^X.
> The caller has to update the content with text_poke like mechanism.
> Specifically, vcopy_exec() is provided to update memory allocated by
> vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
> boundary of one chunk allocated by vmalloc_exec(). Please refer to
> patch
> 1/5 for more details of
> 
> Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> 
> Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to
> share
> PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved
> by
> allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

It might help to spell out what the benefits of this are. My
understanding is that (to my surprise) we actually haven't seen a
performance improvement with using 2MB pages for JITs. The main
performance benefit you saw on your previous version was from reduced
fragmentation of the direct map IIUC. This was from the effect of
reusing the same pages for JITs so that new ones don't need to be
broken.

The other benefit of this thing is reduced shootdowns. It can load a
JIT with about only a local TLB flush on average, which should help
really high cpu systems some unknown amount.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
@ 2022-11-02 23:41   ` Luis Chamberlain
  2022-11-03 15:51     ` Mike Rapoport
  2022-11-07  6:40     ` Aaron Lu
  0 siblings, 2 replies; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-02 23:41 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	dave.hansen, rppt, zhengjun.xing, kbusch, p.raghav, dave, vbabka,
	mgorman, willy, torvalds, mcgrof

On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> vmalloc_exec is used to allocate memory to host dynamic kernel text
> (modules, BPF programs, etc.) with huge pages. This is similar to the
> proposal by Peter in [1].

This is allg reat but we need to clarify *why* we would go through the
trouble.  So if folks are not to excited about this series, that's
probably why. IMHO it lacks substance for rationale, **and** implies a few
gains without any *clear* performance metrics. I have 0 experience with
mm so I'd like other's feedback on my this -- I'm just trying to do
decipher rationale from prior "bpf prog pack" efforts.

I'm sensing that the cables in messaging are a bit crossed here and we need
to provide a bit better full picture for rationale and this is being
completely missed and this work is being undersold.  If my assessment is
accurate though, the bpf prog pack strategy with sharing huge pages may prove
useful long term for other things than just modules / ftrace / kprobes.

I was surprised to see this entire patch series upgrade from RFC to proper
PATCH form now completely fails to mention any of the original motivations
behind the "BPF prog pack", which you are doing a true heroic effort to try to
generalize as the problem is hard. Let me try to help with that. The rationale
for the old BPF prog pack is documented as follows:

* Most BPF programs are pretty small. Allocating a hole page for each
* program is sometime a waste. Many small bpf program also adds pressure
* to instruction TLB. To solve this issue, we introduce a BPF program pack
* allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
* to host BPF programs.

Previously you have also stated in earlier versions of this patch set:

  "Most BPF programs are small, but they consume a page each. For systems
   with busy traffic and many BPF programs, this could also add significant
   pressure to instruction TLB. High iTLB pressure usually causes slow down
   for the whole system, which includes visible performance
   degradation for production workloads."

So it is implied here that one of the benefits is to help reduce iTLB misses.
But that's it. We have no visible numbers to look at and for what... But
reducing iTLB misses doesn't always have a complete direct correlation
with improving things, but if the code change is small enough it obviously
makes sense to apply. If the change is a bit more intrusive, as in this
patch series a bit more rationale should be provided.

Other than the "performance aspects" of your patchset, the *main* reason
I am engaged and like it is it reduces the nasty mess of semantics on
dealing with special permissions on pages which we see in modules and a
few other places which today completely open code it. That proves error
prone and I'm glad to see efforts to generalize that nastiness. So
please ensure this is added as part of the documented rationale. Even
if the iTLB miss ratio improvement is not astronomical I believe that
the gains in sanity on improving semantics on special pages and sharing code
make it well worthwhile. The iTLB miss ratio improvement is just a small
cherry on top.

Going back to performance aspects, when Linus had poked for more details
about this your have elaborated further:

  "we have seen direct map fragmentation causing visible
   performance drop for our major services. This is the shadow 
   production benchmark, so it is not possible to run it out of 
   our data centers. Tracing showed that BPF program was the top 
   trigger of these direct map splits."

And the only other metric we have is:

  "For our web service production benchmark, bpf_prog_pack on 4kB pages
   gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."

These metrics are completely arbitrary and opaque to us. We need
something tangible and reproducible and I have been suggesting that
from early on...

I'm under the impression that the real missed, undocumented, major value-add
here is that the old "BPF prog pack" strategy helps to reduce the direct map
fragmentation caused by heavy use of the eBPF JIT programs and this in
turn helps your overall random system performance (regardless of what
it is you do). As I see it then the eBPF prog pack is just one strategy to
try to mitigate memory fragmentation on the direct map caused by the the eBPF
JIT programs, so the "slow down" your team has obvserved should be due to the
eventual fragmentation caused on the direct map *while* eBPF programs
get heavily used.

Mike Rapoport had presented about the Direct map fragmentation problem
at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
evaluation on whether using 2M/1G pages aggressively for the kernel direct map
help performance [1] ends up generally recommending huge pages. The work by Xing
though was about using huge pages *alone*, not using a strategy such as in the
"bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
and that I think is the real golden nugget here.

I contend therefore that the theoretical reduction of iTLB misses by using
huge pages for "bpf prog pack" is not what gets your systems to perform
somehow better. It should be simply that it reduces fragmentation and
*this* generally can help with performance long term. If this is accurate
then let's please separate the two aspects to this.

There's two aspects to what I would like to see from a performance
perspective then actually mentioned in the commit logs:

1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
   Vs not using it at all:

   I'd tried the below but didn't see much difference:

   perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \
   	--pre 'modprobe -r test_bpf' \-- modprobe test_bpf

   This is likely that the system I have might have a huge cache and I
   was not contending it with another heavy memory intensive workload.
   So it may be worthy to run something like the above *as* some other
   memory intensive benchark kicks gear in a loop. Another reason too
   may be that test_bpf runs tests sequentially instead of in parallel,
   and so it would be good to get ebpf folks's feedback as to an idea
   of what other things could be done to really eBPF JIT the hell out of
   a system. It is why I had suggested long ago maybe a custom new
   selftest which stresses the hell out of only eBPF JIT and had given
   an example multithreaded selftest.

   If we ever get modules support, I was hoping to see if something like
   the below (if the fs module .text section does end up shared) *may*
   have reduce iTLB miss ratio.

   perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \
	--pre 'make -s mrproper defconfig' \-- make -s -j$(nproc) bzImage

   Maybe something as simple as systemd-analyze may show time reduction
   if iTLB miss ratio is reduced by sharing most module code in a huge
   page.

2) Estimate in reduction on direct map fragmentation by using the "bpf
   prog pack" or this generalized solution:

   For this I'd expect a benchmark similar to the workload you guys
   run or something memory intensive, as eBPF JITs are heavily used,
   and after a certain amount of time somehow compute how fragmented
   memory is. The only sensible thing I can think to measure memory
   fragmentation is to look at the memory compaction index
   /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
   ideas as I'm a mm n00b.

[0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

> +static void move_vmap_to_free_text_tree(void *addr)
> +{
> +	struct vmap_area *va;
> +
> +	/* remove from vmap_area_root */
> +	spin_lock(&vmap_area_lock);
> +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> +	if (WARN_ON_ONCE(!va)) {

This seems to be hopeful but we purposely are allowing for the allocated
memory to be used elsewhere are we not? Can you trigger the WARN_ON_ONCE()
with stress-ng as described in commit 82dd23e84be3e ("mm/vmalloc.c:
preload a CPU with one object for split purpose") while using eBPF JIT
or loading/unloading a module in a loop?

> +void *vmalloc_exec(unsigned long size, unsigned long align)
> +{
> +	struct vmap_area *va, *tmp;
> +	unsigned long addr;
> +	enum fit_type type;
> +	int ret;
> +
> +	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
> +	if (unlikely(!va))
> +		return NULL;
> +
> +again:
> +	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
> +	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
> +
> +	if (!tmp) {
> +		unsigned long alloc_size;
> +		void *ptr;
> +
> +		spin_unlock(&free_text_area_lock);
> +
> +		/*
> +		 * Not enough continuous space in free_text_area_root, try
> +		 * allocate more memory. The memory is first added to
> +		 * vmap_area_root, and then moved to free_text_area_root.
> +		 */
> +		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
> +		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
> +					   VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
> +					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
> +					   NUMA_NO_NODE, __builtin_return_address(0));

I thought Peter had suggested keeping the guard?

Once you get modules going you may want to update the comment about
modules on __vmalloc_node_range().

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-02 23:41   ` Luis Chamberlain
@ 2022-11-03 15:51     ` Mike Rapoport
  2022-11-03 18:59       ` Luis Chamberlain
  2022-11-07  6:40     ` Aaron Lu
  1 sibling, 1 reply; 29+ messages in thread
From: Mike Rapoport @ 2022-11-03 15:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, zhengjun.xing, kbusch, p.raghav,
	dave, vbabka, mgorman, willy, torvalds

Hi Luis,

Thanks for looping me in.

On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > proposal by Peter in [1].
> 
> This is allg reat but we need to clarify *why* we would go through the
> trouble.  So if folks are not to excited about this series, that's
> probably why. IMHO it lacks substance for rationale, **and** implies a few
> gains without any *clear* performance metrics. I have 0 experience with
> mm so I'd like other's feedback on my this -- I'm just trying to do
> decipher rationale from prior "bpf prog pack" efforts.
> 
> I'm sensing that the cables in messaging are a bit crossed here and we need
> to provide a bit better full picture for rationale and this is being
> completely missed and this work is being undersold.  If my assessment is
> accurate though, the bpf prog pack strategy with sharing huge pages may prove
> useful long term for other things than just modules / ftrace / kprobes.
> 
> I was surprised to see this entire patch series upgrade from RFC to proper
> PATCH form now completely fails to mention any of the original motivations
> behind the "BPF prog pack", which you are doing a true heroic effort to try to
> generalize as the problem is hard. Let me try to help with that. The rationale
> for the old BPF prog pack is documented as follows:
> 
> * Most BPF programs are pretty small. Allocating a hole page for each
> * program is sometime a waste. Many small bpf program also adds pressure
> * to instruction TLB. To solve this issue, we introduce a BPF program pack
> * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> * to host BPF programs.
> 
> Previously you have also stated in earlier versions of this patch set:
> 
>   "Most BPF programs are small, but they consume a page each. For systems
>    with busy traffic and many BPF programs, this could also add significant
>    pressure to instruction TLB. High iTLB pressure usually causes slow down
>    for the whole system, which includes visible performance
>    degradation for production workloads."
> 
> So it is implied here that one of the benefits is to help reduce iTLB misses.
> But that's it. We have no visible numbers to look at and for what... But
> reducing iTLB misses doesn't always have a complete direct correlation
> with improving things, but if the code change is small enough it obviously
> makes sense to apply. If the change is a bit more intrusive, as in this
> patch series a bit more rationale should be provided.
> 
> Other than the "performance aspects" of your patchset, the *main* reason
> I am engaged and like it is it reduces the nasty mess of semantics on
> dealing with special permissions on pages which we see in modules and a
> few other places which today completely open code it. That proves error
> prone and I'm glad to see efforts to generalize that nastiness. So
> please ensure this is added as part of the documented rationale. Even
> if the iTLB miss ratio improvement is not astronomical I believe that
> the gains in sanity on improving semantics on special pages and sharing code
> make it well worthwhile. The iTLB miss ratio improvement is just a small
> cherry on top.
> 
> Going back to performance aspects, when Linus had poked for more details
> about this your have elaborated further:
> 
>   "we have seen direct map fragmentation causing visible
>    performance drop for our major services. This is the shadow 
>    production benchmark, so it is not possible to run it out of 
>    our data centers. Tracing showed that BPF program was the top 
>    trigger of these direct map splits."
> 
> And the only other metric we have is:
> 
>   "For our web service production benchmark, bpf_prog_pack on 4kB pages
>    gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."
> 
> These metrics are completely arbitrary and opaque to us. We need
> something tangible and reproducible and I have been suggesting that
> from early on...
> 
> I'm under the impression that the real missed, undocumented, major value-add
> here is that the old "BPF prog pack" strategy helps to reduce the direct map
> fragmentation caused by heavy use of the eBPF JIT programs and this in
> turn helps your overall random system performance (regardless of what
> it is you do). As I see it then the eBPF prog pack is just one strategy to
> try to mitigate memory fragmentation on the direct map caused by the the eBPF
> JIT programs, so the "slow down" your team has obvserved should be due to the
> eventual fragmentation caused on the direct map *while* eBPF programs
> get heavily used.

I believe that while the eBPF prog pack is helpful in mitigation of the
direct map fragmentation caused by the eBPF JIT programs, the same strategy
of allocating a large page, splitting its PMD entry and then reusing the
memory for smaller allocations can be (and should be) generalized to other
use cases that require non-default permissions in the page table.  Most
prominent use cases are those that allocate memory for code, but the same
approach is relevant for other cases, like secretmem or page table
protection with PKS.

A while ago I've suggested to handle such caching of large pages at the
page allocator level, but when we discussed it at LSF/MM/BPF, prevailing
opinion was that added value does not justify changes to the page
allocator and it was suggested to handle such caching elsewhere. 

I had to put this project on a backburner for $VARIOUS_REASONS, but I still
think that we need a generic allocator for memory with non-default
permissions in the direct map and that code allocation should build on that
allocator.

All that said, the direct map fragmentation problem is currently relevant
only to x86 because it's the only architecture that supports splitting of
the large pages in the direct map.
 
> Mike Rapoport had presented about the Direct map fragmentation problem
> at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> help performance [1] ends up generally recommending huge pages. The work by Xing
> though was about using huge pages *alone*, not using a strategy such as in the
> "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> and that I think is the real golden nugget here.
> 
> I contend therefore that the theoretical reduction of iTLB misses by using
> huge pages for "bpf prog pack" is not what gets your systems to perform
> somehow better. It should be simply that it reduces fragmentation and
> *this* generally can help with performance long term. If this is accurate
> then let's please separate the two aspects to this.

The direct map fragmentation is the reason for higher TLB miss rate, both
for iTLB and dTLB. Whenever a large page in the direct map is split, all
kernel accesses via the direct map will use small pages which requires
dealing with 512 page table entries instead of one for 2M range.

Since small pages in the direct map are never collapsed back to large
pages, long living system that heavily uses eBPF programs will have its
direct map severely fragmented, higher TLB miss rate and worse overall
performance. 

> There's two aspects to what I would like to see from a performance
> perspective then actually mentioned in the commit logs:
> 
> 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
>    Vs not using it at all:

... 
 
> 2) Estimate in reduction on direct map fragmentation by using the "bpf
>    prog pack" or this generalized solution:
> 
>    For this I'd expect a benchmark similar to the workload you guys
>    run or something memory intensive, as eBPF JITs are heavily used,
>    and after a certain amount of time somehow compute how fragmented
>    memory is. The only sensible thing I can think to measure memory
>    fragmentation is to look at the memory compaction index
>    /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
>    ideas as I'm a mm n00b.

The direct map fragmentation can be tracked with 

	grep DirectMap /proc/meminfo
	grep direct_map /proc/vmstat

and by looking at /sys/kernel/debug/page_tables/kernel
 
> [0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
> 
>   Luis

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-03 15:51     ` Mike Rapoport
@ 2022-11-03 18:59       ` Luis Chamberlain
  2022-11-03 21:19         ` Edgecombe, Rick P
  2022-11-07  6:58         ` Mike Rapoport
  0 siblings, 2 replies; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-03 18:59 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, zhengjun.xing, kbusch, p.raghav,
	dave, vbabka, mgorman, willy, torvalds, a.manzanares

On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> Hi Luis,
> 
> Thanks for looping me in.
> 
> On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> > On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> > > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > > proposal by Peter in [1].
> > 
> > This is allg reat but we need to clarify *why* we would go through the
> > trouble.  So if folks are not to excited about this series, that's
> > probably why. IMHO it lacks substance for rationale, **and** implies a few
> > gains without any *clear* performance metrics. I have 0 experience with
> > mm so I'd like other's feedback on my this -- I'm just trying to do
> > decipher rationale from prior "bpf prog pack" efforts.
> > 
> > I'm sensing that the cables in messaging are a bit crossed here and we need
> > to provide a bit better full picture for rationale and this is being
> > completely missed and this work is being undersold.  If my assessment is
> > accurate though, the bpf prog pack strategy with sharing huge pages may prove
> > useful long term for other things than just modules / ftrace / kprobes.
> > 
> > I was surprised to see this entire patch series upgrade from RFC to proper
> > PATCH form now completely fails to mention any of the original motivations
> > behind the "BPF prog pack", which you are doing a true heroic effort to try to
> > generalize as the problem is hard. Let me try to help with that. The rationale
> > for the old BPF prog pack is documented as follows:
> > 
> > * Most BPF programs are pretty small. Allocating a hole page for each
> > * program is sometime a waste. Many small bpf program also adds pressure
> > * to instruction TLB. To solve this issue, we introduce a BPF program pack
> > * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> > * to host BPF programs.
> > 
> > Previously you have also stated in earlier versions of this patch set:
> > 
> >   "Most BPF programs are small, but they consume a page each. For systems
> >    with busy traffic and many BPF programs, this could also add significant
> >    pressure to instruction TLB. High iTLB pressure usually causes slow down
> >    for the whole system, which includes visible performance
> >    degradation for production workloads."
> > 
> > So it is implied here that one of the benefits is to help reduce iTLB misses.
> > But that's it. We have no visible numbers to look at and for what... But
> > reducing iTLB misses doesn't always have a complete direct correlation
> > with improving things, but if the code change is small enough it obviously
> > makes sense to apply. If the change is a bit more intrusive, as in this
> > patch series a bit more rationale should be provided.
> > 
> > Other than the "performance aspects" of your patchset, the *main* reason
> > I am engaged and like it is it reduces the nasty mess of semantics on
> > dealing with special permissions on pages which we see in modules and a
> > few other places which today completely open code it. That proves error
> > prone and I'm glad to see efforts to generalize that nastiness. So
> > please ensure this is added as part of the documented rationale. Even
> > if the iTLB miss ratio improvement is not astronomical I believe that
> > the gains in sanity on improving semantics on special pages and sharing code
> > make it well worthwhile. The iTLB miss ratio improvement is just a small
> > cherry on top.
> > 
> > Going back to performance aspects, when Linus had poked for more details
> > about this your have elaborated further:
> > 
> >   "we have seen direct map fragmentation causing visible
> >    performance drop for our major services. This is the shadow 
> >    production benchmark, so it is not possible to run it out of 
> >    our data centers. Tracing showed that BPF program was the top 
> >    trigger of these direct map splits."
> > 
> > And the only other metric we have is:
> > 
> >   "For our web service production benchmark, bpf_prog_pack on 4kB pages
> >    gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."
> > 
> > These metrics are completely arbitrary and opaque to us. We need
> > something tangible and reproducible and I have been suggesting that
> > from early on...
> > 
> > I'm under the impression that the real missed, undocumented, major value-add
> > here is that the old "BPF prog pack" strategy helps to reduce the direct map
> > fragmentation caused by heavy use of the eBPF JIT programs and this in
> > turn helps your overall random system performance (regardless of what
> > it is you do). As I see it then the eBPF prog pack is just one strategy to
> > try to mitigate memory fragmentation on the direct map caused by the the eBPF
> > JIT programs, so the "slow down" your team has obvserved should be due to the
> > eventual fragmentation caused on the direct map *while* eBPF programs
> > get heavily used.
> 
> I believe that while the eBPF prog pack is helpful in mitigation of the
> direct map fragmentation caused by the eBPF JIT programs, the same strategy
> of allocating a large page, splitting its PMD entry and then reusing the
> memory for smaller allocations can be (and should be) generalized to other
> use cases that require non-default permissions in the page table.  Most
> prominent use cases are those that allocate memory for code, but the same
> approach is relevant for other cases, like secretmem or page table
> protection with PKS.
> 
> A while ago I've suggested to handle such caching of large pages at the
> page allocator level, but when we discussed it at LSF/MM/BPF, prevailing
> opinion was that added value does not justify changes to the page
> allocator and it was suggested to handle such caching elsewhere. 

I saw that on the lwn coverage.

> I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> think that we need a generic allocator for memory with non-default
> permissions in the direct map and that code allocation should build on that
> allocator.

It seems this generalization of the bpf prog pack to possibly be used
for modules / kprobes / ftrace is a small step in that direction.

> All that said, the direct map fragmentation problem is currently relevant
> only to x86 because it's the only architecture that supports splitting of
> the large pages in the direct map.

I was thinking even more long term too, using this as a proof of concept. If
this practice in general helps with fragmentation, could it be used for
experimetnation with compound pages later, as a way to reduce possible
fragmentation.

> > Mike Rapoport had presented about the Direct map fragmentation problem
> > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> > evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> > help performance [1] ends up generally recommending huge pages. The work by Xing
> > though was about using huge pages *alone*, not using a strategy such as in the
> > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> > and that I think is the real golden nugget here.
> > 
> > I contend therefore that the theoretical reduction of iTLB misses by using
> > huge pages for "bpf prog pack" is not what gets your systems to perform
> > somehow better. It should be simply that it reduces fragmentation and
> > *this* generally can help with performance long term. If this is accurate
> > then let's please separate the two aspects to this.
> 
> The direct map fragmentation is the reason for higher TLB miss rate, both
> for iTLB and dTLB.

OK so then whatever benchmark is running in tandem as eBPF JIT is hammered
should *also* be measured with perf for iTLB and dTLB. ie, the patch can
provide such results as a justifications.

> Whenever a large page in the direct map is split, all
> kernel accesses via the direct map will use small pages which requires
> dealing with 512 page table entries instead of one for 2M range.
> 
> Since small pages in the direct map are never collapsed back to large
> pages, long living system that heavily uses eBPF programs will have its
> direct map severely fragmented, higher TLB miss rate and worse overall
> performance. 

Shouldn't compaction help with those situations?

> > There's two aspects to what I would like to see from a performance
> > perspective then actually mentioned in the commit logs:
> > 
> > 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
> >    Vs not using it at all:
> 
> ... 
>  
> > 2) Estimate in reduction on direct map fragmentation by using the "bpf
> >    prog pack" or this generalized solution:
> > 
> >    For this I'd expect a benchmark similar to the workload you guys
> >    run or something memory intensive, as eBPF JITs are heavily used,
> >    and after a certain amount of time somehow compute how fragmented
> >    memory is. The only sensible thing I can think to measure memory
> >    fragmentation is to look at the memory compaction index
> >    /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
> >    ideas as I'm a mm n00b.
> 
> The direct map fragmentation can be tracked with 
> 
> 	grep DirectMap /proc/meminfo
> 	grep direct_map /proc/vmstat
> 
> and by looking at /sys/kernel/debug/page_tables/kernel

Thanks!

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec()
  2022-11-02 22:21   ` Edgecombe, Rick P
@ 2022-11-03 21:03     ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-11-03 21:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-mm, bpf, hch, mcgrof, peterz, akpm, x86, Hansen, Dave

On Wed, Nov 2, 2022 at 3:22 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> > diff --git a/arch/x86/kernel/alternative.c
> > b/arch/x86/kernel/alternative.c
> > index 5cadcea035e0..73d89774ace3 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -1270,6 +1270,18 @@ void *text_poke_copy(void *addr, const void
> > *opcode, size_t len)
> >         return addr;
> >  }
> >
> > +void *arch_vcopy_exec(void *dst, void *src, size_t len)
> > +{
> > +       if (text_poke_copy(dst, src, len) == NULL)
> > +               return ERR_PTR(-EINVAL);
> > +       return dst;
> > +}
>
> Except for this, there are no more users of text_poke_copy() right?
> Should it just be replaced with arch_vcopy_exec()?

I guess this is not really necessary, as we may have other use
cases for text_poke_copy(), and the text_poke_* calls make a good
API.

I won't object if folks agree removing text_poke_copy() for now is a
better approach.

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm
  2022-11-02 22:24   ` Edgecombe, Rick P
@ 2022-11-03 21:04     ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-11-03 21:04 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-mm, bpf, hch, mcgrof, peterz, akpm, x86, Hansen, Dave

On Wed, Nov 2, 2022 at 3:24 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> > Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
> > [round_up(_etext, 4kb), round_up(_etext, 2MB)] with
> > register_text_tail_vm
> > so that we can use this part of memory for dynamic kernel text (BPF
> > programs, etc.).
> >
> > Here is an example:
> >
> > [root@eth50-1 ~]# grep _etext /proc/kallsyms
> > ffffffff82202a08 T _etext
> >
> > [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> > ffffffff8220f920 t
> > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> > ffffffff8220fa28 t
> > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> > ffffffff8220fad4 t
> > bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> >
> > [root@eth50-1 ~]#  grep 0xffffffff82200000
> > /sys/kernel/debug/page_tables/kernel
> > 0xffffffff82200000-
> > 0xffffffff82400000     2M     ro   PSE         x  pmd
> >
> > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> > and
> > bpf programs.
> >
> > Signed-off-by: Song Liu <song@kernel.org>
> > ---
> >  arch/x86/include/asm/pgtable_64_types.h | 1 +
> >  arch/x86/mm/init_64.c                   | 4 +++-
> >  include/linux/vmalloc.h                 | 4 ++++
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable_64_types.h
> > b/arch/x86/include/asm/pgtable_64_types.h
> > index 04f36063ad54..c0f9cceb109a 100644
> > --- a/arch/x86/include/asm/pgtable_64_types.h
> > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > @@ -101,6 +101,7 @@ extern unsigned int ptrs_per_p4d;
> >  #define PUD_MASK     (~(PUD_SIZE - 1))
> >  #define PGDIR_SIZE   (_AC(1, UL) << PGDIR_SHIFT)
> >  #define PGDIR_MASK   (~(PGDIR_SIZE - 1))
> > +#define PMD_ALIGN(x) (((unsigned long)(x) + (PMD_SIZE - 1)) &
> > PMD_MASK)
> >
> >  /*
> >   * See Documentation/x86/x86_64/mm.rst for a description of the
> > memory map.
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 3f040c6e5d13..5b42fc0c6099 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1373,7 +1373,7 @@ void mark_rodata_ro(void)
> >       unsigned long start = PFN_ALIGN(_text);
> >       unsigned long rodata_start = PFN_ALIGN(__start_rodata);
> >       unsigned long end = (unsigned long)__end_rodata_hpage_align;
> > -     unsigned long text_end = PFN_ALIGN(_etext);
> > +     unsigned long text_end = PMD_ALIGN(_etext);
> >       unsigned long rodata_end = PFN_ALIGN(__end_rodata);
> >       unsigned long all_end;
>
> Check out is_errata93(). Right now it assumes all text is between text-
> etext and MODULES_VADDR-MODULES_END. It's a quite old errata, but it
> would be nice if we had a is_text_addr() helper or something. To help
> keep track of the places where text might pop up.
>
> Speaking of which, it might be nice to update
> Documentation/x86/x86_64/mm.rst with some hints that this area exists.
>
> >
> > @@ -1414,6 +1414,8 @@ void mark_rodata_ro(void)
> >                               (void *)rodata_end, (void *)_sdata);
> >
> >       debug_checkwx();
> > +     register_text_tail_vm(PFN_ALIGN((unsigned long)_etext),
> > +                           PMD_ALIGN((unsigned long)_etext));
> >  }
> >
> >  int kern_addr_valid(unsigned long addr)
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 9b2042313c12..7365cf9c4e7f 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -132,11 +132,15 @@ extern void vm_unmap_aliases(void);
> >  #ifdef CONFIG_MMU
> >  extern void __init vmalloc_init(void);
> >  extern unsigned long vmalloc_nr_pages(void);
> > +void register_text_tail_vm(unsigned long start, unsigned long end);
> >  #else
> >  static inline void vmalloc_init(void)
> >  {
> >  }
> >  static inline unsigned long vmalloc_nr_pages(void) { return 0; }
> > +void register_text_tail_vm(unsigned long start, unsigned long end)
> > +{
> > +}
> >  #endif
>
> This looks like it should be in the previous patch.

Good catch! I will fix it in the next version.

Thanks,
Song

>
> >
> >  extern void *vmalloc(unsigned long size) __alloc_size(1);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs
  2022-11-02 22:29 ` Edgecombe, Rick P
@ 2022-11-03 21:13   ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-11-03 21:13 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-mm, bpf, hch, mcgrof, peterz, akpm, x86, Hansen, Dave

On Wed, Nov 2, 2022 at 3:30 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> > This set enables bpf programs and bpf dispatchers to share huge pages
> > with
> > new API:
> >   vmalloc_exec()
> >   vfree_exec()
> >   vcopy_exec()
> >
> > The idea is similar to Peter's suggestion in [1].
> >
> > vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates
> > these
> > memory to its users. vfree_exec() is used to free memory allocated by
> > vmalloc_exec(). vcopy_exec() is used to update memory allocated by
> > vmalloc_exec().
> >
> > Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate
> > W^X.
> > The caller has to update the content with text_poke like mechanism.
> > Specifically, vcopy_exec() is provided to update memory allocated by
> > vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
> > boundary of one chunk allocated by vmalloc_exec(). Please refer to
> > patch
> > 1/5 for more details of
> >
> > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> >
> > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to
> > share
> > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved
> > by
> > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
>
> It might help to spell out what the benefits of this are. My
> understanding is that (to my surprise) we actually haven't seen a
> performance improvement with using 2MB pages for JITs. The main
> performance benefit you saw on your previous version was from reduced
> fragmentation of the direct map IIUC. This was from the effect of
> reusing the same pages for JITs so that new ones don't need to be
> broken.
>
> The other benefit of this thing is reduced shootdowns. It can load a
> JIT with about only a local TLB flush on average, which should help
> really high cpu systems some unknown amount.

Thanks for pointing out the missing information.

I don't have a benchmark that uses very big BPF programs, so the
results I have don't show much benefit from fewer iTLB misses.

Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-03 18:59       ` Luis Chamberlain
@ 2022-11-03 21:19         ` Edgecombe, Rick P
  2022-11-03 21:41           ` Song Liu
  2022-11-04  0:18           ` Luis Chamberlain
  2022-11-07  6:58         ` Mike Rapoport
  1 sibling, 2 replies; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-11-03 21:19 UTC (permalink / raw)
  To: rppt, mcgrof
  Cc: p.raghav, peterz, bpf, dave, willy, linux-mm, song, hch, vbabka,
	zhengjun.xing, x86, akpm, Torvalds, Linus, Hansen, Dave, kbusch,
	mgorman, a.manzanares

On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > Mike Rapoport had presented about the Direct map fragmentation
> > > problem
> > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > ftrace /
> > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > performance
> > > evaluation on whether using 2M/1G pages aggressively for the
> > > kernel direct map
> > > help performance [1] ends up generally recommending huge pages.
> > > The work by Xing
> > > though was about using huge pages *alone*, not using a strategy
> > > such as in the
> > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > programs,
> > > and that I think is the real golden nugget here.
> > > 
> > > I contend therefore that the theoretical reduction of iTLB misses
> > > by using
> > > huge pages for "bpf prog pack" is not what gets your systems to
> > > perform
> > > somehow better. It should be simply that it reduces fragmentation
> > > and
> > > *this* generally can help with performance long term. If this is
> > > accurate
> > > then let's please separate the two aspects to this.
> > 
> > The direct map fragmentation is the reason for higher TLB miss
> > rate, both
> > for iTLB and dTLB.
> 
> OK so then whatever benchmark is running in tandem as eBPF JIT is
> hammered
> should *also* be measured with perf for iTLB and dTLB. ie, the patch
> can
> provide such results as a justifications.

Song had done some tests on the old prog pack version that to me seemed
to indicate most (or possibly all) of the benefit was direct map
fragmentation reduction. This was surprised me, since 2MB kernel text
has shown to be beneficial.

Otherwise +1 to all these comments. This should be clear about what the
benefits are. I would add, that this is also much nicer about TLB
shootdowns than the existing way of loading text and saves some memory.

So I think there are sort of four areas of improvements:
1. Direct map fragmentation reduction (dTLB miss improvements). This
sort of does it as a side effect in this series, and the solution Mike
is talking about is a more general, probably better one.
2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
solution for this, but surprisingly it doesn't seem to be useful for
JITs. (modules testing TBD)
3. Loading text to reused allocation with per-cpu mappings. This
reduces TLB shootdowns, which are a short term load and teardown time
performance drag. My understanding is this is more of a problem on
bigger systems with many CPUs. This series does a decent job at this,
but the solution is not compatible with modules. Maybe ok since modules
don't load as often as JITs.
4. Having BPF progs share pages. This saves memory. This series could
probably easily get a number for how much.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-03 21:19         ` Edgecombe, Rick P
@ 2022-11-03 21:41           ` Song Liu
  2022-11-03 23:33             ` Luis Chamberlain
  2022-11-04  0:18           ` Luis Chamberlain
  1 sibling, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-11-03 21:41 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rppt, mcgrof, p.raghav, peterz, bpf, dave, willy, linux-mm, hch,
	vbabka, zhengjun.xing, x86, akpm, Torvalds, Linus, Hansen, Dave,
	kbusch, mgorman, a.manzanares

On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > > Mike Rapoport had presented about the Direct map fragmentation
> > > > problem
> > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > > ftrace /
> > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > > performance
> > > > evaluation on whether using 2M/1G pages aggressively for the
> > > > kernel direct map
> > > > help performance [1] ends up generally recommending huge pages.
> > > > The work by Xing
> > > > though was about using huge pages *alone*, not using a strategy
> > > > such as in the
> > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > > programs,
> > > > and that I think is the real golden nugget here.
> > > >
> > > > I contend therefore that the theoretical reduction of iTLB misses
> > > > by using
> > > > huge pages for "bpf prog pack" is not what gets your systems to
> > > > perform
> > > > somehow better. It should be simply that it reduces fragmentation
> > > > and
> > > > *this* generally can help with performance long term. If this is
> > > > accurate
> > > > then let's please separate the two aspects to this.
> > >
> > > The direct map fragmentation is the reason for higher TLB miss
> > > rate, both
> > > for iTLB and dTLB.
> >
> > OK so then whatever benchmark is running in tandem as eBPF JIT is
> > hammered
> > should *also* be measured with perf for iTLB and dTLB. ie, the patch
> > can
> > provide such results as a justifications.
>
> Song had done some tests on the old prog pack version that to me seemed
> to indicate most (or possibly all) of the benefit was direct map
> fragmentation reduction. This was surprised me, since 2MB kernel text
> has shown to be beneficial.
>
> Otherwise +1 to all these comments. This should be clear about what the
> benefits are. I would add, that this is also much nicer about TLB
> shootdowns than the existing way of loading text and saves some memory.
>
> So I think there are sort of four areas of improvements:
> 1. Direct map fragmentation reduction (dTLB miss improvements). This
> sort of does it as a side effect in this series, and the solution Mike
> is talking about is a more general, probably better one.
> 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
> solution for this, but surprisingly it doesn't seem to be useful for
> JITs. (modules testing TBD)
> 3. Loading text to reused allocation with per-cpu mappings. This
> reduces TLB shootdowns, which are a short term load and teardown time
> performance drag. My understanding is this is more of a problem on
> bigger systems with many CPUs. This series does a decent job at this,
> but the solution is not compatible with modules. Maybe ok since modules
> don't load as often as JITs.
> 4. Having BPF progs share pages. This saves memory. This series could
> probably easily get a number for how much.
>

Hi Luis, Rick, and Mike,

Thanks a lot for helping me organize this information. Totally agree with
all these comments. I will add more data to the next revision.

Besides the motivation improvement, could you please also share your
comments on:
1. The logic/design of the vmalloc_exec() et. al. APIs;
2. The naming of these functions. Does  execmem_[alloc|free|fill|cpy]
  (as suggested by Chritoph) sound good?

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-03 21:41           ` Song Liu
@ 2022-11-03 23:33             ` Luis Chamberlain
  0 siblings, 0 replies; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-03 23:33 UTC (permalink / raw)
  To: Song Liu
  Cc: Edgecombe, Rick P, rppt, p.raghav, peterz, bpf, dave, willy,
	linux-mm, hch, vbabka, zhengjun.xing, x86, akpm, Torvalds, Linus,
	Hansen, Dave, kbusch, mgorman, a.manzanares

On Thu, Nov 03, 2022 at 02:41:42PM -0700, Song Liu wrote:
> On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> Besides the motivation improvement, could you please also share your
> comments on:
> 1. The logic/design of the vmalloc_exec() et. al. APIs;

I've provided the feedback that I can so far as I'm new to mm.
Best I can do is provided a better rationale so that mm folks
can understand the motivation.

> 2. The naming of these functions. Does  execmem_[alloc|free|fill|cpy]
>   (as suggested by Chritoph) sound good?

That seems sensible.

There may be other users later, secmm_alloc() too then later, for
instance. So we just gotta keep that in mind.

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-03 21:19         ` Edgecombe, Rick P
  2022-11-03 21:41           ` Song Liu
@ 2022-11-04  0:18           ` Luis Chamberlain
  2022-11-04  3:29             ` Luis Chamberlain
  1 sibling, 1 reply; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-04  0:18 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rppt, p.raghav, peterz, bpf, dave, willy, linux-mm, song, hch,
	vbabka, zhengjun.xing, x86, akpm, Torvalds, Linus, Hansen, Dave,
	kbusch, mgorman, a.manzanares

On Thu, Nov 03, 2022 at 09:19:25PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > > Mike Rapoport had presented about the Direct map fragmentation
> > > > problem
> > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > > ftrace /
> > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > > performance
> > > > evaluation on whether using 2M/1G pages aggressively for the
> > > > kernel direct map
> > > > help performance [1] ends up generally recommending huge pages.
> > > > The work by Xing
> > > > though was about using huge pages *alone*, not using a strategy
> > > > such as in the
> > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > > programs,
> > > > and that I think is the real golden nugget here.
> > > > 
> > > > I contend therefore that the theoretical reduction of iTLB misses
> > > > by using
> > > > huge pages for "bpf prog pack" is not what gets your systems to
> > > > perform
> > > > somehow better. It should be simply that it reduces fragmentation
> > > > and
> > > > *this* generally can help with performance long term. If this is
> > > > accurate
> > > > then let's please separate the two aspects to this.
> > > 
> > > The direct map fragmentation is the reason for higher TLB miss
> > > rate, both
> > > for iTLB and dTLB.
> > 
> > OK so then whatever benchmark is running in tandem as eBPF JIT is
> > hammered
> > should *also* be measured with perf for iTLB and dTLB. ie, the patch
> > can
> > provide such results as a justifications.
> 
> Song had done some tests on the old prog pack version that to me seemed
> to indicate most (or possibly all) of the benefit was direct map
> fragmentation reduction.

Matches my observations but I also provided quite a bit of hints as to
*why* I think that is. I suggested lib/test_kmod.c as an example beefy
multithreaded selftests which really kicks the hell out of the kernel
with whatever crap you want to run. That is precicely how I uncovered
some odd kmod bug lingering for years.

> This was surprised me, since 2MB kernel text
> has shown to be beneficial.
> 
> Otherwise +1 to all these comments. This should be clear about what the
> benefits are. I would add, that this is also much nicer about TLB
> shootdowns than the existing way of loading text and saves some memory.
> 
> So I think there are sort of four areas of improvements:
> 1. Direct map fragmentation reduction (dTLB miss improvements).

The dTLB gains should be on the benchmark which runs in tandem to the
ebpf-JIT-monster-selftest, not on the ebpf-JIT-monster-selftest, right?

> This
> sort of does it as a side effect in this series, and the solution Mike
> is talking about is a more general, probably better one.
> 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
> solution for this, but surprisingly it doesn't seem to be useful for
> JITs. (modules testing TBD)

Yes I'm super eager to get this tested. In fact I wonder if one can
boot Linux with less memory too...

> 3. Loading text to reused allocation with per-cpu mappings. This
> reduces TLB shootdowns, which are a short term load and teardown time
> performance drag. My understanding is this is more of a problem on
> bigger systems with many CPUs. This series does a decent job at this,
> but the solution is not compatible with modules. Maybe ok since modules
> don't load as often as JITs.

There are some tests like fstests which make heavy use of module
removal. But as a side effect, indeed I like to reboot to have
a fresh system before running fstests. I guess fstests should run
with a heavily fragmented memory too as a side corner case thing.

> 4. Having BPF progs share pages. This saves memory. This series could
> probably easily get a number for how much.

Once that does hit modules / kprobes / ftrace, the impact is much
much greater obviously.

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-04  0:18           ` Luis Chamberlain
@ 2022-11-04  3:29             ` Luis Chamberlain
  0 siblings, 0 replies; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-04  3:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rppt, p.raghav, peterz, bpf, dave, willy, linux-mm, song, hch,
	vbabka, zhengjun.xing, x86, akpm, Torvalds, Linus, Hansen, Dave,
	kbusch, mgorman, a.manzanares

On Thu, Nov 03, 2022 at 05:18:51PM -0700, Luis Chamberlain wrote:
> On Thu, Nov 03, 2022 at 09:19:25PM +0000, Edgecombe, Rick P wrote:
> > On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > > > Mike Rapoport had presented about the Direct map fragmentation
> > > > > problem
> > > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > > > ftrace /
> > > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > > > performance
> > > > > evaluation on whether using 2M/1G pages aggressively for the
> > > > > kernel direct map
> > > > > help performance [1] ends up generally recommending huge pages.
> > > > > The work by Xing
> > > > > though was about using huge pages *alone*, not using a strategy
> > > > > such as in the
> > > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > > > programs,
> > > > > and that I think is the real golden nugget here.
> > > > > 
> > > > > I contend therefore that the theoretical reduction of iTLB misses
> > > > > by using
> > > > > huge pages for "bpf prog pack" is not what gets your systems to
> > > > > perform
> > > > > somehow better. It should be simply that it reduces fragmentation
> > > > > and
> > > > > *this* generally can help with performance long term. If this is
> > > > > accurate
> > > > > then let's please separate the two aspects to this.
> > > > 
> > > > The direct map fragmentation is the reason for higher TLB miss
> > > > rate, both
> > > > for iTLB and dTLB.
> > > 
> > > OK so then whatever benchmark is running in tandem as eBPF JIT is
> > > hammered
> > > should *also* be measured with perf for iTLB and dTLB. ie, the patch
> > > can
> > > provide such results as a justifications.
> > 
> > Song had done some tests on the old prog pack version that to me seemed
> > to indicate most (or possibly all) of the benefit was direct map
> > fragmentation reduction.
> 
> Matches my observations but I also provided quite a bit of hints as to
> *why* I think that is. I suggested lib/test_kmod.c as an example beefy
> multithreaded selftests which really kicks the hell out of the kernel
> with whatever crap you want to run. That is precicely how I uncovered
> some odd kmod bug lingering for years.

*and*, *perhaps*... it may be that you need another memory intensive benchmark
to run in tandem, one which mimics the behaviour of the internal "shadow
production benchmark", whatever that is.

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-02 23:41   ` Luis Chamberlain
  2022-11-03 15:51     ` Mike Rapoport
@ 2022-11-07  6:40     ` Aaron Lu
  2022-11-07 17:39       ` Luis Chamberlain
  2022-11-07 18:30       ` Song Liu
  1 sibling, 2 replies; 29+ messages in thread
From: Aaron Lu @ 2022-11-07  6:40 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, rppt, zhengjun.xing, kbusch,
	p.raghav, dave, vbabka, mgorman, willy, torvalds, Hyeonggon Yoo

Hello,

On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:

... ...

> I'm under the impression that the real missed, undocumented, major value-add
> here is that the old "BPF prog pack" strategy helps to reduce the direct map
> fragmentation caused by heavy use of the eBPF JIT programs and this in
> turn helps your overall random system performance (regardless of what
> it is you do). As I see it then the eBPF prog pack is just one strategy to
> try to mitigate memory fragmentation on the direct map caused by the the eBPF
> JIT programs, so the "slow down" your team has obvserved should be due to the
> eventual fragmentation caused on the direct map *while* eBPF programs
> get heavily used.
> 
> Mike Rapoport had presented about the Direct map fragmentation problem
> at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> help performance [1] ends up generally recommending huge pages. The work by Xing
> though was about using huge pages *alone*, not using a strategy such as in the
> "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> and that I think is the real golden nugget here.

I'm interested in how this patchset (further) improves direct map
fragmentation so would like to evaluate it to see if my previous work to
merge small mappings back in architecture layer[1] is still necessary.

I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed,
so I took one step back and evaluated the existing bpf_prog_pack. I'm
aware of this patchset would make things even better by using order-9
page to backup the vmalloced range.

I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to
run, feel free to let me know a better way to evaluate this.

- In kernels before bpf_prog_pack(v5.17 and earlier), this prog would
cause 3 pages to change protection to RO+X from RW+NX; And if the three
pages are far apart, each would cause a level 3 split then a level 2
split; Reality is, allocated pages tend to stay close physically so
actual result will not be this bad.

- After bpf_prog_pack, the load of this prog will most likely requires
no new page protection change as long as the existing pack pool has space
for it(the common case). The actual space required for this bpf prog that
needs special protection is: 64 * 2 + 192 bytes, it would need 6144 such
progs to use up the cache and trigger another 2MB alloc which can
potentially cause a direct map split. 6144 seems to be pretty large number
to me so I think the direct map split due to bpf is greatly improved
(if not totally solved).

Here are test results on a 8G x86_64 VM.
(on x86_64, 4k is PTE mapping, 2M is PMD mapping and 1G is PUD mapping)
- v5.17
1) right after boot
$ grep Direct /proc/meminfo
DirectMap4k:       87900 kB
DirectMap2M:     5154816 kB
DirectMap1G:     5242880 kB

2) after running 512 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      462684 kB
DirectMap2M:     4780032 kB
DirectMap1G:     5242880 kB
PUD mapping survived, some PMD mappings are splitted.

3) after running 1024 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      884572 kB
DirectMap2M:     6455296 kB
DirectMap1G:     3145728 kB
2 PUD mappings and some PMD mappings splitted.

4) after run 2048 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     1654620 kB
DirectMap2M:     6733824 kB
DirectMap1G:     2097152 kB
Another PUD mapping and some PMD mappings spliited.
At the end, 2 PUD mappings survived.

- v6.1-rc3
The direct map number doesn't change for "right after boot", "after
running 512/1024/2048" sockex1 instances. I also tried running 5120
instances but it would consume all available memory when it runs about
4000 instances and even then, the direct map number doesn't change, i.e.
not a single split happened. This is understandable because as I
calculated above, it would need 6144 such progs to cause another
alloc/split. Here is its number:
$ grep Direct /proc/meminfo
DirectMap4k:       22364 kB
DirectMap2M:     3123200 kB
DirectMap1G:     7340032 kB

Consider that production system will have memory mostly consumed and when
CPU allocates pages, the page can be far apart than systems right after
boot, causing more mapping split, so I also tested to firstly consume all
memory to page cache by reading some sparse files and then run this test.
I expect this time v5.17 would become worse. Here it is:

- v5.17
1) right after boot
$ grep Direct /proc/meminfo
DirectMap4k:       94044 kB
DirectMap2M:     4100096 kB
DirectMap1G:     6291456 kB
More mappings are in PUD for this boot.

2) after run 512 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      538460 kB
DirectMap2M:     7849984 kB
DirectMap1G:     2097152 kB
4 PUD mappings and some PMD mappings are splitted this time, more than
last time.

3) after run 1024 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     1083228 kB
DirectMap2M:     7305216 kB
DirectMap1G:     2097152 kB
Some PMD mappings split.

4) after running 2048 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     2340700 kB
DirectMap2M:     6047744 kB
DirectMap1G:     2097152 kB
The end result is about the same as before.

- v6.1-rc3
There is no difference because I can't trigger another pack alloc before
system is OOMed.

Conclusion: I think bpf_prog_pack is very good at reducing direct map
fragmentation and this patchset can further improve this situation on
large machines(with huge amount of memory) or with more large bpf progs
loaded etc.

Some imperfect things I can think of are(not related to this patchset):
1 Once a split happened, it remains happened. This may not be a big deal
now with bpf_prog_pack and this patchset because the need to allocate a
new order-9 page and thus cause a potential split should happen much much
less;
2 When a new order-9 page has to be allocated, there is no way to tell
the allocator to allocate this order-9 page from an already splitted PUD
range to avoid another PUD mapping split;
3 As Mike and others have mentioned, there are other users that can also
cause direct map split.

[1]: https://lore.kernel.org/lkml/20220808145649.2261258-1-aaron.lu@intel.com/

Regards,
Aaron

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-03 18:59       ` Luis Chamberlain
  2022-11-03 21:19         ` Edgecombe, Rick P
@ 2022-11-07  6:58         ` Mike Rapoport
  2022-11-07 17:26           ` Luis Chamberlain
  1 sibling, 1 reply; 29+ messages in thread
From: Mike Rapoport @ 2022-11-07  6:58 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, zhengjun.xing, kbusch, p.raghav,
	dave, vbabka, mgorman, willy, torvalds, a.manzanares

On Thu, Nov 03, 2022 at 11:59:48AM -0700, Luis Chamberlain wrote:
> On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> 
> > I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> > think that we need a generic allocator for memory with non-default
> > permissions in the direct map and that code allocation should build on that
> > allocator.
> 
> It seems this generalization of the bpf prog pack to possibly be used
> for modules / kprobes / ftrace is a small step in that direction.
> 
> > All that said, the direct map fragmentation problem is currently relevant
> > only to x86 because it's the only architecture that supports splitting of
> > the large pages in the direct map.
> 
> I was thinking even more long term too, using this as a proof of concept. If
> this practice in general helps with fragmentation, could it be used for
> experimetnation with compound pages later, as a way to reduce possible
> fragmentation.

As Rick already mentioned, these patches help with the direct map
fragmentation only indirectly. With these patches memory is freed in
PMD_SIZE chunks and this makes the changes to the direct map in
vm_remove_mappings() to happen in in PMD_SIZE units and this is pretty much
the only effect of this series on the direct map layout.

A bit unrelated, but I'm wondering now if we want to have the direct map
alias of the pages allocated for code also to be read-only...

> > Whenever a large page in the direct map is split, all
> > kernel accesses via the direct map will use small pages which requires
> > dealing with 512 page table entries instead of one for 2M range.
> > 
> > Since small pages in the direct map are never collapsed back to large
> > pages, long living system that heavily uses eBPF programs will have its
> > direct map severely fragmented, higher TLB miss rate and worse overall
> > performance. 
> 
> Shouldn't compaction help with those situations?

Compaction helps to reduce fragmentation of the physical memory, it tries
to bring free physical pages next to each other to create large contiguous
chunks, but it does not change the virtual addresses the users of the
underlying data see.

Changing permissions of a small page in the direct map causes
"discontinuity" in the virtual space. E.g. if we have 2M mapped RW with a
single PMD changing several page in the middle of those 2M to R+X will
require to remap that range with 512 PTEs.

> Thanks!
> 
>   Luis

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-07  6:58         ` Mike Rapoport
@ 2022-11-07 17:26           ` Luis Chamberlain
  0 siblings, 0 replies; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-07 17:26 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, zhengjun.xing, kbusch, p.raghav,
	dave, vbabka, mgorman, willy, torvalds, a.manzanares

On Mon, Nov 07, 2022 at 08:58:17AM +0200, Mike Rapoport wrote:
> On Thu, Nov 03, 2022 at 11:59:48AM -0700, Luis Chamberlain wrote:
> > On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> > 
> > > I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> > > think that we need a generic allocator for memory with non-default
> > > permissions in the direct map and that code allocation should build on that
> > > allocator.
> > 
> > It seems this generalization of the bpf prog pack to possibly be used
> > for modules / kprobes / ftrace is a small step in that direction.
> > 
> > > All that said, the direct map fragmentation problem is currently relevant
> > > only to x86 because it's the only architecture that supports splitting of
> > > the large pages in the direct map.
> > 
> > I was thinking even more long term too, using this as a proof of concept. If
> > this practice in general helps with fragmentation, could it be used for
> > experimetnation with compound pages later, as a way to reduce possible
> > fragmentation.
> 
> As Rick already mentioned, these patches help with the direct map
> fragmentation only indirectly. With these patches memory is freed in
> PMD_SIZE chunks and this makes the changes to the direct map in
> vm_remove_mappings() to happen in in PMD_SIZE units and this is pretty much
> the only effect of this series on the direct map layout.

I understand that is what *this* series does. I was wondering is similar
scheme may be useful to study to see if it helps with aggregating say
something like 32 x 64 kb for compound page allocations of order 16 (64 Kib)
to see if it may help with possible fragmentation concerns for that world
where that may be useful in the future (completely unrelated to page
permissions).

> A bit unrelated, but I'm wondering now if we want to have the direct map
> alias of the pages allocated for code also to be read-only...
> 
> > > Whenever a large page in the direct map is split, all
> > > kernel accesses via the direct map will use small pages which requires
> > > dealing with 512 page table entries instead of one for 2M range.
> > > 
> > > Since small pages in the direct map are never collapsed back to large
> > > pages, long living system that heavily uses eBPF programs will have its
> > > direct map severely fragmented, higher TLB miss rate and worse overall
> > > performance. 
> > 
> > Shouldn't compaction help with those situations?
> 
> Compaction helps to reduce fragmentation of the physical memory, it tries
> to bring free physical pages next to each other to create large contiguous
> chunks, but it does not change the virtual addresses the users of the
> underlying data see.

Sorry I understood that 'bpf prog pack' only only used *one* 2 MiB huge
page for *all* eBPF JIT programs, and so the fragmentation issue prior
to 'bpf prog pack' I thought was the fact that as eBPF programs are
still alive, we have fragmentation. In the world without 'bpf prog pack'
I thought no huge pages were used.

> Changing permissions of a small page in the direct map causes
> "discontinuity" in the virtual space. E.g. if we have 2M mapped RW with a
> single PMD changing several page in the middle of those 2M to R+X will
> require to remap that range with 512 PTEs.

Makes sense, thanks.

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-07  6:40     ` Aaron Lu
@ 2022-11-07 17:39       ` Luis Chamberlain
  2022-11-07 18:35         ` Song Liu
  2022-11-07 18:30       ` Song Liu
  1 sibling, 1 reply; 29+ messages in thread
From: Luis Chamberlain @ 2022-11-07 17:39 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, rppt, zhengjun.xing, kbusch,
	p.raghav, dave, vbabka, mgorman, willy, torvalds, Hyeonggon Yoo

On Mon, Nov 07, 2022 at 02:40:14PM +0800, Aaron Lu wrote:
> Hello,
> 
> On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> 
> ... ...
> 
> > I'm under the impression that the real missed, undocumented, major value-add
> > here is that the old "BPF prog pack" strategy helps to reduce the direct map
> > fragmentation caused by heavy use of the eBPF JIT programs and this in
> > turn helps your overall random system performance (regardless of what
> > it is you do). As I see it then the eBPF prog pack is just one strategy to
> > try to mitigate memory fragmentation on the direct map caused by the the eBPF
> > JIT programs, so the "slow down" your team has obvserved should be due to the
> > eventual fragmentation caused on the direct map *while* eBPF programs
> > get heavily used.
> > 
> > Mike Rapoport had presented about the Direct map fragmentation problem
> > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> > evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> > help performance [1] ends up generally recommending huge pages. The work by Xing
> > though was about using huge pages *alone*, not using a strategy such as in the
> > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> > and that I think is the real golden nugget here.
> 
> I'm interested in how this patchset (further) improves direct map
> fragmentation so would like to evaluate it to see if my previous work to
> merge small mappings back in architecture layer[1] is still necessary.

You gotta apply it to 6.0.5 which had a large change go in for eBPF
which was not present on 6.0.

> Conclusion: I think bpf_prog_pack is very good at reducing direct map
> fragmentation and this patchset can further improve this situation on
> large machines(with huge amount of memory) or with more large bpf progs
> loaded etc.

Fantastic. Thanks for the analysis, so yet another set of metrics which
I'd hope can be applied to this patch set as this effort is generalized.

Now imagine the effort in cosnideration also of modules / ftrace / kprobes.

> Some imperfect things I can think of are(not related to this patchset):
> 1 Once a split happened, it remains happened. This may not be a big deal
> now with bpf_prog_pack and this patchset because the need to allocate a
> new order-9 page and thus cause a potential split should happen much much
> less;

Not sure I follow, are you suggesting a order-9 (512 bytes) allocation would
trigger a split of the reserved say 2 MiB huge page?

> 2 When a new order-9 page has to be allocated, there is no way to tell
> the allocator to allocate this order-9 page from an already splitted PUD
> range to avoid another PUD mapping split;
> 3 As Mike and others have mentioned, there are other users that can also
> cause direct map split.

Hence the effort to generalize.

  Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-07  6:40     ` Aaron Lu
  2022-11-07 17:39       ` Luis Chamberlain
@ 2022-11-07 18:30       ` Song Liu
  1 sibling, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-11-07 18:30 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Luis Chamberlain, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, rppt, zhengjun.xing, kbusch,
	p.raghav, dave, vbabka, mgorman, willy, torvalds, Hyeonggon Yoo

Hi Aaron,

On Sun, Nov 6, 2022 at 10:40 PM Aaron Lu <aaron.lu@intel.com> wrote:
>
[...]
> > and that I think is the real golden nugget here.
>
> I'm interested in how this patchset (further) improves direct map
> fragmentation so would like to evaluate it to see if my previous work to
> merge small mappings back in architecture layer[1] is still necessary.
>
> I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed,
> so I took one step back and evaluated the existing bpf_prog_pack. I'm
> aware of this patchset would make things even better by using order-9
> page to backup the vmalloced range.

The patchset was based on bpf-next tree:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/

>
> I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to
> run, feel free to let me know a better way to evaluate this.
>
> - In kernels before bpf_prog_pack(v5.17 and earlier), this prog would
> cause 3 pages to change protection to RO+X from RW+NX; And if the three
> pages are far apart, each would cause a level 3 split then a level 2
> split; Reality is, allocated pages tend to stay close physically so
> actual result will not be this bad.
[...]
>
> - v6.1-rc3
> There is no difference because I can't trigger another pack alloc before
> system is OOMed.
>
> Conclusion: I think bpf_prog_pack is very good at reducing direct map
> fragmentation and this patchset can further improve this situation on
> large machines(with huge amount of memory) or with more large bpf progs
> loaded etc.

Thanks a lot for these experiments! I will include the data in the next
version of the set.

>
> Some imperfect things I can think of are(not related to this patchset):
> 1 Once a split happened, it remains happened. This may not be a big deal
> now with bpf_prog_pack and this patchset because the need to allocate a
> new order-9 page and thus cause a potential split should happen much much
> less;

I think we will need to regroup the direct map for some scenarios. But I
am not aware of such workloads.

> 2 When a new order-9 page has to be allocated, there is no way to tell
> the allocator to allocate this order-9 page from an already splitted PUD
> range to avoid another PUD mapping split;

This would be a good improvement.

> 3 As Mike and others have mentioned, there are other users that can also
> cause direct map split.

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  2022-11-07 17:39       ` Luis Chamberlain
@ 2022-11-07 18:35         ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-11-07 18:35 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Aaron Lu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, dave.hansen, rppt, zhengjun.xing, kbusch,
	p.raghav, dave, vbabka, mgorman, willy, torvalds, Hyeonggon Yoo

On Mon, Nov 7, 2022 at 9:39 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
>
> > Some imperfect things I can think of are(not related to this patchset):
> > 1 Once a split happened, it remains happened. This may not be a big deal
> > now with bpf_prog_pack and this patchset because the need to allocate a
> > new order-9 page and thus cause a potential split should happen much much
> > less;
>
> Not sure I follow, are you suggesting a order-9 (512 bytes) allocation would
> trigger a split of the reserved say 2 MiB huge page?

I think by order-9 allocation, Aaron meant 2MiB huge pages. The issue is that
direct map split is one-way operation. If we set_memory_x() on one 4kB page
out of a 1GB direct map, we will split it into 511x 2MiB pages and 512x 4kB
pages. There is currently no way to regroup the 1GB page after
set_memory_nx() on the page.

Thanks,
Song

>
> > 2 When a new order-9 page has to be allocated, there is no way to tell
> > the allocator to allocate this order-9 page from an already splitted PUD
> > range to avoid another PUD mapping split;
> > 3 As Mike and others have mentioned, there are other users that can also
> > cause direct map split.
>
> Hence the effort to generalize.
>
>   Luis

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-11-07 18:35 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31 22:25 [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
2022-11-02 23:41   ` Luis Chamberlain
2022-11-03 15:51     ` Mike Rapoport
2022-11-03 18:59       ` Luis Chamberlain
2022-11-03 21:19         ` Edgecombe, Rick P
2022-11-03 21:41           ` Song Liu
2022-11-03 23:33             ` Luis Chamberlain
2022-11-04  0:18           ` Luis Chamberlain
2022-11-04  3:29             ` Luis Chamberlain
2022-11-07  6:58         ` Mike Rapoport
2022-11-07 17:26           ` Luis Chamberlain
2022-11-07  6:40     ` Aaron Lu
2022-11-07 17:39       ` Luis Chamberlain
2022-11-07 18:35         ` Song Liu
2022-11-07 18:30       ` Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
2022-11-02 22:21   ` Edgecombe, Rick P
2022-11-03 21:03     ` Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-10-31 22:25 ` [PATCH bpf-next v1 RESEND 5/5] x86: use register_text_tail_vm Song Liu
2022-11-02 22:24   ` Edgecombe, Rick P
2022-11-03 21:04     ` Song Liu
2022-11-01 11:26 ` [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs Christoph Hellwig
2022-11-01 15:10   ` Song Liu
2022-11-02 20:45 ` Luis Chamberlain
2022-11-02 22:29 ` Edgecombe, Rick P
2022-11-03 21:13   ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.