All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
@ 2022-11-07 22:39 Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
                   ` (9 more replies)
  0 siblings, 10 replies; 93+ messages in thread
From: Song Liu @ 2022-11-07 22:39 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof,
	Song Liu

This patchset tries to address the following issues:

1. Direct map fragmentation

On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
RO+X. These set_memory_* calls cause 1GB page table entries to be split
into 2MB and 4kB ones. This fragmentation in direct map results in bigger
and slower page table, and pressure for both instruction and data TLB.

Our previous work in bpf_prog_pack tries to address this issue from BPF
program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
greatly reduced direct map fragmentation from BPF programs.

2. iTLB pressure from BPF program

Dynamic kernel text such as modules and BPF programs (even with current
bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
BPF program is big, we can see visible performance drop caused by high
iTLB miss rate.

3. TLB shootdown for short-living BPF programs

Before bpf_prog_pack loading and unloading BPF programs requires global
TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
TLB flush.

4. Reduce memory usage by BPF programs (in some cases)

Most BPF programs and various trampolines are small, and they often
occupies a whole page. From a random server in our fleet, 50% of the
loaded BPF programs are less than 500 byte in size, and 75% of them are
less than 2kB in size. Allowing these BPF programs to share 2MB pages
would yield some memory saving for systems with many BPF programs. For
systems with only small number of BPF programs, this patch may waste a
little memory by allocating one 2MB page, but using only part of it.


Based on our experiments [5], we measured 0.5% performance improvement
from bpf_prog_pack. This patchset further boosts the improvement to 0.7%.
The difference is because bpf_prog_pack uses 512x 4kB pages instead of
1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.

This patchset replaces bpf_prog_pack with a better API and makes it
available for other dynamic kernel text, such as modules, ftrace, kprobe.


This set enables bpf programs and bpf dispatchers to share huge pages with
new API:
  execmem_alloc()
  execmem_alloc()
  execmem_fill()

The idea is similar to Peter's suggestion in [1].

execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
memory to its users. execmem_alloc() is used to free memory allocated by
execmem_alloc(). execmem_fill() is used to update memory allocated by
execmem_alloc().

Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
The caller has to update the content with text_poke like mechanism.
Specifically, execmem_fill() is provided to update memory allocated by
execmem_alloc(). execmem_fill() also makes sure the update stays in the
boundary of one chunk allocated by execmem_alloc(). Please refer to patch
1/5 for more details of

Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.

Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
_etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
[2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
[3] v1: https://lore.kernel.org/bpf/20221031222541.1773452-1-song@kernel.org/
[4] https://lore.kernel.org/bpf/Y2ioTodn+mBXdIqp@ziqianlu-desk2/
[5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

Changes PATCH v1 => v2:
1. Rename the APIs as execmem_* (Christoph Hellwig)
2. Add more information about the motivation of this work (and follow up
   works in for kernel modules, various trampolines, etc).
   (Luis Chamberlain, Rick Edgecombe, Mike Rapoport, Aaron Lu)
3. Include expermential results from previous bpf_prog_pack and the
   community. (Aaron Lu, Luis Chamberlain, Rick Edgecombe)

Changes RFC v2 => PATCH v1:
1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
   also ensures vcopy_exec() is only used to update memory from one single
   vmalloc_exec() call. (Christoph Hellwig)
2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
   text_poke() like logic.
3. Drop changes for kernel modules and focus on BPF side changes.

Changes RFC v1 => RFC v2:
1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
   still need some work.

Song Liu (5):
  vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  x86/alternative: support execmem_alloc() and execmem_free()
  bpf: use execmem_alloc for bpf program and bpf dispatcher
  vmalloc: introduce register_text_tail_vm()
  x86: use register_text_tail_vm

 arch/x86/include/asm/pgtable_64_types.h |   1 +
 arch/x86/kernel/alternative.c           |  12 +
 arch/x86/mm/init_64.c                   |   4 +-
 arch/x86/net/bpf_jit_comp.c             |  23 +-
 include/linux/bpf.h                     |   3 -
 include/linux/filter.h                  |   5 -
 include/linux/vmalloc.h                 |   9 +
 kernel/bpf/core.c                       | 180 +-----------
 kernel/bpf/dispatcher.c                 |  11 +-
 mm/nommu.c                              |  12 +
 mm/vmalloc.c                            | 354 ++++++++++++++++++++++++
 11 files changed, 412 insertions(+), 202 deletions(-)

--
2.30.2


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
@ 2022-11-07 22:39 ` Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 2/5] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-07 22:39 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof,
	Song Liu

execmem_alloc is used to allocate memory to host dynamic kernel text
(modules, BPF programs, etc.) with huge pages. This is similar to the
proposal by Peter in [1].

A new tree of vmap_area, free_text_area_* tree, is introduced in addition
to free_vmap_area_* and vmap_area_*. execmem_alloc allocates pages from
free_text_area_*. When there isn't enough space left in free_text_area_*,
new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
free_text_area_*. To be more accurate, the vmap_area is first added to
vmap_area_* tree and then moved to free_text_area_*. This extra move
simplifies the logic of execmem_alloc.

vmap_area in free_text_area_* tree are backed with memory, but we need
subtree_max_size for tree operations. Therefore, vm_struct for these
vmap_area are stored in a separate list, all_text_vm.

The new tree allows separate handling of < PAGE_SIZE allocations, as
current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
version of execmem_alloc can handle bpf programs, which uses 64 byte
aligned allocations), and modules, which uses PAGE_SIZE aligned
allocations.

Memory allocated by execmem_alloc() is set to RO+X before returning to the
caller. Therefore, the caller cannot write directly write to the memory.
Instead, the caller is required to use execmem_fill() to update the memory.
For the safety and security of X memory, execmem_fill() checks the data
being updated always in the memory allocated by one execmem_alloc() call.
execmem_fill() uses text_poke like mechanism and requires arch support.
Specifically, the arch need to implement arch_execmem_fill().

In execmem_free(), the memory is first erased with arch_invalidate_exec().
Then, the memory is added to free_text_area_*. If this free creates big
enough continuous free space (> PMD_SIZE), execmem_free() will try to free
the backing vm_struct.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h |   5 +
 mm/nommu.c              |  12 ++
 mm/vmalloc.c            | 320 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 337 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..30aa8c187d40 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 		int node, const void *caller) __alloc_size(1);
 void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *execmem_alloc(unsigned long size, unsigned long align) __alloc_size(1);
+void *execmem_fill(void *dst, void *src, size_t len);
+void execmem_free(void *addr);
+void *arch_fill_execmem(void *dst, void *src, size_t len);
+int arch_invalidate_execmem(void *ptr, size_t len);
 
 extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..e3039fd4f65b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 }
 EXPORT_SYMBOL(vm_map_pages_zero);
 
+void *execmem_alloc(unsigned long size, unsigned long align)
+{
+	return NULL;
+}
+
+void *execmem_fill(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+void execmem_free(const void *addr) { }
+
 /*
  *  sys_brk() for the most part doesn't need the global kernel
  *  lock, except when an application is doing something nasty
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ccaa461998f3..6cc72c795ee5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,11 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
+#ifndef PMD_ALIGN
+#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
+#endif
+#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -769,6 +774,38 @@ static LIST_HEAD(free_vmap_area_list);
  */
 static struct rb_root free_vmap_area_root = RB_ROOT;
 
+/*
+ * free_text_area for execmem_alloc()
+ */
+static DEFINE_SPINLOCK(free_text_area_lock);
+/*
+ * This linked list is used in pair with free_text_area_root.
+ * It gives O(1) access to prev/next to perform fast coalescing.
+ */
+static LIST_HEAD(free_text_area_list);
+
+/*
+ * This augment red-black tree represents the free text space.
+ * All vmap_area objects in this tree are sorted by va->va_start
+ * address. It is used for allocation and merging when a vmap
+ * object is released.
+ *
+ * Each vmap_area node contains a maximum available free block
+ * of its sub-tree, right or left. Therefore it is possible to
+ * find a lowest match of free area.
+ *
+ * vmap_area in this tree are backed by RO+X memory, but they do
+ * not have valid vm pointer (because we need subtree_max_size).
+ * The vm for these vmap_area are stored in all_text_vm.
+ */
+static struct rb_root free_text_area_root = RB_ROOT;
+
+/*
+ * List of vm_struct for free_text_area_root. This list is rarely
+ * accessed, so the O(N) complexity is not likely a real issue.
+ */
+struct vm_struct *all_text_vm;
+
 /*
  * Preload a CPU with one object for "no edge" split case. The
  * aim is to get rid of allocations from the atomic context, thus
@@ -3313,6 +3350,289 @@ void *vmalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc);
 
+#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
+#define EXEC_MEM_START MODULES_VADDR
+#define EXEC_MEM_END MODULES_END
+#else
+#define EXEC_MEM_START VMALLOC_START
+#define EXEC_MEM_END VMALLOC_END
+#endif
+
+static void move_vmap_to_free_text_tree(void *addr)
+{
+	struct vmap_area *va;
+
+	/* remove from vmap_area_root */
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* make the memory RO+X */
+	memset(addr, 0, va->va_end - va->va_start);
+	set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+	set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+
+	/* add to all_text_vm */
+	va->vm->next = all_text_vm;
+	all_text_vm = va->vm;
+
+	/* add to free_text_area_root */
+	spin_lock(&free_text_area_lock);
+	merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
+	spin_unlock(&free_text_area_lock);
+}
+
+/**
+ * execmem_alloc - allocate virtually contiguous RO+X memory
+ * @size:    allocation size
+ *
+ * This is used to allocate dynamic kernel text, such as module text, BPF
+ * programs, etc. User need to use text_poke to update the memory allocated
+ * by execmem_alloc.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *execmem_alloc(unsigned long size, unsigned long align)
+{
+	struct vmap_area *va, *tmp;
+	unsigned long addr;
+	enum fit_type type;
+	int ret;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!va))
+		return NULL;
+
+again:
+	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
+	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
+
+	if (!tmp) {
+		unsigned long alloc_size;
+		void *ptr;
+
+		spin_unlock(&free_text_area_lock);
+
+		/*
+		 * Not enough continuous space in free_text_area_root, try
+		 * allocate more memory. The memory is first added to
+		 * vmap_area_root, and then moved to free_text_area_root.
+		 */
+		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
+		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, EXEC_MEM_START,
+					   EXEC_MEM_END, GFP_KERNEL, PAGE_KERNEL,
+					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
+					   NUMA_NO_NODE, __builtin_return_address(0));
+		if (unlikely(!ptr))
+			goto err_out;
+
+		move_vmap_to_free_text_tree(ptr);
+		goto again;
+	}
+
+	addr = roundup(tmp->va_start, align);
+	type = classify_va_fit_type(tmp, addr, size);
+	if (WARN_ON_ONCE(type == NOTHING_FIT))
+		goto err_out;
+
+	ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
+				    tmp, addr, size);
+	if (ret)
+		goto err_out;
+
+	spin_unlock(&free_text_area_lock);
+
+	va->va_start = addr;
+	va->va_end = addr + size;
+	va->vm = NULL;
+
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	return (void *)addr;
+
+err_out:
+	spin_unlock(&free_text_area_lock);
+	kmem_cache_free(vmap_area_cachep, va);
+	return NULL;
+}
+
+void __weak *arch_fill_execmem(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+int __weak arch_invalidate_execmem(void *ptr, size_t len)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * execmem_fill - Copy text to RO+X memory allocated by execmem_alloc()
+ * @dst:  pointer to memory allocated by execmem_alloc()
+ * @src:  pointer to data being copied from
+ * @len:  number of bytes to be copied
+ *
+ * execmem_fill() will only update memory allocated by a single execmem_fill()
+ * call. If dst + len goes beyond the boundary of one allocation,
+ * execmem_fill() is aborted.
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void *execmem_fill(void *dst, void *src, size_t len)
+{
+	struct vmap_area *va;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)dst, &vmap_area_root);
+
+	/*
+	 * If no va, or va has a vm attached, this memory is not allocated
+	 * by execmem_alloc().
+	 */
+	if (WARN_ON_ONCE(!va) || WARN_ON_ONCE(va->vm))
+		goto err_out;
+	if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end))
+		goto err_out;
+
+	spin_unlock(&vmap_area_lock);
+
+	return arch_fill_execmem(dst, src, len);
+
+err_out:
+	spin_unlock(&vmap_area_lock);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end)
+{
+	struct vm_struct *vm, *prev_vm;
+
+	lockdep_assert_held(&free_text_area_lock);
+
+	vm = all_text_vm;
+	while (vm) {
+		unsigned long vm_addr = (unsigned long)vm->addr;
+
+		/* vm is within this free space, we can free it */
+		if ((vm_addr >= start) && ((vm_addr + vm->size) <= end))
+			goto unlink_vm;
+		vm = vm->next;
+	}
+	return NULL;
+
+unlink_vm:
+	if (all_text_vm == vm) {
+		all_text_vm = vm->next;
+	} else {
+		prev_vm = all_text_vm;
+		while (prev_vm->next != vm)
+			prev_vm = prev_vm->next;
+		prev_vm = vm->next;
+	}
+	return vm;
+}
+
+/**
+ * execmem_free - Release memory allocated by execmem_alloc()
+ * @addr:  Memory base address
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void execmem_free(void *addr)
+{
+	unsigned long free_start, free_end, free_addr;
+	struct vm_struct *vm;
+	struct vmap_area *va;
+
+	might_sleep();
+
+	if (!addr)
+		return;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	WARN_ON_ONCE(va->vm);
+
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* Invalidate text in the region */
+	arch_invalidate_execmem(addr, va->va_end - va->va_start);
+
+	spin_lock(&free_text_area_lock);
+	va = merge_or_add_vmap_area_augment(va,
+		&free_text_area_root, &free_text_area_list);
+
+	if (WARN_ON_ONCE(!va))
+		goto out;
+
+	free_start = PMD_ALIGN(va->va_start);
+	free_end = PMD_ALIGN_DOWN(va->va_end);
+
+	/*
+	 * Only try to free vm when there is at least one PMD_SIZE free
+	 * continuous memory.
+	 */
+	if (free_start >= free_end)
+		goto out;
+
+	/*
+	 * TODO: It is possible that multiple vm are ready to be freed
+	 * after one execmem_free(). But we free at most one vm for now.
+	 */
+	vm = find_and_unlink_text_vm(free_start, free_end);
+	if (!vm)
+		goto out;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE);
+	if (unlikely(!va))
+		goto out_save_vm;
+
+	free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list,
+				      vm->size, 1, (unsigned long)vm->addr,
+				      (unsigned long)vm->addr + vm->size);
+
+	if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr))
+		goto out_save_vm;
+
+	va->va_start = (unsigned long)vm->addr;
+	va->va_end = va->va_start + vm->size;
+	va->vm = vm;
+	spin_unlock(&free_text_area_lock);
+
+	set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT);
+	set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT);
+
+	/* put the va to vmap_area_root, and then free it with vfree */
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	vfree(vm->addr);
+	return;
+
+out_save_vm:
+	/*
+	 * vm is removed from all_text_vm, but not freed. Add it back,
+	 * so that we can use or free it later.
+	 */
+	vm->next = all_text_vm;
+	all_text_vm = vm;
+out:
+	spin_unlock(&free_text_area_lock);
+}
+
 /**
  * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
  * @size:      allocation size
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH bpf-next v2 2/5] x86/alternative: support execmem_alloc() and execmem_free()
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
@ 2022-11-07 22:39 ` Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 3/5] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-07 22:39 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof,
	Song Liu

Implement arch_fill_execmem() and arch_invalidate_execmem() to support
execmem_alloc.

arch_fill_execmem() copies dynamic kernel text (such as BPF programs) to
RO+X memory region allocated by execmem_alloc().

arch_invalidate_execmem() fills memory with 0xcc after it is released by
execmem_free().

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/kernel/alternative.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cadcea035e0..e38829f19a81 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1270,6 +1270,18 @@ void *text_poke_copy(void *addr, const void *opcode, size_t len)
 	return addr;
 }
 
+void *arch_fill_execmem(void *dst, void *src, size_t len)
+{
+	if (text_poke_copy(dst, src, len) == NULL)
+		return ERR_PTR(-EINVAL);
+	return dst;
+}
+
+int arch_invalidate_execmem(void *ptr, size_t len)
+{
+	return IS_ERR_OR_NULL(text_poke_set(ptr, 0xcc, len));
+}
+
 /**
  * text_poke_set - memset into (an unused part of) RX memory
  * @addr: address to modify
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH bpf-next v2 3/5] bpf: use execmem_alloc for bpf program and bpf dispatcher
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 2/5] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
@ 2022-11-07 22:39 ` Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-07 22:39 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof,
	Song Liu

Use execmem_alloc, execmem_free, and execmem_fill instead of
bpf_prog_pack_alloc, bpf_prog_pack_free, and bpf_arch_text_copy.

execmem_free doesn't require extra size information. Therefore, the free
and error handling path can be simplified.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c |  23 +----
 include/linux/bpf.h         |   3 -
 include/linux/filter.h      |   5 -
 kernel/bpf/core.c           | 180 +++---------------------------------
 kernel/bpf/dispatcher.c     |  11 +--
 5 files changed, 21 insertions(+), 201 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index cec5195602bc..43b93570d8f9 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -229,11 +229,6 @@ static void jit_fill_hole(void *area, unsigned int size)
 	memset(area, 0xcc, size);
 }
 
-int bpf_arch_text_invalidate(void *dst, size_t len)
-{
-	return IS_ERR_OR_NULL(text_poke_set(dst, 0xcc, len));
-}
-
 struct jit_context {
 	int cleanup_addr; /* Epilogue code offset */
 
@@ -2509,11 +2504,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		if (proglen <= 0) {
 out_image:
 			image = NULL;
-			if (header) {
-				bpf_arch_text_copy(&header->size, &rw_header->size,
-						   sizeof(rw_header->size));
+			if (header)
 				bpf_jit_binary_pack_free(header, rw_header);
-			}
+
 			/* Fall back to interpreter mode */
 			prog = orig_prog;
 			if (extra_pass) {
@@ -2563,8 +2556,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		if (!prog->is_func || extra_pass) {
 			/*
 			 * bpf_jit_binary_pack_finalize fails in two scenarios:
-			 *   1) header is not pointing to proper module memory;
-			 *   2) the arch doesn't support bpf_arch_text_copy().
+			 *   1) header is not pointing to memory allocated by
+			 *      execmem_alloc;
+			 *   2) the arch doesn't support execmem_free().
 			 *
 			 * Both cases are serious bugs and justify WARN_ON.
 			 */
@@ -2610,13 +2604,6 @@ bool bpf_jit_supports_kfunc_call(void)
 	return true;
 }
 
-void *bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
-	if (text_poke_copy(dst, src, len) == NULL)
-		return ERR_PTR(-EINVAL);
-	return dst;
-}
-
 /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 798aec816970..e9f4806cec93 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2721,9 +2721,6 @@ enum bpf_text_poke_type {
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
 
-void *bpf_arch_text_copy(void *dst, void *src, size_t len);
-int bpf_arch_text_invalidate(void *dst, size_t len);
-
 struct btf_id_set;
 bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index efc42a6e3aed..98e28126c24b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1023,8 +1023,6 @@ extern long bpf_jit_limit_max;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size);
-
 struct bpf_binary_header *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
@@ -1037,9 +1035,6 @@ void bpf_jit_free(struct bpf_prog *fp);
 struct bpf_binary_header *
 bpf_jit_binary_pack_hdr(const struct bpf_prog *fp);
 
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns);
-void bpf_prog_pack_free(struct bpf_binary_header *hdr);
-
 static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
 {
 	return list_empty(&fp->aux->ksym.lnode) ||
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9c16338bcbe8..86b4f640e3c0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -806,149 +806,6 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 	return slot;
 }
 
-/*
- * BPF program pack allocator.
- *
- * Most BPF programs are pretty small. Allocating a hole page for each
- * program is sometime a waste. Many small bpf program also adds pressure
- * to instruction TLB. To solve this issue, we introduce a BPF program pack
- * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
- * to host BPF programs.
- */
-#define BPF_PROG_CHUNK_SHIFT	6
-#define BPF_PROG_CHUNK_SIZE	(1 << BPF_PROG_CHUNK_SHIFT)
-#define BPF_PROG_CHUNK_MASK	(~(BPF_PROG_CHUNK_SIZE - 1))
-
-struct bpf_prog_pack {
-	struct list_head list;
-	void *ptr;
-	unsigned long bitmap[];
-};
-
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size)
-{
-	memset(area, 0, size);
-}
-
-#define BPF_PROG_SIZE_TO_NBITS(size)	(round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
-
-static DEFINE_MUTEX(pack_mutex);
-static LIST_HEAD(pack_list);
-
-/* PMD_SIZE is not available in some special config, e.g. ARCH=arm with
- * CONFIG_MMU=n. Use PAGE_SIZE in these cases.
- */
-#ifdef PMD_SIZE
-#define BPF_PROG_PACK_SIZE (PMD_SIZE * num_possible_nodes())
-#else
-#define BPF_PROG_PACK_SIZE PAGE_SIZE
-#endif
-
-#define BPF_PROG_CHUNK_COUNT (BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
-
-static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	struct bpf_prog_pack *pack;
-
-	pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)),
-		       GFP_KERNEL);
-	if (!pack)
-		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
-	if (!pack->ptr) {
-		kfree(pack);
-		return NULL;
-	}
-	bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE);
-	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
-	list_add_tail(&pack->list, &pack_list);
-
-	set_vm_flush_reset_perms(pack->ptr);
-	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	return pack;
-}
-
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
-	struct bpf_prog_pack *pack;
-	unsigned long pos;
-	void *ptr = NULL;
-
-	mutex_lock(&pack_mutex);
-	if (size > BPF_PROG_PACK_SIZE) {
-		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
-		if (ptr) {
-			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
-			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
-			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
-		}
-		goto out;
-	}
-	list_for_each_entry(pack, &pack_list, list) {
-		pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-						 nbits, 0);
-		if (pos < BPF_PROG_CHUNK_COUNT)
-			goto found_free_area;
-	}
-
-	pack = alloc_new_pack(bpf_fill_ill_insns);
-	if (!pack)
-		goto out;
-
-	pos = 0;
-
-found_free_area:
-	bitmap_set(pack->bitmap, pos, nbits);
-	ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
-
-out:
-	mutex_unlock(&pack_mutex);
-	return ptr;
-}
-
-void bpf_prog_pack_free(struct bpf_binary_header *hdr)
-{
-	struct bpf_prog_pack *pack = NULL, *tmp;
-	unsigned int nbits;
-	unsigned long pos;
-
-	mutex_lock(&pack_mutex);
-	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
-		goto out;
-	}
-
-	list_for_each_entry(tmp, &pack_list, list) {
-		if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) {
-			pack = tmp;
-			break;
-		}
-	}
-
-	if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
-		goto out;
-
-	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
-	pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT;
-
-	WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size),
-		  "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n");
-
-	bitmap_clear(pack->bitmap, pos, nbits);
-	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
-		list_del(&pack->list);
-		module_memfree(pack->ptr);
-		kfree(pack);
-	}
-out:
-	mutex_unlock(&pack_mutex);
-}
-
 static atomic_long_t bpf_jit_current;
 
 /* Can be overridden by an arch's JIT compiler if it has a custom,
@@ -1048,6 +905,9 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 	bpf_jit_uncharge_modmem(size);
 }
 
+#define BPF_PROG_EXEC_ALIGN	64
+#define BPF_PROG_EXEC_MASK	(~(BPF_PROG_EXEC_ALIGN - 1))
+
 /* Allocate jit binary from bpf_prog_pack allocator.
  * Since the allocated memory is RO+X, the JIT engine cannot write directly
  * to the memory. To solve this problem, a RW buffer is also allocated at
@@ -1070,11 +930,11 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 		     alignment > BPF_IMAGE_ALIGNMENT);
 
 	/* add 16 bytes for a random section of illegal instructions */
-	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_CHUNK_SIZE);
+	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_EXEC_ALIGN);
 
 	if (bpf_jit_charge_modmem(size))
 		return NULL;
-	ro_header = bpf_prog_pack_alloc(size, bpf_fill_ill_insns);
+	ro_header = execmem_alloc(size, BPF_PROG_EXEC_ALIGN);
 	if (!ro_header) {
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
@@ -1082,8 +942,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 
 	*rw_header = kvmalloc(size, GFP_KERNEL);
 	if (!*rw_header) {
-		bpf_arch_text_copy(&ro_header->size, &size, sizeof(size));
-		bpf_prog_pack_free(ro_header);
+		execmem_free(ro_header);
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
 	}
@@ -1093,7 +952,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 	(*rw_header)->size = size;
 
 	hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),
-		     BPF_PROG_CHUNK_SIZE - sizeof(*ro_header));
+		     BPF_PROG_EXEC_ALIGN - sizeof(*ro_header));
 	start = prandom_u32_max(hole) & ~(alignment - 1);
 
 	*image_ptr = &ro_header->image[start];
@@ -1109,12 +968,12 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
 {
 	void *ptr;
 
-	ptr = bpf_arch_text_copy(ro_header, rw_header, rw_header->size);
+	ptr = execmem_fill(ro_header, rw_header, rw_header->size);
 
 	kvfree(rw_header);
 
 	if (IS_ERR(ptr)) {
-		bpf_prog_pack_free(ro_header);
+		execmem_free(ro_header);
 		return PTR_ERR(ptr);
 	}
 	return 0;
@@ -1124,18 +983,13 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
  *   1) when the program is freed after;
  *   2) when the JIT engine fails (before bpf_jit_binary_pack_finalize).
  * For case 2), we need to free both the RO memory and the RW buffer.
- *
- * bpf_jit_binary_pack_free requires proper ro_header->size. However,
- * bpf_jit_binary_pack_alloc does not set it. Therefore, ro_header->size
- * must be set with either bpf_jit_binary_pack_finalize (normal path) or
- * bpf_arch_text_copy (when jit fails).
  */
 void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header,
 			      struct bpf_binary_header *rw_header)
 {
-	u32 size = ro_header->size;
+	u32 size = rw_header ? rw_header->size : ro_header->size;
 
-	bpf_prog_pack_free(ro_header);
+	execmem_free(ro_header);
 	kvfree(rw_header);
 	bpf_jit_uncharge_modmem(size);
 }
@@ -1146,7 +1000,7 @@ bpf_jit_binary_pack_hdr(const struct bpf_prog *fp)
 	unsigned long real_start = (unsigned long)fp->bpf_func;
 	unsigned long addr;
 
-	addr = real_start & BPF_PROG_CHUNK_MASK;
+	addr = real_start & BPF_PROG_EXEC_MASK;
 	return (void *)addr;
 }
 
@@ -2736,16 +2590,6 @@ int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 	return -ENOTSUPP;
 }
 
-void * __weak bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
-	return ERR_PTR(-ENOTSUPP);
-}
-
-int __weak bpf_arch_text_invalidate(void *dst, size_t len)
-{
-	return -ENOTSUPP;
-}
-
 DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 EXPORT_SYMBOL(bpf_stats_enabled_key);
 
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
index 04f0a045dcaa..c41dff3379db 100644
--- a/kernel/bpf/dispatcher.c
+++ b/kernel/bpf/dispatcher.c
@@ -126,11 +126,11 @@ static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs)
 	tmp = d->num_progs ? d->rw_image + noff : NULL;
 	if (new) {
 		/* Prepare the dispatcher in d->rw_image. Then use
-		 * bpf_arch_text_copy to update d->image, which is RO+X.
+		 * execmem_fill to update d->image, which is RO+X.
 		 */
 		if (bpf_dispatcher_prepare(d, new, tmp))
 			return;
-		if (IS_ERR(bpf_arch_text_copy(new, tmp, PAGE_SIZE / 2)))
+		if (IS_ERR(execmem_fill(new, tmp, PAGE_SIZE / 2)))
 			return;
 	}
 
@@ -152,15 +152,12 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
 
 	mutex_lock(&d->mutex);
 	if (!d->image) {
-		d->image = bpf_prog_pack_alloc(PAGE_SIZE, bpf_jit_fill_hole_with_zero);
+		d->image = execmem_alloc(PAGE_SIZE, PAGE_SIZE /* align */);
 		if (!d->image)
 			goto out;
 		d->rw_image = bpf_jit_alloc_exec(PAGE_SIZE);
 		if (!d->rw_image) {
-			u32 size = PAGE_SIZE;
-
-			bpf_arch_text_copy(d->image, &size, sizeof(size));
-			bpf_prog_pack_free((struct bpf_binary_header *)d->image);
+			execmem_free((struct bpf_binary_header *)d->image);
 			d->image = NULL;
 			goto out;
 		}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH bpf-next v2 4/5] vmalloc: introduce register_text_tail_vm()
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
                   ` (2 preceding siblings ...)
  2022-11-07 22:39 ` [PATCH bpf-next v2 3/5] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
@ 2022-11-07 22:39 ` Song Liu
  2022-11-07 22:39 ` [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm Song Liu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-07 22:39 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof,
	Song Liu

Allow arch code to register some memory to be used by execmem_alloc().
One possible use case is to allocate PMD pages for kernl text up to
PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for
execmem_alloc. Currently, only one such region is supported.

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h |  4 ++++
 mm/vmalloc.c            | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 30aa8c187d40..2babbe3031a5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -132,11 +132,15 @@ extern void vm_unmap_aliases(void);
 #ifdef CONFIG_MMU
 extern void __init vmalloc_init(void);
 extern unsigned long vmalloc_nr_pages(void);
+void register_text_tail_vm(unsigned long start, unsigned long end);
 #else
 static inline void vmalloc_init(void)
 {
 }
 static inline unsigned long vmalloc_nr_pages(void) { return 0; }
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+}
 #endif
 
 extern void *vmalloc(unsigned long size) __alloc_size(1);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6cc72c795ee5..cb54bce3725e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -77,6 +77,9 @@ static const bool vmap_allow_huge = false;
 #endif
 #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
 
+static struct vm_struct text_tail_vm;
+static struct vmap_area text_tail_va;
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -655,6 +658,8 @@ int is_vmalloc_or_module_addr(const void *x)
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
 	if (addr >= MODULES_VADDR && addr < MODULES_END)
 		return 1;
+	if (addr >= text_tail_va.va_start && addr < text_tail_va.va_end)
+		return 1;
 #endif
 	return is_vmalloc_addr(x);
 }
@@ -2439,6 +2444,35 @@ static void vmap_init_free_space(void)
 	}
 }
 
+/*
+ * register_text_tail_vm() allows arch code to register memory regions
+ * for execmem_alloc. Unlike regular memory regions used by execmem_alloc,
+ * this region is never freed by vfree_exec.
+ *
+ * One possible use case is to allocate PMD pages for kernl text up to
+ * PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for
+ * execmem_alloc.
+ */
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+	struct vmap_area *va;
+
+	/* only support one region */
+	if (WARN_ON_ONCE(text_tail_vm.addr))
+		return;
+
+	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+	if (WARN_ON_ONCE(!va))
+		return;
+	text_tail_vm.addr = (void *)start;
+	text_tail_vm.size = end - start;
+	text_tail_va.va_start = start;
+	text_tail_va.va_end = end;
+	text_tail_va.vm = &text_tail_vm;
+	memcpy(va, &text_tail_va, sizeof(*va));
+	insert_vmap_area_augment(va, NULL, &free_text_area_root, &free_text_area_list);
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
                   ` (3 preceding siblings ...)
  2022-11-07 22:39 ` [PATCH bpf-next v2 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
@ 2022-11-07 22:39 ` Song Liu
  2022-11-08 19:04   ` Edgecombe, Rick P
  2022-11-07 22:55 ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Luis Chamberlain
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-07 22:39 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof,
	Song Liu

Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
[round_up(_etext, 4kb), round_up(_etext, 2MB)] with register_text_tail_vm
so that we can use this part of memory for dynamic kernel text (BPF
programs, etc.).

Here is an example:

[root@eth50-1 ~]# grep _etext /proc/kallsyms
ffffffff82202a08 T _etext

[root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]

[root@eth50-1 ~]#  grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel
0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x  pmd

ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, and
bpf programs.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/include/asm/pgtable_64_types.h | 1 +
 arch/x86/mm/init_64.c                   | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04f36063ad54..c0f9cceb109a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -101,6 +101,7 @@ extern unsigned int ptrs_per_p4d;
 #define PUD_MASK	(~(PUD_SIZE - 1))
 #define PGDIR_SIZE	(_AC(1, UL) << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE - 1))
+#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
 
 /*
  * See Documentation/x86/x86_64/mm.rst for a description of the memory map.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3f040c6e5d13..5b42fc0c6099 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1373,7 +1373,7 @@ void mark_rodata_ro(void)
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
-	unsigned long text_end = PFN_ALIGN(_etext);
+	unsigned long text_end = PMD_ALIGN(_etext);
 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
 	unsigned long all_end;
 
@@ -1414,6 +1414,8 @@ void mark_rodata_ro(void)
 				(void *)rodata_end, (void *)_sdata);
 
 	debug_checkwx();
+	register_text_tail_vm(PFN_ALIGN((unsigned long)_etext),
+			      PMD_ALIGN((unsigned long)_etext));
 }
 
 int kern_addr_valid(unsigned long addr)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
                   ` (4 preceding siblings ...)
  2022-11-07 22:39 ` [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm Song Liu
@ 2022-11-07 22:55 ` Luis Chamberlain
  2022-11-07 23:13   ` Song Liu
  2022-11-08 11:27 ` Mike Rapoport
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-07 22:55 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, rppt, dave, torvalds

On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> This patchset tries to address the following issues:
> 
> 1. Direct map fragmentation
> 
> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> RO+X. These set_memory_* calls cause 1GB page table entries to be split
> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> and slower page table, and pressure for both instruction and data TLB.
> 
> Our previous work in bpf_prog_pack tries to address this issue from BPF
> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> greatly reduced direct map fragmentation from BPF programs.

You should be able to past the results there into the respecite commit
from non-bpf-prog-pack to the new generalized solution here.

> 2. iTLB pressure from BPF program
> 
> Dynamic kernel text such as modules and BPF programs (even with current
> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> BPF program is big, we can see visible performance drop caused by high
> iTLB miss rate.

This is arbitrary, please provide some real stat and in the commit with
some reproducible benchmark.

> 3. TLB shootdown for short-living BPF programs
> 
> Before bpf_prog_pack loading and unloading BPF programs requires global
> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> TLB flush.
> 
> 4. Reduce memory usage by BPF programs (in some cases)
> 
> Most BPF programs and various trampolines are small, and they often
> occupies a whole page. From a random server in our fleet, 50% of the
> loaded BPF programs are less than 500 byte in size, and 75% of them are
> less than 2kB in size. Allowing these BPF programs to share 2MB pages
> would yield some memory saving for systems with many BPF programs. For
> systems with only small number of BPF programs, this patch may waste a
> little memory by allocating one 2MB page, but using only part of it.

Should be easy to provide some real numbers with at least selftests and
onto the commit as well.

> Based on our experiments [5], we measured 0.5% performance improvement
> from bpf_prog_pack. This patchset further boosts the improvement to 0.7%.
> The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> 
> This patchset replaces bpf_prog_pack with a better API and makes it
> available for other dynamic kernel text, such as modules, ftrace, kprobe.

And likewise here, please no arbitrary internal benchmark, real numbers.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 22:55 ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Luis Chamberlain
@ 2022-11-07 23:13   ` Song Liu
  2022-11-07 23:39     ` Luis Chamberlain
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-07 23:13 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, rppt, dave, torvalds

Hi Luis,

On Mon, Nov 7, 2022 at 2:55 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> > This patchset tries to address the following issues:
> >
> > 1. Direct map fragmentation
> >
> > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> > RO+X. These set_memory_* calls cause 1GB page table entries to be split
> > into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> > and slower page table, and pressure for both instruction and data TLB.
> >
> > Our previous work in bpf_prog_pack tries to address this issue from BPF
> > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> > greatly reduced direct map fragmentation from BPF programs.
>
> You should be able to past the results there into the respecite commit
> from non-bpf-prog-pack to the new generalized solution here.
>
> > 2. iTLB pressure from BPF program
> >
> > Dynamic kernel text such as modules and BPF programs (even with current
> > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> > BPF program is big, we can see visible performance drop caused by high
> > iTLB miss rate.
>
> This is arbitrary, please provide some real stat and in the commit with
> some reproducible benchmark.
>
> > 3. TLB shootdown for short-living BPF programs
> >
> > Before bpf_prog_pack loading and unloading BPF programs requires global
> > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> > TLB flush.
> >
> > 4. Reduce memory usage by BPF programs (in some cases)
> >
> > Most BPF programs and various trampolines are small, and they often
> > occupies a whole page. From a random server in our fleet, 50% of the
> > loaded BPF programs are less than 500 byte in size, and 75% of them are
> > less than 2kB in size. Allowing these BPF programs to share 2MB pages
> > would yield some memory saving for systems with many BPF programs. For
> > systems with only small number of BPF programs, this patch may waste a
> > little memory by allocating one 2MB page, but using only part of it.
>
> Should be easy to provide some real numbers with at least selftests and
> onto the commit as well.
>
> > Based on our experiments [5], we measured 0.5% performance improvement
> > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%.
> > The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> >
> > This patchset replaces bpf_prog_pack with a better API and makes it
> > available for other dynamic kernel text, such as modules, ftrace, kprobe.
>
> And likewise here, please no arbitrary internal benchmark, real numbers.

The benchmark used here is identical on our web service, which runs on
many many servers, so it represents the workload that we care a lot.
Unfortunately, it is not possible to run it out of our data centers.

We can build some artificial workloads and probably get much higher
performance improvements. But these workload may not represent real
world use cases.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 23:13   ` Song Liu
@ 2022-11-07 23:39     ` Luis Chamberlain
  2022-11-08  0:13       ` Edgecombe, Rick P
  2022-11-08 18:12       ` Song Liu
  0 siblings, 2 replies; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-07 23:39 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, rppt, dave, torvalds

On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote:
> The benchmark used here is identical on our web service, which runs on
> many many servers, so it represents the workload that we care a lot.
> Unfortunately, it is not possible to run it out of our data centers.

I am not asking for that, I am asking for you to pick any similar
benchark which can run in paralellel which may yield similar results.

> We can build some artificial workloads and probably get much higher
> performance improvements. But these workload may not represent real
> world use cases.

You can very likely use some existing benchmark.

The direct map fragmentation stuff doesn't require much effort, as
was demonstrated by Aaron, you can easily do that or more by
running all selftests or just the test_bpf. This I buy.

I'm not buying the iTLB gains as I can't even reproduce them myself for
eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you
mean iTLB gains for other memory intensive applications running in
tandem?

And none of your patches mentions the gains of this effort helping
with the long term advantage of centralizing the semantics for
permissions on memory.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 23:39     ` Luis Chamberlain
@ 2022-11-08  0:13       ` Edgecombe, Rick P
  2022-11-08  2:45         ` Luis Chamberlain
  2022-11-08 18:20         ` Song Liu
  2022-11-08 18:12       ` Song Liu
  1 sibling, 2 replies; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-08  0:13 UTC (permalink / raw)
  To: song, mcgrof
  Cc: peterz, bpf, rppt, dave, Torvalds, Linus, linux-mm, hch, x86,
	akpm, Lu, Aaron

On Mon, 2022-11-07 at 15:39 -0800, Luis Chamberlain wrote:
> On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote:
> > The benchmark used here is identical on our web service, which runs
> > on
> > many many servers, so it represents the workload that we care a
> > lot.
> > Unfortunately, it is not possible to run it out of our data
> > centers.
> 
> I am not asking for that, I am asking for you to pick any similar
> benchark which can run in paralellel which may yield similar results.
> 
> > We can build some artificial workloads and probably get much higher
> > performance improvements. But these workload may not represent real
> > world use cases.
> 
> You can very likely use some existing benchmark.
> 
> The direct map fragmentation stuff doesn't require much effort, as
> was demonstrated by Aaron, you can easily do that or more by
> running all selftests or just the test_bpf. This I buy.
> 
> I'm not buying the iTLB gains as I can't even reproduce them myself
> for
> eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you
> mean iTLB gains for other memory intensive applications running in
> tandem?

Song, didn't you find that there wasn't (or in the noise) iTLB gains?
What is this about visible performance drop from iTLB misses?

IIRC there was a test done where progpack mapped things at 4k, but in
2MB chunks, so it would re-use pages like the 2MB mapped version. And
it didn't see much improvement over the 2MB mapped version. Did I
remember that wrong?

> 
> And none of your patches mentions the gains of this effort helping
> with the long term advantage of centralizing the semantics for
> permissions on memory.

Another good point. Although this brings up that this interface
"execmem" does just handle one type of permission.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08  0:13       ` Edgecombe, Rick P
@ 2022-11-08  2:45         ` Luis Chamberlain
  2022-11-08 18:20         ` Song Liu
  1 sibling, 0 replies; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-08  2:45 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: song, peterz, bpf, rppt, dave, Torvalds, Linus, linux-mm, hch,
	x86, akpm, Lu, Aaron

On Tue, Nov 08, 2022 at 12:13:09AM +0000, Edgecombe, Rick P wrote:
> On Mon, 2022-11-07 at 15:39 -0800, Luis Chamberlain wrote:
> > And none of your patches mentions the gains of this effort helping
> > with the long term advantage of centralizing the semantics for
> > permissions on memory.
> 
> Another good point. Although this brings up that this interface
> "execmem" does just handle one type of permission.

Yes but the huge benefit of at least sharing this for eBPF + kprobes +
ftrace + modules long term under all execmem would be huge.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
                   ` (5 preceding siblings ...)
  2022-11-07 22:55 ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Luis Chamberlain
@ 2022-11-08 11:27 ` Mike Rapoport
  2022-11-08 12:38   ` Aaron Lu
                     ` (2 more replies)
  2022-11-08 11:44 ` Christophe Leroy
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 93+ messages in thread
From: Mike Rapoport @ 2022-11-08 11:27 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, mcgrof

Hi Song,
 
On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> This patchset tries to address the following issues:
> 
> 1. Direct map fragmentation
> 
> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> RO+X. These set_memory_* calls cause 1GB page table entries to be split
> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> and slower page table, and pressure for both instruction and data TLB.
>
> Our previous work in bpf_prog_pack tries to address this issue from BPF
> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> greatly reduced direct map fragmentation from BPF programs.

Usage of set_memory_* APIs with memory allocated from vmalloc/modules
virtual range does not change the direct map, but only updates the
permissions in vmalloc range. The direct map splits occur in
vm_remove_mappings() when the memory is *freed*.

That said, both bpf_prog_pack and these patches do reduce the
fragmentation, but this happens because the memory is freed to the system
in 2M chunks and there are no splits of 2M pages. Besides, since the same
2M page used for many BPF programs there should be way less vfree() calls. 
 
> 2. iTLB pressure from BPF program
> 
> Dynamic kernel text such as modules and BPF programs (even with current
> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> BPF program is big, we can see visible performance drop caused by high
> iTLB miss rate.

Like Luis mentioned several times already, it would be nice to see numbers.
 
> 3. TLB shootdown for short-living BPF programs
> 
> Before bpf_prog_pack loading and unloading BPF programs requires global
> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> TLB flush.
> 
> 4. Reduce memory usage by BPF programs (in some cases)
> 
> Most BPF programs and various trampolines are small, and they often
> occupies a whole page. From a random server in our fleet, 50% of the
> loaded BPF programs are less than 500 byte in size, and 75% of them are
> less than 2kB in size. Allowing these BPF programs to share 2MB pages
> would yield some memory saving for systems with many BPF programs. For
> systems with only small number of BPF programs, this patch may waste a
> little memory by allocating one 2MB page, but using only part of it.

I'm not convinced there are memory savings here. Unless you have hundreds
of BPF programs, most of 2M page will be wasted, won't it?
So for systems that have moderate use of BPF most of the 2M page will be
unused, right?
 
> Based on our experiments [5], we measured 0.5% performance improvement
> from bpf_prog_pack. This patchset further boosts the improvement to 0.7%.
> The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> 
> This patchset replaces bpf_prog_pack with a better API and makes it
> available for other dynamic kernel text, such as modules, ftrace, kprobe.
 
The proposed execmem_alloc() looks to me very much tailored for x86 to be
used as a replacement for module_alloc(). Some architectures have
module_alloc() that is quite different from the default or x86 version, so
I'd expect at least some explanation how modules etc can use execmem_ APIs
without breaking !x86 architectures.
 
> This set enables bpf programs and bpf dispatchers to share huge pages with
> new API:
>   execmem_alloc()
>   execmem_alloc()
>   execmem_fill()
> 
> The idea is similar to Peter's suggestion in [1].
> 
> execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> memory to its users. execmem_alloc() is used to free memory allocated by
> execmem_alloc(). execmem_fill() is used to update memory allocated by
> execmem_alloc().
> 
> Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
> The caller has to update the content with text_poke like mechanism.
> Specifically, execmem_fill() is provided to update memory allocated by
> execmem_alloc(). execmem_fill() also makes sure the update stays in the
> boundary of one chunk allocated by execmem_alloc(). Please refer to patch
> 1/5 for more details of

Unless I'm mistaken, a failure to allocate PMD_SIZE page will fail text
allocation altogether. That means that if somebody tries to load a BFP
program on a busy long lived system, they are quite likely to fail because
high order free lists might be already exhausted although there is still
plenty of free memory.

Did you consider a fallback for small pages if the high order allocation
fails?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
                   ` (6 preceding siblings ...)
  2022-11-08 11:27 ` Mike Rapoport
@ 2022-11-08 11:44 ` Christophe Leroy
  2022-11-08 18:47   ` Song Liu
  2022-11-08 11:48 ` Mike Rapoport
  2022-11-15  1:30 ` Song Liu
  9 siblings, 1 reply; 93+ messages in thread
From: Christophe Leroy @ 2022-11-08 11:44 UTC (permalink / raw)
  To: Song Liu, bpf, linux-mm
  Cc: akpm, x86, peterz, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof



Le 07/11/2022 à 23:39, Song Liu a écrit :
> This patchset tries to address the following issues:
> 
> 1. Direct map fragmentation
> 
> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> RO+X. These set_memory_* calls cause 1GB page table entries to be split
> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> and slower page table, and pressure for both instruction and data TLB.
> 
> Our previous work in bpf_prog_pack tries to address this issue from BPF
> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> greatly reduced direct map fragmentation from BPF programs.
> 
> 2. iTLB pressure from BPF program
> 
> Dynamic kernel text such as modules and BPF programs (even with current
> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> BPF program is big, we can see visible performance drop caused by high
> iTLB miss rate.
> 
> 3. TLB shootdown for short-living BPF programs
> 
> Before bpf_prog_pack loading and unloading BPF programs requires global
> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> TLB flush.
> 
> 4. Reduce memory usage by BPF programs (in some cases)
> 
> Most BPF programs and various trampolines are small, and they often
> occupies a whole page. From a random server in our fleet, 50% of the
> loaded BPF programs are less than 500 byte in size, and 75% of them are
> less than 2kB in size. Allowing these BPF programs to share 2MB pages
> would yield some memory saving for systems with many BPF programs. For
> systems with only small number of BPF programs, this patch may waste a
> little memory by allocating one 2MB page, but using only part of it.
> 
> 
> Based on our experiments [5], we measured 0.5% performance improvement
> from bpf_prog_pack. This patchset further boosts the improvement to 0.7%.
> The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> 
> This patchset replaces bpf_prog_pack with a better API and makes it
> available for other dynamic kernel text, such as modules, ftrace, kprobe.
> 
> 
> This set enables bpf programs and bpf dispatchers to share huge pages with
> new API:
>    execmem_alloc()
>    execmem_alloc()
>    execmem_fill()
> 
> The idea is similar to Peter's suggestion in [1].
> 
> execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> memory to its users. execmem_alloc() is used to free memory allocated by
> execmem_alloc(). execmem_fill() is used to update memory allocated by
> execmem_alloc().
> 
> Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
> The caller has to update the content with text_poke like mechanism.
> Specifically, execmem_fill() is provided to update memory allocated by
> execmem_alloc(). execmem_fill() also makes sure the update stays in the
> boundary of one chunk allocated by execmem_alloc(). Please refer to patch
> 1/5 for more details of
> 
> Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> 
> Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
> PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
> allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

Would it be possible to have something more generic than being stuck to 
PMD_SIZE ?

On powerpc 8xx, PMD_SIZE is 4MB and hugepages are 512kB and 8MB.

Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
                   ` (7 preceding siblings ...)
  2022-11-08 11:44 ` Christophe Leroy
@ 2022-11-08 11:48 ` Mike Rapoport
  2022-11-15  1:30 ` Song Liu
  9 siblings, 0 replies; 93+ messages in thread
From: Mike Rapoport @ 2022-11-08 11:48 UTC (permalink / raw)
  To: linux-arch
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, aaron.lu, mcgrof

(added linux-arch list)

On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> This patchset tries to address the following issues:
> 
> 1. Direct map fragmentation
> 
> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> RO+X. These set_memory_* calls cause 1GB page table entries to be split
> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> and slower page table, and pressure for both instruction and data TLB.
> 
> Our previous work in bpf_prog_pack tries to address this issue from BPF
> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> greatly reduced direct map fragmentation from BPF programs.
> 
> 2. iTLB pressure from BPF program
> 
> Dynamic kernel text such as modules and BPF programs (even with current
> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> BPF program is big, we can see visible performance drop caused by high
> iTLB miss rate.
> 
> 3. TLB shootdown for short-living BPF programs
> 
> Before bpf_prog_pack loading and unloading BPF programs requires global
> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> TLB flush.
> 
> 4. Reduce memory usage by BPF programs (in some cases)
> 
> Most BPF programs and various trampolines are small, and they often
> occupies a whole page. From a random server in our fleet, 50% of the
> loaded BPF programs are less than 500 byte in size, and 75% of them are
> less than 2kB in size. Allowing these BPF programs to share 2MB pages
> would yield some memory saving for systems with many BPF programs. For
> systems with only small number of BPF programs, this patch may waste a
> little memory by allocating one 2MB page, but using only part of it.
> 
> 
> Based on our experiments [5], we measured 0.5% performance improvement
> from bpf_prog_pack. This patchset further boosts the improvement to 0.7%.
> The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> 
> This patchset replaces bpf_prog_pack with a better API and makes it
> available for other dynamic kernel text, such as modules, ftrace, kprobe.
> 
> 
> This set enables bpf programs and bpf dispatchers to share huge pages with
> new API:
>   execmem_alloc()
>   execmem_alloc()
>   execmem_fill()
> 
> The idea is similar to Peter's suggestion in [1].
> 
> execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> memory to its users. execmem_alloc() is used to free memory allocated by
> execmem_alloc(). execmem_fill() is used to update memory allocated by
> execmem_alloc().
> 
> Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
> The caller has to update the content with text_poke like mechanism.
> Specifically, execmem_fill() is provided to update memory allocated by
> execmem_alloc(). execmem_fill() also makes sure the update stays in the
> boundary of one chunk allocated by execmem_alloc(). Please refer to patch
> 1/5 for more details of
> 
> Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> 
> Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
> PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
> allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
> 
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
> [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
> [3] v1: https://lore.kernel.org/bpf/20221031222541.1773452-1-song@kernel.org/
> [4] https://lore.kernel.org/bpf/Y2ioTodn+mBXdIqp@ziqianlu-desk2/
> [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
> 
> Changes PATCH v1 => v2:
> 1. Rename the APIs as execmem_* (Christoph Hellwig)
> 2. Add more information about the motivation of this work (and follow up
>    works in for kernel modules, various trampolines, etc).
>    (Luis Chamberlain, Rick Edgecombe, Mike Rapoport, Aaron Lu)
> 3. Include expermential results from previous bpf_prog_pack and the
>    community. (Aaron Lu, Luis Chamberlain, Rick Edgecombe)
> 
> Changes RFC v2 => PATCH v1:
> 1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
>    also ensures vcopy_exec() is only used to update memory from one single
>    vmalloc_exec() call. (Christoph Hellwig)
> 2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
>    text_poke() like logic.
> 3. Drop changes for kernel modules and focus on BPF side changes.
> 
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>    work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>    still need some work.
> 
> Song Liu (5):
>   vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
>   x86/alternative: support execmem_alloc() and execmem_free()
>   bpf: use execmem_alloc for bpf program and bpf dispatcher
>   vmalloc: introduce register_text_tail_vm()
>   x86: use register_text_tail_vm
> 
>  arch/x86/include/asm/pgtable_64_types.h |   1 +
>  arch/x86/kernel/alternative.c           |  12 +
>  arch/x86/mm/init_64.c                   |   4 +-
>  arch/x86/net/bpf_jit_comp.c             |  23 +-
>  include/linux/bpf.h                     |   3 -
>  include/linux/filter.h                  |   5 -
>  include/linux/vmalloc.h                 |   9 +
>  kernel/bpf/core.c                       | 180 +-----------
>  kernel/bpf/dispatcher.c                 |  11 +-
>  mm/nommu.c                              |  12 +
>  mm/vmalloc.c                            | 354 ++++++++++++++++++++++++
>  11 files changed, 412 insertions(+), 202 deletions(-)
> 
> --
> 2.30.2

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 11:27 ` Mike Rapoport
@ 2022-11-08 12:38   ` Aaron Lu
  2022-11-09  6:55     ` Christoph Hellwig
  2022-11-08 16:51   ` Edgecombe, Rick P
  2022-11-08 18:41   ` Song Liu
  2 siblings, 1 reply; 93+ messages in thread
From: Aaron Lu @ 2022-11-08 12:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, mcgrof

Hi Mike,

On Tue, Nov 08, 2022 at 01:27:31PM +0200, Mike Rapoport wrote:
> Hi Song,
>  
> On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> > This patchset tries to address the following issues:
> > 
> > 1. Direct map fragmentation
> > 
> > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> > RO+X. These set_memory_* calls cause 1GB page table entries to be split
> > into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> > and slower page table, and pressure for both instruction and data TLB.
> >
> > Our previous work in bpf_prog_pack tries to address this issue from BPF
> > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> > greatly reduced direct map fragmentation from BPF programs.
> 
> Usage of set_memory_* APIs with memory allocated from vmalloc/modules
> virtual range does not change the direct map, but only updates the
> permissions in vmalloc range. The direct map splits occur in
> vm_remove_mappings() when the memory is *freed*.

set_memory_nx/x() on a vmalloced range will not affect direct map but
set_memory_ro/rw() will. set_memory_ro/rw() cares about other alias
mappings and will do the same permission change for that alias mapping,
e.g. direct mapping.

For this reason, the bpf prog load can trigger a direct map split. A
sample callstack on x86_64 VM looks like this:

[   40.602450] address=0xffffffffc01e2000 numpages=1 set=0x0 clr=0x2 alias=1
[   40.614566] address=0xffff88816ee1e000 numpages=1 set=0x0 clr=0x2 alias=0
[   40.627641] split: address=0xffff88816ee1e000, level=2
[   40.627981] CPU: 15 PID: 534 Comm: sockex1 Not tainted 5.17.0-dirty #28
[   40.628421] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[   40.628996] Call Trace:
[   40.629161]  <TASK>
[   40.629304]  dump_stack_lvl+0x45/0x59
[   40.629550]  __change_page_attr_set_clr+0x718/0x8d4
[   40.629872]  ? static_protections+0x1c8/0x1fd
[   40.630160]  ? dump_stack_lvl+0x54/0x59
[   40.630418]  __change_page_attr_set_clr+0x7ff/0x8d4
[   40.630739]  ? _raw_spin_unlock+0x14/0x30
[   40.631004]  ? __purge_vmap_area_lazy+0x323/0x720
[   40.631316]  ? _raw_spin_unlock_irqrestore+0x18/0x40
[   40.631646]  change_page_attr_set_clr.cold+0x2f/0x164
[   40.631979]  set_memory_ro+0x26/0x30
[   40.632215]  bpf_int_jit_compile+0x4a1/0x4e0
[   40.632502]  bpf_prog_select_runtime+0xad/0xf0
[   40.632794]  bpf_prog_load+0x6a1/0xa20
[   40.633044]  ? _raw_spin_trylock_bh+0x1/0x50
[   40.633328]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[   40.633656]  ? free_debug_processing+0x1f8/0x2c0
[   40.633964]  ? __slab_free+0x2f0/0x4f0
[   40.634214]  ? trace_hardirqs_on+0x2b/0xf0
[   40.634492]  __sys_bpf+0xb20/0x2750
[   40.634726]  ? __might_fault+0x1e/0x20
[   40.634978]  __x64_sys_bpf+0x1c/0x20
[   40.635216]  do_syscall_64+0x38/0x90
[   40.635457]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   40.635792] RIP: 0033:0x7fd4f2cacfbd
[   40.636030] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 ce 0e 00 f7 d8 64 89 01 48
[   40.637253] RSP: 002b:00007ffddf20b2d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
[   40.637752] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd4f2cacfbd
[   40.638220] RDX: 0000000000000080 RSI: 00007ffddf20b360 RDI: 0000000000000005
[   40.638689] RBP: 0000000000436c48 R08: 0000000000000000 R09: 0000000000000000
[   40.639156] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000080
[   40.639627] R13: 00007ffddf20b360 R14: 0000000000000000 R15: 0000000000000000
[   40.640096]  </TASK>

Regards,
Aaron

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 11:27 ` Mike Rapoport
  2022-11-08 12:38   ` Aaron Lu
@ 2022-11-08 16:51   ` Edgecombe, Rick P
  2022-11-08 18:50     ` Song Liu
  2022-11-09 11:17     ` Mike Rapoport
  2022-11-08 18:41   ` Song Liu
  2 siblings, 2 replies; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-08 16:51 UTC (permalink / raw)
  To: song, rppt; +Cc: peterz, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote:
> > Based on our experiments [5], we measured 0.5% performance
> > improvement
> > from bpf_prog_pack. This patchset further boosts the improvement to
> > 0.7%.
> > The difference is because bpf_prog_pack uses 512x 4kB pages instead
> > of
> > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> > 
> > This patchset replaces bpf_prog_pack with a better API and makes it
> > available for other dynamic kernel text, such as modules, ftrace,
> > kprobe.
> 
>  
> The proposed execmem_alloc() looks to me very much tailored for x86
> to be
> used as a replacement for module_alloc(). Some architectures have
> module_alloc() that is quite different from the default or x86
> version, so
> I'd expect at least some explanation how modules etc can use execmem_
> APIs
> without breaking !x86 architectures.

I think this is fair, but I think we should ask ask ourselves - how
much should we do in one step?

For non-text_poke() architectures, the way you can make it work is have
the API look like:
execmem_alloc()  <- Does the allocation, but necessarily usable yet
execmem_write()  <- Loads the mapping, doesn't work after finish()
execmem_finish() <- Makes the mapping live (loaded, executable, ready)

So for text_poke():
execmem_alloc()  <- reserves the mapping
execmem_write()  <- text_pokes() to the mapping
execmem_finish() <- does nothing

And non-text_poke():
execmem_alloc()  <- Allocates a regular RW vmalloc allocation
execmem_write()  <- Writes normally to it
execmem_finish() <- does set_memory_ro()/set_memory_x() on it

Non-text_poke() only gets the benefits of centralized logic, but the
interface works for both. This is pretty much what the perm_alloc() RFC
did to make it work with other arch's and modules. But to fit with the
existing modules code (which is actually spread all over) and also
handle RO sections, it also needed some additional bells and whistles.

So the question I'm trying to ask is, how much should we target for the
next step? I first thought that this functionality was so intertwined,
it would be too hard to do iteratively. So if we want to try
iteratively, I'm ok if it doesn't solve everything.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 23:39     ` Luis Chamberlain
  2022-11-08  0:13       ` Edgecombe, Rick P
@ 2022-11-08 18:12       ` Song Liu
  1 sibling, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-08 18:12 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, rppt, dave, torvalds

On Mon, Nov 7, 2022 at 3:39 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote:
> > The benchmark used here is identical on our web service, which runs on
> > many many servers, so it represents the workload that we care a lot.
> > Unfortunately, it is not possible to run it out of our data centers.
>
> I am not asking for that, I am asking for you to pick any similar
> benchark which can run in paralellel which may yield similar results.
>
> > We can build some artificial workloads and probably get much higher
> > performance improvements. But these workload may not represent real
> > world use cases.
>
> You can very likely use some existing benchmark.
>
> The direct map fragmentation stuff doesn't require much effort, as
> was demonstrated by Aaron, you can easily do that or more by
> running all selftests or just the test_bpf. This I buy.
>
> I'm not buying the iTLB gains as I can't even reproduce them myself for
> eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you
> mean iTLB gains for other memory intensive applications running in
> tandem?

Right. In most of the cases, BPF programs are not the main workload.
The main workloads are user space services with big .text sections.
We measure performance in terms of main service throughput under
some latency requirements. For benchmarks with small .text sections,
the benefit from iTLB gains is not significant.

>
> And none of your patches mentions the gains of this effort helping
> with the long term advantage of centralizing the semantics for
> permissions on memory.

Right.. I will add that.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08  0:13       ` Edgecombe, Rick P
  2022-11-08  2:45         ` Luis Chamberlain
@ 2022-11-08 18:20         ` Song Liu
  1 sibling, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-08 18:20 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: mcgrof, peterz, bpf, rppt, dave, Torvalds, Linus, linux-mm, hch,
	x86, akpm, Lu, Aaron

On Mon, Nov 7, 2022 at 4:13 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-11-07 at 15:39 -0800, Luis Chamberlain wrote:
> > On Mon, Nov 07, 2022 at 03:13:59PM -0800, Song Liu wrote:
> > > The benchmark used here is identical on our web service, which runs
> > > on
> > > many many servers, so it represents the workload that we care a
> > > lot.
> > > Unfortunately, it is not possible to run it out of our data
> > > centers.
> >
> > I am not asking for that, I am asking for you to pick any similar
> > benchark which can run in paralellel which may yield similar results.
> >
> > > We can build some artificial workloads and probably get much higher
> > > performance improvements. But these workload may not represent real
> > > world use cases.
> >
> > You can very likely use some existing benchmark.
> >
> > The direct map fragmentation stuff doesn't require much effort, as
> > was demonstrated by Aaron, you can easily do that or more by
> > running all selftests or just the test_bpf. This I buy.
> >
> > I'm not buying the iTLB gains as I can't even reproduce them myself
> > for
> > eBPF JIT, but I tested against iTLB when using eBPF JIT, perhaps you
> > mean iTLB gains for other memory intensive applications running in
> > tandem?
>
> Song, didn't you find that there wasn't (or in the noise) iTLB gains?
> What is this about visible performance drop from iTLB misses?
>
> IIRC there was a test done where progpack mapped things at 4k, but in
> 2MB chunks, so it would re-use pages like the 2MB mapped version. And
> it didn't see much improvement over the 2MB mapped version. Did I
> remember that wrong?

There is still a small gain (~0.2%) for this benchmark.

Thanks,
Song
>
> >
> > And none of your patches mentions the gains of this effort helping
> > with the long term advantage of centralizing the semantics for
> > permissions on memory.
>
> Another good point. Although this brings up that this interface
> "execmem" does just handle one type of permission.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 11:27 ` Mike Rapoport
  2022-11-08 12:38   ` Aaron Lu
  2022-11-08 16:51   ` Edgecombe, Rick P
@ 2022-11-08 18:41   ` Song Liu
  2022-11-08 19:43     ` Christophe Leroy
  2022-11-13  9:58     ` Mike Rapoport
  2 siblings, 2 replies; 93+ messages in thread
From: Song Liu @ 2022-11-08 18:41 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, mcgrof

On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi Song,
>
> On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> > This patchset tries to address the following issues:
> >
> > 1. Direct map fragmentation
> >
> > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> > RO+X. These set_memory_* calls cause 1GB page table entries to be split
> > into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> > and slower page table, and pressure for both instruction and data TLB.
> >
> > Our previous work in bpf_prog_pack tries to address this issue from BPF
> > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> > greatly reduced direct map fragmentation from BPF programs.
>
> Usage of set_memory_* APIs with memory allocated from vmalloc/modules
> virtual range does not change the direct map, but only updates the
> permissions in vmalloc range. The direct map splits occur in
> vm_remove_mappings() when the memory is *freed*.
>
> That said, both bpf_prog_pack and these patches do reduce the
> fragmentation, but this happens because the memory is freed to the system
> in 2M chunks and there are no splits of 2M pages. Besides, since the same
> 2M page used for many BPF programs there should be way less vfree() calls.
>
> > 2. iTLB pressure from BPF program
> >
> > Dynamic kernel text such as modules and BPF programs (even with current
> > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> > BPF program is big, we can see visible performance drop caused by high
> > iTLB miss rate.
>
> Like Luis mentioned several times already, it would be nice to see numbers.
>
> > 3. TLB shootdown for short-living BPF programs
> >
> > Before bpf_prog_pack loading and unloading BPF programs requires global
> > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> > TLB flush.
> >
> > 4. Reduce memory usage by BPF programs (in some cases)
> >
> > Most BPF programs and various trampolines are small, and they often
> > occupies a whole page. From a random server in our fleet, 50% of the
> > loaded BPF programs are less than 500 byte in size, and 75% of them are
> > less than 2kB in size. Allowing these BPF programs to share 2MB pages
> > would yield some memory saving for systems with many BPF programs. For
> > systems with only small number of BPF programs, this patch may waste a
> > little memory by allocating one 2MB page, but using only part of it.
>
> I'm not convinced there are memory savings here. Unless you have hundreds
> of BPF programs, most of 2M page will be wasted, won't it?
> So for systems that have moderate use of BPF most of the 2M page will be
> unused, right?

There will be some memory waste in such cases. But it will get better with:
1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text
section (_stext to _etext);
2) modules, ftrace, kprobe will also share this 2MB page;
3) There are bigger BPF programs in many use cases.

>
> > Based on our experiments [5], we measured 0.5% performance improvement
> > from bpf_prog_pack. This patchset further boosts the improvement to 0.7%.
> > The difference is because bpf_prog_pack uses 512x 4kB pages instead of
> > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> >
> > This patchset replaces bpf_prog_pack with a better API and makes it
> > available for other dynamic kernel text, such as modules, ftrace, kprobe.
>
> The proposed execmem_alloc() looks to me very much tailored for x86 to be
> used as a replacement for module_alloc(). Some architectures have
> module_alloc() that is quite different from the default or x86 version, so
> I'd expect at least some explanation how modules etc can use execmem_ APIs
> without breaking !x86 architectures.
>
> > This set enables bpf programs and bpf dispatchers to share huge pages with
> > new API:
> >   execmem_alloc()
> >   execmem_alloc()
> >   execmem_fill()
> >
> > The idea is similar to Peter's suggestion in [1].
> >
> > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> > memory to its users. execmem_alloc() is used to free memory allocated by
> > execmem_alloc(). execmem_fill() is used to update memory allocated by
> > execmem_alloc().
> >
> > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
> > The caller has to update the content with text_poke like mechanism.
> > Specifically, execmem_fill() is provided to update memory allocated by
> > execmem_alloc(). execmem_fill() also makes sure the update stays in the
> > boundary of one chunk allocated by execmem_alloc(). Please refer to patch
> > 1/5 for more details of
>
> Unless I'm mistaken, a failure to allocate PMD_SIZE page will fail text
> allocation altogether. That means that if somebody tries to load a BFP
> program on a busy long lived system, they are quite likely to fail because
> high order free lists might be already exhausted although there is still
> plenty of free memory.
>
> Did you consider a fallback for small pages if the high order allocation
> fails?

I think __vmalloc_node_range() already has the fallback mechanism.
(the end of the function).

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 11:44 ` Christophe Leroy
@ 2022-11-08 18:47   ` Song Liu
  2022-11-08 19:32     ` Christophe Leroy
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-08 18:47 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

On Tue, Nov 8, 2022 at 3:44 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
[...]
> >
> > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> > memory to its users. execmem_alloc() is used to free memory allocated by
> > execmem_alloc(). execmem_fill() is used to update memory allocated by
> > execmem_alloc().
> >
> > Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
> > The caller has to update the content with text_poke like mechanism.
> > Specifically, execmem_fill() is provided to update memory allocated by
> > execmem_alloc(). execmem_fill() also makes sure the update stays in the
> > boundary of one chunk allocated by execmem_alloc(). Please refer to patch
> > 1/5 for more details of
> >
> > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> >
> > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
> > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
> > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
>
> Would it be possible to have something more generic than being stuck to
> PMD_SIZE ?
>
> On powerpc 8xx, PMD_SIZE is 4MB and hugepages are 512kB and 8MB.

Currently, __vmalloc_node_range() tries to allocate huge pages when
size_per_node >= PMD_SIZE

How do we handle this in powerpc 8xx? I guess we can use the same logic
here?

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 16:51   ` Edgecombe, Rick P
@ 2022-11-08 18:50     ` Song Liu
  2022-11-09 11:17     ` Mike Rapoport
  1 sibling, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-08 18:50 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rppt, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, Nov 8, 2022 at 8:51 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote:
> > > Based on our experiments [5], we measured 0.5% performance
> > > improvement
> > > from bpf_prog_pack. This patchset further boosts the improvement to
> > > 0.7%.
> > > The difference is because bpf_prog_pack uses 512x 4kB pages instead
> > > of
> > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> > >
> > > This patchset replaces bpf_prog_pack with a better API and makes it
> > > available for other dynamic kernel text, such as modules, ftrace,
> > > kprobe.
> >
> >
> > The proposed execmem_alloc() looks to me very much tailored for x86
> > to be
> > used as a replacement for module_alloc(). Some architectures have
> > module_alloc() that is quite different from the default or x86
> > version, so
> > I'd expect at least some explanation how modules etc can use execmem_
> > APIs
> > without breaking !x86 architectures.
>
> I think this is fair, but I think we should ask ask ourselves - how
> much should we do in one step?
>
> For non-text_poke() architectures, the way you can make it work is have
> the API look like:
> execmem_alloc()  <- Does the allocation, but necessarily usable yet
> execmem_write()  <- Loads the mapping, doesn't work after finish()
> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
>
> So for text_poke():
> execmem_alloc()  <- reserves the mapping
> execmem_write()  <- text_pokes() to the mapping
> execmem_finish() <- does nothing
>
> And non-text_poke():
> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> execmem_write()  <- Writes normally to it
> execmem_finish() <- does set_memory_ro()/set_memory_x() on it

Yeah, some fallback mechanism like this is missing in current version.
It is not a problem for BPF programs, as we call it from arch code.
But we do need better APIs for modules.

Thanks,
Song
>
> Non-text_poke() only gets the benefits of centralized logic, but the
> interface works for both. This is pretty much what the perm_alloc() RFC
> did to make it work with other arch's and modules. But to fit with the
> existing modules code (which is actually spread all over) and also
> handle RO sections, it also needed some additional bells and whistles.
>
> So the question I'm trying to ask is, how much should we target for the
> next step? I first thought that this functionality was so intertwined,
> it would be too hard to do iteratively. So if we want to try
> iteratively, I'm ok if it doesn't solve everything.
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm
  2022-11-07 22:39 ` [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm Song Liu
@ 2022-11-08 19:04   ` Edgecombe, Rick P
  2022-11-08 22:15     ` Song Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-08 19:04 UTC (permalink / raw)
  To: linux-mm, song, bpf; +Cc: hch, mcgrof, peterz, Lu, Aaron, akpm, x86, rppt

On Mon, 2022-11-07 at 14:39 -0800, Song Liu wrote:
> Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
> [round_up(_etext, 4kb), round_up(_etext, 2MB)] with
> register_text_tail_vm
> so that we can use this part of memory for dynamic kernel text (BPF
> programs, etc.).
> 
> Here is an example:
> 
> [root@eth50-1 ~]# grep _etext /proc/kallsyms
> ffffffff82202a08 T _etext
> 
> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> ffffffff8220f920 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> ffffffff8220fa28 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> ffffffff8220fad4 t
> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> 
> [root@eth50-1 ~]#  grep 0xffffffff82200000
> /sys/kernel/debug/page_tables/kernel
> 0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x 
> pmd
> 
> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> and
> bpf programs.
> 
> Signed-off-by: Song Liu <song@kernel.org>

Please update Documentation/x86/x86_64/mm.txt and teach places that
check if an address is text about it.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 18:47   ` Song Liu
@ 2022-11-08 19:32     ` Christophe Leroy
  0 siblings, 0 replies; 93+ messages in thread
From: Christophe Leroy @ 2022-11-08 19:32 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof



Le 08/11/2022 à 19:47, Song Liu a écrit :
> On Tue, Nov 8, 2022 at 3:44 AM Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
> [...]
>>>
>>> execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
>>> memory to its users. execmem_alloc() is used to free memory allocated by
>>> execmem_alloc(). execmem_fill() is used to update memory allocated by
>>> execmem_alloc().
>>>
>>> Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
>>> The caller has to update the content with text_poke like mechanism.
>>> Specifically, execmem_fill() is provided to update memory allocated by
>>> execmem_alloc(). execmem_fill() also makes sure the update stays in the
>>> boundary of one chunk allocated by execmem_alloc(). Please refer to patch
>>> 1/5 for more details of
>>>
>>> Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
>>>
>>> Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
>>> PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
>>> allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
>>> _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
>>
>> Would it be possible to have something more generic than being stuck to
>> PMD_SIZE ?
>>
>> On powerpc 8xx, PMD_SIZE is 4MB and hugepages are 512kB and 8MB.
> 
> Currently, __vmalloc_node_range() tries to allocate huge pages when
> size_per_node >= PMD_SIZE

Ah right, that reminds me that for powerpc 8xx, 8MB vmalloc mapping is 
not implemented yet and arch_vmap_pmd_supported() returns false. 
However, 512kB mapping is implemented, through 
arch_vmap_pte_supported_shift().

In __vmalloc_node_range() that's the part below:

	if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
		shift = PMD_SHIFT;
	else
		shift = arch_vmap_pte_supported_shift(size_per_node);


> 
> How do we handle this in powerpc 8xx? I guess we can use the same logic
> here?

arch/powerpc/include/asm/nohash/32/mmu-8xx.h has :

static inline int arch_vmap_pte_supported_shift(unsigned long size)
{
	if (size >= SZ_512K)
		return 19;
	else if (size >= SZ_16K)
		return 14;
	else
		return PAGE_SHIFT;
}


Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 18:41   ` Song Liu
@ 2022-11-08 19:43     ` Christophe Leroy
  2022-11-08 21:40       ` Song Liu
  2022-11-13  9:58     ` Mike Rapoport
  1 sibling, 1 reply; 93+ messages in thread
From: Christophe Leroy @ 2022-11-08 19:43 UTC (permalink / raw)
  To: Song Liu, Mike Rapoport
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, mcgrof



Le 08/11/2022 à 19:41, Song Liu a écrit :
> On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote:
>>
>> Hi Song,
>>
>> On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
>>> This patchset tries to address the following issues:
>>>
>>> 1. Direct map fragmentation
>>>
>>> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
>>> RO+X. These set_memory_* calls cause 1GB page table entries to be split
>>> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
>>> and slower page table, and pressure for both instruction and data TLB.
>>>
>>> Our previous work in bpf_prog_pack tries to address this issue from BPF
>>> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
>>> greatly reduced direct map fragmentation from BPF programs.
>>
>> Usage of set_memory_* APIs with memory allocated from vmalloc/modules
>> virtual range does not change the direct map, but only updates the
>> permissions in vmalloc range. The direct map splits occur in
>> vm_remove_mappings() when the memory is *freed*.
>>
>> That said, both bpf_prog_pack and these patches do reduce the
>> fragmentation, but this happens because the memory is freed to the system
>> in 2M chunks and there are no splits of 2M pages. Besides, since the same
>> 2M page used for many BPF programs there should be way less vfree() calls.
>>
>>> 2. iTLB pressure from BPF program
>>>
>>> Dynamic kernel text such as modules and BPF programs (even with current
>>> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
>>> BPF program is big, we can see visible performance drop caused by high
>>> iTLB miss rate.
>>
>> Like Luis mentioned several times already, it would be nice to see numbers.
>>
>>> 3. TLB shootdown for short-living BPF programs
>>>
>>> Before bpf_prog_pack loading and unloading BPF programs requires global
>>> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
>>> TLB flush.
>>>
>>> 4. Reduce memory usage by BPF programs (in some cases)
>>>
>>> Most BPF programs and various trampolines are small, and they often
>>> occupies a whole page. From a random server in our fleet, 50% of the
>>> loaded BPF programs are less than 500 byte in size, and 75% of them are
>>> less than 2kB in size. Allowing these BPF programs to share 2MB pages
>>> would yield some memory saving for systems with many BPF programs. For
>>> systems with only small number of BPF programs, this patch may waste a
>>> little memory by allocating one 2MB page, but using only part of it.
>>
>> I'm not convinced there are memory savings here. Unless you have hundreds
>> of BPF programs, most of 2M page will be wasted, won't it?
>> So for systems that have moderate use of BPF most of the 2M page will be
>> unused, right?
> 
> There will be some memory waste in such cases. But it will get better with:
> 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text
> section (_stext to _etext);
> 2) modules, ftrace, kprobe will also share this 2MB page;
> 3) There are bigger BPF programs in many use cases.

And what I love with this series (for powerpc/32) is that we will likely 
now be able to have bpf, ftrace, kprobe without the performance cost of 
CONFIG_MODULES.

Today, CONFIG_MODULES means page mapping, which means handling of kernel 
page in ITLB miss handlers.

By using some of the space between end of rodata and start of inittext, 
we are able to use ROX linear memory which is mapped by blocks. It means 
there is no need to handle kernel text in ITLB handler (You can look at 
https://elixir.bootlin.com/linux/v6.1-rc3/source/arch/powerpc/kernel/head_8xx.S#L191 
to better understand what I'm talking about).

Thanks
Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 19:43     ` Christophe Leroy
@ 2022-11-08 21:40       ` Song Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-08 21:40 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, aaron.lu, mcgrof

On Tue, Nov 8, 2022 at 11:43 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 08/11/2022 à 19:41, Song Liu a écrit :
> > On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote:
> >>
> >> Hi Song,
> >>
> >> On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> >>> This patchset tries to address the following issues:
> >>>
> >>> 1. Direct map fragmentation
> >>>
> >>> On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> >>> RO+X. These set_memory_* calls cause 1GB page table entries to be split
> >>> into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> >>> and slower page table, and pressure for both instruction and data TLB.
> >>>
> >>> Our previous work in bpf_prog_pack tries to address this issue from BPF
> >>> program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> >>> greatly reduced direct map fragmentation from BPF programs.
> >>
> >> Usage of set_memory_* APIs with memory allocated from vmalloc/modules
> >> virtual range does not change the direct map, but only updates the
> >> permissions in vmalloc range. The direct map splits occur in
> >> vm_remove_mappings() when the memory is *freed*.
> >>
> >> That said, both bpf_prog_pack and these patches do reduce the
> >> fragmentation, but this happens because the memory is freed to the system
> >> in 2M chunks and there are no splits of 2M pages. Besides, since the same
> >> 2M page used for many BPF programs there should be way less vfree() calls.
> >>
> >>> 2. iTLB pressure from BPF program
> >>>
> >>> Dynamic kernel text such as modules and BPF programs (even with current
> >>> bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> >>> BPF program is big, we can see visible performance drop caused by high
> >>> iTLB miss rate.
> >>
> >> Like Luis mentioned several times already, it would be nice to see numbers.
> >>
> >>> 3. TLB shootdown for short-living BPF programs
> >>>
> >>> Before bpf_prog_pack loading and unloading BPF programs requires global
> >>> TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> >>> TLB flush.
> >>>
> >>> 4. Reduce memory usage by BPF programs (in some cases)
> >>>
> >>> Most BPF programs and various trampolines are small, and they often
> >>> occupies a whole page. From a random server in our fleet, 50% of the
> >>> loaded BPF programs are less than 500 byte in size, and 75% of them are
> >>> less than 2kB in size. Allowing these BPF programs to share 2MB pages
> >>> would yield some memory saving for systems with many BPF programs. For
> >>> systems with only small number of BPF programs, this patch may waste a
> >>> little memory by allocating one 2MB page, but using only part of it.
> >>
> >> I'm not convinced there are memory savings here. Unless you have hundreds
> >> of BPF programs, most of 2M page will be wasted, won't it?
> >> So for systems that have moderate use of BPF most of the 2M page will be
> >> unused, right?
> >
> > There will be some memory waste in such cases. But it will get better with:
> > 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text
> > section (_stext to _etext);
> > 2) modules, ftrace, kprobe will also share this 2MB page;
> > 3) There are bigger BPF programs in many use cases.
>
> And what I love with this series (for powerpc/32) is that we will likely
> now be able to have bpf, ftrace, kprobe without the performance cost of
> CONFIG_MODULES.

Yeah, I remember reading emails about using tracing tools without
CONFIG_MODULES. We still need more work (beyond this set) to make it
happen for powerpc/32. For example, current powerpc bpf_jit doesn't
support jitting into ROX memory.

Song


>
> Today, CONFIG_MODULES means page mapping, which means handling of kernel
> page in ITLB miss handlers.
>
> By using some of the space between end of rodata and start of inittext,
> we are able to use ROX linear memory which is mapped by blocks. It means
> there is no need to handle kernel text in ITLB handler (You can look at
> https://elixir.bootlin.com/linux/v6.1-rc3/source/arch/powerpc/kernel/head_8xx.S#L191
> to better understand what I'm talking about).
>
> Thanks
> Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm
  2022-11-08 19:04   ` Edgecombe, Rick P
@ 2022-11-08 22:15     ` Song Liu
  2022-11-15 17:28       ` Edgecombe, Rick P
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-08 22:15 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-mm, bpf, hch, mcgrof, peterz, Lu, Aaron, akpm, x86, rppt

On Tue, Nov 8, 2022 at 11:04 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-11-07 at 14:39 -0800, Song Liu wrote:
> > Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
> > [round_up(_etext, 4kb), round_up(_etext, 2MB)] with
> > register_text_tail_vm
> > so that we can use this part of memory for dynamic kernel text (BPF
> > programs, etc.).
> >
> > Here is an example:
> >
> > [root@eth50-1 ~]# grep _etext /proc/kallsyms
> > ffffffff82202a08 T _etext
> >
> > [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> > ffffffff8220f920 t
> > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> > ffffffff8220fa28 t
> > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> > ffffffff8220fad4 t
> > bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> >
> > [root@eth50-1 ~]#  grep 0xffffffff82200000
> > /sys/kernel/debug/page_tables/kernel
> > 0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x
> > pmd
> >
> > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> > and
> > bpf programs.
> >
> > Signed-off-by: Song Liu <song@kernel.org>
>
> Please update Documentation/x86/x86_64/mm.txt and teach places that
> check if an address is text about it.

For mm.rst, I got something like:

=========================== 8< ===========================

diff --git i/Documentation/x86/x86_64/mm.rst w/Documentation/x86/x86_64/mm.rst
index 9798676bb0bf..ac041b7d3965 100644
--- i/Documentation/x86/x86_64/mm.rst
+++ w/Documentation/x86/x86_64/mm.rst
@@ -62,7 +62,7 @@ Complete virtual memory map with 4-level page tables
    ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
    ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI
region mapping space
    ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel and module text mapping, mapped to physical address 0
    ffffffff80000000 |-2048    MB |                  |         |
    ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB |
module mapping space
    ffffffffff000000 |  -16    MB |                  |         |
@@ -121,7 +121,7 @@ Complete virtual memory map with 5-level page tables
    ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
    ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI
region mapping space
    ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel and module text mapping, mapped to physical address 0
    ffffffff80000000 |-2048    MB |                  |         |
    ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB |
module mapping space
    ffffffffff000000 |  -16    MB |                  |         |

=========================== 8< ===========================

Is this good enough?

I added extra check in is_vmalloc_or_module_addr() (4/5). Where do we need
similar logic?

Thanks,
Song

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 12:38   ` Aaron Lu
@ 2022-11-09  6:55     ` Christoph Hellwig
  2022-11-09 11:05       ` Peter Zijlstra
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Hellwig @ 2022-11-09  6:55 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Mike Rapoport, Song Liu, bpf, linux-mm, akpm, x86, peterz, hch,
	rick.p.edgecombe, mcgrof

On Tue, Nov 08, 2022 at 08:38:32PM +0800, Aaron Lu wrote:
> set_memory_nx/x() on a vmalloced range will not affect direct map but
> set_memory_ro/rw() will.

Which seems a little odd.  Is there any good reason to not also propagate
the NX bit?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09  6:55     ` Christoph Hellwig
@ 2022-11-09 11:05       ` Peter Zijlstra
  0 siblings, 0 replies; 93+ messages in thread
From: Peter Zijlstra @ 2022-11-09 11:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Aaron Lu, Mike Rapoport, Song Liu, bpf, linux-mm, akpm, x86,
	rick.p.edgecombe, mcgrof

On Wed, Nov 09, 2022 at 07:55:12AM +0100, Christoph Hellwig wrote:
> On Tue, Nov 08, 2022 at 08:38:32PM +0800, Aaron Lu wrote:
> > set_memory_nx/x() on a vmalloced range will not affect direct map but
> > set_memory_ro/rw() will.
> 
> Which seems a little odd.  Is there any good reason to not also propagate
> the NX bit?

Less executable ranges more better. That is, the direct map is *always*
NX.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 16:51   ` Edgecombe, Rick P
  2022-11-08 18:50     ` Song Liu
@ 2022-11-09 11:17     ` Mike Rapoport
  2022-11-09 17:04       ` Edgecombe, Rick P
  2022-11-09 17:43       ` Song Liu
  1 sibling, 2 replies; 93+ messages in thread
From: Mike Rapoport @ 2022-11-09 11:17 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: song, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote:
> > > Based on our experiments [5], we measured 0.5% performance
> > > improvement
> > > from bpf_prog_pack. This patchset further boosts the improvement to
> > > 0.7%.
> > > The difference is because bpf_prog_pack uses 512x 4kB pages instead
> > > of
> > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> > > 
> > > This patchset replaces bpf_prog_pack with a better API and makes it
> > > available for other dynamic kernel text, such as modules, ftrace,
> > > kprobe.
> > 
> >  
> > The proposed execmem_alloc() looks to me very much tailored for x86
> > to be
> > used as a replacement for module_alloc(). Some architectures have
> > module_alloc() that is quite different from the default or x86
> > version, so
> > I'd expect at least some explanation how modules etc can use execmem_
> > APIs
> > without breaking !x86 architectures.
> 
> I think this is fair, but I think we should ask ask ourselves - how
> much should we do in one step?

I think that at least we need an evidence that execmem_alloc() etc can be
actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
for him at all, so having a core MM API for code allocation that only works
with BPF on x86 seems not right to me.
 
> For non-text_poke() architectures, the way you can make it work is have
> the API look like:
> execmem_alloc()  <- Does the allocation, but necessarily usable yet
> execmem_write()  <- Loads the mapping, doesn't work after finish()
> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> 
> So for text_poke():
> execmem_alloc()  <- reserves the mapping
> execmem_write()  <- text_pokes() to the mapping
> execmem_finish() <- does nothing
> 
> And non-text_poke():
> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> execmem_write()  <- Writes normally to it
> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> 
> Non-text_poke() only gets the benefits of centralized logic, but the
> interface works for both. This is pretty much what the perm_alloc() RFC
> did to make it work with other arch's and modules. But to fit with the
> existing modules code (which is actually spread all over) and also
> handle RO sections, it also needed some additional bells and whistles.

I'm less concerned about non-text_poke() part, but rather about
restrictions where code and data can live on different architectures and
whether these restrictions won't lead to inability to use the centralized
logic on, say, arm64 and powerpc.

For instance, if we use execmem_alloc() for modules, it means that data
sections should be allocated separately with plain vmalloc(). Will this
work universally? Or this will require special care with additional
complexity in the modules code?
 
> So the question I'm trying to ask is, how much should we target for the
> next step? I first thought that this functionality was so intertwined,
> it would be too hard to do iteratively. So if we want to try
> iteratively, I'm ok if it doesn't solve everything.
 
With execmem_alloc() as the first step I'm failing to see the large
picture. If we want to use it for modules, how will we allocate RO data?
with similar rodata_alloc() that uses yet another tree in vmalloc? 
How the caching of large pages in vmalloc can be made useful for use cases
like secretmem and PKS?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 11:17     ` Mike Rapoport
@ 2022-11-09 17:04       ` Edgecombe, Rick P
  2022-11-09 17:53         ` Song Liu
  2022-11-13 10:34         ` Mike Rapoport
  2022-11-09 17:43       ` Song Liu
  1 sibling, 2 replies; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-09 17:04 UTC (permalink / raw)
  To: rppt; +Cc: peterz, bpf, linux-mm, song, hch, x86, akpm, mcgrof, Lu, Aaron

On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote:
> On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2022-11-08 at 13:27 +0200, Mike Rapoport wrote:
> > > > Based on our experiments [5], we measured 0.5% performance
> > > > improvement
> > > > from bpf_prog_pack. This patchset further boosts the
> > > > improvement to
> > > > 0.7%.
> > > > The difference is because bpf_prog_pack uses 512x 4kB pages
> > > > instead
> > > > of
> > > > 1x 2MB page, bpf_prog_pack as-is doesn't resolve #2 above.
> > > > 
> > > > This patchset replaces bpf_prog_pack with a better API and
> > > > makes it
> > > > available for other dynamic kernel text, such as modules,
> > > > ftrace,
> > > > kprobe.
> > > 
> > >   
> > > The proposed execmem_alloc() looks to me very much tailored for
> > > x86
> > > to be
> > > used as a replacement for module_alloc(). Some architectures have
> > > module_alloc() that is quite different from the default or x86
> > > version, so
> > > I'd expect at least some explanation how modules etc can use
> > > execmem_
> > > APIs
> > > without breaking !x86 architectures.
> > 
> > I think this is fair, but I think we should ask ask ourselves - how
> > much should we do in one step?
> 
> I think that at least we need an evidence that execmem_alloc() etc
> can be
> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't
> work
> for him at all, so having a core MM API for code allocation that only
> works
> with BPF on x86 seems not right to me.

Those modules changes wouldn't work on non-x86 either. Most of modules
is cross-arch, so this kind of has to work for non-text_poke() or
modules needs to be refactored.

>  
> > For non-text_poke() architectures, the way you can make it work is
> > have
> > the API look like:
> > execmem_alloc()  <- Does the allocation, but necessarily usable yet
> > execmem_write()  <- Loads the mapping, doesn't work after finish()
> > execmem_finish() <- Makes the mapping live (loaded, executable,
> > ready)
> > 
> > So for text_poke():
> > execmem_alloc()  <- reserves the mapping
> > execmem_write()  <- text_pokes() to the mapping
> > execmem_finish() <- does nothing
> > 
> > And non-text_poke():
> > execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> > execmem_write()  <- Writes normally to it
> > execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> > 
> > Non-text_poke() only gets the benefits of centralized logic, but
> > the
> > interface works for both. This is pretty much what the perm_alloc()
> > RFC
> > did to make it work with other arch's and modules. But to fit with
> > the
> > existing modules code (which is actually spread all over) and also
> > handle RO sections, it also needed some additional bells and
> > whistles.
> 
> I'm less concerned about non-text_poke() part, but rather about
> restrictions where code and data can live on different architectures
> and
> whether these restrictions won't lead to inability to use the
> centralized
> logic on, say, arm64 and powerpc.
> 
> For instance, if we use execmem_alloc() for modules, it means that
> data
> sections should be allocated separately with plain vmalloc(). Will
> this
> work universally? Or this will require special care with additional
> complexity in the modules code?

Good point. If the module data was still in the modules range, I would
hope it would still work, but there are a lot of architectures to
check. Some might care if the data is really close to the text. I'm not
sure.

The perm_alloc() stuff did some hacks to force the allocations close to
each other out of paranoia about this. Basically started with one
allocation, but then tracked the pieces separately so arch's could
separate them if they wanted. But I wondered if it was really needed.

>  
> > So the question I'm trying to ask is, how much should we target for
> > the
> > next step? I first thought that this functionality was so
> > intertwined,
> > it would be too hard to do iteratively. So if we want to try
> > iteratively, I'm ok if it doesn't solve everything.
> 
>  
> With execmem_alloc() as the first step I'm failing to see the large
> picture. If we want to use it for modules, how will we allocate RO
> data?

Similar to the perm_alloc() hacks?

> with similar rodata_alloc() that uses yet another tree in vmalloc? 

It would have to group them together at least. Not sure if it needs a
separate tree or not. I would think permission flags would be better
than a new function for each memory type.

> How the caching of large pages in vmalloc can be made useful for use
> cases
> like secretmem and PKS?

This part is easy I think. If we had an unmapped page allocator it
could just feed this. Do you have any idea when you might pick up that
stuff again?

To answer my own question, I think a good first step would be to make
the interface also work for non-text_poke() so it could really be cross
arch, then use it for everything except modules. The benefit to the
other arch's at that point is centralized handling of loading text. 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 11:17     ` Mike Rapoport
  2022-11-09 17:04       ` Edgecombe, Rick P
@ 2022-11-09 17:43       ` Song Liu
  2022-11-09 21:23           ` Christophe Leroy
  2022-11-13 10:42         ` Mike Rapoport
  1 sibling, 2 replies; 93+ messages in thread
From: Song Liu @ 2022-11-09 17:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
>
[...]

> > >
> > > The proposed execmem_alloc() looks to me very much tailored for x86
> > > to be
> > > used as a replacement for module_alloc(). Some architectures have
> > > module_alloc() that is quite different from the default or x86
> > > version, so
> > > I'd expect at least some explanation how modules etc can use execmem_
> > > APIs
> > > without breaking !x86 architectures.
> >
> > I think this is fair, but I think we should ask ask ourselves - how
> > much should we do in one step?
>
> I think that at least we need an evidence that execmem_alloc() etc can be
> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> for him at all, so having a core MM API for code allocation that only works
> with BPF on x86 seems not right to me.

While using execmem_alloc() et. al. in module support is difficult, folks are
making progress with it. For example, the prototype would be more difficult
before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
(introduced by Christophe).

We also have other users that we can onboard soon: BPF trampoline on
x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
s390.

>
> > For non-text_poke() architectures, the way you can make it work is have
> > the API look like:
> > execmem_alloc()  <- Does the allocation, but necessarily usable yet
> > execmem_write()  <- Loads the mapping, doesn't work after finish()
> > execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> >
> > So for text_poke():
> > execmem_alloc()  <- reserves the mapping
> > execmem_write()  <- text_pokes() to the mapping
> > execmem_finish() <- does nothing
> >
> > And non-text_poke():
> > execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> > execmem_write()  <- Writes normally to it
> > execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> >
> > Non-text_poke() only gets the benefits of centralized logic, but the
> > interface works for both. This is pretty much what the perm_alloc() RFC
> > did to make it work with other arch's and modules. But to fit with the
> > existing modules code (which is actually spread all over) and also
> > handle RO sections, it also needed some additional bells and whistles.
>
> I'm less concerned about non-text_poke() part, but rather about
> restrictions where code and data can live on different architectures and
> whether these restrictions won't lead to inability to use the centralized
> logic on, say, arm64 and powerpc.
>
> For instance, if we use execmem_alloc() for modules, it means that data
> sections should be allocated separately with plain vmalloc(). Will this
> work universally? Or this will require special care with additional
> complexity in the modules code?
>
> > So the question I'm trying to ask is, how much should we target for the
> > next step? I first thought that this functionality was so intertwined,
> > it would be too hard to do iteratively. So if we want to try
> > iteratively, I'm ok if it doesn't solve everything.
>
> With execmem_alloc() as the first step I'm failing to see the large
> picture. If we want to use it for modules, how will we allocate RO data?
> with similar rodata_alloc() that uses yet another tree in vmalloc?
> How the caching of large pages in vmalloc can be made useful for use cases
> like secretmem and PKS?

If RO data causes problems with direct map fragmentation, we can use
similar logic. I think we will need another tree in vmalloc for this case.
Since the logic will be mostly identical, I personally don't think adding
another tree is a big overhead.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 17:04       ` Edgecombe, Rick P
@ 2022-11-09 17:53         ` Song Liu
  2022-11-13 10:34         ` Mike Rapoport
  1 sibling, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-09 17:53 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rppt, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Wed, Nov 9, 2022 at 9:04 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
[...]
>
> Similar to the perm_alloc() hacks?
>
> > with similar rodata_alloc() that uses yet another tree in vmalloc?
>
> It would have to group them together at least. Not sure if it needs a
> separate tree or not. I would think permission flags would be better
> than a new function for each memory type.
>
> > How the caching of large pages in vmalloc can be made useful for use
> > cases
> > like secretmem and PKS?
>
> This part is easy I think. If we had an unmapped page allocator it
> could just feed this. Do you have any idea when you might pick up that
> stuff again?
>
> To answer my own question, I think a good first step would be to make
> the interface also work for non-text_poke() so it could really be cross
> arch, then use it for everything except modules. The benefit to the
> other arch's at that point is centralized handling of loading text.

AFAICT, most major archs are introducing text_poke or similar support.
What would be a good non-text_poke major to look into?

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 17:43       ` Song Liu
@ 2022-11-09 21:23           ` Christophe Leroy
  2022-11-13 10:42         ` Mike Rapoport
  1 sibling, 0 replies; 93+ messages in thread
From: Christophe Leroy @ 2022-11-09 21:23 UTC (permalink / raw)
  To: Song Liu, Mike Rapoport
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron, linuxppc-dev

+ linuxppc-dev list as we start mentioning powerpc.

Le 09/11/2022 à 18:43, Song Liu a écrit :
> On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
>>
> [...]
> 
>>>>
>>>> The proposed execmem_alloc() looks to me very much tailored for x86
>>>> to be
>>>> used as a replacement for module_alloc(). Some architectures have
>>>> module_alloc() that is quite different from the default or x86
>>>> version, so
>>>> I'd expect at least some explanation how modules etc can use execmem_
>>>> APIs
>>>> without breaking !x86 architectures.
>>>
>>> I think this is fair, but I think we should ask ask ourselves - how
>>> much should we do in one step?
>>
>> I think that at least we need an evidence that execmem_alloc() etc can be
>> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
>> for him at all, so having a core MM API for code allocation that only works
>> with BPF on x86 seems not right to me.
> 
> While using execmem_alloc() et. al. in module support is difficult, folks are
> making progress with it. For example, the prototype would be more difficult
> before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> (introduced by Christophe).

By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC 
was completely different: This was because on powerpc book3s/32, no-exec 
flaggin is per segment of size 256 Mbytes, so in order to provide 
STRICT_MODULES_RWX it was necessary to put data outside of the segment 
that holds module text in order to be able to flag RW data as no-exec.

But I'm happy if it can also serve other purposes.

> 
> We also have other users that we can onboard soon: BPF trampoline on
> x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> s390.
> 
>>
>>> For non-text_poke() architectures, the way you can make it work is have
>>> the API look like:
>>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
>>> execmem_write()  <- Loads the mapping, doesn't work after finish()
>>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
>>>
>>> So for text_poke():
>>> execmem_alloc()  <- reserves the mapping
>>> execmem_write()  <- text_pokes() to the mapping
>>> execmem_finish() <- does nothing
>>>
>>> And non-text_poke():
>>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
>>> execmem_write()  <- Writes normally to it
>>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
>>>
>>> Non-text_poke() only gets the benefits of centralized logic, but the
>>> interface works for both. This is pretty much what the perm_alloc() RFC
>>> did to make it work with other arch's and modules. But to fit with the
>>> existing modules code (which is actually spread all over) and also
>>> handle RO sections, it also needed some additional bells and whistles.
>>
>> I'm less concerned about non-text_poke() part, but rather about
>> restrictions where code and data can live on different architectures and
>> whether these restrictions won't lead to inability to use the centralized
>> logic on, say, arm64 and powerpc.

Until recently, powerpc CPU didn't implement PC-relative data access. 
Only very recent powerpc CPUs (power10 only ?) have capability to do 
PC-relative accesses, but the kernel doesn't use it yet. So there's no 
constraint about distance between text and data. What matters is the 
distance between core kernel text and module text to avoid trampolines.

>>
>> For instance, if we use execmem_alloc() for modules, it means that data
>> sections should be allocated separately with plain vmalloc(). Will this
>> work universally? Or this will require special care with additional
>> complexity in the modules code?
>>
>>> So the question I'm trying to ask is, how much should we target for the
>>> next step? I first thought that this functionality was so intertwined,
>>> it would be too hard to do iteratively. So if we want to try
>>> iteratively, I'm ok if it doesn't solve everything.
>>
>> With execmem_alloc() as the first step I'm failing to see the large
>> picture. If we want to use it for modules, how will we allocate RO data?
>> with similar rodata_alloc() that uses yet another tree in vmalloc?
>> How the caching of large pages in vmalloc can be made useful for use cases
>> like secretmem and PKS?
> 
> If RO data causes problems with direct map fragmentation, we can use
> similar logic. I think we will need another tree in vmalloc for this case.
> Since the logic will be mostly identical, I personally don't think adding
> another tree is a big overhead.

On powerpc, kernel core RAM is not mapped by pages but is mapped by 
blocks. There are only two blocks: One ROX block which contains both 
text and rodata, and one RW block that contains everything else. Maybe 
the same can be done for modules. What matters is to be sure you never 
have WX memory. Having ROX rodata is not an issue.

Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
@ 2022-11-09 21:23           ` Christophe Leroy
  0 siblings, 0 replies; 93+ messages in thread
From: Christophe Leroy @ 2022-11-09 21:23 UTC (permalink / raw)
  To: Song Liu, Mike Rapoport
  Cc: Lu, Aaron, akpm, peterz, x86, linux-mm, mcgrof, bpf, Edgecombe,
	Rick P, linuxppc-dev, hch

+ linuxppc-dev list as we start mentioning powerpc.

Le 09/11/2022 à 18:43, Song Liu a écrit :
> On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
>>
> [...]
> 
>>>>
>>>> The proposed execmem_alloc() looks to me very much tailored for x86
>>>> to be
>>>> used as a replacement for module_alloc(). Some architectures have
>>>> module_alloc() that is quite different from the default or x86
>>>> version, so
>>>> I'd expect at least some explanation how modules etc can use execmem_
>>>> APIs
>>>> without breaking !x86 architectures.
>>>
>>> I think this is fair, but I think we should ask ask ourselves - how
>>> much should we do in one step?
>>
>> I think that at least we need an evidence that execmem_alloc() etc can be
>> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
>> for him at all, so having a core MM API for code allocation that only works
>> with BPF on x86 seems not right to me.
> 
> While using execmem_alloc() et. al. in module support is difficult, folks are
> making progress with it. For example, the prototype would be more difficult
> before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> (introduced by Christophe).

By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC 
was completely different: This was because on powerpc book3s/32, no-exec 
flaggin is per segment of size 256 Mbytes, so in order to provide 
STRICT_MODULES_RWX it was necessary to put data outside of the segment 
that holds module text in order to be able to flag RW data as no-exec.

But I'm happy if it can also serve other purposes.

> 
> We also have other users that we can onboard soon: BPF trampoline on
> x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> s390.
> 
>>
>>> For non-text_poke() architectures, the way you can make it work is have
>>> the API look like:
>>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
>>> execmem_write()  <- Loads the mapping, doesn't work after finish()
>>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
>>>
>>> So for text_poke():
>>> execmem_alloc()  <- reserves the mapping
>>> execmem_write()  <- text_pokes() to the mapping
>>> execmem_finish() <- does nothing
>>>
>>> And non-text_poke():
>>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
>>> execmem_write()  <- Writes normally to it
>>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
>>>
>>> Non-text_poke() only gets the benefits of centralized logic, but the
>>> interface works for both. This is pretty much what the perm_alloc() RFC
>>> did to make it work with other arch's and modules. But to fit with the
>>> existing modules code (which is actually spread all over) and also
>>> handle RO sections, it also needed some additional bells and whistles.
>>
>> I'm less concerned about non-text_poke() part, but rather about
>> restrictions where code and data can live on different architectures and
>> whether these restrictions won't lead to inability to use the centralized
>> logic on, say, arm64 and powerpc.

Until recently, powerpc CPU didn't implement PC-relative data access. 
Only very recent powerpc CPUs (power10 only ?) have capability to do 
PC-relative accesses, but the kernel doesn't use it yet. So there's no 
constraint about distance between text and data. What matters is the 
distance between core kernel text and module text to avoid trampolines.

>>
>> For instance, if we use execmem_alloc() for modules, it means that data
>> sections should be allocated separately with plain vmalloc(). Will this
>> work universally? Or this will require special care with additional
>> complexity in the modules code?
>>
>>> So the question I'm trying to ask is, how much should we target for the
>>> next step? I first thought that this functionality was so intertwined,
>>> it would be too hard to do iteratively. So if we want to try
>>> iteratively, I'm ok if it doesn't solve everything.
>>
>> With execmem_alloc() as the first step I'm failing to see the large
>> picture. If we want to use it for modules, how will we allocate RO data?
>> with similar rodata_alloc() that uses yet another tree in vmalloc?
>> How the caching of large pages in vmalloc can be made useful for use cases
>> like secretmem and PKS?
> 
> If RO data causes problems with direct map fragmentation, we can use
> similar logic. I think we will need another tree in vmalloc for this case.
> Since the logic will be mostly identical, I personally don't think adding
> another tree is a big overhead.

On powerpc, kernel core RAM is not mapped by pages but is mapped by 
blocks. There are only two blocks: One ROX block which contains both 
text and rodata, and one RW block that contains everything else. Maybe 
the same can be done for modules. What matters is to be sure you never 
have WX memory. Having ROX rodata is not an issue.

Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 21:23           ` Christophe Leroy
@ 2022-11-10  1:50             ` Song Liu
  -1 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-10  1:50 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Mike Rapoport, Edgecombe, Rick P, peterz, bpf, linux-mm, hch,
	x86, akpm, mcgrof, Lu, Aaron, linuxppc-dev

On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> + linuxppc-dev list as we start mentioning powerpc.
>
> Le 09/11/2022 à 18:43, Song Liu a écrit :
> > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
> >>
> > [...]
> >
> >>>>
> >>>> The proposed execmem_alloc() looks to me very much tailored for x86
> >>>> to be
> >>>> used as a replacement for module_alloc(). Some architectures have
> >>>> module_alloc() that is quite different from the default or x86
> >>>> version, so
> >>>> I'd expect at least some explanation how modules etc can use execmem_
> >>>> APIs
> >>>> without breaking !x86 architectures.
> >>>
> >>> I think this is fair, but I think we should ask ask ourselves - how
> >>> much should we do in one step?
> >>
> >> I think that at least we need an evidence that execmem_alloc() etc can be
> >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> >> for him at all, so having a core MM API for code allocation that only works
> >> with BPF on x86 seems not right to me.
> >
> > While using execmem_alloc() et. al. in module support is difficult, folks are
> > making progress with it. For example, the prototype would be more difficult
> > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > (introduced by Christophe).
>
> By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> was completely different: This was because on powerpc book3s/32, no-exec
> flaggin is per segment of size 256 Mbytes, so in order to provide
> STRICT_MODULES_RWX it was necessary to put data outside of the segment
> that holds module text in order to be able to flag RW data as no-exec.

Yeah, I only noticed the actual motivation of this work earlier today. :)

>
> But I'm happy if it can also serve other purposes.
>
> >
> > We also have other users that we can onboard soon: BPF trampoline on
> > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> > s390.
> >
> >>
> >>> For non-text_poke() architectures, the way you can make it work is have
> >>> the API look like:
> >>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
> >>> execmem_write()  <- Loads the mapping, doesn't work after finish()
> >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> >>>
> >>> So for text_poke():
> >>> execmem_alloc()  <- reserves the mapping
> >>> execmem_write()  <- text_pokes() to the mapping
> >>> execmem_finish() <- does nothing
> >>>
> >>> And non-text_poke():
> >>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> >>> execmem_write()  <- Writes normally to it
> >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> >>>
> >>> Non-text_poke() only gets the benefits of centralized logic, but the
> >>> interface works for both. This is pretty much what the perm_alloc() RFC
> >>> did to make it work with other arch's and modules. But to fit with the
> >>> existing modules code (which is actually spread all over) and also
> >>> handle RO sections, it also needed some additional bells and whistles.
> >>
> >> I'm less concerned about non-text_poke() part, but rather about
> >> restrictions where code and data can live on different architectures and
> >> whether these restrictions won't lead to inability to use the centralized
> >> logic on, say, arm64 and powerpc.
>
> Until recently, powerpc CPU didn't implement PC-relative data access.
> Only very recent powerpc CPUs (power10 only ?) have capability to do
> PC-relative accesses, but the kernel doesn't use it yet. So there's no
> constraint about distance between text and data. What matters is the
> distance between core kernel text and module text to avoid trampolines.

Ah, this is great. I guess this means powerpc can benefit from this work
with much less effort than x86_64.

>
> >>
> >> For instance, if we use execmem_alloc() for modules, it means that data
> >> sections should be allocated separately with plain vmalloc(). Will this
> >> work universally? Or this will require special care with additional
> >> complexity in the modules code?
> >>
> >>> So the question I'm trying to ask is, how much should we target for the
> >>> next step? I first thought that this functionality was so intertwined,
> >>> it would be too hard to do iteratively. So if we want to try
> >>> iteratively, I'm ok if it doesn't solve everything.
> >>
> >> With execmem_alloc() as the first step I'm failing to see the large
> >> picture. If we want to use it for modules, how will we allocate RO data?
> >> with similar rodata_alloc() that uses yet another tree in vmalloc?
> >> How the caching of large pages in vmalloc can be made useful for use cases
> >> like secretmem and PKS?
> >
> > If RO data causes problems with direct map fragmentation, we can use
> > similar logic. I think we will need another tree in vmalloc for this case.
> > Since the logic will be mostly identical, I personally don't think adding
> > another tree is a big overhead.
>
> On powerpc, kernel core RAM is not mapped by pages but is mapped by
> blocks. There are only two blocks: One ROX block which contains both
> text and rodata, and one RW block that contains everything else. Maybe
> the same can be done for modules. What matters is to be sure you never
> have WX memory. Having ROX rodata is not an issue.

Got it. Thanks!

Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
@ 2022-11-10  1:50             ` Song Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-10  1:50 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Lu, Aaron, akpm, peterz, x86, linux-mm, mcgrof, bpf, Edgecombe,
	Rick P, linuxppc-dev, hch, Mike Rapoport

On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> + linuxppc-dev list as we start mentioning powerpc.
>
> Le 09/11/2022 à 18:43, Song Liu a écrit :
> > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
> >>
> > [...]
> >
> >>>>
> >>>> The proposed execmem_alloc() looks to me very much tailored for x86
> >>>> to be
> >>>> used as a replacement for module_alloc(). Some architectures have
> >>>> module_alloc() that is quite different from the default or x86
> >>>> version, so
> >>>> I'd expect at least some explanation how modules etc can use execmem_
> >>>> APIs
> >>>> without breaking !x86 architectures.
> >>>
> >>> I think this is fair, but I think we should ask ask ourselves - how
> >>> much should we do in one step?
> >>
> >> I think that at least we need an evidence that execmem_alloc() etc can be
> >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> >> for him at all, so having a core MM API for code allocation that only works
> >> with BPF on x86 seems not right to me.
> >
> > While using execmem_alloc() et. al. in module support is difficult, folks are
> > making progress with it. For example, the prototype would be more difficult
> > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > (introduced by Christophe).
>
> By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> was completely different: This was because on powerpc book3s/32, no-exec
> flaggin is per segment of size 256 Mbytes, so in order to provide
> STRICT_MODULES_RWX it was necessary to put data outside of the segment
> that holds module text in order to be able to flag RW data as no-exec.

Yeah, I only noticed the actual motivation of this work earlier today. :)

>
> But I'm happy if it can also serve other purposes.
>
> >
> > We also have other users that we can onboard soon: BPF trampoline on
> > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> > s390.
> >
> >>
> >>> For non-text_poke() architectures, the way you can make it work is have
> >>> the API look like:
> >>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
> >>> execmem_write()  <- Loads the mapping, doesn't work after finish()
> >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> >>>
> >>> So for text_poke():
> >>> execmem_alloc()  <- reserves the mapping
> >>> execmem_write()  <- text_pokes() to the mapping
> >>> execmem_finish() <- does nothing
> >>>
> >>> And non-text_poke():
> >>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> >>> execmem_write()  <- Writes normally to it
> >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> >>>
> >>> Non-text_poke() only gets the benefits of centralized logic, but the
> >>> interface works for both. This is pretty much what the perm_alloc() RFC
> >>> did to make it work with other arch's and modules. But to fit with the
> >>> existing modules code (which is actually spread all over) and also
> >>> handle RO sections, it also needed some additional bells and whistles.
> >>
> >> I'm less concerned about non-text_poke() part, but rather about
> >> restrictions where code and data can live on different architectures and
> >> whether these restrictions won't lead to inability to use the centralized
> >> logic on, say, arm64 and powerpc.
>
> Until recently, powerpc CPU didn't implement PC-relative data access.
> Only very recent powerpc CPUs (power10 only ?) have capability to do
> PC-relative accesses, but the kernel doesn't use it yet. So there's no
> constraint about distance between text and data. What matters is the
> distance between core kernel text and module text to avoid trampolines.

Ah, this is great. I guess this means powerpc can benefit from this work
with much less effort than x86_64.

>
> >>
> >> For instance, if we use execmem_alloc() for modules, it means that data
> >> sections should be allocated separately with plain vmalloc(). Will this
> >> work universally? Or this will require special care with additional
> >> complexity in the modules code?
> >>
> >>> So the question I'm trying to ask is, how much should we target for the
> >>> next step? I first thought that this functionality was so intertwined,
> >>> it would be too hard to do iteratively. So if we want to try
> >>> iteratively, I'm ok if it doesn't solve everything.
> >>
> >> With execmem_alloc() as the first step I'm failing to see the large
> >> picture. If we want to use it for modules, how will we allocate RO data?
> >> with similar rodata_alloc() that uses yet another tree in vmalloc?
> >> How the caching of large pages in vmalloc can be made useful for use cases
> >> like secretmem and PKS?
> >
> > If RO data causes problems with direct map fragmentation, we can use
> > similar logic. I think we will need another tree in vmalloc for this case.
> > Since the logic will be mostly identical, I personally don't think adding
> > another tree is a big overhead.
>
> On powerpc, kernel core RAM is not mapped by pages but is mapped by
> blocks. There are only two blocks: One ROX block which contains both
> text and rodata, and one RW block that contains everything else. Maybe
> the same can be done for modules. What matters is to be sure you never
> have WX memory. Having ROX rodata is not an issue.

Got it. Thanks!

Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-08 18:41   ` Song Liu
  2022-11-08 19:43     ` Christophe Leroy
@ 2022-11-13  9:58     ` Mike Rapoport
  2022-11-14 20:13       ` Song Liu
  1 sibling, 1 reply; 93+ messages in thread
From: Mike Rapoport @ 2022-11-13  9:58 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, mcgrof

On Tue, Nov 08, 2022 at 10:41:53AM -0800, Song Liu wrote:
> On Tue, Nov 8, 2022 at 3:27 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > Hi Song,
> >
> > On Mon, Nov 07, 2022 at 02:39:16PM -0800, Song Liu wrote:
> > > This patchset tries to address the following issues:
> > >
> > > 1. Direct map fragmentation
> > >
> > > On x86, STRICT_*_RWX requires the direct map of any RO+X memory to be also
> > > RO+X. These set_memory_* calls cause 1GB page table entries to be split
> > > into 2MB and 4kB ones. This fragmentation in direct map results in bigger
> > > and slower page table, and pressure for both instruction and data TLB.
> > >
> > > Our previous work in bpf_prog_pack tries to address this issue from BPF
> > > program side. Based on the experiments by Aaron Lu [4], bpf_prog_pack has
> > > greatly reduced direct map fragmentation from BPF programs.
> >
> > Usage of set_memory_* APIs with memory allocated from vmalloc/modules
> > virtual range does not change the direct map, but only updates the
> > permissions in vmalloc range. The direct map splits occur in
> > vm_remove_mappings() when the memory is *freed*.
> >
> > That said, both bpf_prog_pack and these patches do reduce the
> > fragmentation, but this happens because the memory is freed to the system
> > in 2M chunks and there are no splits of 2M pages. Besides, since the same
> > 2M page used for many BPF programs there should be way less vfree() calls.
> >
> > > 2. iTLB pressure from BPF program
> > >
> > > Dynamic kernel text such as modules and BPF programs (even with current
> > > bpf_prog_pack) use 4kB pages on x86, when the total size of modules and
> > > BPF program is big, we can see visible performance drop caused by high
> > > iTLB miss rate.
> >
> > Like Luis mentioned several times already, it would be nice to see numbers.
> >
> > > 3. TLB shootdown for short-living BPF programs
> > >
> > > Before bpf_prog_pack loading and unloading BPF programs requires global
> > > TLB shootdown. This patchset (and bpf_prog_pack) replaces it with a local
> > > TLB flush.
> > >
> > > 4. Reduce memory usage by BPF programs (in some cases)
> > >
> > > Most BPF programs and various trampolines are small, and they often
> > > occupies a whole page. From a random server in our fleet, 50% of the
> > > loaded BPF programs are less than 500 byte in size, and 75% of them are
> > > less than 2kB in size. Allowing these BPF programs to share 2MB pages
> > > would yield some memory saving for systems with many BPF programs. For
> > > systems with only small number of BPF programs, this patch may waste a
> > > little memory by allocating one 2MB page, but using only part of it.
> >
> > I'm not convinced there are memory savings here. Unless you have hundreds
> > of BPF programs, most of 2M page will be wasted, won't it?
> > So for systems that have moderate use of BPF most of the 2M page will be
> > unused, right?
> 
> There will be some memory waste in such cases. But it will get better with:
> 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text
> section (_stext to _etext);
> 2) modules, ftrace, kprobe will also share this 2MB page;

Unless I'm missing something, what will be shared is the virtual space, the
actual physical pages will be still allocated the same way as any vmalloc()
allocation.

> 3) There are bigger BPF programs in many use cases.
 
With statistics you provided above one will need hundreds if not thousands
of BPF programs to fill a 2M page. I didn't do the math, but it seems that
to see memory savings there should be several hundreds of BPF programs.

> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 17:04       ` Edgecombe, Rick P
  2022-11-09 17:53         ` Song Liu
@ 2022-11-13 10:34         ` Mike Rapoport
  2022-11-14 20:30           ` Song Liu
  1 sibling, 1 reply; 93+ messages in thread
From: Mike Rapoport @ 2022-11-13 10:34 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: peterz, bpf, linux-mm, song, hch, x86, akpm, mcgrof, Lu, Aaron

On Wed, Nov 09, 2022 at 05:04:25PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote:
> > On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote:
>
> > How the caching of large pages in vmalloc can be made useful for use
> > cases like secretmem and PKS?
> 
> This part is easy I think. If we had an unmapped page allocator it
> could just feed this. 

The unmapped page allocator could be used by anything that needs
non-default permissions in the direct map and knows how to map the pages
elsewhere. E.g it would have been a oneliner to switch x86::module_alloc()
to use unmapped allocations. But ...

> Do you have any idea when you might pick up that stuff again?

... unfortunately I don't see it happening anytime soon.
 
> To answer my own question, I think a good first step would be to make
> the interface also work for non-text_poke() so it could really be cross
> arch, then use it for everything except modules. The benefit to the
> other arch's at that point is centralized handling of loading text. 

My concern is that the proposed execmem_alloc() cannot be used for
centralized handling of loading text. I'm not familiar enough with
modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but
my gut feeling is that the proposed execmem_alloc() won't be an improvement
but rather a hindrance for moving to centralized handling of loading text.

It feels to me that a lot of ground work is needed to get to the point
where we can use centralized handling of loading text.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 17:43       ` Song Liu
  2022-11-09 21:23           ` Christophe Leroy
@ 2022-11-13 10:42         ` Mike Rapoport
  2022-11-14 20:45           ` Song Liu
  1 sibling, 1 reply; 93+ messages in thread
From: Mike Rapoport @ 2022-11-13 10:42 UTC (permalink / raw)
  To: Song Liu
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Wed, Nov 09, 2022 at 09:43:50AM -0800, Song Liu wrote:
> On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> [...]
> 
> > > >
> > > > The proposed execmem_alloc() looks to me very much tailored for x86
> > > > to be
> > > > used as a replacement for module_alloc(). Some architectures have
> > > > module_alloc() that is quite different from the default or x86
> > > > version, so
> > > > I'd expect at least some explanation how modules etc can use execmem_
> > > > APIs
> > > > without breaking !x86 architectures.
> > >
> > > I think this is fair, but I think we should ask ask ourselves - how
> > > much should we do in one step?
> >
> > I think that at least we need an evidence that execmem_alloc() etc can be
> > actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> > for him at all, so having a core MM API for code allocation that only works
> > with BPF on x86 seems not right to me.
> 
> While using execmem_alloc() et. al. in module support is difficult, folks are
> making progress with it. For example, the prototype would be more difficult
> before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> (introduced by Christophe).
> 
> We also have other users that we can onboard soon: BPF trampoline on
> x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> s390.

Caching of large pages won't make any difference on arm64 and powerpc
because they do not support splitting of the direct map, so the only
potential benefit there is a centralized handling of text loading and I'm
not convinced execmem_alloc() will get us there.

> > With execmem_alloc() as the first step I'm failing to see the large
> > picture. If we want to use it for modules, how will we allocate RO data?
> > with similar rodata_alloc() that uses yet another tree in vmalloc?
> > How the caching of large pages in vmalloc can be made useful for use cases
> > like secretmem and PKS?
> 
> If RO data causes problems with direct map fragmentation, we can use
> similar logic. I think we will need another tree in vmalloc for this case.
> Since the logic will be mostly identical, I personally don't think adding
> another tree is a big overhead.

Actually, it would be interesting to quantify memory savings/waste as the
result of using execmem_alloc()
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-13  9:58     ` Mike Rapoport
@ 2022-11-14 20:13       ` Song Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-14 20:13 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
	aaron.lu, mcgrof

On Sun, Nov 13, 2022 at 1:59 AM Mike Rapoport <rppt@kernel.org> wrote:
[...]
> >
> > There will be some memory waste in such cases. But it will get better with:
> > 1) With 4/5 and 5/5, BPF programs will share this 2MB page with kernel .text
> > section (_stext to _etext);
> > 2) modules, ftrace, kprobe will also share this 2MB page;
>
> Unless I'm missing something, what will be shared is the virtual space, the
> actual physical pages will be still allocated the same way as any vmalloc()
> allocation.

What do you mean by shared virtual space, but the actual physical pages are
still the same? This is a 2MB page shared by BPF programs, modules, etc.,
so it is 2MB virtual address space, and it is also 1x 2MB physical huge page.
For example, we will allocate one 2MB page, and put 1MB of module text,
512kB of BPF programs, some ftrace trampolines in this page.

>
> > 3) There are bigger BPF programs in many use cases.
>
> With statistics you provided above one will need hundreds if not thousands
> of BPF programs to fill a 2M page. I didn't do the math, but it seems that
> to see memory savings there should be several hundreds of BPF programs.

powerpc is trying to use bpf_prog_pack [1]. IIUC, execmem_alloc() should
allocate 512kB pages for powerpc. This already yielding memory savings:
on a random system in our fleet (x86_64), BPF programs use 140x 4kB
pages (or 560kB) without execmem_alloc(). They will fit in 200kB with
execmem_alloc(), and we can use the other 300kB+ for modules, ftrace,
etc.

OTOH, 512kB or even 2MB is really small for module systems, but iTLB
is always a limited resource.

Thanks,
Song

[1] https://lore.kernel.org/bpf/20221110184303.393179-1-hbathini@linux.ibm.com/

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-13 10:34         ` Mike Rapoport
@ 2022-11-14 20:30           ` Song Liu
  2022-11-15 21:18             ` Luis Chamberlain
  2022-11-17  8:50             ` Mike Rapoport
  0 siblings, 2 replies; 93+ messages in thread
From: Song Liu @ 2022-11-14 20:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Nov 09, 2022 at 05:04:25PM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote:
> > > On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote:
> >
> > > How the caching of large pages in vmalloc can be made useful for use
> > > cases like secretmem and PKS?
> >
> > This part is easy I think. If we had an unmapped page allocator it
> > could just feed this.
>
> The unmapped page allocator could be used by anything that needs
> non-default permissions in the direct map and knows how to map the pages
> elsewhere. E.g it would have been a oneliner to switch x86::module_alloc()
> to use unmapped allocations. But ...
>
> > Do you have any idea when you might pick up that stuff again?
>
> ... unfortunately I don't see it happening anytime soon.
>
> > To answer my own question, I think a good first step would be to make
> > the interface also work for non-text_poke() so it could really be cross
> > arch, then use it for everything except modules. The benefit to the
> > other arch's at that point is centralized handling of loading text.
>
> My concern is that the proposed execmem_alloc() cannot be used for
> centralized handling of loading text. I'm not familiar enough with
> modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but
> my gut feeling is that the proposed execmem_alloc() won't be an improvement
> but rather a hindrance for moving to centralized handling of loading text.

I don't follow why this could ever be a hindrance. Luis is very excited about
this, and I am very sure it works for ftrace, kprobe, and BPF.

If there is a better API, it shouldn't be too hard to do the migration. See the
example in 3/5 of the set, where we move x86 BPF jit AND BPF dispatcher
from bpf_prog_pack to execmem_alloc():

  5 files changed, 21 insertions(+), 201 deletions(-)

It is not a one liner, but it is definitely very easy.

>
> It feels to me that a lot of ground work is needed to get to the point
> where we can use centralized handling of loading text.

Could you please be more specific on what is needed?

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-13 10:42         ` Mike Rapoport
@ 2022-11-14 20:45           ` Song Liu
  2022-11-15 20:51             ` Luis Chamberlain
  2022-11-20 10:44             ` Mike Rapoport
  0 siblings, 2 replies; 93+ messages in thread
From: Song Liu @ 2022-11-14 20:45 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Sun, Nov 13, 2022 at 2:43 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Nov 09, 2022 at 09:43:50AM -0800, Song Liu wrote:
> > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > [...]
> >
> > > > >
> > > > > The proposed execmem_alloc() looks to me very much tailored for x86
> > > > > to be
> > > > > used as a replacement for module_alloc(). Some architectures have
> > > > > module_alloc() that is quite different from the default or x86
> > > > > version, so
> > > > > I'd expect at least some explanation how modules etc can use execmem_
> > > > > APIs
> > > > > without breaking !x86 architectures.
> > > >
> > > > I think this is fair, but I think we should ask ask ourselves - how
> > > > much should we do in one step?
> > >
> > > I think that at least we need an evidence that execmem_alloc() etc can be
> > > actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> > > for him at all, so having a core MM API for code allocation that only works
> > > with BPF on x86 seems not right to me.
> >
> > While using execmem_alloc() et. al. in module support is difficult, folks are
> > making progress with it. For example, the prototype would be more difficult
> > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > (introduced by Christophe).
> >
> > We also have other users that we can onboard soon: BPF trampoline on
> > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> > s390.
>
> Caching of large pages won't make any difference on arm64 and powerpc
> because they do not support splitting of the direct map, so the only
> potential benefit there is a centralized handling of text loading and I'm
> not convinced execmem_alloc() will get us there.

Sharing large pages helps reduce iTLB pressure, which is the second
motivation here (after reducing direct map fragmentation).

>
> > > With execmem_alloc() as the first step I'm failing to see the large
> > > picture. If we want to use it for modules, how will we allocate RO data?
> > > with similar rodata_alloc() that uses yet another tree in vmalloc?
> > > How the caching of large pages in vmalloc can be made useful for use cases
> > > like secretmem and PKS?
> >
> > If RO data causes problems with direct map fragmentation, we can use
> > similar logic. I think we will need another tree in vmalloc for this case.
> > Since the logic will be mostly identical, I personally don't think adding
> > another tree is a big overhead.
>
> Actually, it would be interesting to quantify memory savings/waste as the
> result of using execmem_alloc()

From a random system in our fleet, execmem_alloc() saves:

139 iTLB entries (1x 2MB entry vs, 140x 4kB entries), which is more than
100% of L1 iTLB and about 10% of L2 TLB.

It wastes 1.5MB memory, which is 0.0023% of system memory (64GB).

I believe this is clearly a good trade-off.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
                   ` (8 preceding siblings ...)
  2022-11-08 11:48 ` Mike Rapoport
@ 2022-11-15  1:30 ` Song Liu
  2022-11-15 17:34   ` Edgecombe, Rick P
                     ` (2 more replies)
  9 siblings, 3 replies; 93+ messages in thread
From: Song Liu @ 2022-11-15  1:30 UTC (permalink / raw)
  To: bpf, linux-mm, peterz, akpm
  Cc: x86, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof

On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote:
>

[...]

>
>
> This set enables bpf programs and bpf dispatchers to share huge pages with
> new API:
>   execmem_alloc()
>   execmem_alloc()
>   execmem_fill()
>
> The idea is similar to Peter's suggestion in [1].
>
> execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> memory to its users. execmem_alloc() is used to free memory allocated by
> execmem_alloc(). execmem_fill() is used to update memory allocated by
> execmem_alloc().

Sigh, I just realized this thread made through linux-mm@kvack.org, but got
dropped by bpf@vger.kernel.org, so I guess I will have to resend v3.

Currently, I have got the following action items for v3:
1. Add unify API to allocate text memory to motivation;
2. Update Documentation/x86/x86_64/mm.rst;
3. Allow none PMD_SIZE allocation for powerpc.

1 and 2 are relatively simple. We can probably do 3 in a follow up patch
(as I don't have powerpc environments for testing). Did I miss anything?

Besides these, does this set look reasonable? Andrew and Peter, could
you please share your comments on this?

Thanks,
Song

>
> Memory allocated by execmem_alloc() is RO+X, so this doesnot violate W^X.
> The caller has to update the content with text_poke like mechanism.
> Specifically, execmem_fill() is provided to update memory allocated by
> execmem_alloc(). execmem_fill() also makes sure the update stays in the
> boundary of one chunk allocated by execmem_alloc(). Please refer to patch
> 1/5 for more details of
>
> Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
>
> Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
> PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
> allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
>
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
> [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
> [3] v1: https://lore.kernel.org/bpf/20221031222541.1773452-1-song@kernel.org/
> [4] https://lore.kernel.org/bpf/Y2ioTodn+mBXdIqp@ziqianlu-desk2/
> [5] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
>
> Changes PATCH v1 => v2:
> 1. Rename the APIs as execmem_* (Christoph Hellwig)
> 2. Add more information about the motivation of this work (and follow up
>    works in for kernel modules, various trampolines, etc).
>    (Luis Chamberlain, Rick Edgecombe, Mike Rapoport, Aaron Lu)
> 3. Include expermential results from previous bpf_prog_pack and the
>    community. (Aaron Lu, Luis Chamberlain, Rick Edgecombe)
>
> Changes RFC v2 => PATCH v1:
> 1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
>    also ensures vcopy_exec() is only used to update memory from one single
>    vmalloc_exec() call. (Christoph Hellwig)
> 2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
>    text_poke() like logic.
> 3. Drop changes for kernel modules and focus on BPF side changes.
>
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>    work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>    still need some work.
>
> Song Liu (5):
>   vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill
>   x86/alternative: support execmem_alloc() and execmem_free()
>   bpf: use execmem_alloc for bpf program and bpf dispatcher
>   vmalloc: introduce register_text_tail_vm()
>   x86: use register_text_tail_vm
>
>  arch/x86/include/asm/pgtable_64_types.h |   1 +
>  arch/x86/kernel/alternative.c           |  12 +
>  arch/x86/mm/init_64.c                   |   4 +-
>  arch/x86/net/bpf_jit_comp.c             |  23 +-
>  include/linux/bpf.h                     |   3 -
>  include/linux/filter.h                  |   5 -
>  include/linux/vmalloc.h                 |   9 +
>  kernel/bpf/core.c                       | 180 +-----------
>  kernel/bpf/dispatcher.c                 |  11 +-
>  mm/nommu.c                              |  12 +
>  mm/vmalloc.c                            | 354 ++++++++++++++++++++++++
>  11 files changed, 412 insertions(+), 202 deletions(-)
>
> --
> 2.30.2

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm
  2022-11-08 22:15     ` Song Liu
@ 2022-11-15 17:28       ` Edgecombe, Rick P
  0 siblings, 0 replies; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-15 17:28 UTC (permalink / raw)
  To: song; +Cc: peterz, bpf, rppt, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, 2022-11-08 at 14:15 -0800, Song Liu wrote:
> diff --git i/Documentation/x86/x86_64/mm.rst
> w/Documentation/x86/x86_64/mm.rst
> index 9798676bb0bf..ac041b7d3965 100644
> --- i/Documentation/x86/x86_64/mm.rst
> +++ w/Documentation/x86/x86_64/mm.rst
> @@ -62,7 +62,7 @@ Complete virtual memory map with 4-level page
> tables
>     ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ...
> unused hole
>     ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI
> region mapping space
>     ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ...
> unused hole
> -   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
> kernel text mapping, mapped to physical address 0
> +   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
> kernel and module text mapping, mapped to physical address 0

It's not really "module text mapping" yet right? Because it doesn't get
used by modules. I might just call it execmem or whatever you call the
component. Otherwise it is outdated when the next users starts using
the API. Otherwise looks ok, thanks.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15  1:30 ` Song Liu
@ 2022-11-15 17:34   ` Edgecombe, Rick P
  2022-11-15 21:54     ` Song Liu
  2022-11-15 21:09   ` Luis Chamberlain
  2022-11-29 10:23   ` Thomas Gleixner
  2 siblings, 1 reply; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-15 17:34 UTC (permalink / raw)
  To: peterz, linux-mm, song, bpf, akpm; +Cc: hch, Lu, Aaron, x86, rppt, mcgrof

On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote:
> Currently, I have got the following action items for v3:
> 1. Add unify API to allocate text memory to motivation;
> 2. Update Documentation/x86/x86_64/mm.rst;
> 3. Allow none PMD_SIZE allocation for powerpc.

So what do we think about supporting the fallback mechanism for the
first version, like:

https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/

It helps the (1) story by actually being usable by non-text_poke()
architectures.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-14 20:45           ` Song Liu
@ 2022-11-15 20:51             ` Luis Chamberlain
  2022-11-20 10:44             ` Mike Rapoport
  1 sibling, 0 replies; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-15 20:51 UTC (permalink / raw)
  To: Song Liu
  Cc: Mike Rapoport, Edgecombe, Rick P, peterz, bpf, linux-mm, hch,
	x86, akpm, Lu, Aaron

On Mon, Nov 14, 2022 at 12:45:16PM -0800, Song Liu wrote:
> On Sun, Nov 13, 2022 at 2:43 AM Mike Rapoport <rppt@kernel.org> wrote:
> > Actually, it would be interesting to quantify memory savings/waste as the
> > result of using execmem_alloc()
> 
> From a random system in our fleet, execmem_alloc() saves:
> 
> 139 iTLB entries (1x 2MB entry vs, 140x 4kB entries), which is more than
> 100% of L1 iTLB and about 10% of L2 TLB.

That should be easily reflected then using perf using a real benchmark,
however we don't have such data yet. I have hinted I suspect the reason
might be that the actual loading of eBPF JIT / whatever may need to be
done in parallel a la lib/test_kmod.c, or having the existing ebpf
selftests run JIT tests in parallel, something along those lines,
but I see no effort to showcase that. That's a missed opportunity.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15  1:30 ` Song Liu
  2022-11-15 17:34   ` Edgecombe, Rick P
@ 2022-11-15 21:09   ` Luis Chamberlain
  2022-11-15 21:32     ` Luis Chamberlain
  2022-11-15 22:48     ` Song Liu
  2022-11-29 10:23   ` Thomas Gleixner
  2 siblings, 2 replies; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-15 21:09 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe, aaron.lu, rppt

On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote:
> On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote:
> >
> 
> [...]
> 
> >
> >
> > This set enables bpf programs and bpf dispatchers to share huge pages with
> > new API:
> >   execmem_alloc()
> >   execmem_alloc()
> >   execmem_fill()
> >
> > The idea is similar to Peter's suggestion in [1].
> >
> > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> > memory to its users. execmem_alloc() is used to free memory allocated by
> > execmem_alloc(). execmem_fill() is used to update memory allocated by
> > execmem_alloc().
> 
> Sigh, I just realized this thread made through linux-mm@kvack.org, but got
> dropped by bpf@vger.kernel.org, so I guess I will have to resend v3.

I don't know what is going on with the bpf list but whatever it is, is silly.
You should Cc the right folks to ensure proper review if the bpf list is
the issue.

> Currently, I have got the following action items for v3:
> 1. Add unify API to allocate text memory to motivation;
> 2. Update Documentation/x86/x86_64/mm.rst;
> 3. Allow none PMD_SIZE allocation for powerpc.

- I am really exausted of asking again for real performance tests,
  you keep saying you can't and I keep saying you can, you are not
  trying hard enough. Stop thinking about your internal benchmark which
  you cannot publish. There should be enough crap out which you can use.

- A new selftest or set of selftests which demonstrates gain in
  performance

- Extensions maybe of lib/test_vmalloc.c or whatever is appropriate to
  test correctness

- Enhance commit logs to justify the *why*, one of which to hightight is
  providing an API for memory semantics for special memory pages

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-14 20:30           ` Song Liu
@ 2022-11-15 21:18             ` Luis Chamberlain
  2022-11-15 21:39               ` Edgecombe, Rick P
  2022-11-17  8:50             ` Mike Rapoport
  1 sibling, 1 reply; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-15 21:18 UTC (permalink / raw)
  To: Song Liu
  Cc: Mike Rapoport, Edgecombe, Rick P, peterz, bpf, linux-mm, hch,
	x86, akpm, Lu, Aaron

On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote:
> On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Wed, Nov 09, 2022 at 05:04:25PM +0000, Edgecombe, Rick P wrote:
> > > On Wed, 2022-11-09 at 13:17 +0200, Mike Rapoport wrote:
> > > > On Tue, Nov 08, 2022 at 04:51:12PM +0000, Edgecombe, Rick P wrote:
> > >
> > > > How the caching of large pages in vmalloc can be made useful for use
> > > > cases like secretmem and PKS?
> > >
> > > This part is easy I think. If we had an unmapped page allocator it
> > > could just feed this.
> >
> > The unmapped page allocator could be used by anything that needs
> > non-default permissions in the direct map and knows how to map the pages
> > elsewhere. E.g it would have been a oneliner to switch x86::module_alloc()
> > to use unmapped allocations. But ...
> >
> > > Do you have any idea when you might pick up that stuff again?
> >
> > ... unfortunately I don't see it happening anytime soon.
> >
> > > To answer my own question, I think a good first step would be to make
> > > the interface also work for non-text_poke() so it could really be cross
> > > arch, then use it for everything except modules. The benefit to the
> > > other arch's at that point is centralized handling of loading text.
> >
> > My concern is that the proposed execmem_alloc() cannot be used for
> > centralized handling of loading text. I'm not familiar enough with
> > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but
> > my gut feeling is that the proposed execmem_alloc() won't be an improvement
> > but rather a hindrance for moving to centralized handling of loading text.
> 
> I don't follow why this could ever be a hindrance. Luis is very excited about
> this, and I am very sure it works for ftrace, kprobe, and BPF.

The main hurdles for modules are:

  * x86 needs support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
    to use this properly
  * in light of lack of support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
    we need a fallback
  * today module_alloc() follows special hanky panky open coded semantics for
    special page permissions, a unified way to handle this would be
    ideal instead of expecting everyone to get it right.

Other than this there are probably odd corner cases which would likely
only come up during testing. I see Song's efforts striving towards these
objectives, and because of the new ARCH_WANTS_MODULES_DATA_IN_VMALLOC, it
should be possible to get us there.

  Luis


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 21:09   ` Luis Chamberlain
@ 2022-11-15 21:32     ` Luis Chamberlain
  2022-11-15 22:48     ` Song Liu
  1 sibling, 0 replies; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-15 21:32 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe, aaron.lu, rppt

On Tue, Nov 15, 2022 at 01:09:08PM -0800, Luis Chamberlain wrote:
> - I am really exausted of asking again for real performance tests,
>   you keep saying you can't and I keep saying you can, you are not
>   trying hard enough. Stop thinking about your internal benchmark which
>   you cannot publish. There should be enough crap out which you can use.

and if results defy expectations, please don't be coy about it, just
be direct and mention it.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 21:18             ` Luis Chamberlain
@ 2022-11-15 21:39               ` Edgecombe, Rick P
  2022-11-16 22:34                 ` Luis Chamberlain
  0 siblings, 1 reply; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-15 21:39 UTC (permalink / raw)
  To: song, mcgrof; +Cc: peterz, rppt, bpf, linux-mm, hch, x86, akpm, Lu, Aaron

On Tue, 2022-11-15 at 13:18 -0800, Luis Chamberlain wrote:
> The main hurdles for modules are:
> 
>   * x86 needs support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
>     to use this properly
>   * in light of lack of support for
> CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
>     we need a fallback
>   * today module_alloc() follows special hanky panky open coded
> semantics for
>     special page permissions, a unified way to handle this would be
>     ideal instead of expecting everyone to get it right.

How were you thinking non-text_poke() architectures load their text
into the text region without the fallback method? Split logic in cross
arch modules.c? Sorry if this is changed, but last time I was in there
I remember there was quite a bit of cross-arch code loading text.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 17:34   ` Edgecombe, Rick P
@ 2022-11-15 21:54     ` Song Liu
  2022-11-15 22:14       ` Edgecombe, Rick P
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-15 21:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: peterz, linux-mm, bpf, akpm, hch, Lu, Aaron, x86, rppt, mcgrof

On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote:
> > Currently, I have got the following action items for v3:
> > 1. Add unify API to allocate text memory to motivation;
> > 2. Update Documentation/x86/x86_64/mm.rst;
> > 3. Allow none PMD_SIZE allocation for powerpc.
>
> So what do we think about supporting the fallback mechanism for the
> first version, like:
>
> https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/
>
> It helps the (1) story by actually being usable by non-text_poke()
> architectures.

I personally think this might be a good idea. We will need this when we use
execmem_alloc for modules. But I haven't got a chance to look at module code in
great detail. I was thinking of adding this logic with changes in module code.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 21:54     ` Song Liu
@ 2022-11-15 22:14       ` Edgecombe, Rick P
  2022-11-15 22:32         ` Song Liu
  2022-11-16  1:20         ` Song Liu
  0 siblings, 2 replies; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-15 22:14 UTC (permalink / raw)
  To: song; +Cc: peterz, rppt, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, 2022-11-15 at 13:54 -0800, Song Liu wrote:
> On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote:
> > > Currently, I have got the following action items for v3:
> > > 1. Add unify API to allocate text memory to motivation;
> > > 2. Update Documentation/x86/x86_64/mm.rst;
> > > 3. Allow none PMD_SIZE allocation for powerpc.
> > 
> > So what do we think about supporting the fallback mechanism for the
> > first version, like:
> > 
> > 
https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/
> > 
> > It helps the (1) story by actually being usable by non-text_poke()
> > architectures.
> 
> I personally think this might be a good idea. We will need this when
> we use
> execmem_alloc for modules. But I haven't got a chance to look at
> module code in
> great detail. I was thinking of adding this logic with changes in
> module code.

BPF used to have a generic allocator that just called module_alloc().
If you had a fallback method could you unify all of BPF to use
execmem()?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 22:14       ` Edgecombe, Rick P
@ 2022-11-15 22:32         ` Song Liu
  2022-11-16  1:20         ` Song Liu
  1 sibling, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-15 22:32 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: peterz, rppt, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, Nov 15, 2022 at 2:14 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2022-11-15 at 13:54 -0800, Song Liu wrote:
> > On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote:
> > > > Currently, I have got the following action items for v3:
> > > > 1. Add unify API to allocate text memory to motivation;
> > > > 2. Update Documentation/x86/x86_64/mm.rst;
> > > > 3. Allow none PMD_SIZE allocation for powerpc.
> > >
> > > So what do we think about supporting the fallback mechanism for the
> > > first version, like:
> > >
> > >
> https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/
> > >
> > > It helps the (1) story by actually being usable by non-text_poke()
> > > architectures.
> >
> > I personally think this might be a good idea. We will need this when
> > we use
> > execmem_alloc for modules. But I haven't got a chance to look at
> > module code in
> > great detail. I was thinking of adding this logic with changes in
> > module code.
>
> BPF used to have a generic allocator that just called module_alloc().
> If you had a fallback method could you unify all of BPF to use
> execmem()?

Yes, we can use execmem as the unified API for all BPF JIT engines.

Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 21:09   ` Luis Chamberlain
  2022-11-15 21:32     ` Luis Chamberlain
@ 2022-11-15 22:48     ` Song Liu
  2022-11-16 22:33       ` Luis Chamberlain
  1 sibling, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-15 22:48 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe, aaron.lu, rppt

On Tue, Nov 15, 2022 at 1:09 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote:
> > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote:
> > >
> >
> > [...]
> >
> > >
> > >
> > > This set enables bpf programs and bpf dispatchers to share huge pages with
> > > new API:
> > >   execmem_alloc()
> > >   execmem_alloc()
> > >   execmem_fill()
> > >
> > > The idea is similar to Peter's suggestion in [1].
> > >
> > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> > > memory to its users. execmem_alloc() is used to free memory allocated by
> > > execmem_alloc(). execmem_fill() is used to update memory allocated by
> > > execmem_alloc().
> >
> > Sigh, I just realized this thread made through linux-mm@kvack.org, but got
> > dropped by bpf@vger.kernel.org, so I guess I will have to resend v3.
>
> I don't know what is going on with the bpf list but whatever it is, is silly.
> You should Cc the right folks to ensure proper review if the bpf list is
> the issue.
>
> > Currently, I have got the following action items for v3:
> > 1. Add unify API to allocate text memory to motivation;
> > 2. Update Documentation/x86/x86_64/mm.rst;
> > 3. Allow none PMD_SIZE allocation for powerpc.
>
> - I am really exausted of asking again for real performance tests,
>   you keep saying you can't and I keep saying you can, you are not
>   trying hard enough. Stop thinking about your internal benchmark which
>   you cannot publish. There should be enough crap out which you can use.
>
> - A new selftest or set of selftests which demonstrates gain in
>   performance

I didn't mean to not show the result with publically available. I just
thought the actual benchmark was better (and we do use that to
demonstrate the benefit of a lot of kernel improvement).

For something publically available, how about the following:

Run 100 instances of the following benchmark from bpf selftests:
  tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe
which loads 7 BPF programs, and triggers one of them.

Then use perf to monitor TLB related counters:
   perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \
        itlb_misses.walk_completed_2m_4m -a

The following results are from a qemu VM with 32 cores.

Before bpf_prog_pack:
  iTLB-load-misses: 350k/s
  itlb_misses.walk_completed_4k: 90k/s
  itlb_misses.walk_completed_2m_4m: 0.1/s

With bpf_prog_pack (current upstream):
  iTLB-load-misses: 220k/s
  itlb_misses.walk_completed_4k: 68k/s
  itlb_misses.walk_completed_2m_4m: 0.2/s

With execmem_alloc (with this set):
  iTLB-load-misses: 185k/s
  itlb_misses.walk_completed_4k: 58k/s
  itlb_misses.walk_completed_2m_4m: 1/s

Do these address your questions with this?

>
> - Extensions maybe of lib/test_vmalloc.c or whatever is appropriate to
>   test correctness

I will look into this.

>
> - Enhance commit logs to justify the *why*, one of which to hightight is
>   providing an API for memory semantics for special memory pages

And this.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 22:14       ` Edgecombe, Rick P
  2022-11-15 22:32         ` Song Liu
@ 2022-11-16  1:20         ` Song Liu
  2022-11-16 21:22           ` Edgecombe, Rick P
  1 sibling, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-16  1:20 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: peterz, rppt, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, Nov 15, 2022 at 2:14 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2022-11-15 at 13:54 -0800, Song Liu wrote:
> > On Tue, Nov 15, 2022 at 9:34 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Mon, 2022-11-14 at 17:30 -0800, Song Liu wrote:
> > > > Currently, I have got the following action items for v3:
> > > > 1. Add unify API to allocate text memory to motivation;
> > > > 2. Update Documentation/x86/x86_64/mm.rst;
> > > > 3. Allow none PMD_SIZE allocation for powerpc.
> > >
> > > So what do we think about supporting the fallback mechanism for the
> > > first version, like:
> > >
> > >
> https://lore.kernel.org/all/9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com/
> > >
> > > It helps the (1) story by actually being usable by non-text_poke()
> > > architectures.
> >
> > I personally think this might be a good idea. We will need this when
> > we use
> > execmem_alloc for modules. But I haven't got a chance to look at
> > module code in
> > great detail. I was thinking of adding this logic with changes in
> > module code.
>
> BPF used to have a generic allocator that just called module_alloc().
> If you had a fallback method could you unify all of BPF to use
> execmem()?

To clarify, are you suggesting we need this logic in this set? I would
rather wait until we handle module code. This is because BPF JIT uses
module_alloc() for archs other than x86_64. So the fall back of
execmem_alloc() for these archs would be module_alloc() or
something similar. I think it is really weird to do something like

void *execmem_alloc(size_t size)
{
#ifdef CONFIG_SUPPORT_EXECMEM_ALLOC
    ...
#else
    return module_alloc(size);
#endif
}

WDYT?

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-16  1:20         ` Song Liu
@ 2022-11-16 21:22           ` Edgecombe, Rick P
  2022-11-16 22:03             ` Song Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-16 21:22 UTC (permalink / raw)
  To: song; +Cc: peterz, rppt, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Tue, 2022-11-15 at 17:20 -0800, Song Liu wrote:
> To clarify, are you suggesting we need this logic in this set? I
> would
> rather wait until we handle module code. This is because BPF JIT uses
> module_alloc() for archs other than x86_64. So the fall back of
> execmem_alloc() for these archs would be module_alloc() or
> something similar. I think it is really weird to do something like
> 
> void *execmem_alloc(size_t size)
> {
> #ifdef CONFIG_SUPPORT_EXECMEM_ALLOC
>     ...
> #else
>     return module_alloc(size);
> #endif
> }
> 
> WDYT?

Hmm, that is a good point. It looks like it's plugged in backwards.

Several people in the past have expressed that all the text users
calling into *module*_alloc() also is a little wrong. So I think in
some finished future, each architecture would have an execmem_alloc()
arch breakout of some sort that modules could use instead of it's
module_alloc() logic. So basically all the module_alloc() arch
specifics details of location and PAGE_FOO would move to execmem.

I guess the question is how to get there. Calling into module_alloc()
does the job but looks wrong. There are a lot of module_alloc()s, but
what about implementing an execmem_alloc() for each bpf jit
architecture that doesn't match the existing default version. It
shouldn't be too much code. I think some of them will work with just
the  EXEC_MEM_START/END heuristic and wont need a breakout.

But if this thing just works for x86 BPF JITs, I'm not sure we can say
we have unified anything...


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-16 21:22           ` Edgecombe, Rick P
@ 2022-11-16 22:03             ` Song Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-16 22:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: peterz, rppt, bpf, linux-mm, hch, x86, akpm, mcgrof, Lu, Aaron

On Wed, Nov 16, 2022 at 1:22 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2022-11-15 at 17:20 -0800, Song Liu wrote:
> > To clarify, are you suggesting we need this logic in this set? I
> > would
> > rather wait until we handle module code. This is because BPF JIT uses
> > module_alloc() for archs other than x86_64. So the fall back of
> > execmem_alloc() for these archs would be module_alloc() or
> > something similar. I think it is really weird to do something like
> >
> > void *execmem_alloc(size_t size)
> > {
> > #ifdef CONFIG_SUPPORT_EXECMEM_ALLOC
> >     ...
> > #else
> >     return module_alloc(size);
> > #endif
> > }
> >
> > WDYT?
>
> Hmm, that is a good point. It looks like it's plugged in backwards.
>
> Several people in the past have expressed that all the text users
> calling into *module*_alloc() also is a little wrong. So I think in
> some finished future, each architecture would have an execmem_alloc()
> arch breakout of some sort that modules could use instead of it's
> module_alloc() logic. So basically all the module_alloc() arch
> specifics details of location and PAGE_FOO would move to execmem.
>
> I guess the question is how to get there. Calling into module_alloc()
> does the job but looks wrong. There are a lot of module_alloc()s, but
> what about implementing an execmem_alloc() for each bpf jit
> architecture that doesn't match the existing default version. It
> shouldn't be too much code. I think some of them will work with just
> the  EXEC_MEM_START/END heuristic and wont need a breakout.
>
> But if this thing just works for x86 BPF JITs, I'm not sure we can say
> we have unified anything...

powerpc BPF JIT is getting bpf_prog_pack soon. [1] And we should
be able to make ftrace and BPF trampoline to use execmem_alloc
soon after this set is merged. AFAICT, we don't have to finalize the
API until we handle module text. I personally think current API is
good enough for ftrace and BPF trampoline, which already use
something similar to JIT.

Thanks,
Song

[1] https://lore.kernel.org/bpf/20221110184303.393179-1-hbathini@linux.ibm.com/

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 22:48     ` Song Liu
@ 2022-11-16 22:33       ` Luis Chamberlain
  2022-11-16 22:47         ` Edgecombe, Rick P
  2022-11-17  9:37         ` Mike Rapoport
  0 siblings, 2 replies; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-16 22:33 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe, aaron.lu, rppt

On Tue, Nov 15, 2022 at 02:48:05PM -0800, Song Liu wrote:
> On Tue, Nov 15, 2022 at 1:09 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote:
> > > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote:
> > > >
> > >
> > > [...]
> > >
> > > >
> > > >
> > > > This set enables bpf programs and bpf dispatchers to share huge pages with
> > > > new API:
> > > >   execmem_alloc()
> > > >   execmem_alloc()
> > > >   execmem_fill()
> > > >
> > > > The idea is similar to Peter's suggestion in [1].
> > > >
> > > > execmem_alloc() manages a set of PMD_SIZE RO+X memory, and allocates these
> > > > memory to its users. execmem_alloc() is used to free memory allocated by
> > > > execmem_alloc(). execmem_fill() is used to update memory allocated by
> > > > execmem_alloc().
> > >
> > > Sigh, I just realized this thread made through linux-mm@kvack.org, but got
> > > dropped by bpf@vger.kernel.org, so I guess I will have to resend v3.
> >
> > I don't know what is going on with the bpf list but whatever it is, is silly.
> > You should Cc the right folks to ensure proper review if the bpf list is
> > the issue.
> >
> > > Currently, I have got the following action items for v3:
> > > 1. Add unify API to allocate text memory to motivation;
> > > 2. Update Documentation/x86/x86_64/mm.rst;
> > > 3. Allow none PMD_SIZE allocation for powerpc.
> >
> > - I am really exausted of asking again for real performance tests,
> >   you keep saying you can't and I keep saying you can, you are not
> >   trying hard enough. Stop thinking about your internal benchmark which
> >   you cannot publish. There should be enough crap out which you can use.
> >
> > - A new selftest or set of selftests which demonstrates gain in
> >   performance
> 
> I didn't mean to not show the result with publically available. I just
> thought the actual benchmark was better (and we do use that to
> demonstrate the benefit of a lot of kernel improvement).
> 
> For something publically available, how about the following:
> 
> Run 100 instances of the following benchmark from bpf selftests:
>   tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe
> which loads 7 BPF programs, and triggers one of them.
> 
> Then use perf to monitor TLB related counters:
>    perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \
>         itlb_misses.walk_completed_2m_4m -a
> 
> The following results are from a qemu VM with 32 cores.
> 
> Before bpf_prog_pack:
>   iTLB-load-misses: 350k/s
>   itlb_misses.walk_completed_4k: 90k/s
>   itlb_misses.walk_completed_2m_4m: 0.1/s
> 
> With bpf_prog_pack (current upstream):
>   iTLB-load-misses: 220k/s
>   itlb_misses.walk_completed_4k: 68k/s
>   itlb_misses.walk_completed_2m_4m: 0.2/s
> 
> With execmem_alloc (with this set):
>   iTLB-load-misses: 185k/s
>   itlb_misses.walk_completed_4k: 58k/s
>   itlb_misses.walk_completed_2m_4m: 1/s
> 
> Do these address your questions with this?

More in lines with what I was hoping for. Can something just do
the parallelization for you in one shot? Can bench alone do it for you?
Is there no interest to have soemthing which generically showcases
multithreading / hammering a system with tons of eBPF JITs? It may
prove useful.

And also, it begs the question, what if you had another iTLB generic
benchmark or genearl memory pressure workload running *as* you run the
above? I as, as it was my understanding that one of the issues was the
long term slowdown caused by the directmap fragmentation without
bpf_prog_pack, and so such an application should crawl to its knees
over time, and there should be numbers you could show to prove that
too, before and after.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15 21:39               ` Edgecombe, Rick P
@ 2022-11-16 22:34                 ` Luis Chamberlain
  0 siblings, 0 replies; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-16 22:34 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: song, peterz, rppt, bpf, linux-mm, hch, x86, akpm, Lu, Aaron

On Tue, Nov 15, 2022 at 09:39:08PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-11-15 at 13:18 -0800, Luis Chamberlain wrote:
> > The main hurdles for modules are:
> > 
> >   * x86 needs support for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> >     to use this properly
> >   * in light of lack of support for
> > CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> >     we need a fallback
> >   * today module_alloc() follows special hanky panky open coded
> > semantics for
> >     special page permissions, a unified way to handle this would be
> >     ideal instead of expecting everyone to get it right.
> 
> How were you thinking non-text_poke() architectures load their text
> into the text region without the fallback method?

Fallbacks are needed for sure. I think you spelled out well what would
be needed.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-16 22:33       ` Luis Chamberlain
@ 2022-11-16 22:47         ` Edgecombe, Rick P
  2022-11-16 23:53           ` Luis Chamberlain
  2022-11-17  9:37         ` Mike Rapoport
  1 sibling, 1 reply; 93+ messages in thread
From: Edgecombe, Rick P @ 2022-11-16 22:47 UTC (permalink / raw)
  To: song, mcgrof; +Cc: peterz, bpf, rppt, linux-mm, hch, x86, akpm, Lu, Aaron

On Wed, 2022-11-16 at 14:33 -0800, Luis Chamberlain wrote:
> More in lines with what I was hoping for. Can something just do
> the parallelization for you in one shot? Can bench alone do it for
> you?
> Is there no interest to have soemthing which generically showcases
> multithreading / hammering a system with tons of eBPF JITs? It may
> prove useful.
> 
> And also, it begs the question, what if you had another iTLB generic
> benchmark or genearl memory pressure workload running *as* you run
> the
> above? I as, as it was my understanding that one of the issues was
> the
> long term slowdown caused by the directmap fragmentation without
> bpf_prog_pack, and so such an application should crawl to its knees
> over time, and there should be numbers you could show to prove that
> too, before and after.

We did have some benchmarks that showed if your direct map was totally
fragmented (started from boot at 4k page size) what the regression was:


https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-16 22:47         ` Edgecombe, Rick P
@ 2022-11-16 23:53           ` Luis Chamberlain
  2022-11-17  1:17             ` Song Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Luis Chamberlain @ 2022-11-16 23:53 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: song, peterz, bpf, rppt, linux-mm, hch, x86, akpm, Lu, Aaron

On Wed, Nov 16, 2022 at 10:47:04PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2022-11-16 at 14:33 -0800, Luis Chamberlain wrote:
> > More in lines with what I was hoping for. Can something just do
> > the parallelization for you in one shot? Can bench alone do it for
> > you?
> > Is there no interest to have soemthing which generically showcases
> > multithreading / hammering a system with tons of eBPF JITs? It may
> > prove useful.
> > 
> > And also, it begs the question, what if you had another iTLB generic
> > benchmark or genearl memory pressure workload running *as* you run
> > the
> > above? I as, as it was my understanding that one of the issues was
> > the
> > long term slowdown caused by the directmap fragmentation without
> > bpf_prog_pack, and so such an application should crawl to its knees
> > over time, and there should be numbers you could show to prove that
> > too, before and after.
> 
> We did have some benchmarks that showed if your direct map was totally
> fragmented (started from boot at 4k page size) what the regression was:
> 
> 
> https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Oh yes that is a good example of effort, but I'm suggesting taking for
instance will-it-scale and run it in tandem with bpg prog pack
and measure on *both* iTLB differences, before / after, *and* doing
this again after a period of expected deterioation of the direct
map fragmentation (say after non-bpf-prog-pack shows high direct
map fragmetnation).

This is the sort of thing which easily go into a commit log.

  Luis

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-16 23:53           ` Luis Chamberlain
@ 2022-11-17  1:17             ` Song Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-11-17  1:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Edgecombe, Rick P, peterz, bpf, rppt, linux-mm, hch, x86, akpm,
	Lu, Aaron

On Wed, Nov 16, 2022 at 3:53 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Wed, Nov 16, 2022 at 10:47:04PM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2022-11-16 at 14:33 -0800, Luis Chamberlain wrote:
> > > More in lines with what I was hoping for. Can something just do
> > > the parallelization for you in one shot? Can bench alone do it for
> > > you?
> > > Is there no interest to have soemthing which generically showcases
> > > multithreading / hammering a system with tons of eBPF JITs? It may
> > > prove useful.
> > >
> > > And also, it begs the question, what if you had another iTLB generic
> > > benchmark or genearl memory pressure workload running *as* you run
> > > the
> > > above? I as, as it was my understanding that one of the issues was
> > > the
> > > long term slowdown caused by the directmap fragmentation without
> > > bpf_prog_pack, and so such an application should crawl to its knees
> > > over time, and there should be numbers you could show to prove that
> > > too, before and after.
> >
> > We did have some benchmarks that showed if your direct map was totally
> > fragmented (started from boot at 4k page size) what the regression was:
> >
> >
> > https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
>
> Oh yes that is a good example of effort, but I'm suggesting taking for
> instance will-it-scale and run it in tandem with bpg prog pack
> and measure on *both* iTLB differences, before / after, *and* doing
> this again after a period of expected deterioation of the direct
> map fragmentation (say after non-bpf-prog-pack shows high direct
> map fragmetnation).
>
> This is the sort of thing which easily go into a commit log.

To be honest, I don't see experimental results with artificial benchmarks
would help this set. I don't think a real workload would see 10% speed
up from this set (we can see large % improvements in TLB miss rate
though). However, 1% or even 0.5% improvement matters a lot for
large scale workload.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-14 20:30           ` Song Liu
  2022-11-15 21:18             ` Luis Chamberlain
@ 2022-11-17  8:50             ` Mike Rapoport
  2022-11-17 18:36               ` Song Liu
  1 sibling, 1 reply; 93+ messages in thread
From: Mike Rapoport @ 2022-11-17  8:50 UTC (permalink / raw)
  To: Song Liu
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote:
> On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > My concern is that the proposed execmem_alloc() cannot be used for
> > centralized handling of loading text. I'm not familiar enough with
> > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but
> > my gut feeling is that the proposed execmem_alloc() won't be an improvement
> > but rather a hindrance for moving to centralized handling of loading text.
> 
> I don't follow why this could ever be a hindrance. Luis is very excited about
> this, and I am very sure it works for ftrace, kprobe, and BPF.

Again, it's a gut feeling. But for execmem_alloc() to be a unified place of
code allocation, it has to work for all architectures. If architectures
have to override it, then where is the unification?

The implementation you propose if great for x86, but to see it as unified
solution it should be good at least for the major architectures.

> > It feels to me that a lot of ground work is needed to get to the point
> > where we can use centralized handling of loading text.
> 
> Could you please be more specific on what is needed?

The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP
so that execmem_alloc() can be actually used by modules.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-16 22:33       ` Luis Chamberlain
  2022-11-16 22:47         ` Edgecombe, Rick P
@ 2022-11-17  9:37         ` Mike Rapoport
  1 sibling, 0 replies; 93+ messages in thread
From: Mike Rapoport @ 2022-11-17  9:37 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Song Liu, bpf, linux-mm, peterz, akpm, x86, hch,
	rick.p.edgecombe, aaron.lu

On Wed, Nov 16, 2022 at 02:33:37PM -0800, Luis Chamberlain wrote:
> On Tue, Nov 15, 2022 at 02:48:05PM -0800, Song Liu wrote:
> > On Tue, Nov 15, 2022 at 1:09 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > >
> > > On Mon, Nov 14, 2022 at 05:30:39PM -0800, Song Liu wrote:
> > > > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote:
> > >
> > > > Currently, I have got the following action items for v3:
> > > > 1. Add unify API to allocate text memory to motivation;
> > > > 2. Update Documentation/x86/x86_64/mm.rst;
> > > > 3. Allow none PMD_SIZE allocation for powerpc.
> > >
> > > - I am really exausted of asking again for real performance tests,
> > >   you keep saying you can't and I keep saying you can, you are not
> > >   trying hard enough. Stop thinking about your internal benchmark which
> > >   you cannot publish. There should be enough crap out which you can use.
> > >
> > > - A new selftest or set of selftests which demonstrates gain in
> > >   performance
> > 
> > I didn't mean to not show the result with publically available. I just
> > thought the actual benchmark was better (and we do use that to
> > demonstrate the benefit of a lot of kernel improvement).
> > 
> > For something publically available, how about the following:
> > 
> > Run 100 instances of the following benchmark from bpf selftests:
> >   tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe
> > which loads 7 BPF programs, and triggers one of them.
> > 
> > Then use perf to monitor TLB related counters:
> >    perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \
> >         itlb_misses.walk_completed_2m_4m -a
> > 
> > The following results are from a qemu VM with 32 cores.
> > 
> > Before bpf_prog_pack:
> >   iTLB-load-misses: 350k/s
> >   itlb_misses.walk_completed_4k: 90k/s
> >   itlb_misses.walk_completed_2m_4m: 0.1/s
> > 
> > With bpf_prog_pack (current upstream):
> >   iTLB-load-misses: 220k/s
> >   itlb_misses.walk_completed_4k: 68k/s
> >   itlb_misses.walk_completed_2m_4m: 0.2/s
> > 
> > With execmem_alloc (with this set):
> >   iTLB-load-misses: 185k/s
> >   itlb_misses.walk_completed_4k: 58k/s
> >   itlb_misses.walk_completed_2m_4m: 1/s
> > 
> > Do these address your questions with this?
> 
> More in lines with what I was hoping for. Can something just do
> the parallelization for you in one shot? Can bench alone do it for you?
> Is there no interest to have soemthing which generically showcases
> multithreading / hammering a system with tons of eBPF JITs? It may
> prove useful.
> 
> And also, it begs the question, what if you had another iTLB generic
> benchmark or genearl memory pressure workload running *as* you run the
> above? I as, as it was my understanding that one of the issues was the
> long term slowdown caused by the directmap fragmentation without
> bpf_prog_pack, and so such an application should crawl to its knees
> over time, and there should be numbers you could show to prove that
> too, before and after.

I'd add to that that benchmarking iTLB performance on an idle system is not
very representative. TLB is a scarce resource, so it'd be interesting to
see this benchmark on a loaded system.
 
>   Luis

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-17  8:50             ` Mike Rapoport
@ 2022-11-17 18:36               ` Song Liu
  2022-11-20 10:41                 ` Mike Rapoport
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-17 18:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Thu, Nov 17, 2022 at 12:50 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote:
> > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > My concern is that the proposed execmem_alloc() cannot be used for
> > > centralized handling of loading text. I'm not familiar enough with
> > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but
> > > my gut feeling is that the proposed execmem_alloc() won't be an improvement
> > > but rather a hindrance for moving to centralized handling of loading text.
> >
> > I don't follow why this could ever be a hindrance. Luis is very excited about
> > this, and I am very sure it works for ftrace, kprobe, and BPF.
>
> Again, it's a gut feeling. But for execmem_alloc() to be a unified place of
> code allocation, it has to work for all architectures. If architectures
> have to override it, then where is the unification?
>
> The implementation you propose if great for x86, but to see it as unified
> solution it should be good at least for the major architectures.

As I mentioned earlier, folks are working on using bpf_prog_pack for BPF
JIT on powerpc. We will also work on something similar for ARM. I guess
these are good enough for major architectures?

>
> > > It feels to me that a lot of ground work is needed to get to the point
> > > where we can use centralized handling of loading text.
> >
> > Could you please be more specific on what is needed?
>
> The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP
> so that execmem_alloc() can be actually used by modules.

Current implementation is an alternative to VM_TOPDOWN_VMAP. I am
very sure it works for modules just like VM_TOPDOWN_VMAP solution.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-17 18:36               ` Song Liu
@ 2022-11-20 10:41                 ` Mike Rapoport
  2022-11-21 14:52                   ` Song Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Mike Rapoport @ 2022-11-20 10:41 UTC (permalink / raw)
  To: Song Liu
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Thu, Nov 17, 2022 at 10:36:43AM -0800, Song Liu wrote:
> On Thu, Nov 17, 2022 at 12:50 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote:
> > > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > My concern is that the proposed execmem_alloc() cannot be used for
> > > > centralized handling of loading text. I'm not familiar enough with
> > > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but
> > > > my gut feeling is that the proposed execmem_alloc() won't be an improvement
> > > > but rather a hindrance for moving to centralized handling of loading text.
> > >
> > > I don't follow why this could ever be a hindrance. Luis is very excited about
> > > this, and I am very sure it works for ftrace, kprobe, and BPF.
> >
> > Again, it's a gut feeling. But for execmem_alloc() to be a unified place of
> > code allocation, it has to work for all architectures. If architectures
> > have to override it, then where is the unification?
> >
> > The implementation you propose if great for x86, but to see it as unified
> > solution it should be good at least for the major architectures.
> 
> As I mentioned earlier, folks are working on using bpf_prog_pack for BPF
> JIT on powerpc. We will also work on something similar for ARM.

Does "something similar" mean that it won't use execmem_alloc() as is?

> I guess these are good enough for major architectures?

Sorry if I wasn't clear, I referred for unified solution for all code
allocations, not only BPF, so that execmem_alloc() will eventually replace
module_alloc(). And that means it has to be able to deal with with
architecture specific requirements at least on ARM and powerpc, probably
others as well.

> > > > It feels to me that a lot of ground work is needed to get to the point
> > > > where we can use centralized handling of loading text.
> > >
> > > Could you please be more specific on what is needed?
> >
> > The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP
> > so that execmem_alloc() can be actually used by modules.
> 
> Current implementation is an alternative to VM_TOPDOWN_VMAP. I am
> very sure it works for modules just like VM_TOPDOWN_VMAP solution.

It might, but it still does not. And until they do I consider these
patches as an optimization for BFP rather than unification of code
allocations.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-14 20:45           ` Song Liu
  2022-11-15 20:51             ` Luis Chamberlain
@ 2022-11-20 10:44             ` Mike Rapoport
  1 sibling, 0 replies; 93+ messages in thread
From: Mike Rapoport @ 2022-11-20 10:44 UTC (permalink / raw)
  To: Song Liu
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Mon, Nov 14, 2022 at 12:45:16PM -0800, Song Liu wrote:
> On Sun, Nov 13, 2022 at 2:43 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > If RO data causes problems with direct map fragmentation, we can use
> > > similar logic. I think we will need another tree in vmalloc for this case.
> > > Since the logic will be mostly identical, I personally don't think adding
> > > another tree is a big overhead.
> >
> > Actually, it would be interesting to quantify memory savings/waste as the
> > result of using execmem_alloc()
> 
> From a random system in our fleet, execmem_alloc() saves:
> 
> 139 iTLB entries (1x 2MB entry vs, 140x 4kB entries), which is more than
> 100% of L1 iTLB and about 10% of L2 TLB.

Using 2M pages saves page table entries. They might be cached in iTLB and
might be not because on a loaded system large part of iTBL will cache
userspace mappings.
 
> It wastes 1.5MB memory, which is 0.0023% of system memory (64GB).
> 
> I believe this is clearly a good trade-off.

The actual trade-off would be for 1.5MB of memory for yet unknown, or at
least unpublished, performance improvement on a loaded system.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-20 10:41                 ` Mike Rapoport
@ 2022-11-21 14:52                   ` Song Liu
  2022-11-30  9:39                     ` Mike Rapoport
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-21 14:52 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Sun, Nov 20, 2022 at 3:41 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, Nov 17, 2022 at 10:36:43AM -0800, Song Liu wrote:
> > On Thu, Nov 17, 2022 at 12:50 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Mon, Nov 14, 2022 at 12:30:49PM -0800, Song Liu wrote:
> > > > On Sun, Nov 13, 2022 at 2:35 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > >
> > > > > My concern is that the proposed execmem_alloc() cannot be used for
> > > > > centralized handling of loading text. I'm not familiar enough with
> > > > > modules/ftrace/kprobes/BPF to clearly identify the potential caveats, but
> > > > > my gut feeling is that the proposed execmem_alloc() won't be an improvement
> > > > > but rather a hindrance for moving to centralized handling of loading text.
> > > >
> > > > I don't follow why this could ever be a hindrance. Luis is very excited about
> > > > this, and I am very sure it works for ftrace, kprobe, and BPF.
> > >
> > > Again, it's a gut feeling. But for execmem_alloc() to be a unified place of
> > > code allocation, it has to work for all architectures. If architectures
> > > have to override it, then where is the unification?
> > >
> > > The implementation you propose if great for x86, but to see it as unified
> > > solution it should be good at least for the major architectures.
> >
> > As I mentioned earlier, folks are working on using bpf_prog_pack for BPF
> > JIT on powerpc. We will also work on something similar for ARM.
>
> Does "something similar" mean that it won't use execmem_alloc() as is?

"Something similar" means it will use execmem_alloc as is. We still need
changes to the ARM JIT code, just like we need it for powerpc and x86.

>
> > I guess these are good enough for major architectures?
>
> Sorry if I wasn't clear, I referred for unified solution for all code
> allocations, not only BPF, so that execmem_alloc() will eventually replace
> module_alloc(). And that means it has to be able to deal with with
> architecture specific requirements at least on ARM and powerpc, probably
> others as well.
>
> > > > > It feels to me that a lot of ground work is needed to get to the point
> > > > > where we can use centralized handling of loading text.
> > > >
> > > > Could you please be more specific on what is needed?
> > >
> > > The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP
> > > so that execmem_alloc() can be actually used by modules.
> >
> > Current implementation is an alternative to VM_TOPDOWN_VMAP. I am
> > very sure it works for modules just like VM_TOPDOWN_VMAP solution.
>
> It might, but it still does not. And until they do I consider these
> patches as an optimization for BFP rather than unification of code
> allocations.

We haven't got module to use execmem_alloc yet, that's true. But
this has nothing to do with VM_TOPDOWN_VMAP at all.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-15  1:30 ` Song Liu
  2022-11-15 17:34   ` Edgecombe, Rick P
  2022-11-15 21:09   ` Luis Chamberlain
@ 2022-11-29 10:23   ` Thomas Gleixner
  2022-11-29 17:26     ` Song Liu
  2 siblings, 1 reply; 93+ messages in thread
From: Thomas Gleixner @ 2022-11-29 10:23 UTC (permalink / raw)
  To: Song Liu, bpf, linux-mm, peterz, akpm
  Cc: x86, hch, rick.p.edgecombe, aaron.lu, rppt, mcgrof

On Mon, Nov 14 2022 at 17:30, Song Liu wrote:
> On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote:
> Currently, I have got the following action items for v3:
> 1. Add unify API to allocate text memory to motivation;
> 2. Update Documentation/x86/x86_64/mm.rst;
> 3. Allow none PMD_SIZE allocation for powerpc.
>
> 1 and 2 are relatively simple. We can probably do 3 in a follow up patch
> (as I don't have powerpc environments for testing). Did I miss anything?
>
> Besides these, does this set look reasonable? Andrew and Peter, could
> you please share your comments on this?

This is a step into the right direction, but is it a solution which has
a broader benefit? I don't think so.

While you are so concerned about (i)TLB usage for BPF, I'm way more
concerned about modules. Just from a random server class workstation:

Total module memory:	12.4609 MB
Number of 4k PTEs:         3190

The above would nicely fit into 7 or 8 2M mappings.

Guess how much memory is occupied by BPF on that machine and how much
BPF contributes to iTLB pressure? In comparison to the above very close
to zero.

Modules have exactly the same problem as BPF, just an order of magnitude
larger.

So we don't need a "works" for BPF solution which comes with the
handwaving assumption that it could be used for modules too. We need
something which demonstrates that it solves the entire problem class.

Modules are the obvious starting point. Once that is solved pretty much
everything else falls into place including BPF.

Without modules support this whole exercise is pointless and not going
anywhere near x86.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-29 10:23   ` Thomas Gleixner
@ 2022-11-29 17:26     ` Song Liu
  2022-11-29 23:56       ` Thomas Gleixner
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-29 17:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Hi Thomas,

Thanks for your comments.

On Tue, Nov 29, 2022 at 2:23 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Nov 14 2022 at 17:30, Song Liu wrote:
> > On Mon, Nov 7, 2022 at 2:41 PM Song Liu <song@kernel.org> wrote:
> > Currently, I have got the following action items for v3:
> > 1. Add unify API to allocate text memory to motivation;
> > 2. Update Documentation/x86/x86_64/mm.rst;
> > 3. Allow none PMD_SIZE allocation for powerpc.
> >
> > 1 and 2 are relatively simple. We can probably do 3 in a follow up patch
> > (as I don't have powerpc environments for testing). Did I miss anything?
> >
> > Besides these, does this set look reasonable? Andrew and Peter, could
> > you please share your comments on this?
>
> This is a step into the right direction, but is it a solution which has
> a broader benefit? I don't think so.
>
> While you are so concerned about (i)TLB usage for BPF, I'm way more
> concerned about modules. Just from a random server class workstation:
>
> Total module memory:    12.4609 MB
> Number of 4k PTEs:         3190
>
> The above would nicely fit into 7 or 8 2M mappings.
>
> Guess how much memory is occupied by BPF on that machine and how much
> BPF contributes to iTLB pressure? In comparison to the above very close
> to zero.
>
> Modules have exactly the same problem as BPF, just an order of magnitude
> larger.
>
> So we don't need a "works" for BPF solution which comes with the
> handwaving assumption that it could be used for modules too. We need
> something which demonstrates that it solves the entire problem class.
>
> Modules are the obvious starting point. Once that is solved pretty much
> everything else falls into place including BPF.
>
> Without modules support this whole exercise is pointless and not going
> anywhere near x86.

I am not sure I fully understand your point here. Do you mean

1) There is something wrong with this solution, that makes it not suitable
for modules;
   or
2) The solution is in the right direction and it will very likely work
for modules.
But we haven't finished module support. ?

If it is 1), I would like to understand what are the issues that make it not
suitable for modules. If it is 2), I think a solid, mostly like working small
step toward the right direction is the better way as it makes code reviews
a lot easier and has much lower risks. Does this make sense?

I would also highlight that part of the benefit of this work comes from
reducing direct map fragmentations. While BPF programs consume less
memory, they are more dynamic and can cause more direct map
fragmentations. bpf_prog_pack in upstream kernel already covers this
part, but this set is a better solution than bpf_prog_pack.

Finally, I would like to point out that 5/6 and 6/6 of (v5) the set let BPF
programs share a 2MB page with static kernel text. Therefore, even
for systems without many BPF programs, we should already see some
reduction in iTLB misses.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-29 17:26     ` Song Liu
@ 2022-11-29 23:56       ` Thomas Gleixner
  2022-11-30 16:18         ` Song Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Thomas Gleixner @ 2022-11-29 23:56 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Song,

On Tue, Nov 29 2022 at 09:26, Song Liu wrote:
> On Tue, Nov 29, 2022 at 2:23 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Modules are the obvious starting point. Once that is solved pretty much
>> everything else falls into place including BPF.
>>
>> Without modules support this whole exercise is pointless and not going
>> anywhere near x86.
>
> I am not sure I fully understand your point here. Do you mean
>
> 1) There is something wrong with this solution, that makes it not suitable
> for modules;
>    or
> 2) The solution is in the right direction and it will very likely work
> for modules.
> But we haven't finished module support. ?

As I'm obviously unable to express myself coherently, let me try again:

 A solution which solves the BPF problem, but does not solve the
 underlying problem of module_alloc() is not acceptable.

Is that clear enough?

> If it is 1), I would like to understand what are the issues that make it not
> suitable for modules. If it is 2), I think a solid, mostly like working small
> step toward the right direction is the better way as it makes code reviews
> a lot easier and has much lower risks. Does this make sense?

No. Because all you are interested in is to get your BPF itch scratched
instead of actually sitting down and solving the underlying problem and
thereby creating a benefit for everyone.

You are not making anything easier. You are violating the basic
engineering principle of "Fix the root cause, not the symptom".

By doing that you are actually creating more problems than you
solve. Why?

  Clearly your "solution" does not cover the full requirements of the
  module space because you solely focus on executable memory allocations
  which somehow magically go into the module address space.

  Can you coherently explain how this results in a consistent solution
  for the rest of the module requirements?

  Can you coherently explain why this wont create problems down the road
  for anyone who actually would be willing to solve the root cause?

No, you can't answer any of these questions simply because you never
explored the problem space sufficiently.

I'm not the first one to point this out. Quite some people in the
various threads regarding this issue have been pointing that out to you
before. They even provided you hints on how this can be solved properly
once and forever and for everyones benefits.

> I would also highlight that part of the benefit of this work comes from
> reducing direct map fragmentations. While BPF programs consume less
> memory, they are more dynamic and can cause more direct map
> fragmentations. bpf_prog_pack in upstream kernel already covers this
> part, but this set is a better solution than bpf_prog_pack.
>
> Finally, I would like to point out that 5/6 and 6/6 of (v5) the set let BPF
> programs share a 2MB page with static kernel text. Therefore, even
> for systems without many BPF programs, we should already see some
> reduction in iTLB misses.

Can you please stop this marketing nonsense? As I pointed out to you in
the very mail which your are replying to, the influence of BPF on the
system I picked randomly out of the pool is pretty close to ZERO.

Ergo, the reduction of iTLB misses is going to be equally close to
ZERO. What is the benefit you are trying to sell me?

I'm happy to run perf on this machine and provide the numbers which put
your 'we should already see some reduction' handwaving into perspective.

But the above is just a distraction. The real point is:

You can highlight and point out the benefits of your BPF specific
solution as much as you want, it does not make the fact that you are
"fixing" the symptom instead of the root cause magically go away.

Again for the record:

  The iTLB pressure problem, which affects modules, kprobes, tracing and
  BPF, is caused by the way how module_alloc() is implemented.

That's the root cause and this needs to be solved for _ALL_ of the users
of this infrastructure and not worked around by adding something which
makes BPF shiny and handwaves about that it solves the underlying
problem.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-21 14:52                   ` Song Liu
@ 2022-11-30  9:39                     ` Mike Rapoport
  0 siblings, 0 replies; 93+ messages in thread
From: Mike Rapoport @ 2022-11-30  9:39 UTC (permalink / raw)
  To: Song Liu
  Cc: Edgecombe, Rick P, peterz, bpf, linux-mm, hch, x86, akpm, mcgrof,
	Lu, Aaron

On Mon, Nov 21, 2022 at 07:52:12AM -0700, Song Liu wrote:
> On Sun, Nov 20, 2022 at 3:41 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > The most obvious one to implement Peter's suggestion with VM_TOPDOWN_VMAP
> > > > so that execmem_alloc() can be actually used by modules.
> > >
> > > Current implementation is an alternative to VM_TOPDOWN_VMAP. I am
> > > very sure it works for modules just like VM_TOPDOWN_VMAP solution.
> >
> > It might, but it still does not. And until they do I consider these
> > patches as an optimization for BFP rather than unification of code
> > allocations.
> 
> We haven't got module to use execmem_alloc yet, that's true. But
> this has nothing to do with VM_TOPDOWN_VMAP at all.

The point is that Peter suggested a way to make module_alloc() use 2M
pages, so that all code allocations could benefit from it. Your
execmem_alloc() isn't anywhere near this.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-29 23:56       ` Thomas Gleixner
@ 2022-11-30 16:18         ` Song Liu
  2022-12-01  9:08           ` Thomas Gleixner
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-11-30 16:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Hi Thomas,

On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Song,
>
> On Tue, Nov 29 2022 at 09:26, Song Liu wrote:
> > On Tue, Nov 29, 2022 at 2:23 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> Modules are the obvious starting point. Once that is solved pretty much
> >> everything else falls into place including BPF.
> >>
> >> Without modules support this whole exercise is pointless and not going
> >> anywhere near x86.
> >
> > I am not sure I fully understand your point here. Do you mean
> >
> > 1) There is something wrong with this solution, that makes it not suitable
> > for modules;
> >    or
> > 2) The solution is in the right direction and it will very likely work
> > for modules.
> > But we haven't finished module support. ?
>
> As I'm obviously unable to express myself coherently, let me try again:
>
>  A solution which solves the BPF problem, but does not solve the
>  underlying problem of module_alloc() is not acceptable.
>
> Is that clear enough?

While I sincerely want to provide a solution not just for BPF but also
for modules and others, I don't think I fully understand the underlying
problem of module_alloc(). I sincerely would like to learn more about it.

>
> > If it is 1), I would like to understand what are the issues that make it not
> > suitable for modules. If it is 2), I think a solid, mostly like working small
> > step toward the right direction is the better way as it makes code reviews
> > a lot easier and has much lower risks. Does this make sense?
>
> No. Because all you are interested in is to get your BPF itch scratched
> instead of actually sitting down and solving the underlying problem and
> thereby creating a benefit for everyone.

TBH, until your reply, I thought I was working on something that would
benefit everyone. It is indeed not just for BPF itch, as bpf_prog_pack
already scratched it for BPF.

>
> You are not making anything easier. You are violating the basic
> engineering principle of "Fix the root cause, not the symptom".
>

I am not sure what is the root cause and the symptom here. I
understand ideas referred in this lwn article:

   https://lwn.net/Articles/894557/

But I don't know which one of them (if any) would fix the root cause.

> By doing that you are actually creating more problems than you
> solve. Why?
>
>   Clearly your "solution" does not cover the full requirements of the
>   module space because you solely focus on executable memory allocations
>   which somehow magically go into the module address space.
>
>   Can you coherently explain how this results in a consistent solution
>   for the rest of the module requirements?
>
>   Can you coherently explain why this wont create problems down the road
>   for anyone who actually would be willing to solve the root cause?
>
> No, you can't answer any of these questions simply because you never
> explored the problem space sufficiently.

I was thinking, for modules, we only need something new for module text,
and module data will just use vmalloc(). I guess this is probably not the
right solution?

>
> I'm not the first one to point this out. Quite some people in the
> various threads regarding this issue have been pointing that out to you
> before. They even provided you hints on how this can be solved properly
> once and forever and for everyones benefits.

I tried to review various threads. Unfortunately, I am not able to identify
the proper hints and construct a solution.

>
> > I would also highlight that part of the benefit of this work comes from
> > reducing direct map fragmentations. While BPF programs consume less
> > memory, they are more dynamic and can cause more direct map
> > fragmentations. bpf_prog_pack in upstream kernel already covers this
> > part, but this set is a better solution than bpf_prog_pack.
> >
> > Finally, I would like to point out that 5/6 and 6/6 of (v5) the set let BPF
> > programs share a 2MB page with static kernel text. Therefore, even
> > for systems without many BPF programs, we should already see some
> > reduction in iTLB misses.
>
> Can you please stop this marketing nonsense? As I pointed out to you in
> the very mail which your are replying to, the influence of BPF on the
> system I picked randomly out of the pool is pretty close to ZERO.
>
> Ergo, the reduction of iTLB misses is going to be equally close to
> ZERO. What is the benefit you are trying to sell me?
>
> I'm happy to run perf on this machine and provide the numbers which put
> your 'we should already see some reduction' handwaving into perspective.
>
> But the above is just a distraction. The real point is:
>
> You can highlight and point out the benefits of your BPF specific
> solution as much as you want, it does not make the fact that you are
> "fixing" the symptom instead of the root cause magically go away.
>
> Again for the record:
>
>   The iTLB pressure problem, which affects modules, kprobes, tracing and
>   BPF, is caused by the  way how module_alloc() is implemented.

TBH, I don't think I understand this...

Do you mean the problem with  module_alloc() is that it is not aware of
desired permissions (W or X or neither)? If so, is permission vmalloc [1]
the right direction for this?

[1] https://lwn.net/ml/linux-mm/20201120202426.18009-1-rick.p.edgecombe@intel.com/

>
> That's the root cause and this needs to be solved for _ALL_ of the users
> of this infrastructure and not worked around by adding something which
> makes BPF shiny and handwaves about that it solves the underlying
> problem.

While I did plan to enable 2MB pages for module text, I didn't plan to
solve it in the first set. However, since you think it is possible and would
like to provide directions, I am up for the challenge and will give it a try.
Please share more details about the right direction. Otherwise, I am
still lost...

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-30 16:18         ` Song Liu
@ 2022-12-01  9:08           ` Thomas Gleixner
  2022-12-01 19:31             ` Song Liu
  2022-12-01 20:23             ` Mike Rapoport
  0 siblings, 2 replies; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-01  9:08 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Song!

On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
> On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> You are not making anything easier. You are violating the basic
>> engineering principle of "Fix the root cause, not the symptom".
>>
>
> I am not sure what is the root cause and the symptom here.

The symptom is iTLB pressure. The root cause is the way how module
memory is allocated, which in turn causes the fragmentation into
4k PTEs. That's the same problem for anything which uses module_alloc()
to get space for text allocated, e.g. kprobes, tracing....

A module consists of:

  - text sections
  - data sections

Except for PPC32, which has the module data in vmalloc space, all others
allocate text and data sections in one lump.

This en-bloc allocation is one reason for the 4k splits:

   - text is RX
   - data is RW or RO

Truly vmalloc'ed module data is not an option for 64bit architectures
which use PC relative addressing as vmalloc does not guarantee that the
data ends up within the limited displacement range (s32 on x8664)

This made me look at your allocator again:

> +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> +#define EXEC_MEM_START MODULES_VADDR
> +#define EXEC_MEM_END MODULES_END
> +#else
> +#define EXEC_MEM_START VMALLOC_START
> +#define EXEC_MEM_END VMALLOC_END
> +#endif

The #else part is completely broken on x86/64 and any other
architecture, which has PC relative restricted displacement.

Even if modules are disabled in Kconfig the only safe place to allocate
executable kernel text from (on these architectures) is the modules
address space. The ISA restrictions do not go magically away when
modules are disabled.

In the early version of the SKX retbleed mitigation work I had

  https://lore.kernel.org/all/20220716230953.442937066@linutronix.de

exactly to handle this correctly for the !MODULE case. It went nowhere
as we did not need the trampolines in the final version.

This is why Peter suggested to 'split' the module address range into a
top down and bottom up part:

  https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
  
That obviously separates text and data, but keeps everything within the
defined working range.

It immediately solves the text problem for _all_ module_alloc() users
and still leaves the data split into 4k pages due to RO/RW sections.

But after staring at it for a while I think this top down and bottom up
dance is too much effort for not much gain. The module address space is
sized generously, so the straight forward solution is to split that
space into two blocks and use them to allocate text and data separately.

The rest of Peter's suggestions how to migrate there still apply.

The init sections of a module are obviously separate as they are freed
after the module is initialized, but they are not really special either.
Today they leave holes in the address range. With the new scheme these
holes will be in the memory backed large mapping, but I don't see a real
issue with that, especially as those holes at least in text can be
reused for small allocations (kprobes, trace, bpf).

As a logical next step we make that three blocks and allocate text,
data and rodata separately, which will preserve the large mappings for
text and data. rodata still needs to be split because we need a space to
accomodate ro_after_init data.

Alternatively, instead of splitting the module address space, the
allocation mechanism can keep track of the types (text, data, rodata)
and manage large mapping blocks per type. There are pros and cons for
both approaches, so that needs some thought.

But at the end we want an allocation mechanism which:

  - preserves large mappings
  - handles a distinct address range
  - is mapping type aware

That solves _all_ the issues of modules, kprobes, tracing, bpf in one
go. See?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-01  9:08           ` Thomas Gleixner
@ 2022-12-01 19:31             ` Song Liu
  2022-12-02  1:38               ` Thomas Gleixner
  2022-12-01 20:23             ` Mike Rapoport
  1 sibling, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-12-01 19:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Hi Thomas,

Thanks for these insights! They are really helpful!


On Thu, Dec 1, 2022 at 1:08 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Song!
>
> On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
> > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> You are not making anything easier. You are violating the basic
> >> engineering principle of "Fix the root cause, not the symptom".
> >>
> >
> > I am not sure what is the root cause and the symptom here.
>

[...]

>
> This made me look at your allocator again:
>
> > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> > +#define EXEC_MEM_START MODULES_VADDR
> > +#define EXEC_MEM_END MODULES_END
> > +#else
> > +#define EXEC_MEM_START VMALLOC_START
> > +#define EXEC_MEM_END VMALLOC_END
> > +#endif
>
> The #else part is completely broken on x86/64 and any other
> architecture, which has PC relative restricted displacement.

Yeah, the #else part is just to make it build. It is not really usable.

>
> Even if modules are disabled in Kconfig the only safe place to allocate
> executable kernel text from (on these architectures) is the modules
> address space. The ISA restrictions do not go magically away when
> modules are disabled.
>
> In the early version of the SKX retbleed mitigation work I had
>
>   https://lore.kernel.org/all/20220716230953.442937066@linutronix.de
>
> exactly to handle this correctly for the !MODULE case. It went nowhere
> as we did not need the trampolines in the final version.

I remember there was some other work to use module_alloc for ftrace, etc.
without CONFIG_MODULES. One of these versions would work here.

>
> This is why Peter suggested to 'split' the module address range into a
> top down and bottom up part:
>
>   https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
>
> That obviously separates text and data, but keeps everything within the
> defined working range.
>
> It immediately solves the text problem for _all_ module_alloc() users
> and still leaves the data split into 4k pages due to RO/RW sections.
>
> But after staring at it for a while I think this top down and bottom up
> dance is too much effort for not much gain. The module address space is
> sized generously, so the straight forward solution is to split that
> space into two blocks and use them to allocate text and data separately.
>
> The rest of Peter's suggestions how to migrate there still apply.
>
> The init sections of a module are obviously separate as they are freed
> after the module is initialized, but they are not really special either.
> Today they leave holes in the address range. With the new scheme these
> holes will be in the memory backed large mapping, but I don't see a real
> issue with that, especially as those holes at least in text can be
> reused for small allocations (kprobes, trace, bpf).
>
> As a logical next step we make that three blocks and allocate text,
> data and rodata separately, which will preserve the large mappings for
> text and data. rodata still needs to be split because we need a space to
> accomodate ro_after_init data.
>
> Alternatively, instead of splitting the module address space, the
> allocation mechanism can keep track of the types (text, data, rodata)
> and manage large mapping blocks per type. There are pros and cons for
> both approaches, so that needs some thought.

AFAICT, the new allocator (let's call it module_alloc_new here)
requires quite some different logic than the existing vmalloc logic (or
module_alloc logic):

1. vmalloc is at least PAGE_SIZE granularity; while ftrace, bpf etc would
    benefit from a much smaller granularity.
2. vmalloc maintains 1-to-1 mapping between virtual address range (vmap
    in vmap_area_root) and physical pages (vm_struct); while
    module_alloc_new allocates physical pages in 2MB chunks, and
    maintains multiple vmap within a single 2MB chunk.

To solve this, I introduced a new tree free_text_area_root, address spaces
in this tree is backed with ROX physical pages, but not used by any user.
I think some logic like this is always needed.

With this logic in place, I think we don't really need to split the module
address space. Instead, we can have 3 trees:
  free_module_text_area_root;
  free_module_data_area_root;
  free_module_ro_data_area_root;

Similar to free_text_area_root, we add virtual address and physical pages
to these trees in 2MB chunks, and hands virtual address rnage out to users in
smaller granularity.

What do you think about this idea?

>
> But at the end we want an allocation mechanism which:
>
>   - preserves large mappings
>   - handles a distinct address range
>   - is mapping type aware
>
> That solves _all_ the issues of modules, kprobes, tracing, bpf in one
> go. See?

I think the user still needs to use module_alloc_new() differently. At the
moment, the user does something like.

my_text = module_alloc(size);
set_vm_flush_reset_perms(my_text);
update_my_text(my_text);
set_memory_ro(my_text);
set_memory_x(my_text);
/* use my_text */

With module_alloc_new(), my_text buffer is RX right out of the
allocator, so some text_poke mechanism is needed. In some cases,
the user also needs some logic to handle relative call/jump. It is
something like:

my_text = module_alloc_new(size, MODULE_MEM_TEXT);
my_tmp_buf = vmalloc(size);
update_my_text(my_tmp_buf);
adjust_rela_calls(my_tmp_buf, my_text);
text_poke_copy(my_text, my_tmp_buf, size);
vfree(my_tmp_buf);
/* use my_text */

There are also archs that do not support text_poke, so we need some
logic, especially for modules, to handle them properly. For example,
Rick suggested something like:

For non-text_poke() architectures, the way you can make it work is have
the API look like:
execmem_alloc()  <- Does the allocation, but necessarily usable yet
execmem_write()  <- Loads the mapping, doesn't work after finish()
execmem_finish() <- Makes the mapping live (loaded, executable, ready)

So for text_poke():
execmem_alloc()  <- reserves the mapping
execmem_write()  <- text_pokes() to the mapping
execmem_finish() <- does nothing

And non-text_poke():
execmem_alloc()  <- Allocates a regular RW vmalloc allocation
execmem_write()  <- Writes normally to it
execmem_finish() <- does set_memory_ro()/set_memory_x() on it

Does this sound like the best path forward to you?

Also, do you have suggestions on the name of the API? Maybe
something like:

enum module_mem_type {
    MODULE_MEM_TEXT,
    MODULE_MEM_DATA,
    MODULE_MEM_RODATA,
};
module_alloc_type(size_t len, enum module_mem_type type);
module_free_type(ptr);   /* I guess we may or may not type here */

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-01  9:08           ` Thomas Gleixner
  2022-12-01 19:31             ` Song Liu
@ 2022-12-01 20:23             ` Mike Rapoport
  2022-12-01 22:34               ` Thomas Gleixner
  1 sibling, 1 reply; 93+ messages in thread
From: Mike Rapoport @ 2022-12-01 20:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, bpf, linux-mm, peterz, akpm, x86, hch,
	rick.p.edgecombe, aaron.lu, mcgrof

On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote:
> Song!
> 
> On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
> > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> You are not making anything easier. You are violating the basic
> >> engineering principle of "Fix the root cause, not the symptom".
> >>
> >
> > I am not sure what is the root cause and the symptom here.
> 
> The symptom is iTLB pressure. The root cause is the way how module
> memory is allocated, which in turn causes the fragmentation into
> 4k PTEs. That's the same problem for anything which uses module_alloc()
> to get space for text allocated, e.g. kprobes, tracing....

There's also dTLB pressure caused by the fragmentation of the direct map.
The memory allocated with module_alloc() is a priori mapped with 4k PTEs,
but setting RO in the malloc address space also updates the direct map
alias and this causes splits of large pages.

It's not clear what causes more performance improvement: avoiding splits of
large pages in the direct map or reducing iTLB pressure by backing text
memory with 2M pages. If the major improvement comes from keeping direct
map intact, it's might be possible to mix data and text in the same 2M
page.
 
> A module consists of:
> 
>   - text sections
>   - data sections
> 
> Except for PPC32, which has the module data in vmalloc space, all others
> allocate text and data sections in one lump.
> 
> This en-bloc allocation is one reason for the 4k splits:
> 
>    - text is RX
>    - data is RW or RO
> 
> Truly vmalloc'ed module data is not an option for 64bit architectures
> which use PC relative addressing as vmalloc does not guarantee that the
> data ends up within the limited displacement range (s32 on x8664)
> 
> This made me look at your allocator again:
> 
> > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> > +#define EXEC_MEM_START MODULES_VADDR
> > +#define EXEC_MEM_END MODULES_END
> > +#else
> > +#define EXEC_MEM_START VMALLOC_START
> > +#define EXEC_MEM_END VMALLOC_END
> > +#endif
> 
> The #else part is completely broken on x86/64 and any other
> architecture, which has PC relative restricted displacement.
> 
> Even if modules are disabled in Kconfig the only safe place to allocate
> executable kernel text from (on these architectures) is the modules
> address space. The ISA restrictions do not go magically away when
> modules are disabled.
> 
> In the early version of the SKX retbleed mitigation work I had
> 
>   https://lore.kernel.org/all/20220716230953.442937066@linutronix.de
> 
> exactly to handle this correctly for the !MODULE case. It went nowhere
> as we did not need the trampolines in the final version.
> 
> This is why Peter suggested to 'split' the module address range into a
> top down and bottom up part:
> 
>   https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
>   
> That obviously separates text and data, but keeps everything within the
> defined working range.
> 
> It immediately solves the text problem for _all_ module_alloc() users
> and still leaves the data split into 4k pages due to RO/RW sections.
> 
> But after staring at it for a while I think this top down and bottom up
> dance is too much effort for not much gain. The module address space is
> sized generously, so the straight forward solution is to split that
> space into two blocks and use them to allocate text and data separately.
> 
> The rest of Peter's suggestions how to migrate there still apply.
> 
> The init sections of a module are obviously separate as they are freed
> after the module is initialized, but they are not really special either.
> Today they leave holes in the address range. With the new scheme these
> holes will be in the memory backed large mapping, but I don't see a real
> issue with that, especially as those holes at least in text can be
> reused for small allocations (kprobes, trace, bpf).
> 
> As a logical next step we make that three blocks and allocate text,
> data and rodata separately, which will preserve the large mappings for
> text and data. rodata still needs to be split because we need a space to
> accomodate ro_after_init data.
> 
> Alternatively, instead of splitting the module address space, the
> allocation mechanism can keep track of the types (text, data, rodata)
> and manage large mapping blocks per type. There are pros and cons for
> both approaches, so that needs some thought.
> 
> But at the end we want an allocation mechanism which:
> 
>   - preserves large mappings
>   - handles a distinct address range
>   - is mapping type aware
> 
> That solves _all_ the issues of modules, kprobes, tracing, bpf in one
> go. See?

There is also

    - handles kaslr

and at least for arm and powerpc we'd also need 

    - handles architecture specific range restrictions and fallbacks
 
> Thanks,
> 
>         tglx

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-01 20:23             ` Mike Rapoport
@ 2022-12-01 22:34               ` Thomas Gleixner
  2022-12-03 14:46                 ` Mike Rapoport
  0 siblings, 1 reply; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-01 22:34 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Song Liu, bpf, linux-mm, peterz, akpm, x86, hch,
	rick.p.edgecombe, aaron.lu, mcgrof

Mike!

On Thu, Dec 01 2022 at 22:23, Mike Rapoport wrote:
> On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote:
>> On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
>> The symptom is iTLB pressure. The root cause is the way how module
>> memory is allocated, which in turn causes the fragmentation into
>> 4k PTEs. That's the same problem for anything which uses module_alloc()
>> to get space for text allocated, e.g. kprobes, tracing....
>
> There's also dTLB pressure caused by the fragmentation of the direct map.
> The memory allocated with module_alloc() is a priori mapped with 4k PTEs,
> but setting RO in the malloc address space also updates the direct map
> alias and this causes splits of large pages.
>
> It's not clear what causes more performance improvement: avoiding splits of
> large pages in the direct map or reducing iTLB pressure by backing text
> memory with 2M pages.

From our experiments when doing the first version of the SKX retbleed
mitigation, the main improvement came from reducing iTLB pressure simply
because the iTLB cache is really small.

The kernel text placement is way beyond suboptimal. If you really do a
hotpath analysis and (manually) place all hot code into one or two 2M
pages, then you can achieve massive performance improvements way above
the 10% range.

We currently have a master student investigating this, but it will take
some time until usable results materialize.

> If the major improvement comes from keeping direct map intact, it's
> might be possible to mix data and text in the same 2M page.

No. That can't work.

    text = RX
    data = RW or RO

If you mix this, then you end up with RWX for the whole 2M page. Not an
option really as you lose _all_ protections in one go.

That's why I said:

>>      As a logical next step we make that three blocks and allocate text,
>>      data and rodata separately, which will preserve the large mappings for
>>      text and data. rodata still needs to be split because we need a space to
>>      accomodate ro_after_init data.

The point is, that rodata and ro_after_init_data is a pretty small
portion of modules as far as my limited analysis of a distro build
shows.

The bulk is in text and data. So if we preserve 2M pages for text and
for RW data and bite the bullet to split one 2M page for
ro[_after_init_]data, we get the maximum benefit for the least
complexity.

>> But at the end we want an allocation mechanism which:
>> 
>>   - preserves large mappings
>>   - handles a distinct address range
>>   - is mapping type aware
>> 
>> That solves _all_ the issues of modules, kprobes, tracing, bpf in one
>> go. See?
>
> There is also
>
>     - handles kaslr
>
> and at least for arm and powerpc we'd also need 
>
>     - handles architecture specific range restrictions and fallbacks

Good points.

kaslr should be fairly trivial.

The architecture specific restrictions and fallbacks are not really hard
to solve either. If done right then the allocator just falls back to 4k
maps during initialization in early boot which brings it back to the
status quo. But we can provide consistent semantics for the three types
which are required for modules and the text only usage for kprobes,
tracing, bpf...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-01 19:31             ` Song Liu
@ 2022-12-02  1:38               ` Thomas Gleixner
  2022-12-02  8:38                 ` Song Liu
  2022-12-02 10:46                 ` Christophe Leroy
  0 siblings, 2 replies; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-02  1:38 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Song!

On Thu, Dec 01 2022 at 11:31, Song Liu wrote:
> Thanks for these insights! They are really helpful!

I'm glad it helped and did not create more confusion :)

> On Thu, Dec 1, 2022 at 1:08 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> This made me look at your allocator again:
>>
>> > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
>> > +#define EXEC_MEM_START MODULES_VADDR
>> > +#define EXEC_MEM_END MODULES_END
>> > +#else
>> > +#define EXEC_MEM_START VMALLOC_START
>> > +#define EXEC_MEM_END VMALLOC_END
>> > +#endif
>>
>> The #else part is completely broken on x86/64 and any other
>> architecture, which has PC relative restricted displacement.
>
> Yeah, the #else part is just to make it build. It is not really
> usable.

That's not an option for code which is submitted for inclusion. It can
be used and will be used. You know that already, right? So why are you
posting such broken crap?

>> Even if modules are disabled in Kconfig the only safe place to allocate
>> executable kernel text from (on these architectures) is the modules
>> address space. The ISA restrictions do not go magically away when
>> modules are disabled.
>>
>> In the early version of the SKX retbleed mitigation work I had
>>
>>   https://lore.kernel.org/all/20220716230953.442937066@linutronix.de
>>
>> exactly to handle this correctly for the !MODULE case. It went nowhere
>> as we did not need the trampolines in the final version.
>
> I remember there was some other work to use module_alloc for ftrace, etc.
> without CONFIG_MODULES. One of these versions would work here.

We don't want any of these. The new infrastructure which is set out to
replace module_alloc() has to be designed upfront to handle the
CONFIG_MODULES=n case properly.

>> Alternatively, instead of splitting the module address space, the
>> allocation mechanism can keep track of the types (text, data, rodata)
>> and manage large mapping blocks per type. There are pros and cons for
>> both approaches, so that needs some thought.
>
> AFAICT, the new allocator (let's call it module_alloc_new here)
> requires quite some different logic than the existing vmalloc logic (or
> module_alloc logic):

Obviously.

> 1. vmalloc is at least PAGE_SIZE granularity; while ftrace, bpf etc would
>     benefit from a much smaller granularity.

Even modules can benefit from that. The fact that modules have all
sections (text, data, rodata) page aligned and page granular is not due
to an requirement of modules, it's so because that's how module_alloc()
works and the module layout has been adopted to it.

text has an architecture/compile time specific alignment requirement for
function entries, data obviously has it's own alignment requirements,
but none of those are page granular.

> 2. vmalloc maintains 1-to-1 mapping between virtual address range (vmap
>     in vmap_area_root) and physical pages (vm_struct); while
>     module_alloc_new allocates physical pages in 2MB chunks, and
>     maintains multiple vmap within a single 2MB chunk.
>
> To solve this, I introduced a new tree free_text_area_root, address spaces
> in this tree is backed with ROX physical pages, but not used by any
> user.

Can you please refrain from discussing implementation details _before_
the conceptual details are sorted?

Implementation details are completely irrelevant at this point,
really. Quite the contrary, they just add confusion to the discussion.

The basic concept is:

module_alloc_type(type, size)
{
        if (!size || invalid_type(type))
        	return -ENOPONIES;

	mutex_lock(&m);
        if (!space_available(type, size))
        	alloc_new_blocks(type, size);
        
        p = alloc_space(type, size);
        mutex_unlock(&m);
        return p;
}

and the counterpart

module_free_type(p)
{
        if (!p)
        	return;

	mutex_lock(&m);
        x = lookup(p);
        type = x->type;
        free_space(x) {
        	...
                if (unused_blocks(type))
        		free_unused_blocks(type);
        }
        mutex_unlock(&m);
        return p;
}

> With this logic in place, I think we don't really need to split the module
> address space. Instead, we can have 3 trees:
>   free_module_text_area_root;
>   free_module_data_area_root;
>   free_module_ro_data_area_root;
>
> Similar to free_text_area_root, we add virtual address and physical pages
> to these trees in 2MB chunks, and hands virtual address rnage out to users in
> smaller granularity.

Whatever the actual tracking mechanism is, does not matter. This is
pretty much at the conceptual level the same what I said before:

>> Alternatively, instead of splitting the module address space, the
>> allocation mechanism can keep track of the types (text, data, rodata)
>> and manage large mapping blocks per type. There are pros and cons for
>> both approaches, so that needs some thought.

Right?

You have to be aware, that the rodata space needs to be page granular
while text and data can really aggregate below the page alignment, but
again might have different alignment requirements.

So you need a configuration mechanism which allows to specify per type:

   - Initial mapping type (RX, RWX, RW)
   - Alignment
   - Granularity
   - Address space restrictions

With those 4 parameters you can map all architecture requirements
including kaslr to each type.

>> But at the end we want an allocation mechanism which:
>>
>>   - preserves large mappings
>>   - handles a distinct address range
>>   - is mapping type aware
>>
>> That solves _all_ the issues of modules, kprobes, tracing, bpf in one
>> go. See?
>
> I think the user still needs to use module_alloc_new() differently. At
> the moment, the user does something like.

Neither me nor Peter said, that this ends up as a 1:1 replacement for
the actual module_alloc() users, but that's obvious and not the point.

The point is that the new infrastructure provides a solution which
solves _all_ issues of those users in one go.

The fact that we need to adjust the call sites is more than obvious, but
this can be done with the well known methods of code refactoring.

Step 1:

module_alloc_text(size)
{
        if (!size)
        	return -ENOPONIES;

	mutex_lock(&m);
        p = module_alloc(size);
	mutex_unlock(&m);
        if (p)
       		set_memory_ro+x(p, size);
        return p;
}

and

module_write_text(p, src, size)
{
        // Fill in the architecture specific magic depending on
        // the mapping type for text:
        if (ARCH_TEXT_MAP_TYPE == RX) {
        	text_poke();
        } else {
        	memcpy();
                set_memory_ro+x();
        }
}

Step 2:

Convert the usage sites outside of the core module code which really
only care about text allocations one by one:

-	p = module_alloc(size);
-       memcpy(p, src, size);
+	p = module_alloc_text(size);
+	module_write_text(p, src, size);

After that the only user of module_alloc() left is the module core
code.

Step 3:

Provide functions to prepare for converting the module core. You need to
have module_alloc_data() and module_alloc_rodata() where both implement
in the first step:

module_alloc_[ro]data(size)
{
        if (!size)
        	return -ENOPONIES;

	mutex_lock(&m);
        p = module_alloc(size);
	mutex_unlock(&m);
        if (p)
       		set_memory_rw-nox(p, size);
        return p;
}

Step 4:

Fixup the module code. Break up the en-bloc allocation and allocate
text, data, and rodata separately and adjust the methods to write into
them.

For text that's obviously module_write_text(), but for the [ro]data
mappings memcpy() is still fine. For the rodata mapping you need
set_memory_ro() right in the module prepare stage and for the
ro_after_init_data() you do that after the module init function returns
success, which is pretty much what the code does today.

Step 5:

Now once this is sorted, you add the new allocation magic.

Step 6:

Then you do:

module_alloc_text(size)
{
        if (!size)
        	return -ENOPONIES;

	mutex_lock(&m);
-       p = module_alloc(size);
+	if (IS_ENABLED(CONFIG_MODULE_ALLOC_NEWFANGLED))
+		p = module_alloc_type(TYPE_TEXT, size);
+	else
+		p = module_alloc(size);
	mutex_unlock(&m);
-       if (p)
+	if (!IS_ENABLED(CONFIG_MODULE_ALLOC_NEWFANGLED) && p)
      		set_memory_ro+x(p, size);
        return p;
}

and the corresponding changes to module_alloc_[ro]data()

Step 7:

Then you go and enable it per architecture:

+     select CONFIG_MODULE_ALLOC_NEWFANGLED

+     module_alloc_newfangled_init(&type_parameters);

Step 8:

Once everything is converted over you can remove
CONFIG_MODULE_ALLOC_NEWFANGLED and the associated conversion cruft.

> my_text = module_alloc(size);

<SNIP .... />

> Also, do you have suggestions on the name of the API?

Again, API and implementation details are pointless bikeshedding
material at this point.

I really have a hard time to understand why you are so focussed on
implementation details instead of establishing a common understanding of
the problem, the goals and the concepts in the first place.

Linus once said:

  "Bad programmers worry about the code. Good programmers worry about
   data structures and their relationships."

He's absolutely right. Here is my version of it:

  The order of things to worry about:

      1) Problem analysis
      2) Concepts
      3) Data structures and their relationships
      4) Code

      #1 You need to understand the problem fully to come up with
         concepts

      #2 Once you understand the problem fully you can talk about
         concepts to solve it

      #3 Maps the concept to data structures and forms relationships

      #4 Is the logical consequence of #1 + #2 + #3 and because your
         concept makes sense, the data structures and their
         relationships are understandable, the code becomes
         understandable too.

      If any of the steps finds a gap in the previous ones, then you
      have to go back and solve those first.

Any attempt to reorder the above is putting the cart before the horse
and a guarantee for failure.

Now go back and carefully read up on what I wrote above and in my
previous mail.

The previous mail was mostly about #1 to explain the problem as broad as
possible and an initial stab at #2 suggesting concepts to solve it.

This one is still covering some aspects of #1, but it is mostly about #2
and more focussed on particular aspects of the concept. If you look at
it carefully then you find some bits which map to #3 but still at the
conceptual level.

Did I talk about code or implementation details?

Not at all and I'm not going to do so before #1 and #2 are agreed
on. The above pseudo code snippets are just for illustration and I used
them because I was too lazy to write a novel, but they all are still at
the conceptual level.

Now you can rightfully argue that if you stich those snippets together
then they form a picture which outlines the implementation, but that's
the whole purpose of this exercise, right?

They still do not worry about any of the implementation details of root
trees, APIs or whatever simply because they are completely irrelevant at
this point of the discussion.

See?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-02  1:38               ` Thomas Gleixner
@ 2022-12-02  8:38                 ` Song Liu
  2022-12-02  9:22                   ` Thomas Gleixner
  2022-12-02 10:46                 ` Christophe Leroy
  1 sibling, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-12-02  8:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Hi Thomas,

Thanks for all these suggestions!

On Thu, Dec 1, 2022 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>

[...] (everything snipped here makes perfect sense).

>
> You have to be aware, that the rodata space needs to be page granular
> while text and data can really aggregate below the page alignment, but
> again might have different alignment requirements.

I don't quite follow why rodata space needs to be page granular. If text can
go below page granular, rodata should also do that, no?

>
> So you need a configuration mechanism which allows to specify per type:
>
>    - Initial mapping type (RX, RWX, RW)
>    - Alignment
>    - Granularity
>    - Address space restrictions

[...]

>
> Step 1:
>

These steps are really helpful. Thanks!
[...]
>
> For text that's obviously module_write_text(), but for the [ro]data
> mappings memcpy() is still fine. For the rodata mapping you need
> set_memory_ro() right in the module prepare stage and for the
> ro_after_init_data() you do that after the module init function returns
> success, which is pretty much what the code does today.

I guess this is related to rodata needs to be page granular? But I
don't think I got the idea. Do we allow rodata and rwdata share the
same 2MB page? ro_after_init_data seems trickier.

>
> Step 5:
>
[...]

>
> Linus once said:
>
>   "Bad programmers worry about the code. Good programmers worry about
>    data structures and their relationships."
>
> He's absolutely right. Here is my version of it:
>
>   The order of things to worry about:
>
>       1) Problem analysis
>       2) Concepts
>       3) Data structures and their relationships
>       4) Code
>
>       #1 You need to understand the problem fully to come up with
>          concepts
>
>       #2 Once you understand the problem fully you can talk about
>          concepts to solve it
>
>       #3 Maps the concept to data structures and forms relationships
>
>       #4 Is the logical consequence of #1 + #2 + #3 and because your
>          concept makes sense, the data structures and their
>          relationships are understandable, the code becomes
>          understandable too.
>
>       If any of the steps finds a gap in the previous ones, then you
>       have to go back and solve those first.
>
> Any attempt to reorder the above is putting the cart before the horse
> and a guarantee for failure.

Thanks for these advices! They would help me for many years.

>
> Now go back and carefully read up on what I wrote above and in my
> previous mail.
>
> The previous mail was mostly about #1 to explain the problem as broad as
> possible and an initial stab at #2 suggesting concepts to solve it.
>
> This one is still covering some aspects of #1, but it is mostly about #2
> and more focussed on particular aspects of the concept. If you look at
> it carefully then you find some bits which map to #3 but still at the
> conceptual level.
>
> Did I talk about code or implementation details?
>
> Not at all and I'm not going to do so before #1 and #2 are agreed
> on. The above pseudo code snippets are just for illustration and I used
> them because I was too lazy to write a novel, but they all are still at
> the conceptual level.
>
> Now you can rightfully argue that if you stich those snippets together
> then they form a picture which outlines the implementation, but that's
> the whole purpose of this exercise, right?

I guess I will do my homework, and come back with as much information
as possible for #1 + #2 + #3. Then, we can discuss whether it makes
sense at all.

Does this sound like the right approach?

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-02  8:38                 ` Song Liu
@ 2022-12-02  9:22                   ` Thomas Gleixner
  2022-12-06 20:25                     ` Song Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-02  9:22 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Song!

On Fri, Dec 02 2022 at 00:38, Song Liu wrote:
> Thanks for all these suggestions!

Welcome.

> On Thu, Dec 1, 2022 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> You have to be aware, that the rodata space needs to be page granular
>> while text and data can really aggregate below the page alignment, but
>> again might have different alignment requirements.
>
> I don't quite follow why rodata space needs to be page granular. If text can
> go below page granular, rodata should also do that, no?

Of course it can, except for the case of ro_after_init_data, because
that needs to be RW during module_init() and is then switched to RO when
module_init() returns success. So for that you need page granular maps
per module, right?

Sure you can have a separate space for rodata and ro_after_init_data,
but as I said to Mike:

  "The point is, that rodata and ro_after_init_data is a pretty small
   portion of modules as far as my limited analysis of a distro build
   shows.

   The bulk is in text and data. So if we preserve 2M pages for text and
   for RW data and bite the bullet to split one 2M page for
   ro[_after_init_]data, we get the maximum benefit for the least
   complexity."

So under the assumption that rodata is small, it's questionable whether
the split of rodata and ro_after_init_data makes a lot of difference. It
might, but that needs to be investigated.

That's not a fundamental conceptual problem because adding a 4th type to
the concept we outlined so far is straight forward, right?

> I guess I will do my homework, and come back with as much information
> as possible for #1 + #2 + #3. Then, we can discuss whether it makes
> sense at all.

Correct. Please have a close look at the 11 architecture specific
module_alloc() variants so you can see what kind of tweaks and magic
they need, which lets you better specify the needs for the
initialization parameter set required.

> Does this sound like the right approach?

Very much so.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-02  1:38               ` Thomas Gleixner
  2022-12-02  8:38                 ` Song Liu
@ 2022-12-02 10:46                 ` Christophe Leroy
  2022-12-02 17:43                   ` Thomas Gleixner
  1 sibling, 1 reply; 93+ messages in thread
From: Christophe Leroy @ 2022-12-02 10:46 UTC (permalink / raw)
  To: Thomas Gleixner, Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof



Le 02/12/2022 à 02:38, Thomas Gleixner a écrit :
> 
>>> Alternatively, instead of splitting the module address space, the
>>> allocation mechanism can keep track of the types (text, data, rodata)
>>> and manage large mapping blocks per type. There are pros and cons for
>>> both approaches, so that needs some thought.
>>
>> AFAICT, the new allocator (let's call it module_alloc_new here)
>> requires quite some different logic than the existing vmalloc logic (or
>> module_alloc logic):
> 
> Obviously.
> 
>> 1. vmalloc is at least PAGE_SIZE granularity; while ftrace, bpf etc would
>>      benefit from a much smaller granularity.
> 
> Even modules can benefit from that. The fact that modules have all
> sections (text, data, rodata) page aligned and page granular is not due
> to an requirement of modules, it's so because that's how module_alloc()
> works and the module layout has been adopted to it.


Sections are page aligned only when STRICT_MODULE_RWX is selected.

See commit 3b5be16c7e90 ("modules: page-align module section allocations 
only for arches supporting strict module rwx")

Today it is implemented as follows:

	static inline unsigned int strict_align(unsigned int size)
	{
		if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
			return PAGE_ALIGN(size);
		else
			return size;
	}


Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-02 10:46                 ` Christophe Leroy
@ 2022-12-02 17:43                   ` Thomas Gleixner
  0 siblings, 0 replies; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-02 17:43 UTC (permalink / raw)
  To: Christophe Leroy, Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

On Fri, Dec 02 2022 at 10:46, Christophe Leroy wrote:
> Le 02/12/2022 à 02:38, Thomas Gleixner a écrit :
>> Even modules can benefit from that. The fact that modules have all
>> sections (text, data, rodata) page aligned and page granular is not due
>> to an requirement of modules, it's so because that's how module_alloc()
>> works and the module layout has been adopted to it.
>
> Sections are page aligned only when STRICT_MODULE_RWX is selected.

Correct, but without strict permission separation we would not debate
this at all. Everything would be RWX and fine.

For separation my point still stands that the problem is that
module_alloc() is just doing an en-bloc allocation, which needs to be
split into RX, RW, RO afterwards and that consequently splits the large
mappings apart. Which in turn means text, data, rodata have to be page
aligned and page granular.

The typed approach and having a mechanism to preserve the underlying
large page mappings is the broadest scope we have to cover.

An RWX only architecture is just the most trivial case of such an
infrastructure. The types become all the same, the underlying page size
does not matter, but it's just a configuration variant.

Architectures which support strict separation, but only small pages are
a configuration variant too.

All of them can use the same infrastructure and the same API to
alloc/free and write/update the allocated memory. The configuration will
take care to pick the appropriate mechanisms.

No?

Thanks,

        tglx

 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-01 22:34               ` Thomas Gleixner
@ 2022-12-03 14:46                 ` Mike Rapoport
  2022-12-03 20:58                   ` Thomas Gleixner
  0 siblings, 1 reply; 93+ messages in thread
From: Mike Rapoport @ 2022-12-03 14:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, bpf, linux-mm, peterz, akpm, x86, hch,
	rick.p.edgecombe, aaron.lu, mcgrof

Hi Thomas,

On Thu, Dec 01, 2022 at 11:34:57PM +0100, Thomas Gleixner wrote:
> Mike!
> 
> On Thu, Dec 01 2022 at 22:23, Mike Rapoport wrote:
> > On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote:
> >> On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
> >> The symptom is iTLB pressure. The root cause is the way how module
> >> memory is allocated, which in turn causes the fragmentation into
> >> 4k PTEs. That's the same problem for anything which uses module_alloc()
> >> to get space for text allocated, e.g. kprobes, tracing....
> >
> > There's also dTLB pressure caused by the fragmentation of the direct map.
> > The memory allocated with module_alloc() is a priori mapped with 4k PTEs,
> > but setting RO in the malloc address space also updates the direct map
> > alias and this causes splits of large pages.
> >
> > It's not clear what causes more performance improvement: avoiding splits of
> > large pages in the direct map or reducing iTLB pressure by backing text
> > memory with 2M pages.
> 
> From our experiments when doing the first version of the SKX retbleed
> mitigation, the main improvement came from reducing iTLB pressure simply
> because the iTLB cache is really small.
> 
> The kernel text placement is way beyond suboptimal. If you really do a
> hotpath analysis and (manually) place all hot code into one or two 2M
> pages, then you can achieve massive performance improvements way above
> the 10% range.
> 
> We currently have a master student investigating this, but it will take
> some time until usable results materialize.
> 
> > If the major improvement comes from keeping direct map intact, it's
> > might be possible to mix data and text in the same 2M page.
> 
> No. That can't work.
> 
>     text = RX
>     data = RW or RO
> 
> If you mix this, then you end up with RWX for the whole 2M page. Not an
> option really as you lose _all_ protections in one go.

I meant to take one 2M page from the direct map and split it to 4K in the
module address space. Then the protection could be done at PTE level after
relocations etc and it would save the dance with text poking. But if
mapping the code with 2M pages gives massive performance improvements,
it's surely better to keep 2M pages in the modules space.
 
> Thanks,
> 
>         tglx

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-03 14:46                 ` Mike Rapoport
@ 2022-12-03 20:58                   ` Thomas Gleixner
  0 siblings, 0 replies; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-03 20:58 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Song Liu, bpf, linux-mm, peterz, akpm, x86, hch,
	rick.p.edgecombe, aaron.lu, mcgrof

On Sat, Dec 03 2022 at 16:46, Mike Rapoport wrote:
> On Thu, Dec 01, 2022 at 11:34:57PM +0100, Thomas Gleixner wrote:
>> If you mix this, then you end up with RWX for the whole 2M page. Not an
>> option really as you lose _all_ protections in one go.
>
> I meant to take one 2M page from the direct map and split it to 4K in the
> module address space. Then the protection could be done at PTE level after
> relocations etc and it would save the dance with text poking.

I see what you meant.

> But if mapping the code with 2M pages gives massive performance
> improvements, it's surely better to keep 2M pages in the modules
> space.

Yes.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-02  9:22                   ` Thomas Gleixner
@ 2022-12-06 20:25                     ` Song Liu
  2022-12-07 15:36                       ` Thomas Gleixner
  0 siblings, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-12-06 20:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof

Hi Thomas,

Thanks again for your suggestions. Here is my homework so far.

On Fri, Dec 2, 2022 at 1:22 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Song!
>
> On Fri, Dec 02 2022 at 00:38, Song Liu wrote:
> > Thanks for all these suggestions!
>
> Welcome.
>
> > On Thu, Dec 1, 2022 at 5:38 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> You have to be aware, that the rodata space needs to be page granular
> >> while text and data can really aggregate below the page alignment, but
> >> again might have different alignment requirements.
> >
> > I don't quite follow why rodata space needs to be page granular. If text can
> > go below page granular, rodata should also do that, no?
>
> Of course it can, except for the case of ro_after_init_data, because
> that needs to be RW during module_init() and is then switched to RO when
> module_init() returns success. So for that you need page granular maps
> per module, right?
>
> Sure you can have a separate space for rodata and ro_after_init_data,
> but as I said to Mike:
>
>   "The point is, that rodata and ro_after_init_data is a pretty small
>    portion of modules as far as my limited analysis of a distro build
>    shows.
>
>    The bulk is in text and data. So if we preserve 2M pages for text and
>    for RW data and bite the bullet to split one 2M page for
>    ro[_after_init_]data, we get the maximum benefit for the least
>    complexity."
>
> So under the assumption that rodata is small, it's questionable whether
> the split of rodata and ro_after_init_data makes a lot of difference. It
> might, but that needs to be investigated.
>
> That's not a fundamental conceptual problem because adding a 4th type to
> the concept we outlined so far is straight forward, right?
>
> > I guess I will do my homework, and come back with as much information
> > as possible for #1 + #2 + #3. Then, we can discuss whether it makes
> > sense at all.
>
> Correct. Please have a close look at the 11 architecture specific
> module_alloc() variants so you can see what kind of tweaks and magic
> they need, which lets you better specify the needs for the
> initialization parameter set required.

Survey of the 11 architecture specific module_alloc(). They basically do
the following magic:

1. Modify MODULES_VADDR and/or MODULES_END. There are multiple
  reasons behind this, some arch does this for KASLR, some other archs
  have different MODULES_[VADDR|END] for different processors (32b vs.
  64b for example), some archs use some module address space for other
  things (i.e. _exiprom on arm).

Archs need 1: x86, arm64, arm, mips, ppc, riscv, s390, loongarch, sparc

2. Use kasan_alloc_module_shadow()

Archs need 2: x86, arm64, s390

3. A secondary module address space. There is a smaller preferred
  address space for modules. Once the preferred space runs out, allocate
  memory from a secondary address space.

Archs need 3: some ppc, arm, arm64 (PLTS on arm and arm64)

4. User different pgprot_t (PAGE_KERNEL, PAGE_KERNEL_EXEC, etc.)

5. sparc does memset(ptr, 0, size) in module_alloc()

6. nios2 uses kmalloc() for modules. Based on the comment, this is
  probably only because it needs different MODULES_[VADDR|END].

I think we can handle all these with a single module_alloc() and a few
module_arch_* functions().

unsigned long module_arch_vaddr(void);
unsigned long module_arch_end(void);
unsigned long module_arch_secondary_vaddr(void);
unsigned long module_arch_secondary_end(void);
pgprot_t module_arch_pgprot(alloc_type type);
void *module_arch_initialize(void *s, size_t n);
bool module_arch_do_kasan_shadow(void);

So module_alloc() would look like:

void *module_alloc(unsigned long size, pgprot_t prot, unsigned long align,
       unsigned long granularity, alloc_type type)
{
    unsigned long vm_flags = VM_FLUSH_RESET_PERMS |
        (module_arch_do_kasan_shadow() ? VM_DEFER_KMEMLEAK : 0);
    void *ptr;

    ptr = __vmalloc_node_range(size, align, module_arch_vaddr(),
        module_arch_end(), GFP_KERNEL, module_arch_pgprot(type), vm_flags,
        NUMA_NO_NODE, __builtin_return_address(0));

    if (!ptr && module_arch_secondary_vaddr() != module_arch_secondary_end())
        ptr = __vmalloc_node_range(size, align, module_arch_secondary_vaddr(),
        module_arch_secondary_end(), GFP_KERNEL, module_arch_pgprot(type),
        vm_flags, NUMA_NO_NODE, __builtin_return_address(0));

    if (p && module_arch_do_kasan_shadow() &&
        (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
        vfree(p);
        return NULL;
    }
    module_arch_initialize(ptr, size);
    return p;
}

This is not really pretty, but I don't have a better idea at the moment.


For the allocation type, there are technically 5 of them:

ALLOC_TYPE_RX,   /* text */
ALLOC_TYPE_RW,    /* rw data */
ALLOC_TYPE_RO,    /* ro data */
ALLOC_TYPE_RO_AFTER_INIT,
ALLOC_TYPE_RWX,   /* legacy, existing module_alloc behavior */

Given RO and RO_AFTER_INIT require PAGE alignment and are
relatively small. I think we can merge them with RWX.

For RX and RW, we can allocate huge pages, and cut subpage
chunks out for users (something similar to 1/6 of the set).

For RWX, we have 2 options:
1. Use similar logic as RX and RW, but use PAGE granularity, and
    do set_memory_ro on it.
2. Keep current module_alloc behavior.

1 is better at protecting direct map (less fragmentation); while
2 is probably a little simpler. Given module load/unload are
rare events in most systems, I personally think we can start
with option 2.


We also need to redesign module_layout. Right now, we have
up to 3 layouts: core, init, and data. We will need 6 allocations:
  core text,
  core rw data,
  core ro and ro_after_init data (one allocation)
  init text,
  init rw data,
  init ro data.

PS: how much do we benefit with separate core and init.
Maybe it is time to merge the two? (keep init part around until
the module unloads).

The above is my Problem analysis and Concepts.

For data structures, I propose we use two extra trees for RX
and RW allocation (similar to 1/6 of current version, but 2x
trees). For RWX, we keep current module_alloc() behavior,
so no new data structure is needed.

The new module_layout will be something like:

struct module_layout {
    void *ptr;
    unsigned int size;  /* text size, rw data size, ro + ro_after_init size */
    unsigned int ro_size;    /* ro_size for ro + ro_after_init allocation */
};

So that's all I have so far. Please share your comments and
suggestions on it.

One more question: shall we make module sections page
aligned without STRICT_MODULE_RWX? It appears to be
a good way to simplify the logic. But it may cause too much
memory waste for smaller processors?

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-06 20:25                     ` Song Liu
@ 2022-12-07 15:36                       ` Thomas Gleixner
  2022-12-07 16:53                         ` Christophe Leroy
  2022-12-07 19:26                         ` Song Liu
  0 siblings, 2 replies; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-07 15:36 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof, Dinh Nguyen

Song!

On Tue, Dec 06 2022 at 12:25, Song Liu wrote:
> On Fri, Dec 2, 2022 at 1:22 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Correct. Please have a close look at the 11 architecture specific
>> module_alloc() variants so you can see what kind of tweaks and magic
>> they need, which lets you better specify the needs for the
>> initialization parameter set required.
>
> Survey of the 11 architecture specific module_alloc(). They basically do
> the following magic:
>
> 1. Modify MODULES_VADDR and/or MODULES_END. There are multiple
>   reasons behind this, some arch does this for KASLR, some other archs
>   have different MODULES_[VADDR|END] for different processors (32b vs.
>   64b for example), some archs use some module address space for other
>   things (i.e. _exiprom on arm).
>
> Archs need 1: x86, arm64, arm, mips, ppc, riscv, s390, loongarch,
> sparc

All of this is pretty much a boot time init decision, right?

> 2. Use kasan_alloc_module_shadow()
>
> Archs need 2: x86, arm64, s390

There is nothing really architecture specific, so that can be part of
the core code, right?

> 3. A secondary module address space. There is a smaller preferred
>   address space for modules. Once the preferred space runs out, allocate
>   memory from a secondary address space.
>
> Archs need 3: some ppc, arm, arm64 (PLTS on arm and arm64)

Right.

> 4. User different pgprot_t (PAGE_KERNEL, PAGE_KERNEL_EXEC, etc.)
>
> 5. sparc does memset(ptr, 0, size) in module_alloc()

which is pointless if you do a GPF_ZERO allocation, but sure.

> 6. nios2 uses kmalloc() for modules. Based on the comment, this is
>   probably only because it needs different MODULES_[VADDR|END].

It's a horrible hack because they decided to have their layout:

     VMALLOC_SPACE   0x80000000
     KERNEL_SPACE    0xC0000000

and they use kmalloc because CALL26/PCREL26 cannot reach from 0x80000000
to 0xC0000000. That's true, but broken beyond repair.

Making the layout:

     VMALLOC_SPACE   0x80000000
     MODULE_SPACE    0xBE000000         == 0xC0000000 - (1 << 24) (32M)
or     
     MODULE_SPACE    0xBF000000         == 0xC0000000 - (1 << 24) (16M)
     KERNEL_SPACE    0xC0000000

would have been too obvious...

> I think we can handle all these with a single module_alloc() and a few
> module_arch_* functions().
>
> unsigned long module_arch_vaddr(void);
> unsigned long module_arch_end(void);
> unsigned long module_arch_secondary_vaddr(void);
> unsigned long module_arch_secondary_end(void);
> pgprot_t module_arch_pgprot(alloc_type type);
> void *module_arch_initialize(void *s, size_t n);
> bool module_arch_do_kasan_shadow(void);

Why? None of these functions is required at all.

Go back to one of my previous replies:

>> +     select CONFIG_MODULE_ALLOC_NEWFANGLED
>> 
>> +     module_alloc_newfangled_init(&type_parameters);

/**
 * struct mod_alloc_type - Parameters for module allocation type
 * @mapto_type:		The type to merge this type into, if different
 *			from the actual type which is configured here.
 * @flags:		Properties
 * @granularity:	The allocation granularity (PTE/PMD)
 * @alignment:		The allocation alignment requirement
 * @start:		Array of address space range start (inclusive)
 * @end:		Array of address space range end (inclusive)
 * @pgprot:		The page protection for this type
 * @fill:		Function to fill allocated space. If NULL, use memcpy()
 * @invalidate:		Function to invalidate allocated space. If NULL, use memset()
 *
 * If @granularity > @alignment the allocation can reuse free space in
 * previously allocated pages. If they are the same, then fresh pages
 * have to be allocated.
 */
struct mod_alloc_type {
	unsigned int	mapto_type;
        unsigned int	flags;
	unsigned int	granularity;
        unsigned int    alignment;
	unsigned long	start[MOD_MAX_ADDR_SPACES];
	unsigned long	end[MOD_MAX_ADDR_SPACES];
        pgprot_t	pgprot;
        void		(*fill)(void *dst, void *src, unsigned int size);
        void		(*invalidate)(void *dst, unsigned int size);
};

struct mod_alloc_type_params {
       struct mod_alloc_type	types[MOD_MAX_TYPES];
};

or something like that.

> So module_alloc() would look like:

<SNIP>horror</SNIP>

No. It would not contain a single arch_foo() function at all. Everything
can be expressed with the type descriptors above.

> For the allocation type, there are technically 5 of them:
>
> ALLOC_TYPE_RX,   /* text */
> ALLOC_TYPE_RW,    /* rw data */
> ALLOC_TYPE_RO,    /* ro data */
> ALLOC_TYPE_RO_AFTER_INIT,
> ALLOC_TYPE_RWX,   /* legacy, existing module_alloc behavior */

You are mixing page protections and section types as seen from the
module core code. We need exactly four abstract allocation types:

   MOD_ALLOC_TYPE_TEXT
   MOD_ALLOC_TYPE_DATA
   MOD_ALLOC_TYPE_RODATA
   MOD_ALLOC_TYPE_RODATA_AFTER_INIT

These allocation types represent the section types and are mapped to
module_alloc_type->pgprot by the allocator. Those protections can be
different or identical, e.g. all RWX.

The module core does neither care about the resulting page protection
nor about the question whether the allocation can reuse free space or
needs to allocate fresh pages nor about the question whether an
allocation type maps to some other type.

> Given RO and RO_AFTER_INIT require PAGE alignment and are
> relatively small. I think we can merge them with RWX.

There is no RWX, really. See above.

> We also need to redesign module_layout. Right now, we have
> up to 3 layouts: core, init, and data. We will need 6 allocations:
>   core text,
>   core rw data,
>   core ro and ro_after_init data (one allocation)
>   init text,
>   init rw data,
>   init ro data.
>
> PS: how much do we benefit with separate core and init.
> Maybe it is time to merge the two? (keep init part around until
> the module unloads).

That needs some investigation into how much memory is really made
available.

OTOH, if you look at the above then we can just have:

   MOD_ALLOC_TYPE_INITTEXT
   MOD_ALLOC_TYPE_INITDATA
   MOD_ALLOC_TYPE_INITRODATA

as extra allocation types. All it does is add some initconst memory and
a slightly larger static datastructure in the allocator itself. See
below.

> For data structures, I propose we use two extra trees for RX
> and RW allocation (similar to 1/6 of current version, but 2x
> trees).

The number of trees does not matter. You can make them part of the type
scheme and then the number of active trees depends on the number of
types and the properties of the types. Which is the right thing to do
because then you can e.g. trivially split RODATA and RODATA_AFTER_INIT.

> For RWX, we keep current module_alloc() behavior, so no new data
> structure is needed.

Seriously no. We switch the whole logic over to the new scheme for
consistency, simplicity and mental sanity reasons.

Again: Module code cares about section types, not about the resulting
page protections. The page protections are a matter of the underlying
allocator and architecture specific properties.

> The new module_layout will be something like:
>
> struct module_layout {
>     void *ptr;
>     unsigned int size;  /* text size, rw data size, ro + ro_after_init size */
>     unsigned int ro_size;    /* ro_size for ro + ro_after_init allocation */
> };

*SHUDDER*

struct module_layout {
        unsigned int	type;
        unsigned int	size;
	void 		*ptr;
};

or

struct module_layout {
        unsigned int	size;
	void 		*ptr;
};

struct module {
	...
	struct module_layout	layouts[MOD_ALLOC_TYPE_MAX];
        ...
};

> One more question: shall we make module sections page
> aligned without STRICT_MODULE_RWX? It appears to be
> a good way to simplify the logic. But it may cause too much
> memory waste for smaller processors?

Yes, we want that again for simplicity. That wastes some space on
architectures which do not support large pages, but that's not the end
of the world. Deep embedded is not really module heavy.

So lets look at some examples under the assumption that we have the init
sections separate:

Legacy default (All RWX):

struct mod_alloc_type_params {
   	[MOD_ALLOC_TYPE_TEXT ... MOD_ALLOC_TYPE_RODATA_AFTER_INIT] = {
        	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
                .granularity	= PAGE_SIZE,
                .alignment	= PAGE_SIZE,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL_EXEC,
	},
   	[MOD_ALLOC_TYPE_INITTEXT ... MOD_ALLOC_TYPE_INITRODATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_INITTEXT,
                .granularity	= PAGE_SIZE,
                .alignment	= PAGE_SIZE,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL_EXEC,
	},
   };

So this is the waste space version, but we can also do:

struct mod_alloc_type_params {
   	[MOD_ALLOC_TYPE_TEXT ... MOD_ALLOC_TYPE_INITRODATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
                .granularity	= PAGE_SIZE,
                .alignment	= MOD_ARCH_ALIGNMENT,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL_EXEC,
	},
   };

The "use free space in existing mappings" mechanism is not required to
be PMD_SIZE based, right?

Large page size, strict separation:

struct mod_alloc_type_params {
   	[MOD_ALLOC_TYPE_TEXT] = {
        	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
                .flags		= FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE,
                .granularity	= PMD_SIZE,
                .alignment	= MOD_ARCH_ALIGNMENT,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .start[1]	= MODULES_VADDR_2ND,
                .end[1]		= MODULES_END_2ND,
                .pgprot		= PAGE_KERNEL_EXEC,
                .fill		= text_poke,
                .invalidate	= text_poke_invalidate,
	},
   	[MOD_ALLOC_TYPE_DATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_DATA,
                .flags		= FLAG_SHARED_PMD,
                .granularity	= PMD_SIZE,
                .alignment	= MOD_ARCH_ALIGNMENT,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL,
	},
   	[MOD_ALLOC_TYPE_RODATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_RODATA,
                .granularity	= PAGE_SIZE,
                .alignment	= PAGE_SIZE,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL_RO,
	},
   	[MOD_ALLOC_TYPE_RODATA_AFTER_INIT] = {
        	.mapto_type	= MOD_ALLOC_TYPE_RODATA_AFTER_INIT,
		.flags		= FLAG_SET_RO_AFTER_INIT,                                                          
                .granularity	= PAGE_SIZE,
                .alignment	= PAGE_SIZE,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL,
	},
   	[MOD_ALLOC_TYPE_INITTEXT] = {
        	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
        },
   	[MOD_ALLOC_TYPE_INITDATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_DATA,
        },
   	[MOD_ALLOC_TYPE_INITRODATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_RODATA,
        },
   };

Large page size, strict separation and RODATA uses PMD:

struct mod_alloc_type_params {
   	[MOD_ALLOC_TYPE_TEXT] = {
        	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
                .flags		= FLAG_SHARED_PMD,
                .granularity	= PMD_SIZE,
                .alignment	= MOD_ARCH_ALIGNMENT,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL_EXEC,
                .fill		= text_poke,
                .invalidate	= text_poke_invalidate,
	},
   	[MOD_ALLOC_TYPE_DATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_DATA,
                .flags		= FLAG_SHARED_PMD,
                .granularity	= PMD_SIZE,
                .alignment	= MOD_ARCH_ALIGNMENT,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL,
	},
   	[MOD_ALLOC_TYPE_RODATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_RODATA,
                .flags		= FLAG_SHARED_PMD,
                .granularity	= PMD_SIZE,
                .alignment	= MOD_ARCH_ALIGNMENT,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL_RO,
                .fill		= rodata_poke,
                .invalidate	= rodata_poke_invalidate,
	},
   	[MOD_ALLOC_TYPE_RODATA_AFTER_INIT] = {
        	.mapto_type	= MOD_ALLOC_TYPE_RODATA_AFTER_INIT,
		.flags		= FLAG_SET_RO_AFTER_INIT,                                                          
                .granularity	= PAGE_SIZE,
                .alignment	= PAGE_SIZE,
                .start[0]	= MODULES_VADDR,
                .end[0]		= MODULES_END,
                .pgprot		= PAGE_KERNEL,
	},
   	[MOD_ALLOC_TYPE_INITTEXT] = {
        	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
        },
   	[MOD_ALLOC_TYPE_INITDATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_DATA,
        },
   	[MOD_ALLOC_TYPE_INITRODATA] = {
        	.mapto_type	= MOD_ALLOC_TYPE_RODATA,
        },
   };

Then this data structure is handed to the module allocation init
function, which in turn sets up the resulting type magic.

struct mod_type_allocator {
	struct mutex		mutex;
	struct mod_alloc_type	params;
	struct rb_root		free_area;
        struct list_head	list;
        struct vm_struct	*vm;
};

struct mod_allocator {
	struct mod_type_allocator	base_types[MAX_TYPES];
        struct mod_type_allocator	*real_types[MAX_TYPES];
};

static struct mod_allocator mod_allocator;

So the init function does:

mod_alloc_init(struct mod_alloc_type_params *params)
{
	for (i = 0; i < MAX_TYPES; i++)
        	mod_alloc_init_type(i, &params->types[i]);
}

mod_alloc_init_type(int i, struct mod_alloc_type *params)
{
	struct mod_type_allocator = *ta;

	mutex_init(&ta->mutex);
        memcpy(&ta->params, params, sizeof(params);
        ta->free_area = RB_ROOT;
        LIST_HEAD_INIT(&ta->list);

      	mod_allocator.real_types[i] = &mod_allocator.base_types[params->mapto_type];
}

and then you have:

module_alloc_type(size, type)
{
	return __module_alloc_type(mod_allocator.real_types[type], size);
}

and everything just works and operates from the relevant allocator.

See: No question about how many trees are required, no question about
     legacy RWX, ...

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-07 15:36                       ` Thomas Gleixner
@ 2022-12-07 16:53                         ` Christophe Leroy
  2022-12-07 19:29                           ` Song Liu
  2022-12-07 21:04                           ` Thomas Gleixner
  2022-12-07 19:26                         ` Song Liu
  1 sibling, 2 replies; 93+ messages in thread
From: Christophe Leroy @ 2022-12-07 16:53 UTC (permalink / raw)
  To: Thomas Gleixner, Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof, Dinh Nguyen



Le 07/12/2022 à 16:36, Thomas Gleixner a écrit :
> 
> The "use free space in existing mappings" mechanism is not required to
> be PMD_SIZE based, right?
> 
> Large page size, strict separation:
> 
> struct mod_alloc_type_params {
>     	[MOD_ALLOC_TYPE_TEXT] = {
>          	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
>                  .flags		= FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE,
>                  .granularity	= PMD_SIZE,
>                  .alignment	= MOD_ARCH_ALIGNMENT,
>                  .start[0]	= MODULES_VADDR,
>                  .end[0]		= MODULES_END,
>                  .start[1]	= MODULES_VADDR_2ND,
>                  .end[1]		= MODULES_END_2ND,
>                  .pgprot		= PAGE_KERNEL_EXEC,
>                  .fill		= text_poke,
>                  .invalidate	= text_poke_invalidate,
> 	},

Don't restrict implementation to PMD_SIZE only.

On powerpc 8xx:
- PMD_SIZE is 4 Mbytes
- Large pages are 512 kbytes and 8 Mbytes.

It even has large pages of size 16 kbytes when build for 4k normal page 
size.

Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-07 15:36                       ` Thomas Gleixner
  2022-12-07 16:53                         ` Christophe Leroy
@ 2022-12-07 19:26                         ` Song Liu
  2022-12-07 20:57                           ` Thomas Gleixner
  1 sibling, 1 reply; 93+ messages in thread
From: Song Liu @ 2022-12-07 19:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof, Dinh Nguyen

Hi Thomas,

On Wed, Dec 7, 2022 at 7:36 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
[...]
> > Survey of the 11 architecture specific module_alloc(). They basically do
> > the following magic:
> >
> > 1. Modify MODULES_VADDR and/or MODULES_END. There are multiple
> >   reasons behind this, some arch does this for KASLR, some other archs
> >   have different MODULES_[VADDR|END] for different processors (32b vs.
> >   64b for example), some archs use some module address space for other
> >   things (i.e. _exiprom on arm).
> >
> > Archs need 1: x86, arm64, arm, mips, ppc, riscv, s390, loongarch,
> > sparc
>
> All of this is pretty much a boot time init decision, right?

Yeah, all of these are boot time or compile time decisions.

>
> > 2. Use kasan_alloc_module_shadow()
> >
> > Archs need 2: x86, arm64, s390
>
> There is nothing really architecture specific, so that can be part of
> the core code, right?

Right, kasan_free_module_shadow() is called from vmalloc.c, so the
alloc one can do the same.

>
> > 3. A secondary module address space. There is a smaller preferred
> >   address space for modules. Once the preferred space runs out, allocate
> >   memory from a secondary address space.
[...]
>
> > 6. nios2 uses kmalloc() for modules. Based on the comment, this is
> >   probably only because it needs different MODULES_[VADDR|END].
>
> It's a horrible hack because they decided to have their layout:
>
>      VMALLOC_SPACE   0x80000000
>      KERNEL_SPACE    0xC0000000
>
> and they use kmalloc because CALL26/PCREL26 cannot reach from 0x80000000
> to 0xC0000000. That's true, but broken beyond repair.
>
> Making the layout:
>
>      VMALLOC_SPACE   0x80000000
>      MODULE_SPACE    0xBE000000         == 0xC0000000 - (1 << 24) (32M)
> or
>      MODULE_SPACE    0xBF000000         == 0xC0000000 - (1 << 24) (16M)
>      KERNEL_SPACE    0xC0000000
>
> would have been too obvious...

Yeah, I was thinking about something like this.

>
> > I think we can handle all these with a single module_alloc() and a few
> > module_arch_* functions().

[...]

>
> /**
>  * struct mod_alloc_type - Parameters for module allocation type
>  * @mapto_type:         The type to merge this type into, if different
>  *                      from the actual type which is configured here.
>  * @flags:              Properties
>  * @granularity:        The allocation granularity (PTE/PMD)
>  * @alignment:          The allocation alignment requirement
>  * @start:              Array of address space range start (inclusive)
>  * @end:                Array of address space range end (inclusive)
>  * @pgprot:             The page protection for this type
>  * @fill:               Function to fill allocated space. If NULL, use memcpy()
>  * @invalidate:         Function to invalidate allocated space. If NULL, use memset()
>  *
>  * If @granularity > @alignment the allocation can reuse free space in
>  * previously allocated pages. If they are the same, then fresh pages
>  * have to be allocated.
>  */
> struct mod_alloc_type {
>         unsigned int    mapto_type;
>         unsigned int    flags;
>         unsigned int    granularity;
>         unsigned int    alignment;
>         unsigned long   start[MOD_MAX_ADDR_SPACES];
>         unsigned long   end[MOD_MAX_ADDR_SPACES];
>         pgprot_t        pgprot;
>         void            (*fill)(void *dst, void *src, unsigned int size);
>         void            (*invalidate)(void *dst, unsigned int size);
> };

Yeah, this is a lot better than arch_ functions.

We probably want two more function pointers here:

int (*protect)(unsigned long addr, int numpages);
int (*unprotect)(unsigned long addr, int numpages);

These two functions will be NULL for archs that support text_poke;
while legacy archs use them for set_memory_[ro|x|rw|nx]. Then, I
think we can get rid of VM_FLUSH_RESET_PERMS.

[...]

Everything else makes perfect sense. Thanks!

I think I am ready to dive into the code and prepare the first RFC/PATCH.
Please let me know if there is anything we should discuss/clarify before that.

Best,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-07 16:53                         ` Christophe Leroy
@ 2022-12-07 19:29                           ` Song Liu
  2022-12-07 21:04                           ` Thomas Gleixner
  1 sibling, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-12-07 19:29 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Thomas Gleixner, bpf, linux-mm, peterz, akpm, x86, hch,
	rick.p.edgecombe, aaron.lu, rppt, mcgrof, Dinh Nguyen

On Wed, Dec 7, 2022 at 8:54 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 07/12/2022 à 16:36, Thomas Gleixner a écrit :
> >
> > The "use free space in existing mappings" mechanism is not required to
> > be PMD_SIZE based, right?
> >
> > Large page size, strict separation:
> >
> > struct mod_alloc_type_params {
> >       [MOD_ALLOC_TYPE_TEXT] = {
> >               .mapto_type     = MOD_ALLOC_TYPE_TEXT,
> >                  .flags               = FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE,
> >                  .granularity = PMD_SIZE,
> >                  .alignment   = MOD_ARCH_ALIGNMENT,
> >                  .start[0]    = MODULES_VADDR,
> >                  .end[0]              = MODULES_END,
> >                  .start[1]    = MODULES_VADDR_2ND,
> >                  .end[1]              = MODULES_END_2ND,
> >                  .pgprot              = PAGE_KERNEL_EXEC,
> >                  .fill                = text_poke,
> >                  .invalidate  = text_poke_invalidate,
> >       },
>
> Don't restrict implementation to PMD_SIZE only.
>
> On powerpc 8xx:
> - PMD_SIZE is 4 Mbytes
> - Large pages are 512 kbytes and 8 Mbytes.
>
> It even has large pages of size 16 kbytes when build for 4k normal page
> size.

IIUC, these customizations would fit nicely with this design. I will see how
much I can prototype with powerpc without the actual hardwares.

Thanks,
Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-07 19:26                         ` Song Liu
@ 2022-12-07 20:57                           ` Thomas Gleixner
  2022-12-07 23:17                             ` Song Liu
  0 siblings, 1 reply; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-07 20:57 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof, Dinh Nguyen

Song!

On Wed, Dec 07 2022 at 11:26, Song Liu wrote:
> On Wed, Dec 7, 2022 at 7:36 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > I think we can handle all these with a single module_alloc() and a few
>> > module_arch_* functions().
>> struct mod_alloc_type {
>>         unsigned int    mapto_type;
>>         unsigned int    flags;
>>         unsigned int    granularity;
>>         unsigned int    alignment;
>>         unsigned long   start[MOD_MAX_ADDR_SPACES];
>>         unsigned long   end[MOD_MAX_ADDR_SPACES];
>>         pgprot_t        pgprot;
>>         void            (*fill)(void *dst, void *src, unsigned int size);
>>         void            (*invalidate)(void *dst, unsigned int size);
>> };
>
> Yeah, this is a lot better than arch_ functions.

Remember the order of things to worry about: #3 :)

> We probably want two more function pointers here:
>
> int (*protect)(unsigned long addr, int numpages);
> int (*unprotect)(unsigned long addr, int numpages);
>
> These two functions will be NULL for archs that support text_poke;
> while legacy archs use them for set_memory_[ro|x|rw|nx]. Then, I
> think we can get rid of VM_FLUSH_RESET_PERMS.

Depends. You can implement

fill()
	memcpy(...);
        set_memory_ro();

and

invalidate()
	set_memory_rw();
        memset();

as global helpers which can be used by the architecture for the init
struct or used as default for certain types.

> I think I am ready to dive into the code and prepare the first RFC/PATCH.
> Please let me know if there is anything we should discuss/clarify before that.

I think we covered most of it by now, so sure a POC is probably due, but
please yell when you find a gap in step #1 - #3 which we did not cover
yet.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-07 16:53                         ` Christophe Leroy
  2022-12-07 19:29                           ` Song Liu
@ 2022-12-07 21:04                           ` Thomas Gleixner
  2022-12-07 21:48                             ` Christophe Leroy
  1 sibling, 1 reply; 93+ messages in thread
From: Thomas Gleixner @ 2022-12-07 21:04 UTC (permalink / raw)
  To: Christophe Leroy, Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof, Dinh Nguyen

Christophe,

On Wed, Dec 07 2022 at 16:53, Christophe Leroy wrote:
> Le 07/12/2022 à 16:36, Thomas Gleixner a écrit :
>> The "use free space in existing mappings" mechanism is not required to
>> be PMD_SIZE based, right?

Correct. I just used it for the example.

>> Large page size, strict separation:
>> 
>> struct mod_alloc_type_params {
>>     	[MOD_ALLOC_TYPE_TEXT] = {
>>          	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
>>                  .flags		= FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE,
>>                  .granularity	= PMD_SIZE,
>>                  .alignment	= MOD_ARCH_ALIGNMENT,
>>                  .start[0]	= MODULES_VADDR,
>>                  .end[0]		= MODULES_END,
>>                  .start[1]	= MODULES_VADDR_2ND,
>>                  .end[1]		= MODULES_END_2ND,
>>                  .pgprot		= PAGE_KERNEL_EXEC,
>>                  .fill		= text_poke,
>>                  .invalidate	= text_poke_invalidate,
>> 	},
>
> Don't restrict implementation to PMD_SIZE only.
>
> On powerpc 8xx:
> - PMD_SIZE is 4 Mbytes
> - Large pages are 512 kbytes and 8 Mbytes.
>
> It even has large pages of size 16 kbytes when build for 4k normal page 
> size.

@granularity takes any size which is valid from the architecture side
and can handle the @pgprot distinctions.

That's why I separated functionality and configuration.

Note, it's not strict compile time configuration. You can either build
it completely dynamic at boot or have a static configuration structure
as compile time default.

That static default might be __initconst for architectures where there
is no boot time detection and change required, but can be __initdata for
those which need to adjust it to the needs of the detected CPU/platform
before feeding it into the module allocator init function.

Does that answer your question?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-07 21:04                           ` Thomas Gleixner
@ 2022-12-07 21:48                             ` Christophe Leroy
  0 siblings, 0 replies; 93+ messages in thread
From: Christophe Leroy @ 2022-12-07 21:48 UTC (permalink / raw)
  To: Thomas Gleixner, Song Liu
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof, Dinh Nguyen



Le 07/12/2022 à 22:04, Thomas Gleixner a écrit :
> Christophe,
> 
> On Wed, Dec 07 2022 at 16:53, Christophe Leroy wrote:
>> Le 07/12/2022 à 16:36, Thomas Gleixner a écrit :
>>> The "use free space in existing mappings" mechanism is not required to
>>> be PMD_SIZE based, right?
> 
> Correct. I just used it for the example.
> 
>>> Large page size, strict separation:
>>>
>>> struct mod_alloc_type_params {
>>>      	[MOD_ALLOC_TYPE_TEXT] = {
>>>           	.mapto_type	= MOD_ALLOC_TYPE_TEXT,
>>>                   .flags		= FLAG_SHARED_PMD | FLAG_SECOND_ADDRESS_SPACE,
>>>                   .granularity	= PMD_SIZE,
>>>                   .alignment	= MOD_ARCH_ALIGNMENT,
>>>                   .start[0]	= MODULES_VADDR,
>>>                   .end[0]		= MODULES_END,
>>>                   .start[1]	= MODULES_VADDR_2ND,
>>>                   .end[1]		= MODULES_END_2ND,
>>>                   .pgprot		= PAGE_KERNEL_EXEC,
>>>                   .fill		= text_poke,
>>>                   .invalidate	= text_poke_invalidate,
>>> 	},
>>
>> Don't restrict implementation to PMD_SIZE only.
>>
>> On powerpc 8xx:
>> - PMD_SIZE is 4 Mbytes
>> - Large pages are 512 kbytes and 8 Mbytes.
>>
>> It even has large pages of size 16 kbytes when build for 4k normal page
>> size.
> 
> @granularity takes any size which is valid from the architecture side
> and can handle the @pgprot distinctions.
> 
> That's why I separated functionality and configuration.
> 
> Note, it's not strict compile time configuration. You can either build
> it completely dynamic at boot or have a static configuration structure
> as compile time default.
> 
> That static default might be __initconst for architectures where there
> is no boot time detection and change required, but can be __initdata for
> those which need to adjust it to the needs of the detected CPU/platform
> before feeding it into the module allocator init function.
> 
> Does that answer your question?
> 

Yes it does, thanks.

Christophe

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-12-07 20:57                           ` Thomas Gleixner
@ 2022-12-07 23:17                             ` Song Liu
  0 siblings, 0 replies; 93+ messages in thread
From: Song Liu @ 2022-12-07 23:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bpf, linux-mm, peterz, akpm, x86, hch, rick.p.edgecombe,
	aaron.lu, rppt, mcgrof, Dinh Nguyen

On Wed, Dec 7, 2022 at 12:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Song!
>
> On Wed, Dec 07 2022 at 11:26, Song Liu wrote:
> > On Wed, Dec 7, 2022 at 7:36 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > I think we can handle all these with a single module_alloc() and a few
> >> > module_arch_* functions().
> >> struct mod_alloc_type {
> >>         unsigned int    mapto_type;
> >>         unsigned int    flags;
> >>         unsigned int    granularity;
> >>         unsigned int    alignment;
> >>         unsigned long   start[MOD_MAX_ADDR_SPACES];
> >>         unsigned long   end[MOD_MAX_ADDR_SPACES];
> >>         pgprot_t        pgprot;
> >>         void            (*fill)(void *dst, void *src, unsigned int size);
> >>         void            (*invalidate)(void *dst, unsigned int size);
> >> };
> >
> > Yeah, this is a lot better than arch_ functions.
>
> Remember the order of things to worry about: #3 :)
>
> > We probably want two more function pointers here:
> >
> > int (*protect)(unsigned long addr, int numpages);
> > int (*unprotect)(unsigned long addr, int numpages);
> >
> > These two functions will be NULL for archs that support text_poke;
> > while legacy archs use them for set_memory_[ro|x|rw|nx]. Then, I
> > think we can get rid of VM_FLUSH_RESET_PERMS.
>
> Depends. You can implement
>
> fill()
>         memcpy(...);
>         set_memory_ro();
>
> and
>
> invalidate()
>         set_memory_rw();
>         memset();
>
> as global helpers which can be used by the architecture for the init
> struct or used as default for certain types.

Ah, that's right. Legacy archs should always use PAGE_SIZE granularity,
so fill() and invalidate() are sufficient.

>
> > I think I am ready to dive into the code and prepare the first RFC/PATCH.
> > Please let me know if there is anything we should discuss/clarify before that.
>
> I think we covered most of it by now, so sure a POC is probably due, but
> please yell when you find a gap in step #1 - #3 which we did not cover
> yet.

Thanks! Let me see what I can get.

Song

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2022-12-07 23:17 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 2/5] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 3/5] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm Song Liu
2022-11-08 19:04   ` Edgecombe, Rick P
2022-11-08 22:15     ` Song Liu
2022-11-15 17:28       ` Edgecombe, Rick P
2022-11-07 22:55 ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Luis Chamberlain
2022-11-07 23:13   ` Song Liu
2022-11-07 23:39     ` Luis Chamberlain
2022-11-08  0:13       ` Edgecombe, Rick P
2022-11-08  2:45         ` Luis Chamberlain
2022-11-08 18:20         ` Song Liu
2022-11-08 18:12       ` Song Liu
2022-11-08 11:27 ` Mike Rapoport
2022-11-08 12:38   ` Aaron Lu
2022-11-09  6:55     ` Christoph Hellwig
2022-11-09 11:05       ` Peter Zijlstra
2022-11-08 16:51   ` Edgecombe, Rick P
2022-11-08 18:50     ` Song Liu
2022-11-09 11:17     ` Mike Rapoport
2022-11-09 17:04       ` Edgecombe, Rick P
2022-11-09 17:53         ` Song Liu
2022-11-13 10:34         ` Mike Rapoport
2022-11-14 20:30           ` Song Liu
2022-11-15 21:18             ` Luis Chamberlain
2022-11-15 21:39               ` Edgecombe, Rick P
2022-11-16 22:34                 ` Luis Chamberlain
2022-11-17  8:50             ` Mike Rapoport
2022-11-17 18:36               ` Song Liu
2022-11-20 10:41                 ` Mike Rapoport
2022-11-21 14:52                   ` Song Liu
2022-11-30  9:39                     ` Mike Rapoport
2022-11-09 17:43       ` Song Liu
2022-11-09 21:23         ` Christophe Leroy
2022-11-09 21:23           ` Christophe Leroy
2022-11-10  1:50           ` Song Liu
2022-11-10  1:50             ` Song Liu
2022-11-13 10:42         ` Mike Rapoport
2022-11-14 20:45           ` Song Liu
2022-11-15 20:51             ` Luis Chamberlain
2022-11-20 10:44             ` Mike Rapoport
2022-11-08 18:41   ` Song Liu
2022-11-08 19:43     ` Christophe Leroy
2022-11-08 21:40       ` Song Liu
2022-11-13  9:58     ` Mike Rapoport
2022-11-14 20:13       ` Song Liu
2022-11-08 11:44 ` Christophe Leroy
2022-11-08 18:47   ` Song Liu
2022-11-08 19:32     ` Christophe Leroy
2022-11-08 11:48 ` Mike Rapoport
2022-11-15  1:30 ` Song Liu
2022-11-15 17:34   ` Edgecombe, Rick P
2022-11-15 21:54     ` Song Liu
2022-11-15 22:14       ` Edgecombe, Rick P
2022-11-15 22:32         ` Song Liu
2022-11-16  1:20         ` Song Liu
2022-11-16 21:22           ` Edgecombe, Rick P
2022-11-16 22:03             ` Song Liu
2022-11-15 21:09   ` Luis Chamberlain
2022-11-15 21:32     ` Luis Chamberlain
2022-11-15 22:48     ` Song Liu
2022-11-16 22:33       ` Luis Chamberlain
2022-11-16 22:47         ` Edgecombe, Rick P
2022-11-16 23:53           ` Luis Chamberlain
2022-11-17  1:17             ` Song Liu
2022-11-17  9:37         ` Mike Rapoport
2022-11-29 10:23   ` Thomas Gleixner
2022-11-29 17:26     ` Song Liu
2022-11-29 23:56       ` Thomas Gleixner
2022-11-30 16:18         ` Song Liu
2022-12-01  9:08           ` Thomas Gleixner
2022-12-01 19:31             ` Song Liu
2022-12-02  1:38               ` Thomas Gleixner
2022-12-02  8:38                 ` Song Liu
2022-12-02  9:22                   ` Thomas Gleixner
2022-12-06 20:25                     ` Song Liu
2022-12-07 15:36                       ` Thomas Gleixner
2022-12-07 16:53                         ` Christophe Leroy
2022-12-07 19:29                           ` Song Liu
2022-12-07 21:04                           ` Thomas Gleixner
2022-12-07 21:48                             ` Christophe Leroy
2022-12-07 19:26                         ` Song Liu
2022-12-07 20:57                           ` Thomas Gleixner
2022-12-07 23:17                             ` Song Liu
2022-12-02 10:46                 ` Christophe Leroy
2022-12-02 17:43                   ` Thomas Gleixner
2022-12-01 20:23             ` Mike Rapoport
2022-12-01 22:34               ` Thomas Gleixner
2022-12-03 14:46                 ` Mike Rapoport
2022-12-03 20:58                   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.