* [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs
@ 2022-10-31 21:58 Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
` (4 more replies)
0 siblings, 5 replies; 8+ messages in thread
From: Song Liu @ 2022-10-31 21:58 UTC (permalink / raw)
To: bpf, linux-mm
Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, urezki,
mcgrof, kernel-team, Song Liu
This set enables bpf programs and bpf dispatchers to share huge pages with
new API:
vmalloc_exec()
vfree_exec()
vcopy_exec()
The idea is similar to Peter's suggestion in [1].
vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates these
memory to its users. vfree_exec() is used to free memory allocated by
vmalloc_exec(). vcopy_exec() is used to update memory allocated by
vmalloc_exec().
Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate W^X.
The caller has to update the content with text_poke like mechanism.
Specifically, vcopy_exec() is provided to update memory allocated by
vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
boundary of one chunk allocated by vmalloc_exec(). Please refer to patch
1/5 for more details of
Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
_etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
[2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
Changes RFC v2 => PATCH v1:
1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
also ensures vcopy_exec() is only used to update memory from one single
vmalloc_exec() call. (Christoph Hellwig)
2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
text_poke() like logic.
3. Drop changes for kernel modules and focus on BPF side changes.
Changes RFC v1 => RFC v2:
1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
still need some work.
Song Liu (5):
vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
x86/alternative: support vmalloc_exec() and vfree_exec()
bpf: use vmalloc_exec for bpf program and bpf dispatcher
vmalloc: introduce register_text_tail_vm()
x86: use register_text_tail_vm
arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/kernel/alternative.c | 12 +
arch/x86/mm/init_64.c | 4 +-
arch/x86/net/bpf_jit_comp.c | 23 +-
include/linux/bpf.h | 3 -
include/linux/filter.h | 5 -
include/linux/vmalloc.h | 9 +
kernel/bpf/core.c | 180 +-----------
kernel/bpf/dispatcher.c | 11 +-
mm/nommu.c | 7 +
mm/vmalloc.c | 351 ++++++++++++++++++++++++
11 files changed, 404 insertions(+), 202 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
2022-10-31 21:58 [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs Song Liu
@ 2022-10-31 21:58 ` Song Liu
2022-11-01 11:54 ` Uladzislau Rezki
2022-10-31 21:58 ` [PATCH bpf-next v1 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
` (3 subsequent siblings)
4 siblings, 1 reply; 8+ messages in thread
From: Song Liu @ 2022-10-31 21:58 UTC (permalink / raw)
To: bpf, linux-mm
Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, urezki,
mcgrof, kernel-team, Song Liu
vmalloc_exec is used to allocate memory to host dynamic kernel text
(modules, BPF programs, etc.) with huge pages. This is similar to the
proposal by Peter in [1].
A new tree of vmap_area, free_text_area_* tree, is introduced in addition
to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from
free_text_area_*. When there isn't enough space left in free_text_area_*,
new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
free_text_area_*. To be more accurate, the vmap_area is first added to
vmap_area_* tree and then moved to free_text_area_*. This extra move
simplifies the logic of vmalloc_exec.
vmap_area in free_text_area_* tree are backed with memory, but we need
subtree_max_size for tree operations. Therefore, vm_struct for these
vmap_area are stored in a separate list, all_text_vm.
The new tree allows separate handling of < PAGE_SIZE allocations, as
current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
version of vmalloc_exec can handle bpf programs, which uses 64 byte
aligned allocations), and modules, which uses PAGE_SIZE aligned
allocations.
Memory allocated by vmalloc_exec() is set to RO+X before returning to the
caller. Therefore, the caller cannot write directly write to the memory.
Instead, the caller is required to use vcopy_exec() to update the memory.
For the safety and security of X memory, vcopy_exec() checks the data
being updated always in the memory allocated by one vmalloc_exec() call.
vcopy_exec() uses text_poke like mechanism and requires arch support.
Specifically, the arch need to implement arch_vcopy_exec().
In vfree_exec(), the memory is first erased with arch_invalidate_exec().
Then, the memory is added to free_text_area_*. If this free creates big
enough continuous free space (> PMD_SIZE), vfree_exec() will try to free
the backing vm_struct.
[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
Signed-off-by: Song Liu <song@kernel.org>
---
include/linux/vmalloc.h | 5 +
mm/nommu.c | 12 ++
mm/vmalloc.c | 318 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 335 insertions(+)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..9b2042313c12 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
int node, const void *caller) __alloc_size(1);
void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1);
+void *vcopy_exec(void *dst, void *src, size_t len);
+void vfree_exec(void *addr);
+void *arch_vcopy_exec(void *dst, void *src, size_t len);
+int arch_invalidate_exec(void *ptr, size_t len);
extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..8a1317247ef0 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
}
EXPORT_SYMBOL(vm_map_pages_zero);
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+ return NULL;
+}
+
+void *vcopy_exec(void *dst, void *src, size_t len)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+void vfree_exec(const void *addr) { }
+
/*
* sys_brk() for the most part doesn't need the global kernel
* lock, except when an application is doing something nasty
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ccaa461998f3..6f4c73e67191 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc);
static const bool vmap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
+#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
+
bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -769,6 +772,38 @@ static LIST_HEAD(free_vmap_area_list);
*/
static struct rb_root free_vmap_area_root = RB_ROOT;
+/*
+ * free_text_area for vmalloc_exec()
+ */
+static DEFINE_SPINLOCK(free_text_area_lock);
+/*
+ * This linked list is used in pair with free_text_area_root.
+ * It gives O(1) access to prev/next to perform fast coalescing.
+ */
+static LIST_HEAD(free_text_area_list);
+
+/*
+ * This augment red-black tree represents the free text space.
+ * All vmap_area objects in this tree are sorted by va->va_start
+ * address. It is used for allocation and merging when a vmap
+ * object is released.
+ *
+ * Each vmap_area node contains a maximum available free block
+ * of its sub-tree, right or left. Therefore it is possible to
+ * find a lowest match of free area.
+ *
+ * vmap_area in this tree are backed by RO+X memory, but they do
+ * not have valid vm pointer (because we need subtree_max_size).
+ * The vm for these vmap_area are stored in all_text_vm.
+ */
+static struct rb_root free_text_area_root = RB_ROOT;
+
+/*
+ * List of vm_struct for free_text_area_root. This list is rarely
+ * accessed, so the O(N) complexity is not likely a real issue.
+ */
+struct vm_struct *all_text_vm;
+
/*
* Preload a CPU with one object for "no edge" split case. The
* aim is to get rid of allocations from the atomic context, thus
@@ -3313,6 +3348,289 @@ void *vmalloc(unsigned long size)
}
EXPORT_SYMBOL(vmalloc);
+#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
+#define VMALLOC_EXEC_START MODULES_VADDR
+#define VMALLOC_EXEC_END MODULES_END
+#else
+#define VMALLOC_EXEC_START VMALLOC_START
+#define VMALLOC_EXEC_END VMALLOC_END
+#endif
+
+static void move_vmap_to_free_text_tree(void *addr)
+{
+ struct vmap_area *va;
+
+ /* remove from vmap_area_root */
+ spin_lock(&vmap_area_lock);
+ va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+ if (WARN_ON_ONCE(!va)) {
+ spin_unlock(&vmap_area_lock);
+ return;
+ }
+ unlink_va(va, &vmap_area_root);
+ spin_unlock(&vmap_area_lock);
+
+ /* make the memory RO+X */
+ memset(addr, 0, va->va_end - va->va_start);
+ set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+ set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+
+ /* add to all_text_vm */
+ va->vm->next = all_text_vm;
+ all_text_vm = va->vm;
+
+ /* add to free_text_area_root */
+ spin_lock(&free_text_area_lock);
+ merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
+ spin_unlock(&free_text_area_lock);
+}
+
+/**
+ * vmalloc_exec - allocate virtually contiguous RO+X memory
+ * @size: allocation size
+ *
+ * This is used to allocate dynamic kernel text, such as module text, BPF
+ * programs, etc. User need to use text_poke to update the memory allocated
+ * by vmalloc_exec.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+ struct vmap_area *va, *tmp;
+ unsigned long addr;
+ enum fit_type type;
+ int ret;
+
+ va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
+ if (unlikely(!va))
+ return NULL;
+
+again:
+ preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
+ tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
+
+ if (!tmp) {
+ unsigned long alloc_size;
+ void *ptr;
+
+ spin_unlock(&free_text_area_lock);
+
+ /*
+ * Not enough continuous space in free_text_area_root, try
+ * allocate more memory. The memory is first added to
+ * vmap_area_root, and then moved to free_text_area_root.
+ */
+ alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
+ ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
+ VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
+ VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
+ NUMA_NO_NODE, __builtin_return_address(0));
+ if (unlikely(!ptr))
+ goto err_out;
+
+ move_vmap_to_free_text_tree(ptr);
+ goto again;
+ }
+
+ addr = roundup(tmp->va_start, align);
+ type = classify_va_fit_type(tmp, addr, size);
+ if (WARN_ON_ONCE(type == NOTHING_FIT))
+ goto err_out;
+
+ ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
+ tmp, addr, size);
+ if (ret)
+ goto err_out;
+
+ spin_unlock(&free_text_area_lock);
+
+ va->va_start = addr;
+ va->va_end = addr + size;
+ va->vm = NULL;
+
+ spin_lock(&vmap_area_lock);
+ insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+ spin_unlock(&vmap_area_lock);
+
+ return (void *)addr;
+
+err_out:
+ spin_unlock(&free_text_area_lock);
+ kmem_cache_free(vmap_area_cachep, va);
+ return NULL;
+}
+
+void __weak *arch_vcopy_exec(void *dst, void *src, size_t len)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+int __weak arch_invalidate_exec(void *ptr, size_t len)
+{
+ return -EOPNOTSUPP;
+}
+
+/**
+ * vcopy_exec - Copy text to RO+X memory allocated by vmalloc_exec()
+ * @dst: pointer to memory allocated by vmalloc_exec()
+ * @src: pointer to data being copied from
+ * @len: number of bytes to be copied
+ *
+ * vcopy_exec() will only update memory allocated by a single vcopy_exec()
+ * call. If dst + len goes beyond the boundary of one allocation,
+ * vcopy_exec() is aborted.
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void *vcopy_exec(void *dst, void *src, size_t len)
+{
+ struct vmap_area *va;
+
+ spin_lock(&vmap_area_lock);
+ va = __find_vmap_area((unsigned long)dst, &vmap_area_root);
+
+ /*
+ * If no va, or va has a vm attached, this memory is not allocated
+ * by vmalloc_exec().
+ */
+ if (WARN_ON_ONCE(!va) || WARN_ON_ONCE(va->vm))
+ goto err_out;
+ if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end))
+ goto err_out;
+
+ spin_unlock(&vmap_area_lock);
+
+ return arch_vcopy_exec(dst, src, len);
+
+err_out:
+ spin_unlock(&vmap_area_lock);
+ return ERR_PTR(-EINVAL);
+}
+
+static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end)
+{
+ struct vm_struct *vm, *prev_vm;
+
+ lockdep_assert_held(&free_text_area_lock);
+
+ vm = all_text_vm;
+ while (vm) {
+ unsigned long vm_addr = (unsigned long)vm->addr;
+
+ /* vm is within this free space, we can free it */
+ if ((vm_addr >= start) && ((vm_addr + vm->size) <= end))
+ goto unlink_vm;
+ vm = vm->next;
+ }
+ return NULL;
+
+unlink_vm:
+ if (all_text_vm == vm) {
+ all_text_vm = vm->next;
+ } else {
+ prev_vm = all_text_vm;
+ while (prev_vm->next != vm)
+ prev_vm = prev_vm->next;
+ prev_vm = vm->next;
+ }
+ return vm;
+}
+
+/**
+ * vfree_exec - Release memory allocated by vmalloc_exec()
+ * @addr: Memory base address
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void vfree_exec(void *addr)
+{
+ unsigned long free_start, free_end, free_addr;
+ struct vm_struct *vm;
+ struct vmap_area *va;
+
+ might_sleep();
+
+ if (!addr)
+ return;
+
+ spin_lock(&vmap_area_lock);
+ va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+ if (WARN_ON_ONCE(!va)) {
+ spin_unlock(&vmap_area_lock);
+ return;
+ }
+ WARN_ON_ONCE(va->vm);
+
+ unlink_va(va, &vmap_area_root);
+ spin_unlock(&vmap_area_lock);
+
+ /* Invalidate text in the region */
+ arch_invalidate_exec(addr, va->va_end - va->va_start);
+
+ spin_lock(&free_text_area_lock);
+ va = merge_or_add_vmap_area_augment(va,
+ &free_text_area_root, &free_text_area_list);
+
+ if (WARN_ON_ONCE(!va))
+ goto out;
+
+ free_start = PMD_ALIGN(va->va_start);
+ free_end = PMD_ALIGN_DOWN(va->va_end);
+
+ /*
+ * Only try to free vm when there is at least one PMD_SIZE free
+ * continuous memory.
+ */
+ if (free_start >= free_end)
+ goto out;
+
+ /*
+ * TODO: It is possible that multiple vm are ready to be freed
+ * after one vfree_exec(). But we free at most one vm for now.
+ */
+ vm = find_and_unlink_text_vm(free_start, free_end);
+ if (!vm)
+ goto out;
+
+ va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE);
+ if (unlikely(!va))
+ goto out_save_vm;
+
+ free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list,
+ vm->size, 1, (unsigned long)vm->addr,
+ (unsigned long)vm->addr + vm->size);
+
+ if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr))
+ goto out_save_vm;
+
+ va->va_start = (unsigned long)vm->addr;
+ va->va_end = va->va_start + vm->size;
+ va->vm = vm;
+ spin_unlock(&free_text_area_lock);
+
+ set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT);
+ set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT);
+
+ /* put the va to vmap_area_root, and then free it with vfree */
+ spin_lock(&vmap_area_lock);
+ insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+ spin_unlock(&vmap_area_lock);
+
+ vfree(vm->addr);
+ return;
+
+out_save_vm:
+ /*
+ * vm is removed from all_text_vm, but not freed. Add it back,
+ * so that we can use or free it later.
+ */
+ vm->next = all_text_vm;
+ all_text_vm = vm;
+out:
+ spin_unlock(&free_text_area_lock);
+}
+
/**
* vmalloc_huge - allocate virtually contiguous memory, allow huge pages
* @size: allocation size
--
2.30.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH bpf-next v1 2/5] x86/alternative: support vmalloc_exec() and vfree_exec()
2022-10-31 21:58 [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
@ 2022-10-31 21:58 ` Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
` (2 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2022-10-31 21:58 UTC (permalink / raw)
To: bpf, linux-mm
Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, urezki,
mcgrof, kernel-team, Song Liu
Implement arch_vcopy_exec() and arch_invalidate_exec() to support
vmalloc_exec.
arch_vcopy_exec() copies dynamic kernel text (such as BPF programs) to
RO+X memory region allocated by vmalloc_exec().
arch_invalidate_exec() fills memory with 0xcc after it is released by
vfree_exec().
Signed-off-by: Song Liu <song@kernel.org>
---
arch/x86/kernel/alternative.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cadcea035e0..73d89774ace3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1270,6 +1270,18 @@ void *text_poke_copy(void *addr, const void *opcode, size_t len)
return addr;
}
+void *arch_vcopy_exec(void *dst, void *src, size_t len)
+{
+ if (text_poke_copy(dst, src, len) == NULL)
+ return ERR_PTR(-EINVAL);
+ return dst;
+}
+
+int arch_invalidate_exec(void *ptr, size_t len)
+{
+ return IS_ERR_OR_NULL(text_poke_set(ptr, 0xcc, len));
+}
+
/**
* text_poke_set - memset into (an unused part of) RX memory
* @addr: address to modify
--
2.30.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH bpf-next v1 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher
2022-10-31 21:58 [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
@ 2022-10-31 21:58 ` Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 5/5] x86: use register_text_tail_vm Song Liu
4 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2022-10-31 21:58 UTC (permalink / raw)
To: bpf, linux-mm
Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, urezki,
mcgrof, kernel-team, Song Liu
Use vmalloc_exec, vfree_exec, and vcopy_exec instead of
bpf_prog_pack_alloc, bpf_prog_pack_free, and bpf_arch_text_copy.
vfree_exec doesn't require extra size information. Therefore, the free
and error handling path can be simplified.
Signed-off-by: Song Liu <song@kernel.org>
---
arch/x86/net/bpf_jit_comp.c | 23 +----
include/linux/bpf.h | 3 -
include/linux/filter.h | 5 -
kernel/bpf/core.c | 180 +++---------------------------------
kernel/bpf/dispatcher.c | 11 +--
5 files changed, 21 insertions(+), 201 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 36ffe67ad6e5..3db316b2f3a7 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -228,11 +228,6 @@ static void jit_fill_hole(void *area, unsigned int size)
memset(area, 0xcc, size);
}
-int bpf_arch_text_invalidate(void *dst, size_t len)
-{
- return IS_ERR_OR_NULL(text_poke_set(dst, 0xcc, len));
-}
-
struct jit_context {
int cleanup_addr; /* Epilogue code offset */
@@ -2496,11 +2491,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
if (proglen <= 0) {
out_image:
image = NULL;
- if (header) {
- bpf_arch_text_copy(&header->size, &rw_header->size,
- sizeof(rw_header->size));
+ if (header)
bpf_jit_binary_pack_free(header, rw_header);
- }
+
/* Fall back to interpreter mode */
prog = orig_prog;
if (extra_pass) {
@@ -2550,8 +2543,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
if (!prog->is_func || extra_pass) {
/*
* bpf_jit_binary_pack_finalize fails in two scenarios:
- * 1) header is not pointing to proper module memory;
- * 2) the arch doesn't support bpf_arch_text_copy().
+ * 1) header is not pointing to memory allocated by
+ * vmalloc_exec;
+ * 2) the arch doesn't support vcopy_exec().
*
* Both cases are serious bugs and justify WARN_ON.
*/
@@ -2597,13 +2591,6 @@ bool bpf_jit_supports_kfunc_call(void)
return true;
}
-void *bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
- if (text_poke_copy(dst, src, len) == NULL)
- return ERR_PTR(-EINVAL);
- return dst;
-}
-
/* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
bool bpf_jit_supports_subprog_tailcalls(void)
{
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9fd68b0b3e9c..1d42f58334d0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2654,9 +2654,6 @@ enum bpf_text_poke_type {
int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
void *addr1, void *addr2);
-void *bpf_arch_text_copy(void *dst, void *src, size_t len);
-int bpf_arch_text_invalidate(void *dst, size_t len);
-
struct btf_id_set;
bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index efc42a6e3aed..98e28126c24b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1023,8 +1023,6 @@ extern long bpf_jit_limit_max;
typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size);
-
struct bpf_binary_header *
bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
unsigned int alignment,
@@ -1037,9 +1035,6 @@ void bpf_jit_free(struct bpf_prog *fp);
struct bpf_binary_header *
bpf_jit_binary_pack_hdr(const struct bpf_prog *fp);
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns);
-void bpf_prog_pack_free(struct bpf_binary_header *hdr);
-
static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
{
return list_empty(&fp->aux->ksym.lnode) ||
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a0e762a2bf97..ca722078697b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -806,149 +806,6 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
return slot;
}
-/*
- * BPF program pack allocator.
- *
- * Most BPF programs are pretty small. Allocating a hole page for each
- * program is sometime a waste. Many small bpf program also adds pressure
- * to instruction TLB. To solve this issue, we introduce a BPF program pack
- * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
- * to host BPF programs.
- */
-#define BPF_PROG_CHUNK_SHIFT 6
-#define BPF_PROG_CHUNK_SIZE (1 << BPF_PROG_CHUNK_SHIFT)
-#define BPF_PROG_CHUNK_MASK (~(BPF_PROG_CHUNK_SIZE - 1))
-
-struct bpf_prog_pack {
- struct list_head list;
- void *ptr;
- unsigned long bitmap[];
-};
-
-void bpf_jit_fill_hole_with_zero(void *area, unsigned int size)
-{
- memset(area, 0, size);
-}
-
-#define BPF_PROG_SIZE_TO_NBITS(size) (round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
-
-static DEFINE_MUTEX(pack_mutex);
-static LIST_HEAD(pack_list);
-
-/* PMD_SIZE is not available in some special config, e.g. ARCH=arm with
- * CONFIG_MMU=n. Use PAGE_SIZE in these cases.
- */
-#ifdef PMD_SIZE
-#define BPF_PROG_PACK_SIZE (PMD_SIZE * num_possible_nodes())
-#else
-#define BPF_PROG_PACK_SIZE PAGE_SIZE
-#endif
-
-#define BPF_PROG_CHUNK_COUNT (BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
-
-static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
- struct bpf_prog_pack *pack;
-
- pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)),
- GFP_KERNEL);
- if (!pack)
- return NULL;
- pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
- if (!pack->ptr) {
- kfree(pack);
- return NULL;
- }
- bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE);
- bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
- list_add_tail(&pack->list, &pack_list);
-
- set_vm_flush_reset_perms(pack->ptr);
- set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
- set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
- return pack;
-}
-
-void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
- unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
- struct bpf_prog_pack *pack;
- unsigned long pos;
- void *ptr = NULL;
-
- mutex_lock(&pack_mutex);
- if (size > BPF_PROG_PACK_SIZE) {
- size = round_up(size, PAGE_SIZE);
- ptr = module_alloc(size);
- if (ptr) {
- bpf_fill_ill_insns(ptr, size);
- set_vm_flush_reset_perms(ptr);
- set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
- set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
- }
- goto out;
- }
- list_for_each_entry(pack, &pack_list, list) {
- pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
- nbits, 0);
- if (pos < BPF_PROG_CHUNK_COUNT)
- goto found_free_area;
- }
-
- pack = alloc_new_pack(bpf_fill_ill_insns);
- if (!pack)
- goto out;
-
- pos = 0;
-
-found_free_area:
- bitmap_set(pack->bitmap, pos, nbits);
- ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
-
-out:
- mutex_unlock(&pack_mutex);
- return ptr;
-}
-
-void bpf_prog_pack_free(struct bpf_binary_header *hdr)
-{
- struct bpf_prog_pack *pack = NULL, *tmp;
- unsigned int nbits;
- unsigned long pos;
-
- mutex_lock(&pack_mutex);
- if (hdr->size > BPF_PROG_PACK_SIZE) {
- module_memfree(hdr);
- goto out;
- }
-
- list_for_each_entry(tmp, &pack_list, list) {
- if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) {
- pack = tmp;
- break;
- }
- }
-
- if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
- goto out;
-
- nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
- pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT;
-
- WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size),
- "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n");
-
- bitmap_clear(pack->bitmap, pos, nbits);
- if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
- BPF_PROG_CHUNK_COUNT, 0) == 0) {
- list_del(&pack->list);
- module_memfree(pack->ptr);
- kfree(pack);
- }
-out:
- mutex_unlock(&pack_mutex);
-}
-
static atomic_long_t bpf_jit_current;
/* Can be overridden by an arch's JIT compiler if it has a custom,
@@ -1048,6 +905,9 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
bpf_jit_uncharge_modmem(size);
}
+#define BPF_PROG_EXEC_ALIGN 64
+#define BPF_PROG_EXEC_MASK (~(BPF_PROG_EXEC_ALIGN - 1))
+
/* Allocate jit binary from bpf_prog_pack allocator.
* Since the allocated memory is RO+X, the JIT engine cannot write directly
* to the memory. To solve this problem, a RW buffer is also allocated at
@@ -1070,11 +930,11 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
alignment > BPF_IMAGE_ALIGNMENT);
/* add 16 bytes for a random section of illegal instructions */
- size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_CHUNK_SIZE);
+ size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_EXEC_ALIGN);
if (bpf_jit_charge_modmem(size))
return NULL;
- ro_header = bpf_prog_pack_alloc(size, bpf_fill_ill_insns);
+ ro_header = vmalloc_exec(size, BPF_PROG_EXEC_ALIGN);
if (!ro_header) {
bpf_jit_uncharge_modmem(size);
return NULL;
@@ -1082,8 +942,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
*rw_header = kvmalloc(size, GFP_KERNEL);
if (!*rw_header) {
- bpf_arch_text_copy(&ro_header->size, &size, sizeof(size));
- bpf_prog_pack_free(ro_header);
+ vfree_exec(ro_header);
bpf_jit_uncharge_modmem(size);
return NULL;
}
@@ -1093,7 +952,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
(*rw_header)->size = size;
hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),
- BPF_PROG_CHUNK_SIZE - sizeof(*ro_header));
+ BPF_PROG_EXEC_ALIGN - sizeof(*ro_header));
start = (get_random_int() % hole) & ~(alignment - 1);
*image_ptr = &ro_header->image[start];
@@ -1109,12 +968,12 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
{
void *ptr;
- ptr = bpf_arch_text_copy(ro_header, rw_header, rw_header->size);
+ ptr = vcopy_exec(ro_header, rw_header, rw_header->size);
kvfree(rw_header);
if (IS_ERR(ptr)) {
- bpf_prog_pack_free(ro_header);
+ vfree_exec(ro_header);
return PTR_ERR(ptr);
}
return 0;
@@ -1124,18 +983,13 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
* 1) when the program is freed after;
* 2) when the JIT engine fails (before bpf_jit_binary_pack_finalize).
* For case 2), we need to free both the RO memory and the RW buffer.
- *
- * bpf_jit_binary_pack_free requires proper ro_header->size. However,
- * bpf_jit_binary_pack_alloc does not set it. Therefore, ro_header->size
- * must be set with either bpf_jit_binary_pack_finalize (normal path) or
- * bpf_arch_text_copy (when jit fails).
*/
void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header,
struct bpf_binary_header *rw_header)
{
- u32 size = ro_header->size;
+ u32 size = rw_header ? rw_header->size : ro_header->size;
- bpf_prog_pack_free(ro_header);
+ vfree_exec(ro_header);
kvfree(rw_header);
bpf_jit_uncharge_modmem(size);
}
@@ -1146,7 +1000,7 @@ bpf_jit_binary_pack_hdr(const struct bpf_prog *fp)
unsigned long real_start = (unsigned long)fp->bpf_func;
unsigned long addr;
- addr = real_start & BPF_PROG_CHUNK_MASK;
+ addr = real_start & BPF_PROG_EXEC_MASK;
return (void *)addr;
}
@@ -2736,16 +2590,6 @@ int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
return -ENOTSUPP;
}
-void * __weak bpf_arch_text_copy(void *dst, void *src, size_t len)
-{
- return ERR_PTR(-ENOTSUPP);
-}
-
-int __weak bpf_arch_text_invalidate(void *dst, size_t len)
-{
- return -ENOTSUPP;
-}
-
DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
EXPORT_SYMBOL(bpf_stats_enabled_key);
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
index fa64b80b8bca..7ca8138fb8fa 100644
--- a/kernel/bpf/dispatcher.c
+++ b/kernel/bpf/dispatcher.c
@@ -120,11 +120,11 @@ static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs)
tmp = d->num_progs ? d->rw_image + noff : NULL;
if (new) {
/* Prepare the dispatcher in d->rw_image. Then use
- * bpf_arch_text_copy to update d->image, which is RO+X.
+ * vcopy_exec to update d->image, which is RO+X.
*/
if (bpf_dispatcher_prepare(d, new, tmp))
return;
- if (IS_ERR(bpf_arch_text_copy(new, tmp, PAGE_SIZE / 2)))
+ if (IS_ERR(vcopy_exec(new, tmp, PAGE_SIZE / 2)))
return;
}
@@ -146,15 +146,12 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
mutex_lock(&d->mutex);
if (!d->image) {
- d->image = bpf_prog_pack_alloc(PAGE_SIZE, bpf_jit_fill_hole_with_zero);
+ d->image = vmalloc_exec(PAGE_SIZE, PAGE_SIZE /* align */);
if (!d->image)
goto out;
d->rw_image = bpf_jit_alloc_exec(PAGE_SIZE);
if (!d->rw_image) {
- u32 size = PAGE_SIZE;
-
- bpf_arch_text_copy(d->image, &size, sizeof(size));
- bpf_prog_pack_free((struct bpf_binary_header *)d->image);
+ vfree_exec((struct bpf_binary_header *)d->image);
d->image = NULL;
goto out;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH bpf-next v1 4/5] vmalloc: introduce register_text_tail_vm()
2022-10-31 21:58 [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs Song Liu
` (2 preceding siblings ...)
2022-10-31 21:58 ` [PATCH bpf-next v1 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
@ 2022-10-31 21:58 ` Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 5/5] x86: use register_text_tail_vm Song Liu
4 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2022-10-31 21:58 UTC (permalink / raw)
To: bpf, linux-mm
Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, urezki,
mcgrof, kernel-team, Song Liu
Allow arch code to register some memory to be used by vmalloc_exec().
One possible use case is to allocate PMD pages for kernl text up to
PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for
vmalloc_exec. Currently, only one such region is supported.
Signed-off-by: Song Liu <song@kernel.org>
---
mm/vmalloc.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6f4c73e67191..46f2b7e56670 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -75,6 +75,9 @@ static const bool vmap_allow_huge = false;
#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
+static struct vm_struct text_tail_vm;
+static struct vmap_area text_tail_va;
+
bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -653,6 +656,8 @@ int is_vmalloc_or_module_addr(const void *x)
unsigned long addr = (unsigned long)kasan_reset_tag(x);
if (addr >= MODULES_VADDR && addr < MODULES_END)
return 1;
+ if (addr >= text_tail_va.va_start && addr < text_tail_va.va_end)
+ return 1;
#endif
return is_vmalloc_addr(x);
}
@@ -2437,6 +2442,34 @@ static void vmap_init_free_space(void)
}
}
+/*
+ * register_text_tail_vm() allows arch code to register memory regions
+ * for vmalloc_exec. Unlike regular memory regions used by vmalloc_exec,
+ * this region is never freed by vfree_exec.
+ *
+ * One possible use case is to allocate PMD pages for kernl text up to
+ * PMD_ALIGN(_etext), and use (_etext, PMD_ALIGN(_etext)) for vmalloc_exec.
+ */
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+ struct vmap_area *va;
+
+ /* only support one region */
+ if (WARN_ON_ONCE(text_tail_vm.addr))
+ return;
+
+ va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+ if (WARN_ON_ONCE(!va))
+ return;
+ text_tail_vm.addr = (void *)start;
+ text_tail_vm.size = end - start;
+ text_tail_va.va_start = start;
+ text_tail_va.va_end = end;
+ text_tail_va.vm = &text_tail_vm;
+ memcpy(va, &text_tail_va, sizeof(*va));
+ insert_vmap_area_augment(va, NULL, &free_text_area_root, &free_text_area_list);
+}
+
void __init vmalloc_init(void)
{
struct vmap_area *va;
--
2.30.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH bpf-next v1 5/5] x86: use register_text_tail_vm
2022-10-31 21:58 [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs Song Liu
` (3 preceding siblings ...)
2022-10-31 21:58 ` [PATCH bpf-next v1 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
@ 2022-10-31 21:58 ` Song Liu
4 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2022-10-31 21:58 UTC (permalink / raw)
To: bpf, linux-mm
Cc: akpm, x86, peterz, hch, rick.p.edgecombe, dave.hansen, urezki,
mcgrof, kernel-team, Song Liu
Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
[round_up(_etext, 4kb), round_up(_etext, 2MB)] with register_text_tail_vm
so that we can use this part of memory for dynamic kernel text (BPF
programs, etc.).
Here is an example:
[root@eth50-1 ~]# grep _etext /proc/kallsyms
ffffffff82202a08 T _etext
[root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms | tail -n 3
ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup [bpf]
ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new [bpf]
ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch [bpf]
[root@eth50-1 ~]# grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel
0xffffffff82200000-0xffffffff82400000 2M ro PSE x pmd
ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, and
bpf programs.
Signed-off-by: Song Liu <song@kernel.org>
---
arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/mm/init_64.c | 4 +++-
include/linux/vmalloc.h | 4 ++++
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04f36063ad54..c0f9cceb109a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -101,6 +101,7 @@ extern unsigned int ptrs_per_p4d;
#define PUD_MASK (~(PUD_SIZE - 1))
#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE - 1))
+#define PMD_ALIGN(x) (((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
/*
* See Documentation/x86/x86_64/mm.rst for a description of the memory map.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3f040c6e5d13..5b42fc0c6099 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1373,7 +1373,7 @@ void mark_rodata_ro(void)
unsigned long start = PFN_ALIGN(_text);
unsigned long rodata_start = PFN_ALIGN(__start_rodata);
unsigned long end = (unsigned long)__end_rodata_hpage_align;
- unsigned long text_end = PFN_ALIGN(_etext);
+ unsigned long text_end = PMD_ALIGN(_etext);
unsigned long rodata_end = PFN_ALIGN(__end_rodata);
unsigned long all_end;
@@ -1414,6 +1414,8 @@ void mark_rodata_ro(void)
(void *)rodata_end, (void *)_sdata);
debug_checkwx();
+ register_text_tail_vm(PFN_ALIGN((unsigned long)_etext),
+ PMD_ALIGN((unsigned long)_etext));
}
int kern_addr_valid(unsigned long addr)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 9b2042313c12..7365cf9c4e7f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -132,11 +132,15 @@ extern void vm_unmap_aliases(void);
#ifdef CONFIG_MMU
extern void __init vmalloc_init(void);
extern unsigned long vmalloc_nr_pages(void);
+void register_text_tail_vm(unsigned long start, unsigned long end);
#else
static inline void vmalloc_init(void)
{
}
static inline unsigned long vmalloc_nr_pages(void) { return 0; }
+void register_text_tail_vm(unsigned long start, unsigned long end)
+{
+}
#endif
extern void *vmalloc(unsigned long size) __alloc_size(1);
--
2.30.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
2022-10-31 21:58 ` [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
@ 2022-11-01 11:54 ` Uladzislau Rezki
2022-11-01 15:06 ` Song Liu
0 siblings, 1 reply; 8+ messages in thread
From: Uladzislau Rezki @ 2022-11-01 11:54 UTC (permalink / raw)
To: Song Liu
Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
dave.hansen, urezki, mcgrof, kernel-team
On Mon, Oct 31, 2022 at 02:58:30PM -0700, Song Liu wrote:
> vmalloc_exec is used to allocate memory to host dynamic kernel text
> (modules, BPF programs, etc.) with huge pages. This is similar to the
> proposal by Peter in [1].
>
> A new tree of vmap_area, free_text_area_* tree, is introduced in addition
> to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from
> free_text_area_*. When there isn't enough space left in free_text_area_*,
> new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
> free_text_area_*. To be more accurate, the vmap_area is first added to
> vmap_area_* tree and then moved to free_text_area_*. This extra move
> simplifies the logic of vmalloc_exec.
>
> vmap_area in free_text_area_* tree are backed with memory, but we need
> subtree_max_size for tree operations. Therefore, vm_struct for these
> vmap_area are stored in a separate list, all_text_vm.
>
> The new tree allows separate handling of < PAGE_SIZE allocations, as
> current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
> version of vmalloc_exec can handle bpf programs, which uses 64 byte
> aligned allocations), and modules, which uses PAGE_SIZE aligned
> allocations.
>
> Memory allocated by vmalloc_exec() is set to RO+X before returning to the
> caller. Therefore, the caller cannot write directly write to the memory.
> Instead, the caller is required to use vcopy_exec() to update the memory.
> For the safety and security of X memory, vcopy_exec() checks the data
> being updated always in the memory allocated by one vmalloc_exec() call.
> vcopy_exec() uses text_poke like mechanism and requires arch support.
> Specifically, the arch need to implement arch_vcopy_exec().
>
> In vfree_exec(), the memory is first erased with arch_invalidate_exec().
> Then, the memory is added to free_text_area_*. If this free creates big
> enough continuous free space (> PMD_SIZE), vfree_exec() will try to free
> the backing vm_struct.
>
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
>
> Signed-off-by: Song Liu <song@kernel.org>
> ---
> include/linux/vmalloc.h | 5 +
> mm/nommu.c | 12 ++
> mm/vmalloc.c | 318 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 335 insertions(+)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 096d48aa3437..9b2042313c12 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
> void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
> int node, const void *caller) __alloc_size(1);
> void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> +void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1);
> +void *vcopy_exec(void *dst, void *src, size_t len);
> +void vfree_exec(void *addr);
> +void *arch_vcopy_exec(void *dst, void *src, size_t len);
> +int arch_invalidate_exec(void *ptr, size_t len);
>
> extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
> extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 214c70e1d059..8a1317247ef0 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> }
> EXPORT_SYMBOL(vm_map_pages_zero);
>
> +void *vmalloc_exec(unsigned long size, unsigned long align)
> +{
> + return NULL;
> +}
> +
> +void *vcopy_exec(void *dst, void *src, size_t len)
> +{
> + return ERR_PTR(-EOPNOTSUPP);
> +}
> +
> +void vfree_exec(const void *addr) { }
> +
> /*
> * sys_brk() for the most part doesn't need the global kernel
> * lock, except when an application is doing something nasty
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ccaa461998f3..6f4c73e67191 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc);
> static const bool vmap_allow_huge = false;
> #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
>
> +#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
> +#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
> +
> bool is_vmalloc_addr(const void *x)
> {
> unsigned long addr = (unsigned long)kasan_reset_tag(x);
> @@ -769,6 +772,38 @@ static LIST_HEAD(free_vmap_area_list);
> */
> static struct rb_root free_vmap_area_root = RB_ROOT;
>
> +/*
> + * free_text_area for vmalloc_exec()
> + */
> +static DEFINE_SPINLOCK(free_text_area_lock);
> +/*
> + * This linked list is used in pair with free_text_area_root.
> + * It gives O(1) access to prev/next to perform fast coalescing.
> + */
> +static LIST_HEAD(free_text_area_list);
> +
> +/*
> + * This augment red-black tree represents the free text space.
> + * All vmap_area objects in this tree are sorted by va->va_start
> + * address. It is used for allocation and merging when a vmap
> + * object is released.
> + *
> + * Each vmap_area node contains a maximum available free block
> + * of its sub-tree, right or left. Therefore it is possible to
> + * find a lowest match of free area.
> + *
> + * vmap_area in this tree are backed by RO+X memory, but they do
> + * not have valid vm pointer (because we need subtree_max_size).
> + * The vm for these vmap_area are stored in all_text_vm.
> + */
> +static struct rb_root free_text_area_root = RB_ROOT;
> +
> +/*
> + * List of vm_struct for free_text_area_root. This list is rarely
> + * accessed, so the O(N) complexity is not likely a real issue.
> + */
> +struct vm_struct *all_text_vm;
> +
> /*
> * Preload a CPU with one object for "no edge" split case. The
> * aim is to get rid of allocations from the atomic context, thus
> @@ -3313,6 +3348,289 @@ void *vmalloc(unsigned long size)
> }
> EXPORT_SYMBOL(vmalloc);
>
> +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> +#define VMALLOC_EXEC_START MODULES_VADDR
> +#define VMALLOC_EXEC_END MODULES_END
> +#else
> +#define VMALLOC_EXEC_START VMALLOC_START
> +#define VMALLOC_EXEC_END VMALLOC_END
> +#endif
> +
> +static void move_vmap_to_free_text_tree(void *addr)
> +{
> + struct vmap_area *va;
> +
> + /* remove from vmap_area_root */
> + spin_lock(&vmap_area_lock);
> + va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> + if (WARN_ON_ONCE(!va)) {
> + spin_unlock(&vmap_area_lock);
> + return;
> + }
> + unlink_va(va, &vmap_area_root);
> + spin_unlock(&vmap_area_lock);
> +
> + /* make the memory RO+X */
> + memset(addr, 0, va->va_end - va->va_start);
> + set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
> + set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
> +
> + /* add to all_text_vm */
> + va->vm->next = all_text_vm;
> + all_text_vm = va->vm;
> +
> + /* add to free_text_area_root */
> + spin_lock(&free_text_area_lock);
> + merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
> + spin_unlock(&free_text_area_lock);
> +}
> +
> +/**
> + * vmalloc_exec - allocate virtually contiguous RO+X memory
> + * @size: allocation size
> + *
> + * This is used to allocate dynamic kernel text, such as module text, BPF
> + * programs, etc. User need to use text_poke to update the memory allocated
> + * by vmalloc_exec.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, unsigned long align)
> +{
> + struct vmap_area *va, *tmp;
> + unsigned long addr;
> + enum fit_type type;
> + int ret;
> +
> + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
> + if (unlikely(!va))
> + return NULL;
> +
> +again:
> + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
> + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
> +
> + if (!tmp) {
> + unsigned long alloc_size;
> + void *ptr;
> +
> + spin_unlock(&free_text_area_lock);
> +
> + /*
> + * Not enough continuous space in free_text_area_root, try
> + * allocate more memory. The memory is first added to
> + * vmap_area_root, and then moved to free_text_area_root.
> + */
> + alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
> + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
> + VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
> + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
> + NUMA_NO_NODE, __builtin_return_address(0));
> + if (unlikely(!ptr))
> + goto err_out;
> +
> + move_vmap_to_free_text_tree(ptr);
> + goto again;
>
It is yet another allocator built on top of vmalloc. So there are 4 then.
Could you please avoid of doing it? I do not find it as something that is
reasonable.
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
2022-11-01 11:54 ` Uladzislau Rezki
@ 2022-11-01 15:06 ` Song Liu
0 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2022-11-01 15:06 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: bpf, linux-mm, akpm, x86, peterz, hch, rick.p.edgecombe,
dave.hansen, mcgrof, kernel-team
On Tue, Nov 1, 2022 at 4:54 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Mon, Oct 31, 2022 at 02:58:30PM -0700, Song Liu wrote:
> > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > proposal by Peter in [1].
> >
> > A new tree of vmap_area, free_text_area_* tree, is introduced in addition
> > to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from
> > free_text_area_*. When there isn't enough space left in free_text_area_*,
> > new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
> > free_text_area_*. To be more accurate, the vmap_area is first added to
> > vmap_area_* tree and then moved to free_text_area_*. This extra move
> > simplifies the logic of vmalloc_exec.
> >
> > vmap_area in free_text_area_* tree are backed with memory, but we need
> > subtree_max_size for tree operations. Therefore, vm_struct for these
> > vmap_area are stored in a separate list, all_text_vm.
> >
> > The new tree allows separate handling of < PAGE_SIZE allocations, as
> > current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
> > version of vmalloc_exec can handle bpf programs, which uses 64 byte
> > aligned allocations), and modules, which uses PAGE_SIZE aligned
> > allocations.
> >
> > Memory allocated by vmalloc_exec() is set to RO+X before returning to the
> > caller. Therefore, the caller cannot write directly write to the memory.
> > Instead, the caller is required to use vcopy_exec() to update the memory.
> > For the safety and security of X memory, vcopy_exec() checks the data
> > being updated always in the memory allocated by one vmalloc_exec() call.
> > vcopy_exec() uses text_poke like mechanism and requires arch support.
> > Specifically, the arch need to implement arch_vcopy_exec().
> >
> > In vfree_exec(), the memory is first erased with arch_invalidate_exec().
> > Then, the memory is added to free_text_area_*. If this free creates big
> > enough continuous free space (> PMD_SIZE), vfree_exec() will try to free
> > the backing vm_struct.
> >
> > [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
> >
> > Signed-off-by: Song Liu <song@kernel.org>
> > ---
> > include/linux/vmalloc.h | 5 +
> > mm/nommu.c | 12 ++
> > mm/vmalloc.c | 318 ++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 335 insertions(+)
> >
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 096d48aa3437..9b2042313c12 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -154,6 +154,11 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
> > int node, const void *caller) __alloc_size(1);
> > void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> > +void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1);
> > +void *vcopy_exec(void *dst, void *src, size_t len);
> > +void vfree_exec(void *addr);
> > +void *arch_vcopy_exec(void *dst, void *src, size_t len);
> > +int arch_invalidate_exec(void *ptr, size_t len);
> >
> > extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
> > extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 214c70e1d059..8a1317247ef0 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -371,6 +371,18 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> > }
> > EXPORT_SYMBOL(vm_map_pages_zero);
> >
> > +void *vmalloc_exec(unsigned long size, unsigned long align)
> > +{
> > + return NULL;
> > +}
> > +
> > +void *vcopy_exec(void *dst, void *src, size_t len)
> > +{
> > + return ERR_PTR(-EOPNOTSUPP);
> > +}
> > +
> > +void vfree_exec(const void *addr) { }
> > +
> > /*
> > * sys_brk() for the most part doesn't need the global kernel
> > * lock, except when an application is doing something nasty
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ccaa461998f3..6f4c73e67191 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc);
> > static const bool vmap_allow_huge = false;
> > #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
> >
> > +#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
> > +#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
> > +
> > bool is_vmalloc_addr(const void *x)
> > {
> > unsigned long addr = (unsigned long)kasan_reset_tag(x);
> > @@ -769,6 +772,38 @@ static LIST_HEAD(free_vmap_area_list);
> > */
> > static struct rb_root free_vmap_area_root = RB_ROOT;
> >
> > +/*
> > + * free_text_area for vmalloc_exec()
> > + */
> > +static DEFINE_SPINLOCK(free_text_area_lock);
> > +/*
> > + * This linked list is used in pair with free_text_area_root.
> > + * It gives O(1) access to prev/next to perform fast coalescing.
> > + */
> > +static LIST_HEAD(free_text_area_list);
> > +
> > +/*
> > + * This augment red-black tree represents the free text space.
> > + * All vmap_area objects in this tree are sorted by va->va_start
> > + * address. It is used for allocation and merging when a vmap
> > + * object is released.
> > + *
> > + * Each vmap_area node contains a maximum available free block
> > + * of its sub-tree, right or left. Therefore it is possible to
> > + * find a lowest match of free area.
> > + *
> > + * vmap_area in this tree are backed by RO+X memory, but they do
> > + * not have valid vm pointer (because we need subtree_max_size).
> > + * The vm for these vmap_area are stored in all_text_vm.
> > + */
> > +static struct rb_root free_text_area_root = RB_ROOT;
> > +
> > +/*
> > + * List of vm_struct for free_text_area_root. This list is rarely
> > + * accessed, so the O(N) complexity is not likely a real issue.
> > + */
> > +struct vm_struct *all_text_vm;
> > +
> > /*
> > * Preload a CPU with one object for "no edge" split case. The
> > * aim is to get rid of allocations from the atomic context, thus
> > @@ -3313,6 +3348,289 @@ void *vmalloc(unsigned long size)
> > }
> > EXPORT_SYMBOL(vmalloc);
> >
> > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> > +#define VMALLOC_EXEC_START MODULES_VADDR
> > +#define VMALLOC_EXEC_END MODULES_END
> > +#else
> > +#define VMALLOC_EXEC_START VMALLOC_START
> > +#define VMALLOC_EXEC_END VMALLOC_END
> > +#endif
> > +
> > +static void move_vmap_to_free_text_tree(void *addr)
> > +{
> > + struct vmap_area *va;
> > +
> > + /* remove from vmap_area_root */
> > + spin_lock(&vmap_area_lock);
> > + va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> > + if (WARN_ON_ONCE(!va)) {
> > + spin_unlock(&vmap_area_lock);
> > + return;
> > + }
> > + unlink_va(va, &vmap_area_root);
> > + spin_unlock(&vmap_area_lock);
> > +
> > + /* make the memory RO+X */
> > + memset(addr, 0, va->va_end - va->va_start);
> > + set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
> > + set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
> > +
> > + /* add to all_text_vm */
> > + va->vm->next = all_text_vm;
> > + all_text_vm = va->vm;
> > +
> > + /* add to free_text_area_root */
> > + spin_lock(&free_text_area_lock);
> > + merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
> > + spin_unlock(&free_text_area_lock);
> > +}
> > +
> > +/**
> > + * vmalloc_exec - allocate virtually contiguous RO+X memory
> > + * @size: allocation size
> > + *
> > + * This is used to allocate dynamic kernel text, such as module text, BPF
> > + * programs, etc. User need to use text_poke to update the memory allocated
> > + * by vmalloc_exec.
> > + *
> > + * Return: pointer to the allocated memory or %NULL on error
> > + */
> > +void *vmalloc_exec(unsigned long size, unsigned long align)
> > +{
> > + struct vmap_area *va, *tmp;
> > + unsigned long addr;
> > + enum fit_type type;
> > + int ret;
> > +
> > + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
> > + if (unlikely(!va))
> > + return NULL;
> > +
> > +again:
> > + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
> > + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
> > +
> > + if (!tmp) {
> > + unsigned long alloc_size;
> > + void *ptr;
> > +
> > + spin_unlock(&free_text_area_lock);
> > +
> > + /*
> > + * Not enough continuous space in free_text_area_root, try
> > + * allocate more memory. The memory is first added to
> > + * vmap_area_root, and then moved to free_text_area_root.
> > + */
> > + alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
> > + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
> > + VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
> > + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
> > + NUMA_NO_NODE, __builtin_return_address(0));
> > + if (unlikely(!ptr))
> > + goto err_out;
> > +
> > + move_vmap_to_free_text_tree(ptr);
> > + goto again;
> >
> It is yet another allocator built on top of vmalloc. So there are 4 then.
> Could you please avoid of doing it? I do not find it as something that is
> reasonable.
Could you please elaborate why this is not reasonable? Or, what would
be a more reasonable alternative?
Thanks,
Song
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-11-01 15:12 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31 21:58 [PATCH bpf-next v1 0/5] vmalloc_exec for modules and BPF programs Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec Song Liu
2022-11-01 11:54 ` Uladzislau Rezki
2022-11-01 15:06 ` Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 2/5] x86/alternative: support vmalloc_exec() and vfree_exec() Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 3/5] bpf: use vmalloc_exec for bpf program and bpf dispatcher Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-10-31 21:58 ` [PATCH bpf-next v1 5/5] x86: use register_text_tail_vm Song Liu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.