All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 0/4] vmalloc_exec for modules and BPF programs
@ 2022-10-07 23:43 Song Liu
  2022-10-07 23:43 ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Song Liu @ 2022-10-07 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, x86, peterz, hch, kernel-team, rick.p.edgecombe,
	dave.hansen, urezki, Song Liu

Changes RFC v1 => RFC v2:
1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
   still need some work.

This set is a prototype that allows dynamic kernel text (modules, bpf
programs, various trampolines, etc.) to share huge pages. The idea is
similar to Peter's suggestion in [1]. Please refer to each patch for
more detais.

The ultimate goal is to only host kernel text in 2MB pages (for x86_64).

Please share your comments on this.

Thanks!

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
[2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/

Song Liu (4):
  vmalloc: introduce vmalloc_exec and vfree_exec
  bpf: use vmalloc_exec
  modules, x86: use vmalloc_exec for module core
  vmalloc_exec: share a huge page with kernel text

 arch/x86/Kconfig              |   1 +
 arch/x86/kernel/alternative.c |  30 +++-
 arch/x86/kernel/module.c      |   1 +
 arch/x86/mm/init_64.c         |   3 +-
 include/linux/vmalloc.h       |   2 +
 kernel/bpf/core.c             | 155 ++----------------
 kernel/module/main.c          |  23 +--
 kernel/module/strict_rwx.c    |   3 -
 kernel/trace/ftrace.c         |   3 +-
 mm/nommu.c                    |   7 +
 mm/vmalloc.c                  | 296 ++++++++++++++++++++++++++++++++++
 11 files changed, 358 insertions(+), 166 deletions(-)

--
2.30.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec
  2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
@ 2022-10-07 23:43 ` Song Liu
  2022-10-10 18:13   ` Edgecombe, Rick P
  2022-10-07 23:43 ` [RFC v2 2/4] bpf: use vmalloc_exec Song Liu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-07 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, x86, peterz, hch, kernel-team, rick.p.edgecombe,
	dave.hansen, urezki, Song Liu

vmalloc_exec is used to allocate memory to host dynamic kernel text
(modules, BPF programs, etc.) with huge pages. This is similar to the
proposal by Peter in [1].

A new tree of vmap_area, free_text_area_* tree, is introduced in addition
to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from
free_text_area_*. When there isn't enough space left in free_text_area_*,
new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
free_text_area_*. To be more accurate, the vmap_area is first added to
vmap_area_* tree and then moved to free_text_area_*. This extra move
simplifies the logic of vmalloc_exec.

vmap_area in free_text_area_* tree are backed with memory, but we need
subtree_max_size for tree operations. Therefore, vm_struct for these
vmap_area are stored in a separate list, all_text_vm.

The new tree allows separate handling of < PAGE_SIZE allocations, as
current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
version of vmalloc_exec can handle bpf programs, which uses 64 byte
aligned allocations), and modules, which uses PAGE_SIZE aligned
allocations.

In vfree_exec(), the memory is first added to free_text_area_*. If this
free creates big enough free space (> PMD_SIZE), vfree_exec() will try to
free the backing vm_struct.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h |   2 +
 mm/nommu.c              |   7 ++
 mm/vmalloc.c            | 269 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 278 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..16c0adc1daee 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -154,6 +154,8 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 		int node, const void *caller) __alloc_size(1);
 void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1);
+void vfree_exec(const void *addr);
 
 extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
diff --git a/mm/nommu.c b/mm/nommu.c
index e819cbc21b39..c7dcd920ec26 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -372,6 +372,13 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 }
 EXPORT_SYMBOL(vm_map_pages_zero);
 
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+	return NULL;
+}
+
+void vfree_exec(const void *addr) { }
+
 /*
  *  sys_brk() for the most part doesn't need the global kernel
  *  lock, except when an application is doing something nasty
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 088b421601c4..9212ff96b871 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,9 @@ early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
+#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
+#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -753,6 +756,39 @@ static LIST_HEAD(free_vmap_area_list);
  */
 static struct rb_root free_vmap_area_root = RB_ROOT;
 
+/*
+ * free_text_area for vmalloc_exec() and vfree_exec()
+ *
+ */
+static DEFINE_SPINLOCK(free_text_area_lock);
+/*
+ * This linked list is used in pair with free_text_area_root.
+ * It gives O(1) access to prev/next to perform fast coalescing.
+ */
+static LIST_HEAD(free_text_area_list);
+
+/*
+ * This augment red-black tree represents the free text space.
+ * All vmap_area objects in this tree are sorted by va->va_start
+ * address. It is used for allocation and merging when a vmap
+ * object is released.
+ *
+ * Each vmap_area node contains a maximum available free block
+ * of its sub-tree, right or left. Therefore it is possible to
+ * find a lowest match of free area.
+ *
+ * vmap_area in this tree are backed by RO+X memory, but they do
+ * not have valid vm pointer (because we need subtree_max_size).
+ * The vm for these vmap_area are stored in all_text_vm.
+ */
+static struct rb_root free_text_area_root = RB_ROOT;
+
+/*
+ * List of vm_struct for free_text_area_root. This list is rarely
+ * accessed, so the O(N) complexity is not likely a real issue.
+ */
+struct vm_struct *all_text_vm;
+
 /*
  * Preload a CPU with one object for "no edge" split case. The
  * aim is to get rid of allocations from the atomic context, thus
@@ -3297,6 +3333,239 @@ void *vmalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc);
 
+#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
+#define VMALLOC_EXEC_START MODULES_VADDR
+#define VMALLOC_EXEC_END MODULES_END
+#else
+#define VMALLOC_EXEC_START VMALLOC_START
+#define VMALLOC_EXEC_END VMALLOC_END
+#endif
+
+static void move_vmap_to_free_text_tree(void *addr)
+{
+	struct vmap_area *va;
+
+	/* remove from vmap_area_root */
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* make the memory RO+X */
+	memset(addr, 0, va->va_end - va->va_start);
+	set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+	set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+
+	/* add to all_text_vm */
+	va->vm->next = all_text_vm;
+	all_text_vm = va->vm;
+
+	/* add to free_text_area_root */
+	spin_lock(&free_text_area_lock);
+	merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
+	spin_unlock(&free_text_area_lock);
+}
+
+/**
+ * vmalloc_exec - allocate virtually contiguous RO+X memory
+ * @size:    allocation size
+ *
+ * This is used to allocate dynamic kernel text, such as module text, BPF
+ * programs, etc. User need to use text_poke to update the memory allocated
+ * by vmalloc_exec.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+	struct vmap_area *va, *tmp;
+	unsigned long addr;
+	enum fit_type type;
+	int ret;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!va))
+		return NULL;
+
+again:
+	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
+	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
+
+	if (!tmp) {
+		unsigned long alloc_size;
+		void *ptr;
+
+		spin_unlock(&free_text_area_lock);
+
+		/*
+		 * Not enough continuous space in free_text_area_root, try
+		 * allocate more memory. The memory is first added to
+		 * vmap_area_root, and then moved to free_text_area_root.
+		 */
+		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
+		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
+					   VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
+					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
+					   NUMA_NO_NODE, __builtin_return_address(0));
+		if (unlikely(!ptr))
+			goto err_out;
+
+		move_vmap_to_free_text_tree(ptr);
+		goto again;
+	}
+
+	addr = roundup(tmp->va_start, align);
+	type = classify_va_fit_type(tmp, addr, size);
+	if (WARN_ON_ONCE(type == NOTHING_FIT))
+		goto err_out;
+
+	ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
+				    tmp, addr, size);
+	if (ret)
+		goto err_out;
+
+	spin_unlock(&free_text_area_lock);
+
+	va->va_start = addr;
+	va->va_end = addr + size;
+	va->vm = NULL;
+
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	return (void *)addr;
+
+err_out:
+	spin_unlock(&free_text_area_lock);
+	kmem_cache_free(vmap_area_cachep, va);
+	return NULL;
+}
+
+static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end)
+{
+	struct vm_struct *vm, *prev_vm;
+
+	lockdep_assert_held(&free_text_area_lock);
+
+	vm = all_text_vm;
+	while (vm) {
+		unsigned long vm_addr = (unsigned long)vm->addr;
+
+		/* vm is within this free space, we can free it */
+		if ((vm_addr >= start) && ((vm_addr + vm->size) <= end))
+			goto unlink_vm;
+		vm = vm->next;
+	}
+	return NULL;
+
+unlink_vm:
+	if (all_text_vm == vm) {
+		all_text_vm = vm->next;
+	} else {
+		prev_vm = all_text_vm;
+		while (prev_vm->next != vm)
+			prev_vm = prev_vm->next;
+		prev_vm = vm->next;
+	}
+	return vm;
+}
+
+/**
+ * vfree_exec - Release memory allocated by vmalloc_exec()
+ * @addr:  Memory base address
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void vfree_exec(const void *addr)
+{
+	unsigned long free_start, free_end, free_addr;
+	struct vm_struct *vm;
+	struct vmap_area *va;
+
+	might_sleep();
+
+	if (!addr)
+		return;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	WARN_ON_ONCE(va->vm);
+
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	spin_lock(&free_text_area_lock);
+	va = merge_or_add_vmap_area_augment(va,
+		&free_text_area_root, &free_text_area_list);
+
+	if (WARN_ON_ONCE(!va))
+		goto out;
+
+	free_start = PMD_ALIGN(va->va_start);
+	free_end = PMD_ALIGN_DOWN(va->va_end);
+
+	/*
+	 * Only try to free vm when there is at least one PMD_SIZE free
+	 * continuous memory.
+	 */
+	if (free_start >= free_end)
+		goto out;
+
+	/*
+	 * TODO: It is possible that multiple vm are ready to be freed
+	 * after one vfree_exec(). But we free at most one vm for now.
+	 */
+	vm = find_and_unlink_text_vm(free_start, free_end);
+	if (!vm)
+		goto out;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE);
+	if (unlikely(!va))
+		goto out_save_vm;
+
+	free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list,
+				      vm->size, 1, (unsigned long)vm->addr,
+				      (unsigned long)vm->addr + vm->size);
+
+	if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr))
+		goto out_save_vm;
+
+	va->va_start = (unsigned long)vm->addr;
+	va->va_end = va->va_start + vm->size;
+	va->vm = vm;
+	spin_unlock(&free_text_area_lock);
+
+	set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT);
+	set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT);
+
+	/* put the va to vmap_area_root, and then free it with vfree */
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	vfree(vm->addr);
+	return;
+
+out_save_vm:
+	/*
+	 * vm is removed from all_text_vm, but not freed. Add it back,
+	 * so that we can use or free it later.
+	 */
+	vm->next = all_text_vm;
+	all_text_vm = vm;
+out:
+	spin_unlock(&free_text_area_lock);
+}
+
 /**
  * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
  * @size:      allocation size
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 2/4] bpf: use vmalloc_exec
  2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
  2022-10-07 23:43 ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
@ 2022-10-07 23:43 ` Song Liu
  2022-10-07 23:43 ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Song Liu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-10-07 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, x86, peterz, hch, kernel-team, rick.p.edgecombe,
	dave.hansen, urezki, Song Liu

Use vmalloc_exec and vfree_exec instead of bpf_prog_pack_[alloc|free].

Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 155 +++-------------------------------------------
 1 file changed, 10 insertions(+), 145 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3d9eb3ae334c..f96e8acaa34c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -806,144 +806,6 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 	return slot;
 }
 
-/*
- * BPF program pack allocator.
- *
- * Most BPF programs are pretty small. Allocating a hole page for each
- * program is sometime a waste. Many small bpf program also adds pressure
- * to instruction TLB. To solve this issue, we introduce a BPF program pack
- * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
- * to host BPF programs.
- */
-#define BPF_PROG_CHUNK_SHIFT	6
-#define BPF_PROG_CHUNK_SIZE	(1 << BPF_PROG_CHUNK_SHIFT)
-#define BPF_PROG_CHUNK_MASK	(~(BPF_PROG_CHUNK_SIZE - 1))
-
-struct bpf_prog_pack {
-	struct list_head list;
-	void *ptr;
-	unsigned long bitmap[];
-};
-
-#define BPF_PROG_SIZE_TO_NBITS(size)	(round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
-
-static DEFINE_MUTEX(pack_mutex);
-static LIST_HEAD(pack_list);
-
-/* PMD_SIZE is not available in some special config, e.g. ARCH=arm with
- * CONFIG_MMU=n. Use PAGE_SIZE in these cases.
- */
-#ifdef PMD_SIZE
-#define BPF_PROG_PACK_SIZE (PMD_SIZE * num_possible_nodes())
-#else
-#define BPF_PROG_PACK_SIZE PAGE_SIZE
-#endif
-
-#define BPF_PROG_CHUNK_COUNT (BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
-
-static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	struct bpf_prog_pack *pack;
-
-	pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)),
-		       GFP_KERNEL);
-	if (!pack)
-		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
-	if (!pack->ptr) {
-		kfree(pack);
-		return NULL;
-	}
-	bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE);
-	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
-	list_add_tail(&pack->list, &pack_list);
-
-	set_vm_flush_reset_perms(pack->ptr);
-	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
-	return pack;
-}
-
-static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
-{
-	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
-	struct bpf_prog_pack *pack;
-	unsigned long pos;
-	void *ptr = NULL;
-
-	mutex_lock(&pack_mutex);
-	if (size > BPF_PROG_PACK_SIZE) {
-		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
-		if (ptr) {
-			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
-			set_memory_ro((unsigned long)ptr, size / PAGE_SIZE);
-			set_memory_x((unsigned long)ptr, size / PAGE_SIZE);
-		}
-		goto out;
-	}
-	list_for_each_entry(pack, &pack_list, list) {
-		pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-						 nbits, 0);
-		if (pos < BPF_PROG_CHUNK_COUNT)
-			goto found_free_area;
-	}
-
-	pack = alloc_new_pack(bpf_fill_ill_insns);
-	if (!pack)
-		goto out;
-
-	pos = 0;
-
-found_free_area:
-	bitmap_set(pack->bitmap, pos, nbits);
-	ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
-
-out:
-	mutex_unlock(&pack_mutex);
-	return ptr;
-}
-
-static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
-{
-	struct bpf_prog_pack *pack = NULL, *tmp;
-	unsigned int nbits;
-	unsigned long pos;
-
-	mutex_lock(&pack_mutex);
-	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
-		goto out;
-	}
-
-	list_for_each_entry(tmp, &pack_list, list) {
-		if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) {
-			pack = tmp;
-			break;
-		}
-	}
-
-	if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
-		goto out;
-
-	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
-	pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT;
-
-	WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size),
-		  "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n");
-
-	bitmap_clear(pack->bitmap, pos, nbits);
-	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
-				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
-		list_del(&pack->list);
-		module_memfree(pack->ptr);
-		kfree(pack);
-	}
-out:
-	mutex_unlock(&pack_mutex);
-}
-
 static atomic_long_t bpf_jit_current;
 
 /* Can be overridden by an arch's JIT compiler if it has a custom,
@@ -1043,6 +905,9 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 	bpf_jit_uncharge_modmem(size);
 }
 
+#define BPF_PROG_EXEC_ALIGN	64
+#define BPF_PROG_EXEC_MASK	(~(BPF_PROG_EXEC_ALIGN - 1))
+
 /* Allocate jit binary from bpf_prog_pack allocator.
  * Since the allocated memory is RO+X, the JIT engine cannot write directly
  * to the memory. To solve this problem, a RW buffer is also allocated at
@@ -1065,11 +930,11 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 		     alignment > BPF_IMAGE_ALIGNMENT);
 
 	/* add 16 bytes for a random section of illegal instructions */
-	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_CHUNK_SIZE);
+	size = round_up(proglen + sizeof(*ro_header) + 16, BPF_PROG_EXEC_ALIGN);
 
 	if (bpf_jit_charge_modmem(size))
 		return NULL;
-	ro_header = bpf_prog_pack_alloc(size, bpf_fill_ill_insns);
+	ro_header = vmalloc_exec(size, BPF_PROG_EXEC_ALIGN);
 	if (!ro_header) {
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
@@ -1078,7 +943,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 	*rw_header = kvmalloc(size, GFP_KERNEL);
 	if (!*rw_header) {
 		bpf_arch_text_copy(&ro_header->size, &size, sizeof(size));
-		bpf_prog_pack_free(ro_header);
+		vfree_exec(ro_header);
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
 	}
@@ -1088,7 +953,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 	(*rw_header)->size = size;
 
 	hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),
-		     BPF_PROG_CHUNK_SIZE - sizeof(*ro_header));
+		     BPF_PROG_EXEC_ALIGN - sizeof(*ro_header));
 	start = (get_random_int() % hole) & ~(alignment - 1);
 
 	*image_ptr = &ro_header->image[start];
@@ -1109,7 +974,7 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog,
 	kvfree(rw_header);
 
 	if (IS_ERR(ptr)) {
-		bpf_prog_pack_free(ro_header);
+		vfree_exec(ro_header);
 		return PTR_ERR(ptr);
 	}
 	return 0;
@@ -1130,7 +995,7 @@ void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header,
 {
 	u32 size = ro_header->size;
 
-	bpf_prog_pack_free(ro_header);
+	vfree_exec(ro_header);
 	kvfree(rw_header);
 	bpf_jit_uncharge_modmem(size);
 }
@@ -1141,7 +1006,7 @@ bpf_jit_binary_pack_hdr(const struct bpf_prog *fp)
 	unsigned long real_start = (unsigned long)fp->bpf_func;
 	unsigned long addr;
 
-	addr = real_start & BPF_PROG_CHUNK_MASK;
+	addr = real_start & BPF_PROG_EXEC_MASK;
 	return (void *)addr;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
  2022-10-07 23:43 ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
  2022-10-07 23:43 ` [RFC v2 2/4] bpf: use vmalloc_exec Song Liu
@ 2022-10-07 23:43 ` Song Liu
  2022-10-14  3:48   ` Aaron Lu
  2022-10-14 15:42   ` Edgecombe, Rick P
  2022-10-07 23:43 ` [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text Song Liu
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: Song Liu @ 2022-10-07 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, x86, peterz, hch, kernel-team, rick.p.edgecombe,
	dave.hansen, urezki, Song Liu

This is a prototype that allows modules to share 2MB text pages with other
modules and BPF programs.

Current version only covers core_layout.
---
 arch/x86/Kconfig              |  1 +
 arch/x86/kernel/alternative.c | 30 ++++++++++++++++++++++++------
 arch/x86/kernel/module.c      |  1 +
 kernel/module/main.c          | 23 +++++++++++++----------
 kernel/module/strict_rwx.c    |  3 ---
 kernel/trace/ftrace.c         |  3 ++-
 6 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..0b1ea05a1da6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,6 +91,7 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
+	select ARCH_WANTS_MODULES_DATA_IN_VMALLOC	if X86_64
 	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4f3204364caa..0e47a558c5bc 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -332,7 +332,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 
 		DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
 
-		text_poke_early(instr, insn_buff, insn_buff_sz);
+		if (system_state < SYSTEM_RUNNING) {
+			text_poke_early(instr, insn_buff, insn_buff_sz);
+		} else {
+			mutex_lock(&text_mutex);
+			text_poke(instr, insn_buff, insn_buff_sz);
+			mutex_unlock(&text_mutex);
+		}
 
 next:
 		optimize_nops(instr, a->instrlen);
@@ -503,7 +509,13 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
 			optimize_nops(bytes, len);
 			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
 			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-			text_poke_early(addr, bytes, len);
+			if (system_state == SYSTEM_BOOTING) {
+				text_poke_early(addr, bytes, len);
+			} else {
+				mutex_lock(&text_mutex);
+				text_poke(addr, bytes, len);
+				mutex_unlock(&text_mutex);
+			}
 		}
 	}
 }
@@ -568,7 +580,13 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
 		if (len == insn.length) {
 			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
 			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-			text_poke_early(addr, bytes, len);
+			if (unlikely(system_state == SYSTEM_BOOTING)) {
+				text_poke_early(addr, bytes, len);
+			} else {
+				mutex_lock(&text_mutex);
+				text_poke(addr, bytes, len);
+				mutex_unlock(&text_mutex);
+			}
 		}
 	}
 }
@@ -609,7 +627,7 @@ void __init_or_module noinline apply_ibt_endbr(s32 *start, s32 *end)
 		 */
 		DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
 		DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
-		text_poke_early(addr, &poison, 4);
+		text_poke(addr, &poison, 4);
 	}
 }
 
@@ -791,7 +809,7 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 
 		/* Pad the rest with nops */
 		add_nops(insn_buff + used, p->len - used);
-		text_poke_early(p->instr, insn_buff, p->len);
+		text_poke(p->instr, insn_buff, p->len);
 	}
 }
 extern struct paravirt_patch_site __start_parainstructions[],
@@ -1699,7 +1717,7 @@ void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *
 	struct text_poke_loc tp;
 
 	if (unlikely(system_state == SYSTEM_BOOTING)) {
-		text_poke_early(addr, opcode, len);
+		text_poke(addr, opcode, len);
 		return;
 	}
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b1abf663417c..577e31647dc4 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -229,6 +229,7 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
 	bool early = me->state == MODULE_STATE_UNFORMED;
 	void *(*write)(void *, const void *, size_t) = memcpy;
 
+	early = false;
 	if (!early) {
 		write = text_poke;
 		mutex_lock(&text_mutex);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index a4e4d84b6f4e..b44806e31a56 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -53,6 +53,7 @@
 #include <linux/bsearch.h>
 #include <linux/dynamic_debug.h>
 #include <linux/audit.h>
+#include <linux/bpf.h>
 #include <uapi/linux/module.h>
 #include "internal.h"
 
@@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
 	lockdep_free_key_range(mod->data_layout.base, mod->data_layout.size);
 
 	/* Finally, free the core (containing the module structure) */
-	module_memfree(mod->core_layout.base);
+	vfree_exec(mod->core_layout.base);
 #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
 	vfree(mod->data_layout.base);
 #endif
@@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
 			ksym = resolve_symbol_wait(mod, info, name);
 			/* Ok if resolved.  */
 			if (ksym && !IS_ERR(ksym)) {
-				sym[i].st_value = kernel_symbol_value(ksym);
+				unsigned long val = kernel_symbol_value(ksym);
+				bpf_arch_text_copy(&sym[i].st_value, &val, sizeof(val));
 				break;
 			}
 
@@ -1342,7 +1344,8 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
 				secbase = (unsigned long)mod_percpu(mod);
 			else
 				secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
-			sym[i].st_value += secbase;
+			secbase += sym[i].st_value;
+			bpf_arch_text_copy(&sym[i].st_value, &secbase, sizeof(secbase));
 			break;
 		}
 	}
@@ -2123,7 +2126,7 @@ static int move_module(struct module *mod, struct load_info *info)
 	void *ptr;
 
 	/* Do the allocs. */
-	ptr = module_alloc(mod->core_layout.size);
+	ptr = vmalloc_exec(mod->core_layout.size, PAGE_SIZE);
 	/*
 	 * The pointer to this block is stored in the module structure
 	 * which is inside the block. Just mark it as not being a
@@ -2133,7 +2136,7 @@ static int move_module(struct module *mod, struct load_info *info)
 	if (!ptr)
 		return -ENOMEM;
 
-	memset(ptr, 0, mod->core_layout.size);
+/* 	memset(ptr, 0, mod->core_layout.size); */
 	mod->core_layout.base = ptr;
 
 	if (mod->init_layout.size) {
@@ -2146,7 +2149,7 @@ static int move_module(struct module *mod, struct load_info *info)
 		 */
 		kmemleak_ignore(ptr);
 		if (!ptr) {
-			module_memfree(mod->core_layout.base);
+			vfree_exec(mod->core_layout.base);
 			return -ENOMEM;
 		}
 		memset(ptr, 0, mod->init_layout.size);
@@ -2156,7 +2159,7 @@ static int move_module(struct module *mod, struct load_info *info)
 
 #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
 	/* Do the allocs. */
-	ptr = vzalloc(mod->data_layout.size);
+	ptr = module_alloc(mod->data_layout.size);
 	/*
 	 * The pointer to this block is stored in the module structure
 	 * which is inside the block. Just mark it as not being a
@@ -2164,7 +2167,7 @@ static int move_module(struct module *mod, struct load_info *info)
 	 */
 	kmemleak_not_leak(ptr);
 	if (!ptr) {
-		module_memfree(mod->core_layout.base);
+		vfree_exec(mod->core_layout.base);
 		module_memfree(mod->init_layout.base);
 		return -ENOMEM;
 	}
@@ -2189,7 +2192,7 @@ static int move_module(struct module *mod, struct load_info *info)
 			dest = mod->core_layout.base + shdr->sh_entsize;
 
 		if (shdr->sh_type != SHT_NOBITS)
-			memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size);
+			bpf_arch_text_copy(dest, (void *)shdr->sh_addr, shdr->sh_size);
 		/* Update sh_addr to point to copy in image. */
 		shdr->sh_addr = (unsigned long)dest;
 		pr_debug("\t0x%lx %s\n",
@@ -2345,7 +2348,7 @@ static void module_deallocate(struct module *mod, struct load_info *info)
 	percpu_modfree(mod);
 	module_arch_freeing_init(mod);
 	module_memfree(mod->init_layout.base);
-	module_memfree(mod->core_layout.base);
+	vfree_exec(mod->core_layout.base);
 #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
 	vfree(mod->data_layout.base);
 #endif
diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index 14fbea66f12f..d392eb7bf574 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -85,7 +85,6 @@ void module_enable_x(const struct module *mod)
 	    !PAGE_ALIGNED(mod->init_layout.base))
 		return;
 
-	frob_text(&mod->core_layout, set_memory_x);
 	frob_text(&mod->init_layout, set_memory_x);
 }
 
@@ -98,9 +97,7 @@ void module_enable_ro(const struct module *mod, bool after_init)
 		return;
 #endif
 
-	set_vm_flush_reset_perms(mod->core_layout.base);
 	set_vm_flush_reset_perms(mod->init_layout.base);
-	frob_text(&mod->core_layout, set_memory_ro);
 
 	frob_rodata(&mod->data_layout, set_memory_ro);
 	frob_text(&mod->init_layout, set_memory_ro);
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 439e2ab6905e..818418d5b853 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3142,6 +3142,7 @@ static int ftrace_update_code(struct module *mod, struct ftrace_page *new_pgs)
 	if (mod)
 		rec_flags |= FTRACE_FL_DISABLED;
 
+	ftrace_arch_code_modify_prepare();
 	for (pg = new_pgs; pg; pg = pg->next) {
 
 		for (i = 0; i < pg->index; i++) {
@@ -3163,7 +3164,7 @@ static int ftrace_update_code(struct module *mod, struct ftrace_page *new_pgs)
 			update_cnt++;
 		}
 	}
-
+	ftrace_arch_code_modify_post_process();
 	stop = ftrace_now(raw_smp_processor_id());
 	ftrace_update_time = stop - start;
 	ftrace_update_tot_cnt += update_cnt;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
                   ` (2 preceding siblings ...)
  2022-10-07 23:43 ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Song Liu
@ 2022-10-07 23:43 ` Song Liu
  2022-10-10 18:32   ` Edgecombe, Rick P
  2022-10-08  0:17 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-07 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: akpm, x86, peterz, hch, kernel-team, rick.p.edgecombe,
	dave.hansen, urezki, Song Liu

On x86 kernel, we allocate 2MB pages for kernel text up to
round_down(_etext, 2MB). Therefore, some of the kernel text is still
on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
round_up(_etext, 2MB), and use the rest of the page for modules and
BPF programs.

Here is an example:

[root@eth50-1 ~]# grep _etext /proc/kallsyms
ffffffff82202a08 T _etext

[root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]

[root@eth50-1 ~]#  grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel
0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x  pmd

[root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
ffffffff822bc580 t xfs_flush_inodes     [xfs]

ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, xfs
module, and bpf programs.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/mm/init_64.c |  3 ++-
 mm/vmalloc.c          | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0fe690ebc269..d94f196c541a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1367,12 +1367,13 @@ int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
 
 int kernel_set_to_readonly;
 
+#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
 void mark_rodata_ro(void)
 {
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
-	unsigned long text_end = PFN_ALIGN(_etext);
+	unsigned long text_end = PMD_ALIGN(_etext);
 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
 	unsigned long all_end;
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9212ff96b871..41509bbec583 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -75,6 +75,9 @@ static const bool vmap_allow_huge = false;
 #define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
 #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
 
+static struct vm_struct text_tail_vm;
+static struct vmap_area text_tail_va;
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -637,6 +640,8 @@ int is_vmalloc_or_module_addr(const void *x)
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
 	if (addr >= MODULES_VADDR && addr < MODULES_END)
 		return 1;
+	if (addr >= text_tail_va.va_start && addr < text_tail_va.va_end)
+		return 1;
 #endif
 	return is_vmalloc_addr(x);
 }
@@ -2422,6 +2427,24 @@ static void vmap_init_free_space(void)
 	}
 }
 
+static void register_text_tail_vm(void)
+{
+	unsigned long start = PFN_ALIGN((unsigned long)_etext);
+	unsigned long end = PMD_ALIGN((unsigned long)_etext);
+	struct vmap_area *va;
+
+	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+	if (WARN_ON_ONCE(!va))
+		return;
+	text_tail_vm.addr = (void *)start;
+	text_tail_vm.size = end - start;
+	text_tail_va.va_start = start;
+	text_tail_va.va_end = end;
+	text_tail_va.vm = &text_tail_vm;
+	memcpy(va, &text_tail_va, sizeof(*va));
+	insert_vmap_area_augment(va, NULL, &free_text_area_root, &free_text_area_list);
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
@@ -2432,6 +2455,7 @@ void __init vmalloc_init(void)
 	 * Create the cache for vmap_area objects.
 	 */
 	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
+	register_text_tail_vm();
 
 	for_each_possible_cpu(i) {
 		struct vmap_block_queue *vbq;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
                   ` (3 preceding siblings ...)
  2022-10-07 23:43 ` [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text Song Liu
@ 2022-10-08  0:17 ` Song Liu
  2022-10-12 19:03 ` Song Liu
  2022-10-17  7:26 ` Christoph Hellwig
  6 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-10-08  0:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linux-MM, lkml, Andrew Morton, X86 ML, Peter Zijlstra,
	Christoph Hellwig, Kernel Team, Edgecombe, Rick P, dave.hansen,
	urezki


+ Luis.
> On Oct 7, 2022, at 4:43 PM, Song Liu <song@kernel.org> wrote:
> 
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>   still need some work.
> 
> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.
> 
> The ultimate goal is to only host kernel text in 2MB pages (for x86_64).
> 
> Please share your comments on this.
> 
> Thanks!
> 
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
> [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/
> 
> Song Liu (4):
>  vmalloc: introduce vmalloc_exec and vfree_exec
>  bpf: use vmalloc_exec
>  modules, x86: use vmalloc_exec for module core
>  vmalloc_exec: share a huge page with kernel text
> 
> arch/x86/Kconfig              |   1 +
> arch/x86/kernel/alternative.c |  30 +++-
> arch/x86/kernel/module.c      |   1 +
> arch/x86/mm/init_64.c         |   3 +-
> include/linux/vmalloc.h       |   2 +
> kernel/bpf/core.c             | 155 ++----------------
> kernel/module/main.c          |  23 +--
> kernel/module/strict_rwx.c    |   3 -
> kernel/trace/ftrace.c         |   3 +-
> mm/nommu.c                    |   7 +
> mm/vmalloc.c                  | 296 ++++++++++++++++++++++++++++++++++
> 11 files changed, 358 insertions(+), 166 deletions(-)
> 
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec
  2022-10-07 23:43 ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
@ 2022-10-10 18:13   ` Edgecombe, Rick P
  2022-10-10 19:04     ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 18:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm, song
  Cc: hch, kernel-team, peterz, urezki, akpm, x86, Hansen, Dave

How does this work with kasan?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-07 23:43 ` [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text Song Liu
@ 2022-10-10 18:32   ` Edgecombe, Rick P
  2022-10-10 19:08     ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 18:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm, song
  Cc: hch, kernel-team, peterz, urezki, akpm, x86, Hansen, Dave

On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> On x86 kernel, we allocate 2MB pages for kernel text up to
> round_down(_etext, 2MB). Therefore, some of the kernel text is still
> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
> round_up(_etext, 2MB), and use the rest of the page for modules and
> BPF programs.
> 
> Here is an example:
> 
> [root@eth50-1 ~]# grep _etext /proc/kallsyms
> ffffffff82202a08 T _etext
> 
> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> ffffffff8220f920 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> ffffffff8220fa28 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> ffffffff8220fad4 t
> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> 
> [root@eth50-1 ~]#  grep 0xffffffff82200000
> /sys/kernel/debug/page_tables/kernel
> 0xffffffff82200000-
> 0xffffffff82400000     2M     ro   PSE         x  pmd
> 
> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
> ffffffff822bc580 t xfs_flush_inodes     [xfs]
> 
> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> xfs
> module, and bpf programs.

Can this memory range be freed as part of a vfree_exec() call then?
Does vmalloc actually try to unmap it? If so, it could get complicated
with PTI.

It probably should be a special case that never gets fully freed.

> 
> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  arch/x86/mm/init_64.c |  3 ++-
>  mm/vmalloc.c          | 24 ++++++++++++++++++++++++
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0fe690ebc269..d94f196c541a 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1367,12 +1367,13 @@ int __init
> deferred_page_init_max_threads(const struct cpumask *node_cpumask)
>  
>  int kernel_set_to_readonly;
>  
> +#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) &
> PMD_MASK)
>  void mark_rodata_ro(void)
>  {
>  	unsigned long start = PFN_ALIGN(_text);
>  	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>  	unsigned long end = (unsigned long)__end_rodata_hpage_align;
> -	unsigned long text_end = PFN_ALIGN(_etext);
> +	unsigned long text_end = PMD_ALIGN(_etext);

This should probably have more logic and adjustments. If etext is PMD
aligned, some of the stuff outside the diff won't do anything.

Also, if a kernel doesn't have modules or BPF JIT it would be a waste
of memory.

>  	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
>  	unsigned long all_end;
>  
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 9212ff96b871..41509bbec583 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -75,6 +75,9 @@ static const bool vmap_allow_huge = false;
>  #define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
>  #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
>  
> +static struct vm_struct text_tail_vm;
> +static struct vmap_area text_tail_va;
> +
>  bool is_vmalloc_addr(const void *x)
>  {
>  	unsigned long addr = (unsigned long)kasan_reset_tag(x);
> @@ -637,6 +640,8 @@ int is_vmalloc_or_module_addr(const void *x)
>  	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>  	if (addr >= MODULES_VADDR && addr < MODULES_END)
>  		return 1;
> +	if (addr >= text_tail_va.va_start && addr <
> text_tail_va.va_end)
> +		return 1;
>  #endif
>  	return is_vmalloc_addr(x);
>  }
> @@ -2422,6 +2427,24 @@ static void vmap_init_free_space(void)
>  	}
>  }
>  
> +static void register_text_tail_vm(void)
> +{
> +	unsigned long start = PFN_ALIGN((unsigned long)_etext);
> +	unsigned long end = PMD_ALIGN((unsigned long)_etext);
> +	struct vmap_area *va;
> +
> +	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> +	if (WARN_ON_ONCE(!va))
> +		return;
> +	text_tail_vm.addr = (void *)start;
> +	text_tail_vm.size = end - start;
> +	text_tail_va.va_start = start;
> +	text_tail_va.va_end = end;
> +	text_tail_va.vm = &text_tail_vm;
> +	memcpy(va, &text_tail_va, sizeof(*va));
> +	insert_vmap_area_augment(va, NULL, &free_text_area_root,
> &free_text_area_list);
> +}
> +
>  void __init vmalloc_init(void)
>  {
>  	struct vmap_area *va;
> @@ -2432,6 +2455,7 @@ void __init vmalloc_init(void)
>  	 * Create the cache for vmap_area objects.
>  	 */
>  	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
> +	register_text_tail_vm();
>  
>  	for_each_possible_cpu(i) {
>  		struct vmap_block_queue *vbq;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec
  2022-10-10 18:13   ` Edgecombe, Rick P
@ 2022-10-10 19:04     ` Song Liu
  2022-10-10 19:59       ` Edgecombe, Rick P
  0 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-10 19:04 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, linux-mm, song, hch, Kernel Team, peterz, urezki,
	akpm, x86, Hansen, Dave



> On Oct 10, 2022, at 11:13 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> How does this work with kasan?

With KASAN enabled, BPF works fine. But module load does hit some issues. 

I guess we are missing some kasan_*() calls?

Thanks,
Song


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-10 18:32   ` Edgecombe, Rick P
@ 2022-10-10 19:08     ` Song Liu
  2022-10-10 20:09       ` Edgecombe, Rick P
  0 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-10 19:08 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, linux-mm, song, hch, Kernel Team, peterz, urezki,
	akpm, x86, Hansen, Dave



> On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>> On x86 kernel, we allocate 2MB pages for kernel text up to
>> round_down(_etext, 2MB). Therefore, some of the kernel text is still
>> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
>> round_up(_etext, 2MB), and use the rest of the page for modules and
>> BPF programs.
>> 
>> Here is an example:
>> 
>> [root@eth50-1 ~]# grep _etext /proc/kallsyms
>> ffffffff82202a08 T _etext
>> 
>> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
>> ffffffff8220f920 t
>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
>> ffffffff8220fa28 t
>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
>> ffffffff8220fad4 t
>> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
>> 
>> [root@eth50-1 ~]#  grep 0xffffffff82200000
>> /sys/kernel/debug/page_tables/kernel
>> 0xffffffff82200000-
>> 0xffffffff82400000     2M     ro   PSE         x  pmd
>> 
>> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
>> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
>> ffffffff822bc580 t xfs_flush_inodes     [xfs]
>> 
>> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
>> xfs
>> module, and bpf programs.
> 
> Can this memory range be freed as part of a vfree_exec() call then?
> Does vmalloc actually try to unmap it? If so, it could get complicated
> with PTI.
> 
> It probably should be a special case that never gets fully freed.

Right, this is never freed. 

> 
>> 
>> Signed-off-by: Song Liu <song@kernel.org>
>> ---
>> arch/x86/mm/init_64.c |  3 ++-
>> mm/vmalloc.c          | 24 ++++++++++++++++++++++++
>> 2 files changed, 26 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 0fe690ebc269..d94f196c541a 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1367,12 +1367,13 @@ int __init
>> deferred_page_init_max_threads(const struct cpumask *node_cpumask)
>> 
>> int kernel_set_to_readonly;
>> 
>> +#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) &
>> PMD_MASK)
>> void mark_rodata_ro(void)
>> {
>> 	unsigned long start = PFN_ALIGN(_text);
>> 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>> 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
>> -	unsigned long text_end = PFN_ALIGN(_etext);
>> +	unsigned long text_end = PMD_ALIGN(_etext);
> 
> This should probably have more logic and adjustments. If etext is PMD
> aligned, some of the stuff outside the diff won't do anything.

Hmm.. I don't quite follow this comment. If the etext is PMD aligned, 
we can still use vmalloc_exec to allocate memory. So it shouldn't 
matter, no?

> 
> Also, if a kernel doesn't have modules or BPF JIT it would be a waste
> of memory.

I guess we can add a command line argument for these corner cases? 

Thanks,
Song

> 
>> 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
>> 	unsigned long all_end;
>> 
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 9212ff96b871..41509bbec583 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -75,6 +75,9 @@ static const bool vmap_allow_huge = false;
>> #define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
>> #define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
>> 
>> +static struct vm_struct text_tail_vm;
>> +static struct vmap_area text_tail_va;
>> +
>> bool is_vmalloc_addr(const void *x)
>> {
>> 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>> @@ -637,6 +640,8 @@ int is_vmalloc_or_module_addr(const void *x)
>> 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
>> 	if (addr >= MODULES_VADDR && addr < MODULES_END)
>> 		return 1;
>> +	if (addr >= text_tail_va.va_start && addr <
>> text_tail_va.va_end)
>> +		return 1;
>> #endif
>> 	return is_vmalloc_addr(x);
>> }
>> @@ -2422,6 +2427,24 @@ static void vmap_init_free_space(void)
>> 	}
>> }
>> 
>> +static void register_text_tail_vm(void)
>> +{
>> +	unsigned long start = PFN_ALIGN((unsigned long)_etext);
>> +	unsigned long end = PMD_ALIGN((unsigned long)_etext);
>> +	struct vmap_area *va;
>> +
>> +	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
>> +	if (WARN_ON_ONCE(!va))
>> +		return;
>> +	text_tail_vm.addr = (void *)start;
>> +	text_tail_vm.size = end - start;
>> +	text_tail_va.va_start = start;
>> +	text_tail_va.va_end = end;
>> +	text_tail_va.vm = &text_tail_vm;
>> +	memcpy(va, &text_tail_va, sizeof(*va));
>> +	insert_vmap_area_augment(va, NULL, &free_text_area_root,
>> &free_text_area_list);
>> +}
>> +
>> void __init vmalloc_init(void)
>> {
>> 	struct vmap_area *va;
>> @@ -2432,6 +2455,7 @@ void __init vmalloc_init(void)
>> 	 * Create the cache for vmap_area objects.
>> 	 */
>> 	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
>> +	register_text_tail_vm();
>> 
>> 	for_each_possible_cpu(i) {
>> 		struct vmap_block_queue *vbq;


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec
  2022-10-10 19:04     ` Song Liu
@ 2022-10-10 19:59       ` Edgecombe, Rick P
  0 siblings, 0 replies; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 19:59 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Mon, 2022-10-10 at 19:04 +0000, Song Liu wrote:
> > On Oct 10, 2022, at 11:13 AM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > How does this work with kasan?
> 
> With KASAN enabled, BPF works fine. But module load does hit some
> issues. 
> 
> I guess we are missing some kasan_*() calls?

I'm not sure what the desired behavior should be. Do you mark the
individual vmalloc_exec() areas as allocated/freed, or the bigger
allocations that contain them? Since this is text, it is going to be
getting mostly fetches which kasan is not going to check. Not sure
which is right, but there should probably be some specific kasan
behavior.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-10 19:08     ` Song Liu
@ 2022-10-10 20:09       ` Edgecombe, Rick P
  2022-10-11 16:25         ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-10 20:09 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote:
> > On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> > > On x86 kernel, we allocate 2MB pages for kernel text up to
> > > round_down(_etext, 2MB). Therefore, some of the kernel text is
> > > still
> > > on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
> > > round_up(_etext, 2MB), and use the rest of the page for modules
> > > and
> > > BPF programs.
> > > 
> > > Here is an example:
> > > 
> > > [root@eth50-1 ~]# grep _etext /proc/kallsyms
> > > ffffffff82202a08 T _etext
> > > 
> > > [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> > > ffffffff8220f920 t
> > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> > > ffffffff8220fa28 t
> > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> > > ffffffff8220fad4 t
> > > bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> > > 
> > > [root@eth50-1 ~]#  grep 0xffffffff82200000
> > > /sys/kernel/debug/page_tables/kernel
> > > 0xffffffff82200000-
> > > 0xffffffff82400000     2M     ro   PSE         x  pmd
> > > 
> > > [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
> > > ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
> > > ffffffff822bc580 t xfs_flush_inodes     [xfs]
> > > 
> > > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel
> > > text,
> > > xfs
> > > module, and bpf programs.
> > 
> > Can this memory range be freed as part of a vfree_exec() call then?
> > Does vmalloc actually try to unmap it? If so, it could get
> > complicated
> > with PTI.
> > 
> > It probably should be a special case that never gets fully freed.
> 
> Right, this is never freed. 

Can we get a comment somewhere highlighting how this is avoided?

Maybe this is just me missing some vmalloc understanding, but this
pointer to an all zero vm_struct seems weird too. Are there other vmap
allocations like this? Which vmap APIs work with this and which don't?

> 
> > 
> > > 
> > > Signed-off-by: Song Liu <song@kernel.org>
> > > ---
> > > arch/x86/mm/init_64.c |  3 ++-
> > > mm/vmalloc.c          | 24 ++++++++++++++++++++++++
> > > 2 files changed, 26 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > > index 0fe690ebc269..d94f196c541a 100644
> > > --- a/arch/x86/mm/init_64.c
> > > +++ b/arch/x86/mm/init_64.c
> > > @@ -1367,12 +1367,13 @@ int __init
> > > deferred_page_init_max_threads(const struct cpumask
> > > *node_cpumask)
> > > 
> > > int kernel_set_to_readonly;
> > > 
> > > +#define PMD_ALIGN(x)        (((unsigned long)(x) + (PMD_SIZE -
> > > 1)) &
> > > PMD_MASK)
> > > void mark_rodata_ro(void)
> > > {
> > >       unsigned long start = PFN_ALIGN(_text);
> > >       unsigned long rodata_start = PFN_ALIGN(__start_rodata);
> > >       unsigned long end = (unsigned
> > > long)__end_rodata_hpage_align;
> > > -    unsigned long text_end = PFN_ALIGN(_etext);
> > > +    unsigned long text_end = PMD_ALIGN(_etext);
> > 
> > This should probably have more logic and adjustments. If etext is
> > PMD
> > aligned, some of the stuff outside the diff won't do anything.
> 
> Hmm.. I don't quite follow this comment. If the etext is PMD
> aligned, 
> we can still use vmalloc_exec to allocate memory. So it shouldn't 
> matter, no?

Maybe this doesn't matter since PMD alignment must happen naturally
sometimes. I was just noticing the attempts to operate on this region
between etext and start_rodata (free_init_pages(), etc). If this was
never not PMD aligned they could be dropped. But if you are going to
adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-10 20:09       ` Edgecombe, Rick P
@ 2022-10-11 16:25         ` Song Liu
  2022-10-11 20:40           ` Edgecombe, Rick P
  0 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-11 16:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, peterz, Kernel Team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki



> On Oct 10, 2022, at 1:09 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote:
>>> On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <
>>> rick.p.edgecombe@intel.com> wrote:
>>> 
>>> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>>>> On x86 kernel, we allocate 2MB pages for kernel text up to
>>>> round_down(_etext, 2MB). Therefore, some of the kernel text is
>>>> still
>>>> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
>>>> round_up(_etext, 2MB), and use the rest of the page for modules
>>>> and
>>>> BPF programs.
>>>> 
>>>> Here is an example:
>>>> 
>>>> [root@eth50-1 ~]# grep _etext /proc/kallsyms
>>>> ffffffff82202a08 T _etext
>>>> 
>>>> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
>>>> ffffffff8220f920 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
>>>> ffffffff8220fa28 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
>>>> ffffffff8220fad4 t
>>>> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
>>>> 
>>>> [root@eth50-1 ~]#  grep 0xffffffff82200000
>>>> /sys/kernel/debug/page_tables/kernel
>>>> 0xffffffff82200000-
>>>> 0xffffffff82400000     2M     ro   PSE         x  pmd
>>>> 
>>>> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
>>>> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
>>>> ffffffff822bc580 t xfs_flush_inodes     [xfs]
>>>> 
>>>> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel
>>>> text,
>>>> xfs
>>>> module, and bpf programs.
>>> 
>>> Can this memory range be freed as part of a vfree_exec() call then?
>>> Does vmalloc actually try to unmap it? If so, it could get
>>> complicated
>>> with PTI.
>>> 
>>> It probably should be a special case that never gets fully freed.
>> 
>> Right, this is never freed. 
> 
> Can we get a comment somewhere highlighting how this is avoided?

vfree_exec() only frees the range when we get PMD aligned range >=
PMD_SIZE. Since this range is smaller than PMD_SIZe, it is never
freed. I will add a comment for this in the never version. 
 
> 
> Maybe this is just me missing some vmalloc understanding, but this
> pointer to an all zero vm_struct seems weird too. Are there other vmap
> allocations like this? Which vmap APIs work with this and which don't?

There are two vmap trees at the moment: free_area_ tree and 
vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while 
vmap_area_ tree contains vmap backed by vm_struct, and thus uses 
vmap->vm. 

This set add a new tree, free_text_area_. This tree is different to 
the other two, as it uses subtree_max_size, and it is also backed 
by vm_struct. To handle this requirement without growing vmap_struct,
we introduced all_text_vm to store the vm_struct for free_text_area_
tree. 

free_text_area_ tree is different to vmap_area_ tree. Each vmap in
vmap_area_ tree has its own vm_struct (1 to 1 mapping), while 
multiple vmap in free_text_area_ tree map to a single vm_struct.

Also, free_text_area_ handles granularity < PAGE_SIZE; while the
other two trees only work with PAGE_SIZE aligned memory. 

Does this answer your questions? 

> 
>> 
>>> 
>>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> arch/x86/mm/init_64.c |  3 ++-
>>>> mm/vmalloc.c          | 24 ++++++++++++++++++++++++
>>>> 2 files changed, 26 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>>> index 0fe690ebc269..d94f196c541a 100644
>>>> --- a/arch/x86/mm/init_64.c
>>>> +++ b/arch/x86/mm/init_64.c
>>>> @@ -1367,12 +1367,13 @@ int __init
>>>> deferred_page_init_max_threads(const struct cpumask
>>>> *node_cpumask)
>>>> 
>>>> int kernel_set_to_readonly;
>>>> 
>>>> +#define PMD_ALIGN(x)        (((unsigned long)(x) + (PMD_SIZE -
>>>> 1)) &
>>>> PMD_MASK)
>>>> void mark_rodata_ro(void)
>>>> {
>>>>      unsigned long start = PFN_ALIGN(_text);
>>>>      unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>>>>      unsigned long end = (unsigned
>>>> long)__end_rodata_hpage_align;
>>>> -    unsigned long text_end = PFN_ALIGN(_etext);
>>>> +    unsigned long text_end = PMD_ALIGN(_etext);
>>> 
>>> This should probably have more logic and adjustments. If etext is
>>> PMD
>>> aligned, some of the stuff outside the diff won't do anything.
>> 
>> Hmm.. I don't quite follow this comment. If the etext is PMD
>> aligned, 
>> we can still use vmalloc_exec to allocate memory. So it shouldn't 
>> matter, no?
> 
> Maybe this doesn't matter since PMD alignment must happen naturally
> sometimes. I was just noticing the attempts to operate on this region
> between etext and start_rodata (free_init_pages(), etc). If this was
> never not PMD aligned they could be dropped. But if you are going to
> adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.

I guess we can add special handling for !CONFIG_MODULES && !CONFIG_BPF
&& !CONFIG_FTRACE cases, where we will not allocate this memory. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-11 16:25         ` Song Liu
@ 2022-10-11 20:40           ` Edgecombe, Rick P
  2022-10-12  5:37             ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-11 20:40 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Tue, 2022-10-11 at 16:25 +0000, Song Liu wrote:
> > Maybe this is just me missing some vmalloc understanding, but this
> > pointer to an all zero vm_struct seems weird too. Are there other
> > vmap
> > allocations like this? Which vmap APIs work with this and which
> > don't?
> 
> There are two vmap trees at the moment: free_area_ tree and 
> vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while 
> vmap_area_ tree contains vmap backed by vm_struct, and thus uses 
> vmap->vm. 
> 
> This set add a new tree, free_text_area_. This tree is different to 
> the other two, as it uses subtree_max_size, and it is also backed 
> by vm_struct. To handle this requirement without growing vmap_struct,
> we introduced all_text_vm to store the vm_struct for free_text_area_
> tree. 
> 
> free_text_area_ tree is different to vmap_area_ tree. Each vmap in
> vmap_area_ tree has its own vm_struct (1 to 1 mapping), while 
> multiple vmap in free_text_area_ tree map to a single vm_struct.
> 
> Also, free_text_area_ handles granularity < PAGE_SIZE; while the
> other two trees only work with PAGE_SIZE aligned memory. 
> 
> Does this answer your questions? 

I mean from the perspective of someone trying to use this without
diving into the entire implementation.

The function is called vmalloc_exec() and is freed with vfree_exec().
Makes sense. But with the other vmallocs_foo's (including previous
vmalloc_exec() implementations) you can call find_vm_area(), etc on
them. They show in "vmallocinfo" and generally behave similarly. That
isn't true for these new allocations, right?

Then you have code that operates on module text like:
if (is_vmalloc_or_module_addr(addr))
	pfn = vmalloc_to_pfn(addr);

It looks like it would work (on x86 at least). Should it be expected
to?

Especially after this patch, where there is memory that isn't even
tracked by the original vmap_area trees, it is pretty much a separate
allocator. So I think it might be nice to spell out which other vmalloc
APIs work with these new functions since they are named "vmalloc".
Maybe just say none of them do.


Separate from that, I guess you are planning to make this limited to
certain architectures? It might be better to put logic with assumptions
about x86 boot time page table details inside arch/x86 somewhere.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-11 20:40           ` Edgecombe, Rick P
@ 2022-10-12  5:37             ` Song Liu
  2022-10-12 18:38               ` Edgecombe, Rick P
  0 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-12  5:37 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Song Liu, linux-kernel, peterz, Kernel Team, linux-mm, song, hch,
	x86, akpm, Hansen, Dave, urezki



> On Oct 11, 2022, at 1:40 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Tue, 2022-10-11 at 16:25 +0000, Song Liu wrote:
>>> Maybe this is just me missing some vmalloc understanding, but this
>>> pointer to an all zero vm_struct seems weird too. Are there other
>>> vmap
>>> allocations like this? Which vmap APIs work with this and which
>>> don't?
>> 
>> There are two vmap trees at the moment: free_area_ tree and 
>> vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while 
>> vmap_area_ tree contains vmap backed by vm_struct, and thus uses 
>> vmap->vm. 
>> 
>> This set add a new tree, free_text_area_. This tree is different to 
>> the other two, as it uses subtree_max_size, and it is also backed 
>> by vm_struct. To handle this requirement without growing vmap_struct,
>> we introduced all_text_vm to store the vm_struct for free_text_area_
>> tree. 
>> 
>> free_text_area_ tree is different to vmap_area_ tree. Each vmap in
>> vmap_area_ tree has its own vm_struct (1 to 1 mapping), while 
>> multiple vmap in free_text_area_ tree map to a single vm_struct.
>> 
>> Also, free_text_area_ handles granularity < PAGE_SIZE; while the
>> other two trees only work with PAGE_SIZE aligned memory. 
>> 
>> Does this answer your questions? 
> 
> I mean from the perspective of someone trying to use this without
> diving into the entire implementation.
> 
> The function is called vmalloc_exec() and is freed with vfree_exec().
> Makes sense. But with the other vmallocs_foo's (including previous
> vmalloc_exec() implementations) you can call find_vm_area(), etc on
> them. They show in "vmallocinfo" and generally behave similarly. That
> isn't true for these new allocations, right?

That's right. These operations are not supported (at least for now). 

> 
> Then you have code that operates on module text like:
> if (is_vmalloc_or_module_addr(addr))
> 	pfn = vmalloc_to_pfn(addr);
> 
> It looks like it would work (on x86 at least). Should it be expected
> to?
> 
> Especially after this patch, where there is memory that isn't even
> tracked by the original vmap_area trees, it is pretty much a separate
> allocator. So I think it might be nice to spell out which other vmalloc
> APIs work with these new functions since they are named "vmalloc".
> Maybe just say none of them do.

I guess it is fair to call this a separate allocator. Maybe 
vmalloc_exec is not the right name? I do think this is the best 
way to build an allocator with vmap tree logic. 

> 
> 
> Separate from that, I guess you are planning to make this limited to
> certain architectures? It might be better to put logic with assumptions
> about x86 boot time page table details inside arch/x86 somewhere.

Yes, the architecture need some text_poke mechanism to use this. 
On BPF side, x86_64 calls this directly from arch code (jit engine), 
so it is mostly covered. For modules, we need to handle this better. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-12  5:37             ` Song Liu
@ 2022-10-12 18:38               ` Edgecombe, Rick P
  2022-10-12 19:01                 ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-12 18:38 UTC (permalink / raw)
  To: songliubraving
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Wed, 2022-10-12 at 05:37 +0000, Song Liu wrote:
> > Then you have code that operates on module text like:
> > if (is_vmalloc_or_module_addr(addr))
> >        pfn = vmalloc_to_pfn(addr);
> > 
> > It looks like it would work (on x86 at least). Should it be
> > expected
> > to?
> > 
> > Especially after this patch, where there is memory that isn't even
> > tracked by the original vmap_area trees, it is pretty much a
> > separate
> > allocator. So I think it might be nice to spell out which other
> > vmalloc
> > APIs work with these new functions since they are named "vmalloc".
> > Maybe just say none of them do.
> 
> I guess it is fair to call this a separate allocator. Maybe 
> vmalloc_exec is not the right name? I do think this is the best 
> way to build an allocator with vmap tree logic. 

Yea, I don't know about the name. I think someone else suggested it
specifically, right?

I had called mine perm_alloc() so it could also handle read-only and
other permissions. If you keep vmalloc_exec() it needs some big
comments about which APIs can work with it, and an audit of the
existing code that works on module and JIT text.

> 
> > 
> > 
> > Separate from that, I guess you are planning to make this limited
> > to
> > certain architectures? It might be better to put logic with
> > assumptions
> > about x86 boot time page table details inside arch/x86 somewhere.
> 
> Yes, the architecture need some text_poke mechanism to use this. 

It also depends on the space between _etext and the PMD aligned _etext
to be present and not get used by anything else. For other
architectures, there might be rodata there or other things.

> On BPF side, x86_64 calls this directly from arch code (jit engine), 
> so it is mostly covered. For modules, we need to handle this better. 

That old RFC has some ideas around this. I kind of like your
incremental approach though. To me it seems to be moving in the right
direction.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
  2022-10-12 18:38               ` Edgecombe, Rick P
@ 2022-10-12 19:01                 ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-10-12 19:01 UTC (permalink / raw)
  To: Edgecombe, Rick P, Luis Chamberlain
  Cc: Song Liu, linux-kernel, peterz, Kernel Team, linux-mm, song, hch,
	x86, akpm, Hansen, Dave, urezki



> On Oct 12, 2022, at 11:38 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2022-10-12 at 05:37 +0000, Song Liu wrote:
>>> Then you have code that operates on module text like:
>>> if (is_vmalloc_or_module_addr(addr))
>>>       pfn = vmalloc_to_pfn(addr);
>>> 
>>> It looks like it would work (on x86 at least). Should it be
>>> expected
>>> to?
>>> 
>>> Especially after this patch, where there is memory that isn't even
>>> tracked by the original vmap_area trees, it is pretty much a
>>> separate
>>> allocator. So I think it might be nice to spell out which other
>>> vmalloc
>>> APIs work with these new functions since they are named "vmalloc".
>>> Maybe just say none of them do.
>> 
>> I guess it is fair to call this a separate allocator. Maybe 
>> vmalloc_exec is not the right name? I do think this is the best 
>> way to build an allocator with vmap tree logic. 
> 
> Yea, I don't know about the name. I think someone else suggested it
> specifically, right?

I think Luis suggested rename module_alloc to vmalloc_exec. But I 
guess we still need module_alloc for module data allocations. 

> 
> I had called mine perm_alloc() so it could also handle read-only and
> other permissions.

What are other permissions that we use? We can probably duplicate
the free_text_are_ tree logic for other cases. 


> If you keep vmalloc_exec() it needs some big
> comments about which APIs can work with it, and an audit of the
> existing code that works on module and JIT text.
> 
>> 
>>> 
>>> 
>>> Separate from that, I guess you are planning to make this limited
>>> to
>>> certain architectures? It might be better to put logic with
>>> assumptions
>>> about x86 boot time page table details inside arch/x86 somewhere.
>> 
>> Yes, the architecture need some text_poke mechanism to use this. 
> 
> It also depends on the space between _etext and the PMD aligned _etext
> to be present and not get used by anything else. For other
> architectures, there might be rodata there or other things.

Good point! We need to make sure this part is not used by other things.

> 
>> On BPF side, x86_64 calls this directly from arch code (jit engine), 
>> so it is mostly covered. For modules, we need to handle this better. 
> 
> That old RFC has some ideas around this. I kind of like your
> incremental approach though. To me it seems to be moving in the right
> direction.

Thanks!
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
                   ` (4 preceding siblings ...)
  2022-10-08  0:17 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
@ 2022-10-12 19:03 ` Song Liu
  2022-10-17  7:26 ` Christoph Hellwig
  6 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-10-12 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, lkml, Andrew Morton, X86 ML, Christoph Hellwig,
	Kernel Team, Edgecombe, Rick P, Hansen, Dave, Uladzislau Rezki,
	Luis Chamberlain, Song Liu

Hi Peter,

> On Oct 7, 2022, at 4:43 PM, Song Liu <song@kernel.org> wrote:
> 
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>   still need some work.
> 
> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.
> 
> The ultimate goal is to only host kernel text in 2MB pages (for x86_64).
> 
> Please share your comments on this.
> 
> Thanks!
> 
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
> [2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/

Could you please share your comments on this version? Is this the 
right direction to move this work?

Thanks,
Song

> 
> Song Liu (4):
>  vmalloc: introduce vmalloc_exec and vfree_exec
>  bpf: use vmalloc_exec
>  modules, x86: use vmalloc_exec for module core
>  vmalloc_exec: share a huge page with kernel text
> 
> arch/x86/Kconfig              |   1 +
> arch/x86/kernel/alternative.c |  30 +++-
> arch/x86/kernel/module.c      |   1 +
> arch/x86/mm/init_64.c         |   3 +-
> include/linux/vmalloc.h       |   2 +
> kernel/bpf/core.c             | 155 ++----------------
> kernel/module/main.c          |  23 +--
> kernel/module/strict_rwx.c    |   3 -
> kernel/trace/ftrace.c         |   3 +-
> mm/nommu.c                    |   7 +
> mm/vmalloc.c                  | 296 ++++++++++++++++++++++++++++++++++
> 11 files changed, 358 insertions(+), 166 deletions(-)
> 
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-07 23:43 ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Song Liu
@ 2022-10-14  3:48   ` Aaron Lu
  2022-10-14  6:07     ` Song Liu
  2022-10-14 15:42   ` Edgecombe, Rick P
  1 sibling, 1 reply; 29+ messages in thread
From: Aaron Lu @ 2022-10-14  3:48 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-mm, linux-kernel, akpm, x86, peterz, hch, kernel-team,
	rick.p.edgecombe, dave.hansen, urezki

On Fri, Oct 07, 2022 at 04:43:14PM -0700, Song Liu wrote:
> This is a prototype that allows modules to share 2MB text pages with other
> modules and BPF programs.
> 
> Current version only covers core_layout.
> ---
>  arch/x86/Kconfig              |  1 +
>  arch/x86/kernel/alternative.c | 30 ++++++++++++++++++++++++------
>  arch/x86/kernel/module.c      |  1 +
>  kernel/module/main.c          | 23 +++++++++++++----------
>  kernel/module/strict_rwx.c    |  3 ---
>  kernel/trace/ftrace.c         |  3 ++-
>  6 files changed, 41 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index f9920f1341c8..0b1ea05a1da6 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -91,6 +91,7 @@ config X86
>  	select ARCH_HAS_SET_DIRECT_MAP
>  	select ARCH_HAS_STRICT_KERNEL_RWX
>  	select ARCH_HAS_STRICT_MODULE_RWX
> +	select ARCH_WANTS_MODULES_DATA_IN_VMALLOC	if X86_64
>  	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>  	select ARCH_HAS_SYSCALL_WRAPPER
>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 4f3204364caa..0e47a558c5bc 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -332,7 +332,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
>  
>  		DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
>  
> -		text_poke_early(instr, insn_buff, insn_buff_sz);
> +		if (system_state < SYSTEM_RUNNING) {
> +			text_poke_early(instr, insn_buff, insn_buff_sz);
> +		} else {
> +			mutex_lock(&text_mutex);
> +			text_poke(instr, insn_buff, insn_buff_sz);
> +			mutex_unlock(&text_mutex);
> +		}
>  
>  next:
>  		optimize_nops(instr, a->instrlen);
> @@ -503,7 +509,13 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
>  			optimize_nops(bytes, len);
>  			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
>  			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> -			text_poke_early(addr, bytes, len);
> +			if (system_state == SYSTEM_BOOTING) {
> +				text_poke_early(addr, bytes, len);
> +			} else {
> +				mutex_lock(&text_mutex);
> +				text_poke(addr, bytes, len);
> +				mutex_unlock(&text_mutex);
> +			}
>  		}
>  	}
>  }
> @@ -568,7 +580,13 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
>  		if (len == insn.length) {
>  			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
>  			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> -			text_poke_early(addr, bytes, len);
> +			if (unlikely(system_state == SYSTEM_BOOTING)) {
> +				text_poke_early(addr, bytes, len);
> +			} else {
> +				mutex_lock(&text_mutex);
> +				text_poke(addr, bytes, len);
> +				mutex_unlock(&text_mutex);
> +			}
>  		}
>  	}
>  }
> @@ -609,7 +627,7 @@ void __init_or_module noinline apply_ibt_endbr(s32 *start, s32 *end)
>  		 */
>  		DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
>  		DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
> -		text_poke_early(addr, &poison, 4);
> +		text_poke(addr, &poison, 4);
>  	}
>  }
>  
> @@ -791,7 +809,7 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
>  
>  		/* Pad the rest with nops */
>  		add_nops(insn_buff + used, p->len - used);
> -		text_poke_early(p->instr, insn_buff, p->len);
> +		text_poke(p->instr, insn_buff, p->len);

Got below warning when booting a VM:

[    0.190098] ------------[ cut here ]------------
[    0.190377] WARNING: CPU: 0 PID: 0 at /home/aaron/linux/src/arch/x86/kernel/alternative.c:1224 text_poke+0x53/0x60
[    0.191083] Modules linked in:
[    0.191269] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0.0-00004-gc49d19177d78 #5
[    0.191721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[    0.192083] RIP: 0010:text_poke+0x53/0x60
[    0.192326] Code: c7 c7 20 e7 02 81 5b 5d e9 2a f8 ff ff be ff ff ff ff 48 c7 c7 b0 6d 06 83 48 89 14 24 e8 75 fd bf 00 85 c0 48 8b 14 24 75 c8 <0f> 0b eb c4 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56
[    0.193083] RSP: 0000:ffffffff83003d60 EFLAGS: 00010246
[    0.194083] RAX: 0000000000000000 RBX: ffffffff810295b7 RCX: 0000000000000001
[    0.194506] RDX: 0000000000000006 RSI: ffffffff828b01c5 RDI: ffffffff8293898e
[    0.195083] RBP: ffffffff83003d82 R08: ffffffff82206520 R09: 0000000000000001
[    0.195506] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8a9949c0
[    0.195929] R13: ffffffff8a95f400 R14: 00000000ffffffff R15: 00000000ffffffff
[    0.196083] FS:  0000000000000000(0000) GS:ffff88842de00000(0000) knlGS:0000000000000000
[    0.196562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.197083] CR2: ffff88843ffff000 CR3: 0000000003012001 CR4: 0000000000770ef0
[    0.197508] PKRU: 55555554
[    0.197673] Call Trace:
[    0.197822]  <TASK>
[    0.198084]  apply_paravirt+0xaf/0x150
[    0.198313]  ? __might_resched+0x3f/0x280
[    0.198557]  ? synchronize_rcu+0xe0/0x1c0
[    0.198799]  ? lock_release+0x230/0x450
[    0.199030]  ? _raw_spin_unlock_irqrestore+0x30/0x60
[    0.199083]  ? lockdep_hardirqs_on+0x79/0x100
[    0.199345]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
[    0.199643]  ? atomic_notifier_chain_unregister+0x51/0x80
[    0.200084]  alternative_instructions+0x27/0xfa
[    0.200357]  check_bugs+0xe08/0xe82
[    0.200570]  start_kernel+0x692/0x6cc
[    0.200797]  secondary_startup_64_no_verify+0xe0/0xeb
[    0.201088]  </TASK>
[    0.201223] irq event stamp: 13575
[    0.201428] hardirqs last  enabled at (13583): [<ffffffff811193c2>] __up_console_sem+0x52/0x60
[    0.202083] hardirqs last disabled at (13592): [<ffffffff811193a7>] __up_console_sem+0x37/0x60
[    0.202594] softirqs last  enabled at (12762): [<ffffffff8117e169>] cgroup_idr_alloc.constprop.60+0x59/0x100
[    0.203083] softirqs last disabled at (12750): [<ffffffff8117e13d>] cgroup_idr_alloc.constprop.60+0x2d/0x100
[    0.203665] ---[ end trace 0000000000000000 ]---

Looks like it is also necessary to differentiate system_state in
apply_paravirt() like you did in the other apply_XXX() functions.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-14  3:48   ` Aaron Lu
@ 2022-10-14  6:07     ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-10-14  6:07 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, akpm, x86, peterz, hch, kernel-team,
	rick.p.edgecombe, dave.hansen, urezki

On Thu, Oct 13, 2022 at 8:49 PM Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Fri, Oct 07, 2022 at 04:43:14PM -0700, Song Liu wrote:
> > This is a prototype that allows modules to share 2MB text pages with other
> > modules and BPF programs.
> >
> > Current version only covers core_layout.
> > ---
> >  arch/x86/Kconfig              |  1 +
> >  arch/x86/kernel/alternative.c | 30 ++++++++++++++++++++++++------
> >  arch/x86/kernel/module.c      |  1 +
> >  kernel/module/main.c          | 23 +++++++++++++----------
> >  kernel/module/strict_rwx.c    |  3 ---
> >  kernel/trace/ftrace.c         |  3 ++-
> >  6 files changed, 41 insertions(+), 20 deletions(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index f9920f1341c8..0b1ea05a1da6 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -91,6 +91,7 @@ config X86
> >       select ARCH_HAS_SET_DIRECT_MAP
> >       select ARCH_HAS_STRICT_KERNEL_RWX
> >       select ARCH_HAS_STRICT_MODULE_RWX
> > +     select ARCH_WANTS_MODULES_DATA_IN_VMALLOC       if X86_64
> >       select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> >       select ARCH_HAS_SYSCALL_WRAPPER
> >       select ARCH_HAS_UBSAN_SANITIZE_ALL
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index 4f3204364caa..0e47a558c5bc 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -332,7 +332,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
> >
> >               DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
> >
> > -             text_poke_early(instr, insn_buff, insn_buff_sz);
> > +             if (system_state < SYSTEM_RUNNING) {
> > +                     text_poke_early(instr, insn_buff, insn_buff_sz);
> > +             } else {
> > +                     mutex_lock(&text_mutex);
> > +                     text_poke(instr, insn_buff, insn_buff_sz);
> > +                     mutex_unlock(&text_mutex);
> > +             }
> >
> >  next:
> >               optimize_nops(instr, a->instrlen);
> > @@ -503,7 +509,13 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
> >                       optimize_nops(bytes, len);
> >                       DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
> >                       DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> > -                     text_poke_early(addr, bytes, len);
> > +                     if (system_state == SYSTEM_BOOTING) {
> > +                             text_poke_early(addr, bytes, len);
> > +                     } else {
> > +                             mutex_lock(&text_mutex);
> > +                             text_poke(addr, bytes, len);
> > +                             mutex_unlock(&text_mutex);
> > +                     }
> >               }
> >       }
> >  }
> > @@ -568,7 +580,13 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
> >               if (len == insn.length) {
> >                       DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
> >                       DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
> > -                     text_poke_early(addr, bytes, len);
> > +                     if (unlikely(system_state == SYSTEM_BOOTING)) {
> > +                             text_poke_early(addr, bytes, len);
> > +                     } else {
> > +                             mutex_lock(&text_mutex);
> > +                             text_poke(addr, bytes, len);
> > +                             mutex_unlock(&text_mutex);
> > +                     }
> >               }
> >       }
> >  }
> > @@ -609,7 +627,7 @@ void __init_or_module noinline apply_ibt_endbr(s32 *start, s32 *end)
> >                */
> >               DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
> >               DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
> > -             text_poke_early(addr, &poison, 4);
> > +             text_poke(addr, &poison, 4);
> >       }
> >  }
> >
> > @@ -791,7 +809,7 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
> >
> >               /* Pad the rest with nops */
> >               add_nops(insn_buff + used, p->len - used);
> > -             text_poke_early(p->instr, insn_buff, p->len);
> > +             text_poke(p->instr, insn_buff, p->len);
>
> Got below warning when booting a VM:
>
> [    0.190098] ------------[ cut here ]------------
> [    0.190377] WARNING: CPU: 0 PID: 0 at /home/aaron/linux/src/arch/x86/kernel/alternative.c:1224 text_poke+0x53/0x60
> [    0.191083] Modules linked in:
> [    0.191269] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0.0-00004-gc49d19177d78 #5
> [    0.191721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> [    0.192083] RIP: 0010:text_poke+0x53/0x60
> [    0.192326] Code: c7 c7 20 e7 02 81 5b 5d e9 2a f8 ff ff be ff ff ff ff 48 c7 c7 b0 6d 06 83 48 89 14 24 e8 75 fd bf 00 85 c0 48 8b 14 24 75 c8 <0f> 0b eb c4 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56
> [    0.193083] RSP: 0000:ffffffff83003d60 EFLAGS: 00010246
> [    0.194083] RAX: 0000000000000000 RBX: ffffffff810295b7 RCX: 0000000000000001
> [    0.194506] RDX: 0000000000000006 RSI: ffffffff828b01c5 RDI: ffffffff8293898e
> [    0.195083] RBP: ffffffff83003d82 R08: ffffffff82206520 R09: 0000000000000001
> [    0.195506] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8a9949c0
> [    0.195929] R13: ffffffff8a95f400 R14: 00000000ffffffff R15: 00000000ffffffff
> [    0.196083] FS:  0000000000000000(0000) GS:ffff88842de00000(0000) knlGS:0000000000000000
> [    0.196562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.197083] CR2: ffff88843ffff000 CR3: 0000000003012001 CR4: 0000000000770ef0
> [    0.197508] PKRU: 55555554
> [    0.197673] Call Trace:
> [    0.197822]  <TASK>
> [    0.198084]  apply_paravirt+0xaf/0x150
> [    0.198313]  ? __might_resched+0x3f/0x280
> [    0.198557]  ? synchronize_rcu+0xe0/0x1c0
> [    0.198799]  ? lock_release+0x230/0x450
> [    0.199030]  ? _raw_spin_unlock_irqrestore+0x30/0x60
> [    0.199083]  ? lockdep_hardirqs_on+0x79/0x100
> [    0.199345]  ? _raw_spin_unlock_irqrestore+0x3b/0x60
> [    0.199643]  ? atomic_notifier_chain_unregister+0x51/0x80
> [    0.200084]  alternative_instructions+0x27/0xfa
> [    0.200357]  check_bugs+0xe08/0xe82
> [    0.200570]  start_kernel+0x692/0x6cc
> [    0.200797]  secondary_startup_64_no_verify+0xe0/0xeb
> [    0.201088]  </TASK>
> [    0.201223] irq event stamp: 13575
> [    0.201428] hardirqs last  enabled at (13583): [<ffffffff811193c2>] __up_console_sem+0x52/0x60
> [    0.202083] hardirqs last disabled at (13592): [<ffffffff811193a7>] __up_console_sem+0x37/0x60
> [    0.202594] softirqs last  enabled at (12762): [<ffffffff8117e169>] cgroup_idr_alloc.constprop.60+0x59/0x100
> [    0.203083] softirqs last disabled at (12750): [<ffffffff8117e13d>] cgroup_idr_alloc.constprop.60+0x2d/0x100
> [    0.203665] ---[ end trace 0000000000000000 ]---
>
> Looks like it is also necessary to differentiate system_state in
> apply_paravirt() like you did in the other apply_XXX() functions.

Thanks for the report! Somehow I didn't see this in my qemu vm.

Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-07 23:43 ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Song Liu
  2022-10-14  3:48   ` Aaron Lu
@ 2022-10-14 15:42   ` Edgecombe, Rick P
  2022-10-14 18:26     ` Song Liu
  1 sibling, 1 reply; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 15:42 UTC (permalink / raw)
  To: linux-kernel, linux-mm, song
  Cc: hch, kernel-team, peterz, urezki, akpm, x86, Hansen, Dave

On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index a4e4d84b6f4e..b44806e31a56 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -53,6 +53,7 @@
>  #include <linux/bsearch.h>
>  #include <linux/dynamic_debug.h>
>  #include <linux/audit.h>
> +#include <linux/bpf.h>
>  #include <uapi/linux/module.h>
>  #include "internal.h"
>  
> @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
>         lockdep_free_key_range(mod->data_layout.base, mod-
> >data_layout.size);
>  
>         /* Finally, free the core (containing the module structure)
> */
> -       module_memfree(mod->core_layout.base);
> +       vfree_exec(mod->core_layout.base);
>  #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
>         vfree(mod->data_layout.base);
>  #endif
> @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module *mod,
> const struct load_info *info)
>                         ksym = resolve_symbol_wait(mod, info, name);
>                         /* Ok if resolved.  */
>                         if (ksym && !IS_ERR(ksym)) {
> -                               sym[i].st_value =
> kernel_symbol_value(ksym);
> +                               unsigned long val =
> kernel_symbol_value(ksym);
> +                               bpf_arch_text_copy(&sym[i].st_value,
> &val, sizeof(val));

Why bpf_arch_text_copy()? This of course won't work for other
architectures. So there needs to be fallback method. That RFC broke the
operation into two stages: Loading and finalized. When loading, on non-
x86 the writes would simply be to the allocation mapped as writable.
When it was finalized it changed it to it's final permission (RO, etc).
Then for x86 it does text_pokes() for the writes and has it RO from the
beginning.

I ended up needing a staging buffer for modules too, so that the code
could operate on it directly. I can't remember why that was, it might
be unneeded now since you moved data out of the core allocation.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-14 15:42   ` Edgecombe, Rick P
@ 2022-10-14 18:26     ` Song Liu
  2022-10-14 18:41       ` Edgecombe, Rick P
  0 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-14 18:26 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, linux-mm, song, hch, Kernel Team, peterz, urezki,
	akpm, x86, Hansen, Dave



> On Oct 14, 2022, at 8:42 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>> index a4e4d84b6f4e..b44806e31a56 100644
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -53,6 +53,7 @@
>> #include <linux/bsearch.h>
>> #include <linux/dynamic_debug.h>
>> #include <linux/audit.h>
>> +#include <linux/bpf.h>
>> #include <uapi/linux/module.h>
>> #include "internal.h"
>> 
>> @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
>>        lockdep_free_key_range(mod->data_layout.base, mod-
>>> data_layout.size);
>> 
>>        /* Finally, free the core (containing the module structure)
>> */
>> -       module_memfree(mod->core_layout.base);
>> +       vfree_exec(mod->core_layout.base);
>> #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
>>        vfree(mod->data_layout.base);
>> #endif
>> @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module *mod,
>> const struct load_info *info)
>>                        ksym = resolve_symbol_wait(mod, info, name);
>>                        /* Ok if resolved.  */
>>                        if (ksym && !IS_ERR(ksym)) {
>> -                               sym[i].st_value =
>> kernel_symbol_value(ksym);
>> +                               unsigned long val =
>> kernel_symbol_value(ksym);
>> +                               bpf_arch_text_copy(&sym[i].st_value,
>> &val, sizeof(val));
> 
> Why bpf_arch_text_copy()? This of course won't work for other
> architectures. So there needs to be fallback method. That RFC broke the
> operation into two stages: Loading and finalized. When loading, on non-
> x86 the writes would simply be to the allocation mapped as writable.
> When it was finalized it changed it to it's final permission (RO, etc).
> Then for x86 it does text_pokes() for the writes and has it RO from the
> beginning.

Yeah, this one (3/4) is really a prototype to show vmalloc_exec could 
work for modules (with a lot more work of course). And something to
replace bpf_arch_text_copy() is one of the issues we need to address in
the future. 

> 
> I ended up needing a staging buffer for modules too, so that the code
> could operate on it directly. I can't remember why that was, it might
> be unneeded now since you moved data out of the core allocation.

Both bpf_jit and bpf_dispather uses a staging buffer with bpf_prog_pack. 
The benefit of this approach is that it minimizes the number of 
text_poke/copy() calls. OTOH, it is quite a pain to make all the 
relative calls correct, as the staging buffer has different address to 
the final allocation. 

I think we may not need the staging buffer for modules, as module 
load/unload happens less often than BPF program JITs (so it is ok for 
it to be slightly slower). 

btw: I cannot take credit for split module data out of core allocation,
Christophe Leroy did the work. :)

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 3/4] modules, x86: use vmalloc_exec for module core
  2022-10-14 18:26     ` Song Liu
@ 2022-10-14 18:41       ` Edgecombe, Rick P
  0 siblings, 0 replies; 29+ messages in thread
From: Edgecombe, Rick P @ 2022-10-14 18:41 UTC (permalink / raw)
  To: songliubraving, mcgrof
  Cc: linux-kernel, peterz, Kernel-team, linux-mm, song, hch, x86,
	akpm, Hansen, Dave, urezki

On Fri, 2022-10-14 at 18:26 +0000, Song Liu wrote:
> > On Oct 14, 2022, at 8:42 AM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> > > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > > index a4e4d84b6f4e..b44806e31a56 100644
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -53,6 +53,7 @@
> > > #include <linux/bsearch.h>
> > > #include <linux/dynamic_debug.h>
> > > #include <linux/audit.h>
> > > +#include <linux/bpf.h>
> > > #include <uapi/linux/module.h>
> > > #include "internal.h"
> > > 
> > > @@ -1203,7 +1204,7 @@ static void free_module(struct module *mod)
> > >         lockdep_free_key_range(mod->data_layout.base, mod-
> > > > data_layout.size);
> > > 
> > >         /* Finally, free the core (containing the module
> > > structure)
> > > */
> > > -       module_memfree(mod->core_layout.base);
> > > +       vfree_exec(mod->core_layout.base);
> > > #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > >         vfree(mod->data_layout.base);
> > > #endif
> > > @@ -1321,7 +1322,8 @@ static int simplify_symbols(struct module
> > > *mod,
> > > const struct load_info *info)
> > >                         ksym = resolve_symbol_wait(mod, info,
> > > name);
> > >                         /* Ok if resolved.  */
> > >                         if (ksym && !IS_ERR(ksym)) {
> > > -                               sym[i].st_value =
> > > kernel_symbol_value(ksym);
> > > +                               unsigned long val =
> > > kernel_symbol_value(ksym);
> > > +                              
> > > bpf_arch_text_copy(&sym[i].st_value,
> > > &val, sizeof(val));
> > 
> > Why bpf_arch_text_copy()? This of course won't work for other
> > architectures. So there needs to be fallback method. That RFC broke
> > the
> > operation into two stages: Loading and finalized. When loading, on
> > non-
> > x86 the writes would simply be to the allocation mapped as
> > writable.
> > When it was finalized it changed it to it's final permission (RO,
> > etc).
> > Then for x86 it does text_pokes() for the writes and has it RO from
> > the
> > beginning.
> 
> Yeah, this one (3/4) is really a prototype to show vmalloc_exec
> could 
> work for modules (with a lot more work of course). And something to
> replace bpf_arch_text_copy() is one of the issues we need to address
> in
> the future. 

Right, I think making it work both with and without text_poke() is
needed to ever get it to work with modules. Since so much of it is
cross arch. Oops, it looks like we lost Luis on most of these
responses.

If we don't have that, then the modules RFC is kind of a distraction.
But bpf, kprobes and ftrace is still nice.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
                   ` (5 preceding siblings ...)
  2022-10-12 19:03 ` Song Liu
@ 2022-10-17  7:26 ` Christoph Hellwig
  2022-10-17 16:23   ` Song Liu
  6 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2022-10-17  7:26 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-mm, linux-kernel, akpm, x86, peterz, hch, kernel-team,
	rick.p.edgecombe, dave.hansen, urezki

On Fri, Oct 07, 2022 at 04:43:11PM -0700, Song Liu wrote:
> Changes RFC v1 => RFC v2:
> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>    work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>    still need some work.

Can you please move the changelog under the description of WTF the
series actually does like the normal kernel process?  Explaining the
changes from a previous version before you even describe what the series
does is completely incoherent.

> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.

Well, nothing explains what the method is to avoid having memory
that is mapped writable and executable at the same time, which really
could use some explanation here (and in the main patch as well).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-17  7:26 ` Christoph Hellwig
@ 2022-10-17 16:23   ` Song Liu
  2022-10-18 14:50     ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-17 16:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, Linux-MM, lkml, Andrew Morton, X86 ML, Peter Zijlstra,
	Kernel Team, Edgecombe, Rick P, Hansen, Dave, urezki

Hi Chritoph, 

> On Oct 17, 2022, at 12:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Fri, Oct 07, 2022 at 04:43:11PM -0700, Song Liu wrote:
>> Changes RFC v1 => RFC v2:
>> 1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
>>   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
>>   still need some work.
> 
> Can you please move the changelog under the description of WTF the
> series actually does like the normal kernel process?  Explaining the
> changes from a previous version before you even describe what the series
> does is completely incoherent.

Will fix in the next version. 

> 
>> This set is a prototype that allows dynamic kernel text (modules, bpf
>> programs, various trampolines, etc.) to share huge pages. The idea is
>> similar to Peter's suggestion in [1]. Please refer to each patch for
>> more detais.
> 
> Well, nothing explains what the method is to avoid having memory
> that is mapped writable and executable at the same time, which really
> could use some explanation here (and in the main patch as well).

Thanks for the feedback. I will add this. 

Does the code look good to you? I personally think patch 1, 2, 4 could
ship with a little more work. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-17 16:23   ` Song Liu
@ 2022-10-18 14:50     ` Christoph Hellwig
  2022-10-18 15:05       ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2022-10-18 14:50 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, Linux-MM, lkml, Andrew Morton,
	X86 ML, Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen,
	Dave, urezki

On Mon, Oct 17, 2022 at 04:23:52PM +0000, Song Liu wrote:
> > Well, nothing explains what the method is to avoid having memory
> > that is mapped writable and executable at the same time, which really
> > could use some explanation here (and in the main patch as well).
> 
> Thanks for the feedback. I will add this. 
> 
> Does the code look good to you? I personally think patch 1, 2, 4 could
> ship with a little more work. 

I only took a quick look and I'm not sure how the W^X actually works.
Yes, it alls into the text poke helpers, but how do these work on
less than page sized allocations?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-18 14:50     ` Christoph Hellwig
@ 2022-10-18 15:05       ` Song Liu
  2022-10-18 15:40         ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-10-18 15:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Song Liu, Song Liu, Linux-MM, lkml, Andrew Morton, X86 ML,
	Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen, Dave,
	urezki



> On Oct 18, 2022, at 7:50 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Mon, Oct 17, 2022 at 04:23:52PM +0000, Song Liu wrote:
>>> Well, nothing explains what the method is to avoid having memory
>>> that is mapped writable and executable at the same time, which really
>>> could use some explanation here (and in the main patch as well).
>> 
>> Thanks for the feedback. I will add this. 
>> 
>> Does the code look good to you? I personally think patch 1, 2, 4 could
>> ship with a little more work. 
> 
> I only took a quick look and I'm not sure how the W^X actually works.
> Yes, it alls into the text poke helpers, but how do these work on
> less than page sized allocations?

Aha, I guess I understand your point (and concern) now. 

It is the same as text poke into static kernel text: we create a local 
writable mapping to the memory we need to update. For less than page
sized allocation, this mapping does have access to X memory that may 
belong to a different allocation, just like text poke into static 
kernel text. 

Maybe we need something like vcopy_exec(x_mem, tmp_buf, size), where
we explicitly check the allowed memory of x_mem is bigger or equal to
size. And users of vmalloc_exec should only use vcopy_exec to update
memory from vmalloc_exec. 

Does this make sense? Did I understand your concern correctly? 

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-18 15:05       ` Song Liu
@ 2022-10-18 15:40         ` Christoph Hellwig
  2022-10-18 15:40           ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2022-10-18 15:40 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, Linux-MM, lkml, Andrew Morton,
	X86 ML, Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen,
	Dave, urezki

On Tue, Oct 18, 2022 at 03:05:24PM +0000, Song Liu wrote:
> Maybe we need something like vcopy_exec(x_mem, tmp_buf, size), where
> we explicitly check the allowed memory of x_mem is bigger or equal to
> size. And users of vmalloc_exec should only use vcopy_exec to update
> memory from vmalloc_exec. 
> 
> Does this make sense? Did I understand your concern correctly? 

It sounds sensible.  Make sure it iѕ properly documented and reviewed
by people that actually know this area.  I just know enough to ask
stupid question, not to very something is actually correct :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v2 0/4] vmalloc_exec for modules and BPF programs
  2022-10-18 15:40         ` Christoph Hellwig
@ 2022-10-18 15:40           ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2022-10-18 15:40 UTC (permalink / raw)
  To: Song Liu
  Cc: Christoph Hellwig, Song Liu, Linux-MM, lkml, Andrew Morton,
	X86 ML, Peter Zijlstra, Kernel Team, Edgecombe, Rick P, Hansen,
	Dave, urezki

On Tue, Oct 18, 2022 at 05:40:20PM +0200, Christoph Hellwig wrote:
> It sounds sensible.  Make sure it iѕ properly documented and reviewed
> by people that actually know this area.  I just know enough to ask
> stupid question, not to very something is actually correct :)

s/very/verify/

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-10-18 15:40 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
2022-10-07 23:43 ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
2022-10-10 18:13   ` Edgecombe, Rick P
2022-10-10 19:04     ` Song Liu
2022-10-10 19:59       ` Edgecombe, Rick P
2022-10-07 23:43 ` [RFC v2 2/4] bpf: use vmalloc_exec Song Liu
2022-10-07 23:43 ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Song Liu
2022-10-14  3:48   ` Aaron Lu
2022-10-14  6:07     ` Song Liu
2022-10-14 15:42   ` Edgecombe, Rick P
2022-10-14 18:26     ` Song Liu
2022-10-14 18:41       ` Edgecombe, Rick P
2022-10-07 23:43 ` [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text Song Liu
2022-10-10 18:32   ` Edgecombe, Rick P
2022-10-10 19:08     ` Song Liu
2022-10-10 20:09       ` Edgecombe, Rick P
2022-10-11 16:25         ` Song Liu
2022-10-11 20:40           ` Edgecombe, Rick P
2022-10-12  5:37             ` Song Liu
2022-10-12 18:38               ` Edgecombe, Rick P
2022-10-12 19:01                 ` Song Liu
2022-10-08  0:17 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
2022-10-12 19:03 ` Song Liu
2022-10-17  7:26 ` Christoph Hellwig
2022-10-17 16:23   ` Song Liu
2022-10-18 14:50     ` Christoph Hellwig
2022-10-18 15:05       ` Song Liu
2022-10-18 15:40         ` Christoph Hellwig
2022-10-18 15:40           ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.