bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Type aware module allocator
@ 2023-05-26  5:15 Song Liu
  2023-05-26  5:15 ` [PATCH 1/3] module: Introduce module_alloc_type Song Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Song Liu @ 2023-05-26  5:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: bpf, mcgrof, peterz, tglx, x86, rppt, kent.overstreet, Song Liu

This set implements the second part of module type aware allocator
(module_alloc_type), which was discussed in [1]. This part contains the
interface of the new allocator, as well as changes in x86 code to use the
new allocator (modules, BPF, ftrace, kprobe).

The set does not contain a binpack allocator to enable sharing huge pages
among different allocations. But this set defines the interface used by
the binpack allocator. [2] has some discussion on different options of the
binpack allocator.

[1] https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/
[2] https://lore.kernel.org/all/20230308094106.227365-1-rppt@kernel.org/

Song Liu (3):
  module: Introduce module_alloc_type
  ftrace: Add swap_func to ftrace_process_locs()
  x86/module: Use module_alloc_type

 arch/x86/kernel/alternative.c  |  37 ++++--
 arch/x86/kernel/ftrace.c       |  44 ++++---
 arch/x86/kernel/kprobes/core.c |   8 +-
 arch/x86/kernel/module.c       | 114 +++++++++++------
 arch/x86/kernel/unwind_orc.c   |  13 +-
 arch/x86/net/bpf_jit_comp.c    |  22 +++-
 include/linux/ftrace.h         |   2 +
 include/linux/module.h         |   6 +
 include/linux/moduleloader.h   |  75 ++++++++++++
 init/main.c                    |   1 +
 kernel/bpf/bpf_struct_ops.c    |  10 +-
 kernel/bpf/core.c              |  26 ++--
 kernel/bpf/trampoline.c        |   6 +-
 kernel/kprobes.c               |   6 +-
 kernel/module/internal.h       |   3 +
 kernel/module/main.c           | 217 +++++++++++++++++++++++++++++++--
 kernel/module/strict_rwx.c     |   4 +
 kernel/trace/ftrace.c          |  13 +-
 18 files changed, 493 insertions(+), 114 deletions(-)

--
2.34.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-26  5:15 [PATCH 0/3] Type aware module allocator Song Liu
@ 2023-05-26  5:15 ` Song Liu
  2023-05-26  6:07   ` Randy Dunlap
  2023-05-26 22:29   ` Kent Overstreet
  2023-05-26  5:15 ` [PATCH 2/3] ftrace: Add swap_func to ftrace_process_locs() Song Liu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Song Liu @ 2023-05-26  5:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: bpf, mcgrof, peterz, tglx, x86, rppt, kent.overstreet, Song Liu

Introduce memory type aware module_alloc_type, which provides a unified
allocator for all different archs. This work was discussed in [1].

Each arch can configure the allocator to do the following:

   1. Specify module_vaddr and module_end
   2. Random module start address for KASLR
   3. kasan_alloc_module_shadow()
   4. kasan_reset_tag()
   5. Preferred and secondary module address ranges

enum mod_alloc_params_flags are used to control the behavior of
module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type
fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let
module_alloc_type to protect the memory before returning to the user.

module_allocator_init() call is added to start_kernel() to initialize
module_alloc_type.

Signed-off-by: Song Liu <song@kernel.org>

[1] https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/
---
 include/linux/module.h       |   6 +
 include/linux/moduleloader.h |  75 ++++++++++++
 init/main.c                  |   1 +
 kernel/bpf/bpf_struct_ops.c  |  10 +-
 kernel/bpf/core.c            |  20 ++--
 kernel/bpf/trampoline.c      |   6 +-
 kernel/kprobes.c             |   6 +-
 kernel/module/internal.h     |   3 +
 kernel/module/main.c         | 217 +++++++++++++++++++++++++++++++++--
 kernel/module/strict_rwx.c   |   4 +
 10 files changed, 319 insertions(+), 29 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 9e56763dff81..948b8132a742 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -752,6 +752,8 @@ static inline bool is_livepatch_module(struct module *mod)
 
 void set_module_sig_enforced(void);
 
+void __init module_allocator_init(void);
+
 #else /* !CONFIG_MODULES... */
 
 static inline struct module *__module_address(unsigned long addr)
@@ -855,6 +857,10 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr)
 	return ptr;
 }
 
+static inline void __init module_allocator_init(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_SYSFS
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 03be088fb439..59c7114a7b65 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -32,6 +32,81 @@ void *module_alloc(unsigned long size);
 /* Free memory returned from module_alloc. */
 void module_memfree(void *module_region);
 
+#ifdef CONFIG_MODULES
+
+/* For mod_alloc_params.flags */
+enum mod_alloc_params_flags {
+	MOD_ALLOC_FALLBACK		= (1 << 0),	/* Fallback to module_alloc() */
+	MOD_ALLOC_KASAN_MODULE_SHADOW	= (1 << 1),	/* Calls kasan_alloc_module_shadow() */
+	MOD_ALLOC_KASAN_RESET_TAG	= (1 << 2),	/* Calls kasan_reset_tag() */
+	MOD_ALLOC_SET_MEMORY		= (1 << 3),	/* The allocator calls set_memory_ on
+							 * memory before returning it to the
+							 * caller, so that the caller do not need
+							 * to call set_memory_* again. This does
+							 * not work for MOD_RO_AFTER_INIT.
+							 */
+};
+
+#define MOD_MAX_ADDR_SPACES 2
+
+/**
+ * struct vmalloc_params - Parameters to call __vmalloc_node_range()
+ * @start:          Address space range start
+ * @end:            Address space range end
+ * @gfp_mask:       The gfp_t mask used for this range
+ * @pgprot:         The page protection for this range
+ * @vm_flags        The vm_flag used for this range
+ */
+struct vmalloc_params {
+	unsigned long	start;
+	unsigned long	end;
+	gfp_t		gfp_mask;
+	pgprot_t	pgprot;
+	unsigned long	vm_flags;
+};
+
+/**
+ * struct mod_alloc_params - Parameters for module allocation type
+ * @flags:          Properties in mod_alloc_params_flags
+ * @granularity:    The allocation granularity (PAGE/PMD) in bytes
+ * @alignment:      The allocation alignment requirement
+ * @vmp:            Parameters used to call vmalloc
+ * @fill:           Function to fill allocated space. If NULL, use memcpy()
+ * @invalidate:     Function to invalidate memory space.
+ *
+ * If @granularity > @alignment the allocation can reuse free space in
+ * previously allocated pages. If they are the same, then fresh pages
+ * have to be allocated.
+ */
+struct mod_alloc_params {
+	unsigned int		flags;
+	unsigned int		granularity;
+	unsigned int		alignment;
+	struct vmalloc_params	vmp[MOD_MAX_ADDR_SPACES];
+	void *			(*fill)(void *dst, const void *src, size_t len);
+	void *			(*invalidate)(void *ptr, size_t len);
+};
+
+struct mod_type_allocator {
+	struct mod_alloc_params	params;
+};
+
+struct mod_allocators {
+	struct mod_type_allocator *types[MOD_MEM_NUM_TYPES];
+};
+
+void *module_alloc_type(size_t size, enum mod_mem_type type);
+void module_memfree_type(void *ptr, enum mod_mem_type type);
+void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type);
+void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type);
+void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type);
+void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type);
+void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type);
+void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type);
+void module_alloc_type_init(struct mod_allocators *allocators);
+
+#endif /* CONFIG_MODULES */
+
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
diff --git a/init/main.c b/init/main.c
index af50044deed5..e05228cabde8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -936,6 +936,7 @@ asmlinkage __visible void __init __no_sanitize_address __noreturn start_kernel(v
 	sort_main_extable();
 	trap_init();
 	mm_core_init();
+	module_allocator_init();
 	poking_init();
 	ftrace_init();
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index d3f0a4825fa6..e4ec4be866cc 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -12,6 +12,7 @@
 #include <linux/mutex.h>
 #include <linux/btf_ids.h>
 #include <linux/rcupdate_wait.h>
+#include <linux/moduleloader.h>
 
 enum bpf_struct_ops_state {
 	BPF_STRUCT_OPS_STATE_INIT,
@@ -512,7 +513,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 		err = st_ops->validate(kdata);
 		if (err)
 			goto reset_unlock;
-		set_memory_rox((long)st_map->image, 1);
+		module_memory_protect(st_map->image, PAGE_SIZE, MOD_TEXT);
+
 		/* Let bpf_link handle registration & unregistration.
 		 *
 		 * Pair with smp_load_acquire() during lookup_elem().
@@ -521,7 +523,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 		goto unlock;
 	}
 
-	set_memory_rox((long)st_map->image, 1);
+	module_memory_protect(st_map->image, PAGE_SIZE, MOD_TEXT);
 	err = st_ops->reg(kdata);
 	if (likely(!err)) {
 		/* This refcnt increment on the map here after
@@ -544,8 +546,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 	 * there was a race in registering the struct_ops (under the same name) to
 	 * a sub-system through different struct_ops's maps.
 	 */
-	set_memory_nx((long)st_map->image, 1);
-	set_memory_rw((long)st_map->image, 1);
+	module_memory_unprotect(st_map->image, PAGE_SIZE, MOD_TEXT);
 
 reset_unlock:
 	bpf_struct_ops_map_put_progs(st_map);
@@ -907,4 +908,3 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 	kfree(link);
 	return err;
 }
-
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7421487422d4..4c989a8fe8b8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -860,7 +860,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
+	pack->ptr = module_alloc_type(BPF_PROG_PACK_SIZE, MOD_TEXT);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -869,8 +869,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
 	list_add_tail(&pack->list, &pack_list);
 
-	set_vm_flush_reset_perms(pack->ptr);
-	set_memory_rox((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
+	module_memory_protect(pack->ptr, BPF_PROG_PACK_SIZE, MOD_TEXT);
 	return pack;
 }
 
@@ -884,11 +883,10 @@ void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
 	mutex_lock(&pack_mutex);
 	if (size > BPF_PROG_PACK_SIZE) {
 		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
+		ptr = module_alloc_type(size, MOD_TEXT);
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
-			set_vm_flush_reset_perms(ptr);
-			set_memory_rox((unsigned long)ptr, size / PAGE_SIZE);
+			module_memory_protect(ptr, size, MOD_TEXT);
 		}
 		goto out;
 	}
@@ -922,7 +920,8 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 
 	mutex_lock(&pack_mutex);
 	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
+		module_memfree_type(hdr, MOD_TEXT);
+
 		goto out;
 	}
 
@@ -946,7 +945,8 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
 				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
 		list_del(&pack->list);
-		module_memfree(pack->ptr);
+		module_memfree_type(pack->ptr, MOD_TEXT);
+
 		kfree(pack);
 	}
 out:
@@ -997,12 +997,12 @@ void bpf_jit_uncharge_modmem(u32 size)
 
 void *__weak bpf_jit_alloc_exec(unsigned long size)
 {
-	return module_alloc(size);
+	return module_alloc_type(size, MOD_TEXT);
 }
 
 void __weak bpf_jit_free_exec(void *addr)
 {
-	module_memfree(addr);
+	module_memfree_type(addr, MOD_TEXT);
 }
 
 struct bpf_binary_header *
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index ac021bc43a66..fd2d46c9a295 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -13,6 +13,7 @@
 #include <linux/bpf_verifier.h>
 #include <linux/bpf_lsm.h>
 #include <linux/delay.h>
+#include <linux/moduleloader.h>
 
 /* dummy _ops. The verifier will operate on target program's ops. */
 const struct bpf_verifier_ops bpf_extension_verifier_ops = {
@@ -440,7 +441,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut
 	if (err < 0)
 		goto out;
 
-	set_memory_rox((long)im->image, 1);
+	module_memory_protect(im->image, PAGE_SIZE, MOD_TEXT);
 
 	WARN_ON(tr->cur_image && tr->selector == 0);
 	WARN_ON(!tr->cur_image && tr->selector);
@@ -462,8 +463,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut
 		tr->fops->trampoline = 0;
 
 		/* reset im->image memory attr for arch_prepare_bpf_trampoline */
-		set_memory_nx((long)im->image, 1);
-		set_memory_rw((long)im->image, 1);
+		module_memory_unprotect(im->image, PAGE_SIZE, MOD_TEXT);
 		goto again;
 	}
 #endif
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 00e177de91cc..daf47da3c96e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -113,17 +113,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
 	/*
-	 * Use module_alloc() so this page is within +/- 2GB of where the
+	 * Use module_alloc_type() so this page is within +/- 2GB of where the
 	 * kernel image and loaded module images reside. This is required
 	 * for most of the architectures.
 	 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 	 */
-	return module_alloc(PAGE_SIZE);
+	return module_alloc_type(PAGE_SIZE, MOD_TEXT);
 }
 
 static void free_insn_page(void *page)
 {
-	module_memfree(page);
+	module_memfree_type(page, MOD_TEXT);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index dc7b0160c480..b2e136326c4c 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -12,6 +12,7 @@
 #include <linux/mutex.h>
 #include <linux/rculist.h>
 #include <linux/rcupdate.h>
+#include <linux/moduleloader.h>
 #include <linux/mm.h>
 
 #ifndef ARCH_SHF_SMALL
@@ -392,3 +393,5 @@ static inline int same_magic(const char *amagic, const char *bmagic, bool has_cr
 	return strcmp(amagic, bmagic) == 0;
 }
 #endif /* CONFIG_MODVERSIONS */
+
+extern struct mod_allocators module_allocators;
diff --git a/kernel/module/main.c b/kernel/module/main.c
index ea7d0c7f3e60..0f9183f1ca9f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1203,11 +1203,11 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type)
 		mod_mem_type_is_core_data(type);
 }
 
-static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
+static void *module_memory_alloc(size_t size, enum mod_mem_type type)
 {
 	if (mod_mem_use_vmalloc(type))
 		return vzalloc(size);
-	return module_alloc(size);
+	return module_alloc_type(size, type);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
@@ -1215,7 +1215,7 @@ static void module_memory_free(void *ptr, enum mod_mem_type type)
 	if (mod_mem_use_vmalloc(type))
 		vfree(ptr);
 	else
-		module_memfree(ptr);
+		module_memfree_type(ptr, type);
 }
 
 static void free_mod_mem(struct module *mod)
@@ -1609,6 +1609,201 @@ void * __weak module_alloc(unsigned long size)
 			NUMA_NO_NODE, __builtin_return_address(0));
 }
 
+struct mod_allocators module_allocators;
+
+static struct mod_type_allocator default_mod_type_allocator = {
+	.params = {
+		.flags = MOD_ALLOC_FALLBACK,
+	},
+};
+
+void __init __weak module_alloc_type_init(struct mod_allocators *allocators)
+{
+	for_each_mod_mem_type(type)
+		allocators->types[type] = &default_mod_type_allocator;
+}
+
+static void module_memory_enable_protection(void *ptr, size_t len, enum mod_mem_type type)
+{
+	int npages = DIV_ROUND_UP(len, PAGE_SIZE);
+
+	switch (type) {
+	case MOD_TEXT:
+	case MOD_INIT_TEXT:
+		set_memory_rox((unsigned long)ptr, npages);
+		break;
+	case MOD_DATA:
+	case MOD_INIT_DATA:
+		set_memory_nx((unsigned long)ptr, npages);
+		break;
+	case MOD_RODATA:
+		set_memory_nx((unsigned long)ptr, npages);
+		set_memory_ro((unsigned long)ptr, npages);
+		break;
+	case MOD_RO_AFTER_INIT:
+		set_memory_ro((unsigned long)ptr, npages);
+		break;
+	default:
+		WARN_ONCE(true, "Unknown mod_mem_type: %d\n", type);
+		break;
+	}
+}
+
+static void module_memory_disable_protection(void *ptr, size_t len, enum mod_mem_type type)
+{
+	int npages = DIV_ROUND_UP(len, PAGE_SIZE);
+
+	switch (type) {
+	case MOD_TEXT:
+	case MOD_INIT_TEXT:
+		set_memory_nx((unsigned long)ptr, npages);
+		set_memory_rw((unsigned long)ptr, npages);
+		break;
+	case MOD_RODATA:
+	case MOD_RO_AFTER_INIT:
+		set_memory_rw((unsigned long)ptr, npages);
+		break;
+	case MOD_DATA:
+	case MOD_INIT_DATA:
+		break;
+	default:
+		WARN_ONCE(true, "Unknown mod_mem_type: %d\n", type);
+		break;
+	}
+}
+
+void *module_alloc_type(size_t size, enum mod_mem_type type)
+{
+	struct mod_type_allocator *allocator;
+	struct mod_alloc_params *params;
+	void *ptr = NULL;
+	int i;
+
+	if (WARN_ON_ONCE(type >= MOD_MEM_NUM_TYPES))
+		return NULL;
+
+	allocator = module_allocators.types[type];
+	params = &allocator->params;
+
+	if (params->flags & MOD_ALLOC_FALLBACK)
+		return module_alloc(size);
+
+	for (i = 0; i < MOD_MAX_ADDR_SPACES; i++) {
+		struct vmalloc_params *vmp = &params->vmp[i];
+
+		if (vmp->start == vmp->end)
+			continue;
+
+		ptr = __vmalloc_node_range(size, params->alignment, vmp->start, vmp->end,
+					   vmp->gfp_mask, vmp->pgprot, vmp->vm_flags,
+					   NUMA_NO_NODE, __builtin_return_address(0));
+		if (!ptr)
+			continue;
+
+		if (params->flags & MOD_ALLOC_KASAN_MODULE_SHADOW) {
+			if (ptr && kasan_alloc_module_shadow(ptr, size, vmp->gfp_mask)) {
+				vfree(ptr);
+				return NULL;
+			}
+		}
+
+		/*
+		 * VM_FLUSH_RESET_PERMS is still needed here. This is
+		 * because "size" is not available in module_memfree_type
+		 * at the moment, so we cannot undo set_memory_rox in
+		 * module_memfree_type. Once a better allocator is used,
+		 * we can manually undo set_memory_rox, and thus remove
+		 * VM_FLUSH_RESET_PERMS.
+		 */
+		set_vm_flush_reset_perms(ptr);
+
+		if (params->flags & MOD_ALLOC_SET_MEMORY)
+			module_memory_enable_protection(ptr, size, type);
+
+		if (params->flags & MOD_ALLOC_KASAN_RESET_TAG)
+			return kasan_reset_tag(ptr);
+		return ptr;
+	}
+	return NULL;
+}
+
+void module_memfree_type(void *ptr, enum mod_mem_type type)
+{
+	module_memfree(ptr);
+}
+
+void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type)
+{
+	struct mod_type_allocator *allocator;
+	struct mod_alloc_params *params;
+
+	allocator = module_allocators.types[type];
+	params = &allocator->params;
+
+	if (params->fill)
+		params->fill(dst, src, len);
+	else
+		memcpy(dst, src, len);
+}
+
+void module_memory_invalidate_type(void *dst, size_t len, enum mod_mem_type type)
+{
+	struct mod_type_allocator *allocator;
+	struct mod_alloc_params *params;
+
+	allocator = module_allocators.types[type];
+	params = &allocator->params;
+
+	if (params->invalidate)
+		params->invalidate(dst, len);
+	else
+		memset(dst, 0, len);
+}
+
+/*
+ * Protect memory allocated by module_alloc_type(). Called by users of
+ * module_alloc_type. This is a no-op with MOD_ALLOC_SET_MEMORY.
+ */
+void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type)
+{
+	struct mod_alloc_params *params = &module_allocators.types[type]->params;
+
+	if (params->flags & MOD_ALLOC_SET_MEMORY)
+		return;
+	module_memory_enable_protection(ptr, len, type);
+}
+
+/*
+ * Unprotect memory allocated by module_alloc_type(). Called by users of
+ * module_alloc_type. This is a no-op with MOD_ALLOC_SET_MEMORY.
+ */
+void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type)
+{
+	struct mod_alloc_params *params = &module_allocators.types[type]->params;
+
+	if (params->flags & MOD_ALLOC_SET_MEMORY)
+		return;
+	module_memory_disable_protection(ptr, len, type);
+}
+
+/*
+ * Should only be used by arch code in cases where text_poke like
+ * solution is not ready yet
+ */
+void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type)
+{
+	module_memory_enable_protection(ptr, len, type);
+}
+
+/*
+ * Should only be used by arch code in cases where text_poke like
+ * solution is not ready yet
+ */
+void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type)
+{
+	module_memory_disable_protection(ptr, len, type);
+}
+
 bool __weak module_init_section(const char *name)
 {
 	return strstarts(name, ".init");
@@ -2241,7 +2436,7 @@ static int move_module(struct module *mod, struct load_info *info)
 			t = type;
 			goto out_enomem;
 		}
-		memset(ptr, 0, mod->mem[type].size);
+		module_memory_invalidate_type(ptr, mod->mem[type].size, type);
 		mod->mem[type].base = ptr;
 	}
 
@@ -2269,7 +2464,8 @@ static int move_module(struct module *mod, struct load_info *info)
 				ret = -ENOEXEC;
 				goto out_enomem;
 			}
-			memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size);
+
+			module_memory_fill_type(dest, (void *)shdr->sh_addr, shdr->sh_size, type);
 		}
 		/*
 		 * Update the userspace copy's ELF section address to point to
@@ -2471,9 +2667,9 @@ static void do_free_init(struct work_struct *w)
 
 	llist_for_each_safe(pos, n, list) {
 		initfree = container_of(pos, struct mod_initfree, node);
-		module_memfree(initfree->init_text);
-		module_memfree(initfree->init_data);
-		module_memfree(initfree->init_rodata);
+		module_memfree_type(initfree->init_text, MOD_INIT_TEXT);
+		module_memfree_type(initfree->init_data, MOD_INIT_DATA);
+		module_memfree_type(initfree->init_rodata, MOD_INIT_RODATA);
 		kfree(initfree);
 	}
 }
@@ -3268,3 +3464,8 @@ static int module_debugfs_init(void)
 }
 module_init(module_debugfs_init);
 #endif
+
+void __init module_allocator_init(void)
+{
+	module_alloc_type_init(&module_allocators);
+}
diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index a2b656b4e3d2..65ff1b09dc84 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -16,6 +16,10 @@ static void module_set_memory(const struct module *mod, enum mod_mem_type type,
 {
 	const struct module_memory *mod_mem = &mod->mem[type];
 
+	/* The allocator already called set_memory_*, skip here. */
+	if (module_allocators.types[type]->params.flags & MOD_ALLOC_SET_MEMORY)
+		return;
+
 	set_vm_flush_reset_perms(mod_mem->base);
 	set_memory((unsigned long)mod_mem->base, mod_mem->size >> PAGE_SHIFT);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] ftrace: Add swap_func to ftrace_process_locs()
  2023-05-26  5:15 [PATCH 0/3] Type aware module allocator Song Liu
  2023-05-26  5:15 ` [PATCH 1/3] module: Introduce module_alloc_type Song Liu
@ 2023-05-26  5:15 ` Song Liu
  2023-05-26  5:15 ` [PATCH 3/3] x86/module: Use module_alloc_type Song Liu
  2023-05-27  7:04 ` [PATCH 0/3] Type aware module allocator Kent Overstreet
  3 siblings, 0 replies; 22+ messages in thread
From: Song Liu @ 2023-05-26  5:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: bpf, mcgrof, peterz, tglx, x86, rppt, kent.overstreet, Song Liu

ftrace_process_locs sorts module mcount, which is inside RO memory. Add a
ftrace_swap_func so that archs can use RO-memory-poke function to do the
sorting.

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/ftrace.h |  2 ++
 kernel/trace/ftrace.c  | 13 ++++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index b23bdd414394..fe443b8ed32c 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1166,4 +1166,6 @@ unsigned long arch_syscall_addr(int nr);
 
 #endif /* CONFIG_FTRACE_SYSCALLS */
 
+void ftrace_swap_func(void *a, void *b, int n);
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 764668467155..f5ddc9d4cfb6 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6430,6 +6430,17 @@ static void test_is_sorted(unsigned long *start, unsigned long count)
 }
 #endif
 
+void __weak ftrace_swap_func(void *a, void *b, int n)
+{
+	unsigned long t;
+
+	WARN_ON_ONCE(n != sizeof(t));
+
+	t = *((unsigned long *)a);
+	*(unsigned long *)a = *(unsigned long *)b;
+	*(unsigned long *)b = t;
+}
+
 static int ftrace_process_locs(struct module *mod,
 			       unsigned long *start,
 			       unsigned long *end)
@@ -6455,7 +6466,7 @@ static int ftrace_process_locs(struct module *mod,
 	 */
 	if (!IS_ENABLED(CONFIG_BUILDTIME_MCOUNT_SORT) || mod) {
 		sort(start, count, sizeof(*start),
-		     ftrace_cmp_ips, NULL);
+		     ftrace_cmp_ips, ftrace_swap_func);
 	} else {
 		test_is_sorted(start, count);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] x86/module: Use module_alloc_type
  2023-05-26  5:15 [PATCH 0/3] Type aware module allocator Song Liu
  2023-05-26  5:15 ` [PATCH 1/3] module: Introduce module_alloc_type Song Liu
  2023-05-26  5:15 ` [PATCH 2/3] ftrace: Add swap_func to ftrace_process_locs() Song Liu
@ 2023-05-26  5:15 ` Song Liu
  2023-05-27  7:04 ` [PATCH 0/3] Type aware module allocator Kent Overstreet
  3 siblings, 0 replies; 22+ messages in thread
From: Song Liu @ 2023-05-26  5:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: bpf, mcgrof, peterz, tglx, x86, rppt, kent.overstreet, Song Liu

Enable module_alloc_type to

1. Allocate ROX data for MOD_TEXT and MOD_INIT_TEXT;
2. Allocate RO data for MOD_RODATA and MOD_INIT_RODATA;
3. Allocate RW data for other types.

Also, update users of module_alloc_type (BPF, ftrace, kprobe) to handle
these restrictions.

arch_prepare_bpf_trampoline() cannot jit directly into module memory yet,
so we have to use module_memory_force_[un]protect() in it.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/kernel/alternative.c  |  37 +++++++----
 arch/x86/kernel/ftrace.c       |  44 +++++++------
 arch/x86/kernel/kprobes/core.c |   8 +--
 arch/x86/kernel/module.c       | 114 +++++++++++++++++++++++----------
 arch/x86/kernel/unwind_orc.c   |  13 ++--
 arch/x86/net/bpf_jit_comp.c    |  22 +++++--
 kernel/bpf/core.c              |   6 +-
 7 files changed, 160 insertions(+), 84 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f615e0cb6d93..bb4e6c3225bf 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -122,6 +122,17 @@ extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
 extern s32 __smp_locks[], __smp_locks_end[];
 void text_poke_early(void *addr, const void *opcode, size_t len);
 
+static void __init_or_module do_text_poke(void *addr, const void *opcode, size_t len)
+{
+	if (system_state < SYSTEM_RUNNING) {
+		text_poke_early(addr, opcode, len);
+	} else {
+		mutex_lock(&text_mutex);
+		text_poke(addr, opcode, len);
+		mutex_unlock(&text_mutex);
+	}
+}
+
 /*
  * Are we looking at a near JMP with a 1 or 4-byte displacement.
  */
@@ -331,7 +342,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 
 		DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
 
-		text_poke_early(instr, insn_buff, insn_buff_sz);
+		do_text_poke(instr, insn_buff, insn_buff_sz);
 
 next:
 		optimize_nops(instr, a->instrlen);
@@ -564,7 +575,7 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
 			optimize_nops(bytes, len);
 			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
 			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-			text_poke_early(addr, bytes, len);
+			do_text_poke(addr, bytes, len);
 		}
 	}
 }
@@ -638,7 +649,7 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
 		if (len == insn.length) {
 			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
 			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-			text_poke_early(addr, bytes, len);
+			do_text_poke(addr, bytes, len);
 		}
 	}
 }
@@ -674,7 +685,7 @@ static void poison_endbr(void *addr, bool warn)
 	 */
 	DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
 	DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
-	text_poke_early(addr, &poison, 4);
+	do_text_poke(addr, &poison, 4);
 }
 
 /*
@@ -869,7 +880,7 @@ static int cfi_disable_callers(s32 *start, s32 *end)
 		if (!hash) /* nocfi callers */
 			continue;
 
-		text_poke_early(addr, jmp, 2);
+		do_text_poke(addr, jmp, 2);
 	}
 
 	return 0;
@@ -892,7 +903,7 @@ static int cfi_enable_callers(s32 *start, s32 *end)
 		if (!hash) /* nocfi callers */
 			continue;
 
-		text_poke_early(addr, mov, 2);
+		do_text_poke(addr, mov, 2);
 	}
 
 	return 0;
@@ -913,7 +924,7 @@ static int cfi_rand_preamble(s32 *start, s32 *end)
 			return -EINVAL;
 
 		hash = cfi_rehash(hash);
-		text_poke_early(addr + 1, &hash, 4);
+		do_text_poke(addr + 1, &hash, 4);
 	}
 
 	return 0;
@@ -932,9 +943,9 @@ static int cfi_rewrite_preamble(s32 *start, s32 *end)
 			 addr, addr, 5, addr))
 			return -EINVAL;
 
-		text_poke_early(addr, fineibt_preamble_start, fineibt_preamble_size);
+		do_text_poke(addr, fineibt_preamble_start, fineibt_preamble_size);
 		WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) != 0x12345678);
-		text_poke_early(addr + fineibt_preamble_hash, &hash, 4);
+		do_text_poke(addr + fineibt_preamble_hash, &hash, 4);
 	}
 
 	return 0;
@@ -953,7 +964,7 @@ static int cfi_rand_callers(s32 *start, s32 *end)
 		hash = decode_caller_hash(addr);
 		if (hash) {
 			hash = -cfi_rehash(hash);
-			text_poke_early(addr + 2, &hash, 4);
+			do_text_poke(addr + 2, &hash, 4);
 		}
 	}
 
@@ -971,9 +982,9 @@ static int cfi_rewrite_callers(s32 *start, s32 *end)
 		addr -= fineibt_caller_size;
 		hash = decode_caller_hash(addr);
 		if (hash) {
-			text_poke_early(addr, fineibt_caller_start, fineibt_caller_size);
+			do_text_poke(addr, fineibt_caller_start, fineibt_caller_size);
 			WARN_ON(*(u32 *)(addr + fineibt_caller_hash) != 0x12345678);
-			text_poke_early(addr + fineibt_caller_hash, &hash, 4);
+			do_text_poke(addr + fineibt_caller_hash, &hash, 4);
 		}
 		/* rely on apply_retpolines() */
 	}
@@ -1243,7 +1254,7 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 
 		/* Pad the rest with nops */
 		add_nops(insn_buff + used, p->len - used);
-		text_poke_early(p->instr, insn_buff, p->len);
+		do_text_poke(p->instr, insn_buff, p->len);
 	}
 }
 extern struct paravirt_patch_site __start_parainstructions[],
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 5e7ead52cfdb..a41af9e49afb 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -119,8 +119,11 @@ ftrace_modify_code_direct(unsigned long ip, const char *old_code,
 	/* replace the text with the new text */
 	if (ftrace_poke_late)
 		text_poke_queue((void *)ip, new_code, MCOUNT_INSN_SIZE, NULL);
-	else
-		text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
+	else {
+		mutex_lock(&text_mutex);
+		text_poke((void *)ip, new_code, MCOUNT_INSN_SIZE);
+		mutex_unlock(&text_mutex);
+	}
 	return 0;
 }
 
@@ -265,11 +268,11 @@ void arch_ftrace_update_code(int command)
 /* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
-	return module_alloc(size);
+	return module_alloc_type(size, MOD_TEXT);
 }
 static inline void tramp_free(void *tramp)
 {
-	module_memfree(tramp);
+	module_memfree_type(tramp, MOD_TEXT);
 }
 #else
 /* Trampolines can only be created if modules are supported */
@@ -319,7 +322,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	unsigned long call_offset;
 	unsigned long jmp_offset;
 	unsigned long offset;
-	unsigned long npages;
 	unsigned long size;
 	unsigned long *ptr;
 	void *trampoline;
@@ -328,7 +330,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	unsigned const char op_ref[] = { 0x48, 0x8b, 0x15 };
 	unsigned const char retq[] = { RET_INSN_OPCODE, INT3_INSN_OPCODE };
 	union ftrace_op_code_union op_ptr;
-	int ret;
 
 	if (ops->flags & FTRACE_OPS_FL_SAVE_REGS) {
 		start_offset = (unsigned long)ftrace_regs_caller;
@@ -356,18 +357,16 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 		return 0;
 
 	*tramp_size = size + RET_SIZE + sizeof(void *);
-	npages = DIV_ROUND_UP(*tramp_size, PAGE_SIZE);
 
 	/* Copy ftrace_caller onto the trampoline memory */
-	ret = copy_from_kernel_nofault(trampoline, (void *)start_offset, size);
-	if (WARN_ON(ret < 0))
+	if (WARN_ON(text_poke_copy(trampoline, (void *)start_offset, size) == NULL))
 		goto fail;
 
 	ip = trampoline + size;
 	if (cpu_feature_enabled(X86_FEATURE_RETHUNK))
 		__text_gen_insn(ip, JMP32_INSN_OPCODE, ip, x86_return_thunk, JMP32_INSN_SIZE);
 	else
-		memcpy(ip, retq, sizeof(retq));
+		text_poke_copy(ip, retq, sizeof(retq));
 
 	/* No need to test direct calls on created trampolines */
 	if (ops->flags & FTRACE_OPS_FL_SAVE_REGS) {
@@ -375,8 +374,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 		ip = trampoline + (jmp_offset - start_offset);
 		if (WARN_ON(*(char *)ip != 0x75))
 			goto fail;
-		ret = copy_from_kernel_nofault(ip, x86_nops[2], 2);
-		if (ret < 0)
+		if (text_poke_copy(ip, x86_nops[2], 2) == NULL)
 			goto fail;
 	}
 
@@ -389,7 +387,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	 */
 
 	ptr = (unsigned long *)(trampoline + size + RET_SIZE);
-	*ptr = (unsigned long)ops;
+	text_poke_copy(ptr, &ops, sizeof(unsigned long));
 
 	op_offset -= start_offset;
 	memcpy(&op_ptr, trampoline + op_offset, OP_REF_SIZE);
@@ -405,7 +403,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	op_ptr.offset = offset;
 
 	/* put in the new offset to the ftrace_ops */
-	memcpy(trampoline + op_offset, &op_ptr, OP_REF_SIZE);
+	text_poke_copy(trampoline + op_offset, &op_ptr, OP_REF_SIZE);
 
 	/* put in the call to the function */
 	mutex_lock(&text_mutex);
@@ -415,15 +413,14 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	 * the depth accounting before the call already.
 	 */
 	dest = ftrace_ops_get_func(ops);
-	memcpy(trampoline + call_offset,
-	       text_gen_insn(CALL_INSN_OPCODE, trampoline + call_offset, dest),
-	       CALL_INSN_SIZE);
+	text_poke_copy_locked(trampoline + call_offset,
+	      text_gen_insn(CALL_INSN_OPCODE, trampoline + call_offset, dest),
+	      CALL_INSN_SIZE, false);
 	mutex_unlock(&text_mutex);
 
 	/* ALLOC_TRAMP flags lets us know we created it */
 	ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
 
-	set_memory_rox((unsigned long)trampoline, npages);
 	return (unsigned long)trampoline;
 fail:
 	tramp_free(trampoline);
@@ -667,4 +664,15 @@ void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
 }
 #endif
 
+void ftrace_swap_func(void *a, void *b, int n)
+{
+	unsigned long t;
+
+	WARN_ON_ONCE(n != sizeof(t));
+
+	t = *((unsigned long *)a);
+	text_poke_copy(a, b, sizeof(t));
+	text_poke_copy(b, &t, sizeof(t));
+}
+
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index f7f6042eb7e6..96f56e663cbe 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -414,16 +414,10 @@ void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = module_alloc_type(PAGE_SIZE, MOD_TEXT);
 	if (!page)
 		return NULL;
 
-	/*
-	 * TODO: Once additional kernel code protection mechanisms are set, ensure
-	 * that the page was not maliciously altered and it is still zeroed.
-	 */
-	set_memory_rox((unsigned long)page, 1);
-
 	return page;
 }
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b05f62ee2344..80c2ee1a4f7f 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -67,24 +67,82 @@ static unsigned long int get_module_load_offset(void)
 
 void *module_alloc(unsigned long size)
 {
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	if (PAGE_ALIGN(size) > MODULES_LEN)
-		return NULL;
-
-	p = __vmalloc_node_range(size, MODULE_ALIGN,
-				 MODULES_VADDR + get_module_load_offset(),
-				 MODULES_END, gfp_mask, PAGE_KERNEL,
-				 VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
-				 NUMA_NO_NODE, __builtin_return_address(0));
+	WARN(true, "x86 should not use module_alloc\n");
+	return NULL;
+}
 
-	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
-		vfree(p);
-		return NULL;
-	}
+static void *x86_module_invalidate_text(void *ptr, size_t len)
+{
+	return text_poke_set(ptr, 0xcc, len);
+}
 
-	return p;
+static struct mod_type_allocator x86_mod_allocator_text = {
+	.params = {
+		.flags		= MOD_ALLOC_KASAN_MODULE_SHADOW | MOD_ALLOC_SET_MEMORY,
+		.granularity	= PAGE_SIZE,
+		.alignment	= MODULE_ALIGN,
+		.fill		= text_poke_copy,
+		.invalidate	= x86_module_invalidate_text,
+	},
+};
+
+static struct mod_type_allocator x86_mod_allocator_rw_data = {
+	.params = {
+		.flags		= MOD_ALLOC_KASAN_MODULE_SHADOW,
+		.granularity	= PAGE_SIZE,
+		.alignment	= MODULE_ALIGN,
+	},
+};
+
+static struct mod_type_allocator x86_mod_allocator_ro_data = {
+	.params = {
+		.flags		= MOD_ALLOC_KASAN_MODULE_SHADOW | MOD_ALLOC_SET_MEMORY,
+		.granularity	= PAGE_SIZE,
+		.alignment	= MODULE_ALIGN,
+		.fill		= text_poke_copy,
+		.invalidate	= x86_module_invalidate_text,
+	},
+};
+
+void __init module_alloc_type_init(struct mod_allocators *allocators)
+{
+	struct mod_alloc_params *params = &x86_mod_allocator_text.params;
+	struct vmalloc_params *vmp = &params->vmp[0];
+
+	vmp->start = MODULES_VADDR + get_module_load_offset();
+	vmp->end = MODULES_END;
+	vmp->gfp_mask = GFP_KERNEL;
+	vmp->pgprot = PAGE_KERNEL_EXEC;
+	vmp->vm_flags = VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK |
+		VM_ALLOW_HUGE_VMAP;
+
+	for_class_mod_mem_type(type, text)
+		allocators->types[type] = &x86_mod_allocator_text;
+
+	params = &x86_mod_allocator_rw_data.params;
+	vmp = &params->vmp[0];
+
+	vmp->start = MODULES_VADDR + get_module_load_offset();
+	vmp->end = MODULES_END;
+	vmp->gfp_mask = GFP_KERNEL;
+	vmp->pgprot = PAGE_KERNEL_EXEC;
+	vmp->vm_flags = VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK;
+
+	allocators->types[MOD_DATA] = &x86_mod_allocator_rw_data;
+	allocators->types[MOD_INIT_DATA] = &x86_mod_allocator_rw_data;
+	allocators->types[MOD_RO_AFTER_INIT] = &x86_mod_allocator_rw_data;
+
+	params = &x86_mod_allocator_ro_data.params;
+	vmp = &params->vmp[0];
+
+	vmp->start = MODULES_VADDR + get_module_load_offset();
+	vmp->end = MODULES_END;
+	vmp->gfp_mask = GFP_KERNEL;
+	vmp->pgprot = PAGE_KERNEL_EXEC;
+	vmp->vm_flags = VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK;
+
+	allocators->types[MOD_RODATA] = &x86_mod_allocator_ro_data;
+	allocators->types[MOD_INIT_RODATA] = &x86_mod_allocator_ro_data;
 }
 
 #ifdef CONFIG_X86_32
@@ -134,7 +192,6 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
 		   unsigned int symindex,
 		   unsigned int relsec,
 		   struct module *me,
-		   void *(*write)(void *dest, const void *src, size_t len),
 		   bool apply)
 {
 	unsigned int i;
@@ -202,14 +259,14 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
 				       (int)ELF64_R_TYPE(rel[i].r_info), loc, val);
 				return -ENOEXEC;
 			}
-			write(loc, &val, size);
+			text_poke(loc, &val, size);
 		} else {
 			if (memcmp(loc, &val, size)) {
 				pr_warn("x86/modules: Invalid relocation target, existing value does not match expected value for type %d, loc %p, val %Lx\n",
 					(int)ELF64_R_TYPE(rel[i].r_info), loc, val);
 				return -ENOEXEC;
 			}
-			write(loc, &zero, size);
+			text_poke(loc, &zero, size);
 		}
 	}
 	return 0;
@@ -230,22 +287,11 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
 			      bool apply)
 {
 	int ret;
-	bool early = me->state == MODULE_STATE_UNFORMED;
-	void *(*write)(void *, const void *, size_t) = memcpy;
-
-	if (!early) {
-		write = text_poke;
-		mutex_lock(&text_mutex);
-	}
-
-	ret = __write_relocate_add(sechdrs, strtab, symindex, relsec, me,
-				   write, apply);
-
-	if (!early) {
-		text_poke_sync();
-		mutex_unlock(&text_mutex);
-	}
 
+	mutex_lock(&text_mutex);
+	ret = __write_relocate_add(sechdrs, strtab, symindex, relsec, me, apply);
+	text_poke_sync();
+	mutex_unlock(&text_mutex);
 	return ret;
 }
 
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 3ac50b7298d1..264188ec50c9 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -7,6 +7,7 @@
 #include <asm/unwind.h>
 #include <asm/orc_types.h>
 #include <asm/orc_lookup.h>
+#include <asm/text-patching.h>
 
 #define orc_warn(fmt, ...) \
 	printk_deferred_once(KERN_WARNING "WARNING: " fmt, ##__VA_ARGS__)
@@ -222,18 +223,22 @@ static void orc_sort_swap(void *_a, void *_b, int size)
 	struct orc_entry orc_tmp;
 	int *a = _a, *b = _b, tmp;
 	int delta = _b - _a;
+	int val;
 
 	/* Swap the .orc_unwind_ip entries: */
 	tmp = *a;
-	*a = *b + delta;
-	*b = tmp - delta;
+	val = *b + delta;
+	text_poke_copy(a, &val, sizeof(val));
+	val = tmp - delta;
+	text_poke_copy(b, &val, sizeof(val));
 
 	/* Swap the corresponding .orc_unwind entries: */
 	orc_a = cur_orc_table + (a - cur_orc_ip_table);
 	orc_b = cur_orc_table + (b - cur_orc_ip_table);
 	orc_tmp = *orc_a;
-	*orc_a = *orc_b;
-	*orc_b = orc_tmp;
+
+	text_poke_copy(orc_a, orc_b, sizeof(*orc_b));
+	text_poke_copy(orc_b, &orc_tmp, sizeof(orc_tmp));
 }
 
 static int orc_sort_cmp(const void *_a, const void *_b)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 1056bbf55b17..846228fb12f2 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -11,6 +11,7 @@
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/moduleloader.h>
 #include <asm/extable.h>
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
@@ -226,7 +227,7 @@ static u8 simple_alu_opcodes[] = {
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	/* Fill whole space with INT3 instructions */
-	memset(area, 0xcc, size);
+	text_poke_set(area, 0xcc, size);
 }
 
 int bpf_arch_text_invalidate(void *dst, size_t len)
@@ -2202,6 +2203,8 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 		orig_call += X86_PATCH_SIZE;
 	}
 
+	module_memory_force_unprotect((void *)((unsigned long)image & PAGE_MASK),
+				      PAGE_SIZE, MOD_TEXT);
 	prog = image;
 
 	EMIT_ENDBR();
@@ -2238,20 +2241,24 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 		emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
 		if (emit_rsb_call(&prog, __bpf_tramp_enter, prog)) {
 			ret = -EINVAL;
-			goto cleanup;
+			goto reprotect_memory;
 		}
 	}
 
 	if (fentry->nr_links)
 		if (invoke_bpf(m, &prog, fentry, regs_off, run_ctx_off,
-			       flags & BPF_TRAMP_F_RET_FENTRY_RET))
-			return -EINVAL;
+			       flags & BPF_TRAMP_F_RET_FENTRY_RET)) {
+			ret = -EINVAL;
+			goto reprotect_memory;
+		}
 
 	if (fmod_ret->nr_links) {
 		branches = kcalloc(fmod_ret->nr_links, sizeof(u8 *),
 				   GFP_KERNEL);
-		if (!branches)
-			return -ENOMEM;
+		if (!branches) {
+			ret =  -ENOMEM;
+			goto reprotect_memory;
+		}
 
 		if (invoke_bpf_mod_ret(m, &prog, fmod_ret, regs_off,
 				       run_ctx_off, branches)) {
@@ -2336,6 +2343,9 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 
 cleanup:
 	kfree(branches);
+reprotect_memory:
+	module_memory_force_protect((void *)((unsigned long)image & PAGE_MASK),
+				    PAGE_SIZE, MOD_TEXT);
 	return ret;
 }
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 4c989a8fe8b8..90f09218d30f 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1092,8 +1092,10 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr,
 		return NULL;
 	}
 
-	/* Fill space with illegal/arch-dep instructions. */
-	bpf_fill_ill_insns(*rw_header, size);
+	/* bpf_fill_ill_insns is used to write to RO memory, so we cannot
+	 * use it on rw_header, use memset(0) instead.
+	 */
+	memset(*rw_header, 0, size);
 	(*rw_header)->size = size;
 
 	hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-26  5:15 ` [PATCH 1/3] module: Introduce module_alloc_type Song Liu
@ 2023-05-26  6:07   ` Randy Dunlap
  2023-05-26 22:29   ` Kent Overstreet
  1 sibling, 0 replies; 22+ messages in thread
From: Randy Dunlap @ 2023-05-26  6:07 UTC (permalink / raw)
  To: Song Liu, linux-kernel
  Cc: bpf, mcgrof, peterz, tglx, x86, rppt, kent.overstreet

Hi--

On 5/25/23 22:15, Song Liu wrote:
> +/**
> + * struct vmalloc_params - Parameters to call __vmalloc_node_range()
> + * @start:          Address space range start
> + * @end:            Address space range end
> + * @gfp_mask:       The gfp_t mask used for this range
> + * @pgprot:         The page protection for this range
> + * @vm_flags        The vm_flag used for this range

    * @vm_flags:

> + */
> +struct vmalloc_params {
> +	unsigned long	start;
> +	unsigned long	end;
> +	gfp_t		gfp_mask;
> +	pgprot_t	pgprot;
> +	unsigned long	vm_flags;
> +};

-- 
~Randy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-26  5:15 ` [PATCH 1/3] module: Introduce module_alloc_type Song Liu
  2023-05-26  6:07   ` Randy Dunlap
@ 2023-05-26 22:29   ` Kent Overstreet
  2023-05-26 23:09     ` Song Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2023-05-26 22:29 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Thu, May 25, 2023 at 10:15:27PM -0700, Song Liu wrote:
> Introduce memory type aware module_alloc_type, which provides a unified
> allocator for all different archs. This work was discussed in [1].
> 
> Each arch can configure the allocator to do the following:
> 
>    1. Specify module_vaddr and module_end
>    2. Random module start address for KASLR
>    3. kasan_alloc_module_shadow()
>    4. kasan_reset_tag()
>    5. Preferred and secondary module address ranges
> 
> enum mod_alloc_params_flags are used to control the behavior of
> module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type
> fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let
> module_alloc_type to protect the memory before returning to the user.
> 
> module_allocator_init() call is added to start_kernel() to initialize
> module_alloc_type.
> 

...

> +/**
> + * struct vmalloc_params - Parameters to call __vmalloc_node_range()
> + * @start:          Address space range start
> + * @end:            Address space range end
> + * @gfp_mask:       The gfp_t mask used for this range
> + * @pgprot:         The page protection for this range
> + * @vm_flags        The vm_flag used for this range
> + */
> +struct vmalloc_params {
> +	unsigned long	start;
> +	unsigned long	end;
> +	gfp_t		gfp_mask;
> +	pgprot_t	pgprot;
> +	unsigned long	vm_flags;
> +};
> +
> +/**
> + * struct mod_alloc_params - Parameters for module allocation type
> + * @flags:          Properties in mod_alloc_params_flags
> + * @granularity:    The allocation granularity (PAGE/PMD) in bytes
> + * @alignment:      The allocation alignment requirement
> + * @vmp:            Parameters used to call vmalloc
> + * @fill:           Function to fill allocated space. If NULL, use memcpy()
> + * @invalidate:     Function to invalidate memory space.
> + *
> + * If @granularity > @alignment the allocation can reuse free space in
> + * previously allocated pages. If they are the same, then fresh pages
> + * have to be allocated.
> + */
> +struct mod_alloc_params {
> +	unsigned int		flags;
> +	unsigned int		granularity;
> +	unsigned int		alignment;
> +	struct vmalloc_params	vmp[MOD_MAX_ADDR_SPACES];
> +	void *			(*fill)(void *dst, const void *src, size_t len);
> +	void *			(*invalidate)(void *ptr, size_t len);
> +};
> +
> +struct mod_type_allocator {
> +	struct mod_alloc_params	params;
> +};
> +
> +struct mod_allocators {
> +	struct mod_type_allocator *types[MOD_MEM_NUM_TYPES];
> +};
> +
> +void *module_alloc_type(size_t size, enum mod_mem_type type);
> +void module_memfree_type(void *ptr, enum mod_mem_type type);
> +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type);
> +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type);
> +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type);
> +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type);
> +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> +void module_alloc_type_init(struct mod_allocators *allocators);

This is a pretty big and complicated interface, and I haven't been able
to find any reasoning or justification for why it's needed.

Why is this going in module.h? Don't do that, this is supposed to be a
general purpose allocator for executable memory so start a new file.

What is vmalloc_params doing here? Why is it needed? Can we just get rid
of it?

module_memory_protect() and module_memory_unprotect() looks like a
completely broken interface for supporting sub-page allocations -
different threads with different allocations on the same page will race.

The text_poke() abstraction works; the exposed interface should look
like that.

Please kill MOD_ALLOC_SET_MEMORY, and _only_ return memory that is
read-only. You should be able to kill mod_alloc_params entirely.

Our other memory allocators don't expose kasan - why does this one?
Again, that looks wrong.

I would like to see something much simpler (that doesn't encode awkward
assumptions from module and bpf!): please look at the code I sketched
out below and tell me why this interface won't work - or if it can, go
with _that_.

commit 80ceffd6b1108afc7593bdf060954c73eaf0ffc4
Author: Kent Overstreet <kent.overstreet@linux.dev>
Date:   Wed May 17 01:22:06 2023 -0400

    mm: jit/text allocator
    
    This provides a new, very simple slab allocator for jit/text, i.e. bpf,
    ftrace trampolines, or bcachefs unpack functions.
    
    With this API we can avoid ever mapping pages both writeable and
    executable (not implemented in this patch: need to tweak
    module_alloc()), and it also supports sub-page sized allocations.
    
    Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
new file mode 100644
index 0000000000..f1549d60e8
--- /dev/null
+++ b/include/linux/jitalloc.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_JITALLOC_H
+#define _LINUX_JITALLOC_H
+
+void jit_update(void *buf, void *new_buf, size_t len);
+void jit_free(void *buf);
+void *jit_alloc(void *buf, size_t len);
+
+#endif /* _LINUX_JITALLOC_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f..ff26a4f0c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config JITALLOC
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index c03e1e5859..25e82db9e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_JITALLOC) += jitalloc.o
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
new file mode 100644
index 0000000000..7c4d621802
--- /dev/null
+++ b/mm/jitalloc.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/gfp.h>
+#include <linux/highmem.h>
+#include <linux/jitalloc.h>
+#include <linux/mm.h>
+#include <linux/moduleloader.h>
+#include <linux/mutex.h>
+#include <linux/set_memory.h>
+#include <linux/vmalloc.h>
+
+#include <asm/text-patching.h>
+
+static DEFINE_MUTEX(jit_alloc_lock);
+
+struct jit_cache {
+	unsigned		obj_size_bits;
+	unsigned		objs_per_slab;
+	struct list_head	partial;
+};
+
+#define JITALLOC_MIN_SIZE	16
+#define NR_JIT_CACHES		ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
+
+static struct jit_cache jit_caches[NR_JIT_CACHES];
+
+struct jit_slab {
+	unsigned long		__page_flags;
+
+	struct jit_cache	*cache;
+	void			*executably_mapped;;
+	unsigned long		*objs_allocated; /* bitmap of free objects */
+	struct list_head	list;
+};
+
+#define folio_jit_slab(folio)		(_Generic((folio),			\
+	const struct folio *:		(const struct jit_slab *)(folio),	\
+	struct folio *:			(struct jit_slab *)(folio)))
+
+#define jit_slab_folio(s)		(_Generic((s),				\
+	const struct jit_slab *:	(const struct folio *)s,		\
+	struct jit_slab *:		(struct folio *)s))
+
+static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
+{
+	void *executably_mapped = module_alloc(PAGE_SIZE);
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *slab;
+	unsigned long *objs_allocated;
+
+	if (!executably_mapped)
+		return NULL;
+
+	objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
+	if (!objs_allocated ) {
+		vfree(executably_mapped);
+		return NULL;
+	}
+
+	set_vm_flush_reset_perms(executably_mapped);
+	set_memory_rox((unsigned long) executably_mapped, 1);
+
+	page = vmalloc_to_page(executably_mapped);
+	folio = page_folio(page);
+
+	__folio_set_slab(folio);
+	slab			= folio_jit_slab(folio);
+	slab->cache		= cache;
+	slab->executably_mapped	= executably_mapped;
+	slab->objs_allocated = objs_allocated;
+	INIT_LIST_HEAD(&slab->list);
+
+	return slab;
+}
+
+static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
+{
+	struct jit_slab *s =
+		list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
+		jit_slab_alloc(cache);
+	unsigned obj_idx, nr_allocated;
+
+	if (!s)
+		return NULL;
+
+	obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
+
+	BUG_ON(obj_idx >= cache->objs_per_slab);
+	__set_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == s->cache->objs_per_slab) {
+		list_del_init(&s->list);
+	} else if (nr_allocated == 1) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	return s->executably_mapped + (obj_idx << cache->obj_size_bits);
+}
+
+void jit_update(void *buf, void *new_buf, size_t len)
+{
+	text_poke_copy(buf, new_buf, len);
+}
+EXPORT_SYMBOL_GPL(jit_update);
+
+void jit_free(void *buf)
+{
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *s;
+	unsigned obj_idx, nr_allocated;
+	size_t offset;
+
+	if (!buf)
+		return;
+
+	page	= vmalloc_to_page(buf);
+	folio	= page_folio(page);
+	offset	= offset_in_folio(folio, buf);
+
+	if (!folio_test_slab(folio)) {
+		vfree(buf);
+		return;
+	}
+
+	s = folio_jit_slab(folio);
+
+	mutex_lock(&jit_alloc_lock);
+	obj_idx = offset >> s->cache->obj_size_bits;
+
+	__clear_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == 0) {
+		list_del(&s->list);
+		kfree(s->objs_allocated);
+		folio_put(folio);
+	} else if (nr_allocated + 1 == s->cache->objs_per_slab) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	mutex_unlock(&jit_alloc_lock);
+}
+EXPORT_SYMBOL_GPL(jit_free);
+
+void *jit_alloc(void *buf, size_t len)
+{
+	unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
+	void *p;
+
+	if (jit_cache_idx < NR_JIT_CACHES) {
+		mutex_lock(&jit_alloc_lock);
+		p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
+		mutex_unlock(&jit_alloc_lock);
+	} else {
+		p = module_alloc(len);
+		if (p) {
+			set_vm_flush_reset_perms(p);
+			set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
+		}
+	}
+
+	if (p && buf)
+		jit_update(p, buf, len);
+
+	return p;
+}
+EXPORT_SYMBOL_GPL(jit_alloc);
+
+static int __init jit_alloc_init(void)
+{
+	for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
+		jit_caches[i].obj_size_bits	= ilog2(JITALLOC_MIN_SIZE) + i;
+		jit_caches[i].objs_per_slab	= PAGE_SIZE >> jit_caches[i].obj_size_bits;
+
+		INIT_LIST_HEAD(&jit_caches[i].partial);
+	}
+
+	return 0;
+}
+core_initcall(jit_alloc_init);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-26 22:29   ` Kent Overstreet
@ 2023-05-26 23:09     ` Song Liu
  2023-05-26 23:39       ` Kent Overstreet
  0 siblings, 1 reply; 22+ messages in thread
From: Song Liu @ 2023-05-26 23:09 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Fri, May 26, 2023 at 3:30 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 25, 2023 at 10:15:27PM -0700, Song Liu wrote:
> > Introduce memory type aware module_alloc_type, which provides a unified
> > allocator for all different archs. This work was discussed in [1].
> >
> > Each arch can configure the allocator to do the following:
> >
> >    1. Specify module_vaddr and module_end
> >    2. Random module start address for KASLR
> >    3. kasan_alloc_module_shadow()
> >    4. kasan_reset_tag()
> >    5. Preferred and secondary module address ranges
> >
> > enum mod_alloc_params_flags are used to control the behavior of
> > module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type
> > fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let
> > module_alloc_type to protect the memory before returning to the user.
> >
> > module_allocator_init() call is added to start_kernel() to initialize
> > module_alloc_type.
> >
>
> ...
>
> > +/**
> > + * struct vmalloc_params - Parameters to call __vmalloc_node_range()
> > + * @start:          Address space range start
> > + * @end:            Address space range end
> > + * @gfp_mask:       The gfp_t mask used for this range
> > + * @pgprot:         The page protection for this range
> > + * @vm_flags        The vm_flag used for this range
> > + */
> > +struct vmalloc_params {
> > +     unsigned long   start;
> > +     unsigned long   end;
> > +     gfp_t           gfp_mask;
> > +     pgprot_t        pgprot;
> > +     unsigned long   vm_flags;
> > +};
> > +
> > +/**
> > + * struct mod_alloc_params - Parameters for module allocation type
> > + * @flags:          Properties in mod_alloc_params_flags
> > + * @granularity:    The allocation granularity (PAGE/PMD) in bytes
> > + * @alignment:      The allocation alignment requirement
> > + * @vmp:            Parameters used to call vmalloc
> > + * @fill:           Function to fill allocated space. If NULL, use memcpy()
> > + * @invalidate:     Function to invalidate memory space.
> > + *
> > + * If @granularity > @alignment the allocation can reuse free space in
> > + * previously allocated pages. If they are the same, then fresh pages
> > + * have to be allocated.
> > + */
> > +struct mod_alloc_params {
> > +     unsigned int            flags;
> > +     unsigned int            granularity;
> > +     unsigned int            alignment;
> > +     struct vmalloc_params   vmp[MOD_MAX_ADDR_SPACES];
> > +     void *                  (*fill)(void *dst, const void *src, size_t len);
> > +     void *                  (*invalidate)(void *ptr, size_t len);
> > +};
> > +
> > +struct mod_type_allocator {
> > +     struct mod_alloc_params params;
> > +};
> > +
> > +struct mod_allocators {
> > +     struct mod_type_allocator *types[MOD_MEM_NUM_TYPES];
> > +};
> > +
> > +void *module_alloc_type(size_t size, enum mod_mem_type type);
> > +void module_memfree_type(void *ptr, enum mod_mem_type type);
> > +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type);
> > +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type);
> > +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type);
> > +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> > +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type);
> > +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> > +void module_alloc_type_init(struct mod_allocators *allocators);
>
> This is a pretty big and complicated interface, and I haven't been able
> to find any reasoning or justification for why it's needed.
>
> Why is this going in module.h? Don't do that, this is supposed to be a
> general purpose allocator for executable memory so start a new file.

The goal of this work is to build a memory allocator for modules, text,
rw data, and ro data. So it is not the same as execmem_alloc or jitalloc.

>
> What is vmalloc_params doing here? Why is it needed? Can we just get rid
> of it?

We need it to have an interface for all archs. They are all using different
variations of vmalloc for module_alloc.

>
> module_memory_protect() and module_memory_unprotect() looks like a
> completely broken interface for supporting sub-page allocations -
> different threads with different allocations on the same page will race.

These two APIs allow the core code work with all archs. They won't break
sub-page allocations. (not all archs will have sub-page allocations)

OTOH, the "force" version of the two APIs should be removed later. In this
set, we only need them for arch_prepare_bpf_trampoline(). I plan to remove
this limitation for x86 soon.

>
> The text_poke() abstraction works; the exposed interface should look
> like that.
>
> Please kill MOD_ALLOC_SET_MEMORY, and _only_ return memory that is
> read-only. You should be able to kill mod_alloc_params entirely.
>
> Our other memory allocators don't expose kasan - why does this one?
> Again, that looks wrong.
>
> I would like to see something much simpler (that doesn't encode awkward
> assumptions from module and bpf!): please look at the code I sketched
> out below and tell me why this interface won't work - or if it can, go
> with _that_.

I think we need to align the goal here. PS: we did align the goal about
6 months ago when I proposed the execmem_alloc() set.

My current goal is to build an allocator for all the use cases of modules,
text, data, rodata, etc. Then the same allocator can be used by other
dynamic kernel text: bpf, ftrace, kprobe, bcachefs, etc.

OTOH, jit_alloc (as-is) solves a subset of the problem (only the text).
We can probably use it (with some updates) instead of the vmap_area
based allocator. But I guess we need to align the direction first.

Thanks,
Song

>
> commit 80ceffd6b1108afc7593bdf060954c73eaf0ffc4
> Author: Kent Overstreet <kent.overstreet@linux.dev>
> Date:   Wed May 17 01:22:06 2023 -0400
>
>     mm: jit/text allocator
>
>     This provides a new, very simple slab allocator for jit/text, i.e. bpf,
>     ftrace trampolines, or bcachefs unpack functions.
>
>     With this API we can avoid ever mapping pages both writeable and
>     executable (not implemented in this patch: need to tweak
>     module_alloc()), and it also supports sub-page sized allocations.
>
>     Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
>
> diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
> new file mode 100644
> index 0000000000..f1549d60e8
> --- /dev/null
> +++ b/include/linux/jitalloc.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_JITALLOC_H
> +#define _LINUX_JITALLOC_H
> +
> +void jit_update(void *buf, void *new_buf, size_t len);
> +void jit_free(void *buf);
> +void *jit_alloc(void *buf, size_t len);
> +
> +#endif /* _LINUX_JITALLOC_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4751031f3f..ff26a4f0c9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
>           This option has a per-memcg and per-node memory overhead.
>  # }
>
> +config JITALLOC
> +       bool
> +
>  source "mm/damon/Kconfig"
>
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index c03e1e5859..25e82db9e8 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
>  obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
>  obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
>  obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
> +obj-$(CONFIG_JITALLOC) += jitalloc.o
> diff --git a/mm/jitalloc.c b/mm/jitalloc.c
> new file mode 100644
> index 0000000000..7c4d621802
> --- /dev/null
> +++ b/mm/jitalloc.c
> @@ -0,0 +1,187 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/gfp.h>
> +#include <linux/highmem.h>
> +#include <linux/jitalloc.h>
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/mutex.h>
> +#include <linux/set_memory.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/text-patching.h>
> +
> +static DEFINE_MUTEX(jit_alloc_lock);
> +
> +struct jit_cache {
> +       unsigned                obj_size_bits;
> +       unsigned                objs_per_slab;
> +       struct list_head        partial;
> +};
> +
> +#define JITALLOC_MIN_SIZE      16
> +#define NR_JIT_CACHES          ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
> +
> +static struct jit_cache jit_caches[NR_JIT_CACHES];
> +
> +struct jit_slab {
> +       unsigned long           __page_flags;
> +
> +       struct jit_cache        *cache;
> +       void                    *executably_mapped;;
> +       unsigned long           *objs_allocated; /* bitmap of free objects */
> +       struct list_head        list;
> +};
> +
> +#define folio_jit_slab(folio)          (_Generic((folio),                      \
> +       const struct folio *:           (const struct jit_slab *)(folio),       \
> +       struct folio *:                 (struct jit_slab *)(folio)))
> +
> +#define jit_slab_folio(s)              (_Generic((s),                          \
> +       const struct jit_slab *:        (const struct folio *)s,                \
> +       struct jit_slab *:              (struct folio *)s))
> +
> +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
> +{
> +       void *executably_mapped = module_alloc(PAGE_SIZE);
> +       struct page *page;
> +       struct folio *folio;
> +       struct jit_slab *slab;
> +       unsigned long *objs_allocated;
> +
> +       if (!executably_mapped)
> +               return NULL;
> +
> +       objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
> +       if (!objs_allocated ) {
> +               vfree(executably_mapped);
> +               return NULL;
> +       }
> +
> +       set_vm_flush_reset_perms(executably_mapped);
> +       set_memory_rox((unsigned long) executably_mapped, 1);
> +
> +       page = vmalloc_to_page(executably_mapped);
> +       folio = page_folio(page);
> +
> +       __folio_set_slab(folio);
> +       slab                    = folio_jit_slab(folio);
> +       slab->cache             = cache;
> +       slab->executably_mapped = executably_mapped;
> +       slab->objs_allocated = objs_allocated;
> +       INIT_LIST_HEAD(&slab->list);
> +
> +       return slab;
> +}
> +
> +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> +{
> +       struct jit_slab *s =
> +               list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> +               jit_slab_alloc(cache);
> +       unsigned obj_idx, nr_allocated;
> +
> +       if (!s)
> +               return NULL;
> +
> +       obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> +
> +       BUG_ON(obj_idx >= cache->objs_per_slab);
> +       __set_bit(obj_idx, s->objs_allocated);
> +
> +       nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +       if (nr_allocated == s->cache->objs_per_slab) {
> +               list_del_init(&s->list);
> +       } else if (nr_allocated == 1) {
> +               list_del(&s->list);
> +               list_add(&s->list, &s->cache->partial);
> +       }
> +
> +       return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> +}
> +
> +void jit_update(void *buf, void *new_buf, size_t len)
> +{
> +       text_poke_copy(buf, new_buf, len);
> +}
> +EXPORT_SYMBOL_GPL(jit_update);
> +
> +void jit_free(void *buf)
> +{
> +       struct page *page;
> +       struct folio *folio;
> +       struct jit_slab *s;
> +       unsigned obj_idx, nr_allocated;
> +       size_t offset;
> +
> +       if (!buf)
> +               return;
> +
> +       page    = vmalloc_to_page(buf);
> +       folio   = page_folio(page);
> +       offset  = offset_in_folio(folio, buf);
> +
> +       if (!folio_test_slab(folio)) {
> +               vfree(buf);
> +               return;
> +       }
> +
> +       s = folio_jit_slab(folio);
> +
> +       mutex_lock(&jit_alloc_lock);
> +       obj_idx = offset >> s->cache->obj_size_bits;
> +
> +       __clear_bit(obj_idx, s->objs_allocated);
> +
> +       nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +       if (nr_allocated == 0) {
> +               list_del(&s->list);
> +               kfree(s->objs_allocated);
> +               folio_put(folio);
> +       } else if (nr_allocated + 1 == s->cache->objs_per_slab) {
> +               list_del(&s->list);
> +               list_add(&s->list, &s->cache->partial);
> +       }
> +
> +       mutex_unlock(&jit_alloc_lock);
> +}
> +EXPORT_SYMBOL_GPL(jit_free);
> +
> +void *jit_alloc(void *buf, size_t len)
> +{
> +       unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
> +       void *p;
> +
> +       if (jit_cache_idx < NR_JIT_CACHES) {
> +               mutex_lock(&jit_alloc_lock);
> +               p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
> +               mutex_unlock(&jit_alloc_lock);
> +       } else {
> +               p = module_alloc(len);
> +               if (p) {
> +                       set_vm_flush_reset_perms(p);
> +                       set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
> +               }
> +       }
> +
> +       if (p && buf)
> +               jit_update(p, buf, len);
> +
> +       return p;
> +}
> +EXPORT_SYMBOL_GPL(jit_alloc);
> +
> +static int __init jit_alloc_init(void)
> +{
> +       for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
> +               jit_caches[i].obj_size_bits     = ilog2(JITALLOC_MIN_SIZE) + i;
> +               jit_caches[i].objs_per_slab     = PAGE_SIZE >> jit_caches[i].obj_size_bits;
> +
> +               INIT_LIST_HEAD(&jit_caches[i].partial);
> +       }
> +
> +       return 0;
> +}
> +core_initcall(jit_alloc_init);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-26 23:09     ` Song Liu
@ 2023-05-26 23:39       ` Kent Overstreet
  2023-05-27  0:03         ` Song Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2023-05-26 23:39 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Fri, May 26, 2023 at 04:09:01PM -0700, Song Liu wrote:
> On Fri, May 26, 2023 at 3:30 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, May 25, 2023 at 10:15:27PM -0700, Song Liu wrote:
> > > Introduce memory type aware module_alloc_type, which provides a unified
> > > allocator for all different archs. This work was discussed in [1].
> > >
> > > Each arch can configure the allocator to do the following:
> > >
> > >    1. Specify module_vaddr and module_end
> > >    2. Random module start address for KASLR
> > >    3. kasan_alloc_module_shadow()
> > >    4. kasan_reset_tag()
> > >    5. Preferred and secondary module address ranges
> > >
> > > enum mod_alloc_params_flags are used to control the behavior of
> > > module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type
> > > fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let
> > > module_alloc_type to protect the memory before returning to the user.
> > >
> > > module_allocator_init() call is added to start_kernel() to initialize
> > > module_alloc_type.
> > >
> >
> > ...
> >
> > > +/**
> > > + * struct vmalloc_params - Parameters to call __vmalloc_node_range()
> > > + * @start:          Address space range start
> > > + * @end:            Address space range end
> > > + * @gfp_mask:       The gfp_t mask used for this range
> > > + * @pgprot:         The page protection for this range
> > > + * @vm_flags        The vm_flag used for this range
> > > + */
> > > +struct vmalloc_params {
> > > +     unsigned long   start;
> > > +     unsigned long   end;
> > > +     gfp_t           gfp_mask;
> > > +     pgprot_t        pgprot;
> > > +     unsigned long   vm_flags;
> > > +};
> > > +
> > > +/**
> > > + * struct mod_alloc_params - Parameters for module allocation type
> > > + * @flags:          Properties in mod_alloc_params_flags
> > > + * @granularity:    The allocation granularity (PAGE/PMD) in bytes
> > > + * @alignment:      The allocation alignment requirement
> > > + * @vmp:            Parameters used to call vmalloc
> > > + * @fill:           Function to fill allocated space. If NULL, use memcpy()
> > > + * @invalidate:     Function to invalidate memory space.
> > > + *
> > > + * If @granularity > @alignment the allocation can reuse free space in
> > > + * previously allocated pages. If they are the same, then fresh pages
> > > + * have to be allocated.
> > > + */
> > > +struct mod_alloc_params {
> > > +     unsigned int            flags;
> > > +     unsigned int            granularity;
> > > +     unsigned int            alignment;
> > > +     struct vmalloc_params   vmp[MOD_MAX_ADDR_SPACES];
> > > +     void *                  (*fill)(void *dst, const void *src, size_t len);
> > > +     void *                  (*invalidate)(void *ptr, size_t len);
> > > +};
> > > +
> > > +struct mod_type_allocator {
> > > +     struct mod_alloc_params params;
> > > +};
> > > +
> > > +struct mod_allocators {
> > > +     struct mod_type_allocator *types[MOD_MEM_NUM_TYPES];
> > > +};
> > > +
> > > +void *module_alloc_type(size_t size, enum mod_mem_type type);
> > > +void module_memfree_type(void *ptr, enum mod_mem_type type);
> > > +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type);
> > > +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_alloc_type_init(struct mod_allocators *allocators);
> >
> > This is a pretty big and complicated interface, and I haven't been able
> > to find any reasoning or justification for why it's needed.
> >
> > Why is this going in module.h? Don't do that, this is supposed to be a
> > general purpose allocator for executable memory so start a new file.
> 
> The goal of this work is to build a memory allocator for modules, text,
> rw data, and ro data. So it is not the same as execmem_alloc or jitalloc.
> 
> >
> > What is vmalloc_params doing here? Why is it needed? Can we just get rid
> > of it?
> 
> We need it to have an interface for all archs. They are all using different
> variations of vmalloc for module_alloc.

But it should be an internal implementation detail, I don't think we
want the external interface tied to vmalloc - 

> These two APIs allow the core code work with all archs. They won't break
> sub-page allocations. (not all archs will have sub-page allocations)

So yes, text_poke() doesn't work on all architectures, and we need a
fallback.

But if the fallback needs to go the unprotect/protect route, I think we
need to put that in the helper, and not expose it. Part of what people
are wanting is to limit or eliminate pages that are RWX, so we
definitely shouldn't be creating new interfaces to flipping page
permissions: that should be pushed down to as low a level as possible.

E.g., with my jitalloc_update(), a more complete version would look like

void jitalloc_update(void *dst, void *src, size_t len)
{
	if (text_poke_available) {
		text_poke(...);
	} else {
		unprotect();
		memcpy();
		protect();
	}
}

See? We provide a simpler, safer interface, and this also lets us handle
multiple threads racing to update/flip permissions on the same page in a
single place (e.g. with sticking a lock somewhere in the jitalloc
structures).

Or ideally, we'd have a kmap_local() variant that actually creates a new
mapping on all architectures, and we'd just use that to do the update
without any arch-dependent code - if we get that, having this helper
means we'll only have to change a single place later.

> 
> OTOH, the "force" version of the two APIs should be removed later. In this
> set, we only need them for arch_prepare_bpf_trampoline(). I plan to remove
> this limitation for x86 soon.
> 
> >
> > The text_poke() abstraction works; the exposed interface should look
> > like that.
> >
> > Please kill MOD_ALLOC_SET_MEMORY, and _only_ return memory that is
> > read-only. You should be able to kill mod_alloc_params entirely.
> >
> > Our other memory allocators don't expose kasan - why does this one?
> > Again, that looks wrong.
> >
> > I would like to see something much simpler (that doesn't encode awkward
> > assumptions from module and bpf!): please look at the code I sketched
> > out below and tell me why this interface won't work - or if it can, go
> > with _that_.
> 
> I think we need to align the goal here. PS: we did align the goal about
> 6 months ago when I proposed the execmem_alloc() set.

Yeah, I know what it feels like to get yanked around by different
reviewers who see and join in the discussion at different times. Sorry :)

Having a rational/design doc in the header file would help a _lot_. 

> My current goal is to build an allocator for all the use cases of modules,
> text, data, rodata, etc. Then the same allocator can be used by other
> dynamic kernel text: bpf, ftrace, kprobe, bcachefs, etc.

This must have been part of the previous discussion since you started
with execmem_alloc(), but I did scan through that and I'm still not
seeing the justification for why this needs to handle non-text
allocations.

If I had to hazard a guess it would be because of tglx wanting to solve
tlb fragmentation for modules, but to me that doesn't sound like a good
enough reason and it looks like we're ending up with a messy "try to be
all things to all people" interface as a result :/

Do we _really_ need that?

Mike was just saying at LSF that direct map fragmentation turned out to
be a non issue for performance for non-text, so maybe we don't.

Also - module_memory_fill_type(), module_memory_invalidate_type() - I
know these are for BPF, but could you explain why we need them in the
external interface here? Could they perhaps be small helpers in the bpf
code that use something like jitalloc_update()?

> OTOH, jit_alloc (as-is) solves a subset of the problem (only the text).
> We can probably use it (with some updates) instead of the vmap_area
> based allocator. But I guess we need to align the direction first.

Let's see if we can get tglx to chime in again, since he was pretty
vocal in the last discussion.

It's also good practice to try to summarize all the relevant "whys" from
the discussion that went into a feature and put that in at least the
commit message - or even better, header file comments.

Also, organization: the module code is a huge mess, let's _please_ do
split this out and think about organization a bit more, not add to the
mess :)

Cheers,
Kent

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-26 23:39       ` Kent Overstreet
@ 2023-05-27  0:03         ` Song Liu
  2023-05-27  3:19           ` Kent Overstreet
  0 siblings, 1 reply; 22+ messages in thread
From: Song Liu @ 2023-05-27  0:03 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Fri, May 26, 2023 at 4:39 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
[...]

>
> But it should be an internal implementation detail, I don't think we
> want the external interface tied to vmalloc -
>
> > These two APIs allow the core code work with all archs. They won't break
> > sub-page allocations. (not all archs will have sub-page allocations)
>
> So yes, text_poke() doesn't work on all architectures, and we need a
> fallback.
>
> But if the fallback needs to go the unprotect/protect route, I think we
> need to put that in the helper, and not expose it. Part of what people
> are wanting is to limit or eliminate pages that are RWX, so we
> definitely shouldn't be creating new interfaces to flipping page
> permissions: that should be pushed down to as low a level as possible.
>
> E.g., with my jitalloc_update(), a more complete version would look like
>
> void jitalloc_update(void *dst, void *src, size_t len)
> {
>         if (text_poke_available) {
>                 text_poke(...);
>         } else {
>                 unprotect();
>                 memcpy();
>                 protect();
>         }
> }

I think this doesn't work for sub page allocation?

>
> See? We provide a simpler, safer interface, and this also lets us handle
> multiple threads racing to update/flip permissions on the same page in a
> single place (e.g. with sticking a lock somewhere in the jitalloc
> structures).
[...]
>
> This must have been part of the previous discussion since you started
> with execmem_alloc(), but I did scan through that and I'm still not
> seeing the justification for why this needs to handle non-text
> allocations.
>
> If I had to hazard a guess it would be because of tglx wanting to solve
> tlb fragmentation for modules, but to me that doesn't sound like a good
> enough reason and it looks like we're ending up with a messy "try to be
> all things to all people" interface as a result :/
>
> Do we _really_ need that?
>
> Mike was just saying at LSF that direct map fragmentation turned out to
> be a non issue for performance for non-text, so maybe we don't.

At the end of all this, we will have modules running from huge pages, data
and text. It will give significant performance benefit when some key driver
cannot be compiled into the kernel.

>
> Also - module_memory_fill_type(), module_memory_invalidate_type() - I
> know these are for BPF, but could you explain why we need them in the
> external interface here? Could they perhaps be small helpers in the bpf
> code that use something like jitalloc_update()?

These are used by all users, not just BPF. 1/3 uses them in
kernel/module/main.c. I didn't use them in 3/3 as it is arch code, but I can
use them instead of text_poke_* (and that is probably better code style).
As I am writing the email, I think I should use it in ftrace.c (2/3) to avoid
the __weak function.

>
> > OTOH, jit_alloc (as-is) solves a subset of the problem (only the text).
> > We can probably use it (with some updates) instead of the vmap_area
> > based allocator. But I guess we need to align the direction first.
>
> Let's see if we can get tglx to chime in again, since he was pretty
> vocal in the last discussion.
>
> It's also good practice to try to summarize all the relevant "whys" from
> the discussion that went into a feature and put that in at least the
> commit message - or even better, header file comments.
>
> Also, organization: the module code is a huge mess, let's _please_ do
> split this out and think about organization a bit more, not add to the
> mess :)

I don't really think module code is very messy at the moment. If
necessary, we can put module_alloc_type related code in a
separate file.

Thanks,
Song

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-27  0:03         ` Song Liu
@ 2023-05-27  3:19           ` Kent Overstreet
  2023-05-27  6:00             ` Song Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2023-05-27  3:19 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Fri, May 26, 2023 at 05:03:18PM -0700, Song Liu wrote:
> On Fri, May 26, 2023 at 4:39 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> [...]
> 
> >
> > But it should be an internal implementation detail, I don't think we
> > want the external interface tied to vmalloc -
> >
> > > These two APIs allow the core code work with all archs. They won't break
> > > sub-page allocations. (not all archs will have sub-page allocations)
> >
> > So yes, text_poke() doesn't work on all architectures, and we need a
> > fallback.
> >
> > But if the fallback needs to go the unprotect/protect route, I think we
> > need to put that in the helper, and not expose it. Part of what people
> > are wanting is to limit or eliminate pages that are RWX, so we
> > definitely shouldn't be creating new interfaces to flipping page
> > permissions: that should be pushed down to as low a level as possible.
> >
> > E.g., with my jitalloc_update(), a more complete version would look like
> >
> > void jitalloc_update(void *dst, void *src, size_t len)
> > {
> >         if (text_poke_available) {
> >                 text_poke(...);
> >         } else {
> >                 unprotect();
> >                 memcpy();
> >                 protect();
> >         }
> > }
> 
> I think this doesn't work for sub page allocation?

Perhaps I elided too much - it does if you add a single lock. You can't
do that if it's not a common helper.

> At the end of all this, we will have modules running from huge pages, data
> and text. It will give significant performance benefit when some key driver
> cannot be compiled into the kernel.

Yeah, I've seen the numbers for the perf impact of running as a module
due to the extra TLB overhead - but Mike's recent data was showing that
this doesn't matter nearly as much as data as it does for text.

> > Also - module_memory_fill_type(), module_memory_invalidate_type() - I
> > know these are for BPF, but could you explain why we need them in the
> > external interface here? Could they perhaps be small helpers in the bpf
> > code that use something like jitalloc_update()?
> 
> These are used by all users, not just BPF. 1/3 uses them in
> kernel/module/main.c. I didn't use them in 3/3 as it is arch code, but I can
> use them instead of text_poke_* (and that is probably better code style).
> As I am writing the email, I think I should use it in ftrace.c (2/3) to avoid
> the __weak function.

Ok. Could we make it clearer why they're needed and what they're for?

I know bpf fills in invalid instructions initially; I didn't read
through enough code to find the "why", and let's document that to save
other people the same effort.

And do they really need to be callbacks in mod_alloc_params...?

Do we need the other things in mod_alloc_params/vmalloc_params?
 - your granularity field says it's for specifying PAGE/PMD size: we
   definitely do not want that. We've had way too much code along the
   lines of "implement hugepages for x", we need to stop doing that.
   It's an internal mm/ detail.

 - vmalloc_params: we need gfp_t and pgprot_t, but those should be
   normal arguments. start/end/vm_flags? They seem like historical
   module baggage, can we get rid of them?

> > > OTOH, jit_alloc (as-is) solves a subset of the problem (only the text).
> > > We can probably use it (with some updates) instead of the vmap_area
> > > based allocator. But I guess we need to align the direction first.
> >
> > Let's see if we can get tglx to chime in again, since he was pretty
> > vocal in the last discussion.
> >
> > It's also good practice to try to summarize all the relevant "whys" from
> > the discussion that went into a feature and put that in at least the
> > commit message - or even better, header file comments.
> >
> > Also, organization: the module code is a huge mess, let's _please_ do
> > split this out and think about organization a bit more, not add to the
> > mess :)
> 
> I don't really think module code is very messy at the moment. If
> necessary, we can put module_alloc_type related code in a
> separate file.

Hey, it's been organized since I last looked at it :) But I tihink this
belongs in mm/, not module code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] module: Introduce module_alloc_type
  2023-05-27  3:19           ` Kent Overstreet
@ 2023-05-27  6:00             ` Song Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Song Liu @ 2023-05-27  6:00 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Fri, May 26, 2023 at 8:20 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
[...]
> > > void jitalloc_update(void *dst, void *src, size_t len)
> > > {
> > >         if (text_poke_available) {
> > >                 text_poke(...);
> > >         } else {
> > >                 unprotect();
> > >                 memcpy();
> > >                 protect();
> > >         }
> > > }
> >
> > I think this doesn't work for sub page allocation?
>
> Perhaps I elided too much - it does if you add a single lock. You can't
> do that if it's not a common helper.

Actually, sub page allocation is not the problem. The problem is
with unprotect() and protect(), as they require global TLB flush.

>
> > At the end of all this, we will have modules running from huge pages, data
> > and text. It will give significant performance benefit when some key driver
> > cannot be compiled into the kernel.
>
> Yeah, I've seen the numbers for the perf impact of running as a module
> due to the extra TLB overhead - but Mike's recent data was showing that
> this doesn't matter nearly as much as data as it does for text.
>
> > > Also - module_memory_fill_type(), module_memory_invalidate_type() - I
> > > know these are for BPF, but could you explain why we need them in the
> > > external interface here? Could they perhaps be small helpers in the bpf
> > > code that use something like jitalloc_update()?
> >
> > These are used by all users, not just BPF. 1/3 uses them in
> > kernel/module/main.c. I didn't use them in 3/3 as it is arch code, but I can
> > use them instead of text_poke_* (and that is probably better code style).
> > As I am writing the email, I think I should use it in ftrace.c (2/3) to avoid
> > the __weak function.
>
> Ok. Could we make it clearer why they're needed and what they're for?
>
> I know bpf fills in invalid instructions initially; I didn't read
> through enough code to find the "why", and let's document that to save
> other people the same effort.
>
> And do they really need to be callbacks in mod_alloc_params...?

If we want to call them from common code, they will either be callback
or some __weak functions.

>
> Do we need the other things in mod_alloc_params/vmalloc_params?
>  - your granularity field says it's for specifying PAGE/PMD size: we
>    definitely do not want that. We've had way too much code along the
>    lines of "implement hugepages for x", we need to stop doing that.
>    It's an internal mm/ detail.

We don't need "granularity" yet. We will need them with the binpack
allocator.

>  - vmalloc_params: we need gfp_t and pgprot_t, but those should be
>    normal arguments. start/end/vm_flags? They seem like historical
>    module baggage, can we get rid of them?

All these fields are needed by some of the module_alloc()s for different
archs. I am not an expert of all the archs, so I cannot say whether some
of them are not needed.
[...]
> > I don't really think module code is very messy at the moment. If
> > necessary, we can put module_alloc_type related code in a
> > separate file.
>
> Hey, it's been organized since I last looked at it :) But I tihink this
> belongs in mm/, not module code.

I don't have a strong preference for this (at the moment). If we agree
it belongs to mm/, we sure can move it.

Thanks,
Song

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-26  5:15 [PATCH 0/3] Type aware module allocator Song Liu
                   ` (2 preceding siblings ...)
  2023-05-26  5:15 ` [PATCH 3/3] x86/module: Use module_alloc_type Song Liu
@ 2023-05-27  7:04 ` Kent Overstreet
  2023-05-28  5:58   ` Song Liu
  3 siblings, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2023-05-27  7:04 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Thu, May 25, 2023 at 10:15:26PM -0700, Song Liu wrote:
> This set implements the second part of module type aware allocator
> (module_alloc_type), which was discussed in [1]. This part contains the
> interface of the new allocator, as well as changes in x86 code to use the
> new allocator (modules, BPF, ftrace, kprobe).
> 
> The set does not contain a binpack allocator to enable sharing huge pages
> among different allocations. But this set defines the interface used by
> the binpack allocator. [2] has some discussion on different options of the
> binpack allocator.

I'm afraid the more I look at this patchset the more it appears to be
complete nonsense.

The exposed interface appears to be both for specifying architecture
dependent options (which should be hidden inside the allocator
internals!) _and_ a general purpose interface for module/bpf/kprobes -
but it's not very clear, and the rational appears to be completely
missing.

I think this needs to back to the drawing board and we need something
simpler just targeted at executable memory; architecture specific
options should definitely _not_ be part of the exposed interface.

The memory protection interface also needs to go, we've got a better
interface to model after (text_poke(), although that code needs work
too!). And the instruction fill features need a thorough justification
if they're to be included.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-27  7:04 ` [PATCH 0/3] Type aware module allocator Kent Overstreet
@ 2023-05-28  5:58   ` Song Liu
  2023-05-29 10:45     ` Mike Rapoport
  2023-05-29 18:25     ` Kent Overstreet
  0 siblings, 2 replies; 22+ messages in thread
From: Song Liu @ 2023-05-28  5:58 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Sat, May 27, 2023 at 12:04 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 25, 2023 at 10:15:26PM -0700, Song Liu wrote:
> > This set implements the second part of module type aware allocator
> > (module_alloc_type), which was discussed in [1]. This part contains the
> > interface of the new allocator, as well as changes in x86 code to use the
> > new allocator (modules, BPF, ftrace, kprobe).
> >
> > The set does not contain a binpack allocator to enable sharing huge pages
> > among different allocations. But this set defines the interface used by
> > the binpack allocator. [2] has some discussion on different options of the
> > binpack allocator.
>
> I'm afraid the more I look at this patchset the more it appears to be
> complete nonsense.

I don't think it is nonsense, as you clearly got most of the points here. :)

>
> The exposed interface appears to be both for specifying architecture
> dependent options (which should be hidden inside the allocator
> internals!) _and_ a general purpose interface for module/bpf/kprobes -
> but it's not very clear, and the rational appears to be completely
> missing.

The rationale is to have something to replace module_alloc(). Therefore,
it needs to handle architecture specific requirements, and provide
interface to all current users of module_alloc(). It may look a little weird
at the moment, because the actual allocator logic is very thin. But that's
where we will plug in the next step of the work.

>
> I think this needs to back to the drawing board and we need something
> simpler just targeted at executable memory; architecture specific
> options should definitely _not_ be part of the exposed interface.

I don't think we are exposing architecture specific options to users.
Some layer need to handle arch specifics. If the new allocator is
built on top of module_alloc, module_alloc is handling that. If the new
allocator is to replace module_alloc, it needs to handle arch specifics.

>
> The memory protection interface also needs to go, we've got a better
> interface to model after (text_poke(), although that code needs work
> too!). And the instruction fill features need a thorough justification
> if they're to be included.

I guess the first step to use text_poke() is to make it available on all
archs? That doesn't seem easy to me.

Thanks,
Song

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-28  5:58   ` Song Liu
@ 2023-05-29 10:45     ` Mike Rapoport
  2023-05-30 22:37       ` Song Liu
  2023-05-29 18:25     ` Kent Overstreet
  1 sibling, 1 reply; 22+ messages in thread
From: Mike Rapoport @ 2023-05-29 10:45 UTC (permalink / raw)
  To: Song Liu; +Cc: Kent Overstreet, linux-kernel, bpf, mcgrof, peterz, tglx, x86

On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> On Sat, May 27, 2023 at 12:04 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > I think this needs to back to the drawing board and we need something
> > simpler just targeted at executable memory; architecture specific
> > options should definitely _not_ be part of the exposed interface.
> 
> I don't think we are exposing architecture specific options to users.
> Some layer need to handle arch specifics. If the new allocator is
> built on top of module_alloc, module_alloc is handling that. If the new
> allocator is to replace module_alloc, it needs to handle arch specifics.
 
I'm for creating a new allocator that will replace module_alloc(). This
will give us a clean abstraction that modules and all the rest will use and
it will make easier to plug binpack or another allocator instead of
vmalloc.
Another point is with a new allocator we won't have weird dependencies on
CONFIG_MODULE in e.g. bpf and kprobes.

I'll have something ready to post as an RFC in a few days.

> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-28  5:58   ` Song Liu
  2023-05-29 10:45     ` Mike Rapoport
@ 2023-05-29 18:25     ` Kent Overstreet
  2023-05-30 22:48       ` Song Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Kent Overstreet @ 2023-05-29 18:25 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> I don't think we are exposing architecture specific options to users.
> Some layer need to handle arch specifics. If the new allocator is
> built on top of module_alloc, module_alloc is handling that. If the new
> allocator is to replace module_alloc, it needs to handle arch specifics.

Ok, I went back and read more thoroughly, I got this part wrong. The
actual interface is the mod_mem_type enum, not mod_alloc_params or
vmalloc_params.

So this was my main complaint, but this actually looks ok now.

It would be better to have those structs in a .c file, not the header
file - it looks like those are the public interface the way you have it.

> > The memory protection interface also needs to go, we've got a better
> > interface to model after (text_poke(), although that code needs work
> > too!). And the instruction fill features need a thorough justification
> > if they're to be included.
> 
> I guess the first step to use text_poke() is to make it available on all
> archs? That doesn't seem easy to me.

We just need a helper that either calls text_poke() or does the page
permission dance in a single place.

If we do the same thing for other mod_mem_types, we could potentially
allow them to be shared on hugepages too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-29 10:45     ` Mike Rapoport
@ 2023-05-30 22:37       ` Song Liu
  2023-05-31 13:51         ` Mike Rapoport
  0 siblings, 1 reply; 22+ messages in thread
From: Song Liu @ 2023-05-30 22:37 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kent Overstreet, linux-kernel, bpf, mcgrof, peterz, tglx, x86

On Mon, May 29, 2023 at 3:45 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> > On Sat, May 27, 2023 at 12:04 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > I think this needs to back to the drawing board and we need something
> > > simpler just targeted at executable memory; architecture specific
> > > options should definitely _not_ be part of the exposed interface.
> >
> > I don't think we are exposing architecture specific options to users.
> > Some layer need to handle arch specifics. If the new allocator is
> > built on top of module_alloc, module_alloc is handling that. If the new
> > allocator is to replace module_alloc, it needs to handle arch specifics.
>
> I'm for creating a new allocator that will replace module_alloc(). This
> will give us a clean abstraction that modules and all the rest will use and
> it will make easier to plug binpack or another allocator instead of
> vmalloc.
>
> Another point is with a new allocator we won't have weird dependencies on
> CONFIG_MODULE in e.g. bpf and kprobes.
>
> I'll have something ready to post as an RFC in a few days.

I guess this RFC is similar to unmapped_alloc()? If it replaces
vmalloc, we can probably trim this set down a bit (remove
mod_alloc_params and vmalloc_params, etc.).

Thanks,
Song

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-29 18:25     ` Kent Overstreet
@ 2023-05-30 22:48       ` Song Liu
  2023-06-01 17:53         ` Kent Overstreet
  2023-06-01 18:21         ` Kent Overstreet
  0 siblings, 2 replies; 22+ messages in thread
From: Song Liu @ 2023-05-30 22:48 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Mon, May 29, 2023 at 11:25 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> > I don't think we are exposing architecture specific options to users.
> > Some layer need to handle arch specifics. If the new allocator is
> > built on top of module_alloc, module_alloc is handling that. If the new
> > allocator is to replace module_alloc, it needs to handle arch specifics.
>
> Ok, I went back and read more thoroughly, I got this part wrong. The
> actual interface is the mod_mem_type enum, not mod_alloc_params or
> vmalloc_params.
>
> So this was my main complaint, but this actually looks ok now.
>
> It would be better to have those structs in a .c file, not the header
> file - it looks like those are the public interface the way you have it.

Thanks for this suggestion. It makes a lot of sense. But I am not quite
sure how we can avoid putting it in the header yet. I will take a closer
look. OTOH, if we plan to use Mike's new allocator to replace vmalloc,
we probably don't need this part.

>
> > > The memory protection interface also needs to go, we've got a better
> > > interface to model after (text_poke(), although that code needs work
> > > too!). And the instruction fill features need a thorough justification
> > > if they're to be included.
> >
> > I guess the first step to use text_poke() is to make it available on all
> > archs? That doesn't seem easy to me.
>
> We just need a helper that either calls text_poke() or does the page
> permission dance in a single place.

AFAICT, we don't have a global text_poke() API yet. I can take a look
into it (if it makes sense).

>
> If we do the same thing for other mod_mem_types, we could potentially
> allow them to be shared on hugepages too.

Yeah, that's part of the goal to extend the scope from executable to all
types.

Thanks,
Song

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-30 22:37       ` Song Liu
@ 2023-05-31 13:51         ` Mike Rapoport
  2023-05-31 17:03           ` Song Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Rapoport @ 2023-05-31 13:51 UTC (permalink / raw)
  To: Song Liu; +Cc: Kent Overstreet, linux-kernel, bpf, mcgrof, peterz, tglx, x86

On Tue, May 30, 2023 at 03:37:24PM -0700, Song Liu wrote:
> On Mon, May 29, 2023 at 3:45 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> > > On Sat, May 27, 2023 at 12:04 AM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > I think this needs to back to the drawing board and we need something
> > > > simpler just targeted at executable memory; architecture specific
> > > > options should definitely _not_ be part of the exposed interface.
> > >
> > > I don't think we are exposing architecture specific options to users.
> > > Some layer need to handle arch specifics. If the new allocator is
> > > built on top of module_alloc, module_alloc is handling that. If the new
> > > allocator is to replace module_alloc, it needs to handle arch specifics.
> >
> > I'm for creating a new allocator that will replace module_alloc(). This
> > will give us a clean abstraction that modules and all the rest will use and
> > it will make easier to plug binpack or another allocator instead of
> > vmalloc.
> >
> > Another point is with a new allocator we won't have weird dependencies on
> > CONFIG_MODULE in e.g. bpf and kprobes.
> >
> > I'll have something ready to post as an RFC in a few days.
> 
> I guess this RFC is similar to unmapped_alloc()? If it replaces
> vmalloc, we can probably trim this set down a bit (remove
> mod_alloc_params and vmalloc_params, etc.).

No, it's not a new allocator. I'm trying to create an API for code
allocations that can accommodate all the architectures and it won't be a
part of modules code. The modules will use the new API just like every
other subsystem that needs to allocate code.

I've got a core part of it here:

https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=jitalloc/v1

and I hope I'll get it ready to post this week.

> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-31 13:51         ` Mike Rapoport
@ 2023-05-31 17:03           ` Song Liu
  2023-05-31 19:54             ` Mike Rapoport
  0 siblings, 1 reply; 22+ messages in thread
From: Song Liu @ 2023-05-31 17:03 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kent Overstreet, linux-kernel, bpf, mcgrof, peterz, tglx, x86

On Wed, May 31, 2023 at 6:51 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, May 30, 2023 at 03:37:24PM -0700, Song Liu wrote:
> > On Mon, May 29, 2023 at 3:45 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> > > > On Sat, May 27, 2023 at 12:04 AM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > I think this needs to back to the drawing board and we need something
> > > > > simpler just targeted at executable memory; architecture specific
> > > > > options should definitely _not_ be part of the exposed interface.
> > > >
> > > > I don't think we are exposing architecture specific options to users.
> > > > Some layer need to handle arch specifics. If the new allocator is
> > > > built on top of module_alloc, module_alloc is handling that. If the new
> > > > allocator is to replace module_alloc, it needs to handle arch specifics.
> > >
> > > I'm for creating a new allocator that will replace module_alloc(). This
> > > will give us a clean abstraction that modules and all the rest will use and
> > > it will make easier to plug binpack or another allocator instead of
> > > vmalloc.
> > >
> > > Another point is with a new allocator we won't have weird dependencies on
> > > CONFIG_MODULE in e.g. bpf and kprobes.
> > >
> > > I'll have something ready to post as an RFC in a few days.
> >
> > I guess this RFC is similar to unmapped_alloc()? If it replaces
> > vmalloc, we can probably trim this set down a bit (remove
> > mod_alloc_params and vmalloc_params, etc.).
>
> No, it's not a new allocator. I'm trying to create an API for code
> allocations that can accommodate all the architectures and it won't be a
> part of modules code. The modules will use the new API just like every
> other subsystem that needs to allocate code.
>
> I've got a core part of it here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=jitalloc/v1

This branch looks like the same scope as this set (but with different
implementation). So it will still use vmalloc, right?

Thanks,
Song

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-31 17:03           ` Song Liu
@ 2023-05-31 19:54             ` Mike Rapoport
  0 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2023-05-31 19:54 UTC (permalink / raw)
  To: Song Liu; +Cc: Kent Overstreet, linux-kernel, bpf, mcgrof, peterz, tglx, x86

On Wed, May 31, 2023 at 10:03:58AM -0700, Song Liu wrote:
> On Wed, May 31, 2023 at 6:51 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, May 30, 2023 at 03:37:24PM -0700, Song Liu wrote:
> > > On Mon, May 29, 2023 at 3:45 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> > > > > On Sat, May 27, 2023 at 12:04 AM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > I think this needs to back to the drawing board and we need something
> > > > > > simpler just targeted at executable memory; architecture specific
> > > > > > options should definitely _not_ be part of the exposed interface.
> > > > >
> > > > > I don't think we are exposing architecture specific options to users.
> > > > > Some layer need to handle arch specifics. If the new allocator is
> > > > > built on top of module_alloc, module_alloc is handling that. If the new
> > > > > allocator is to replace module_alloc, it needs to handle arch specifics.
> > > >
> > > > I'm for creating a new allocator that will replace module_alloc(). This
> > > > will give us a clean abstraction that modules and all the rest will use and
> > > > it will make easier to plug binpack or another allocator instead of
> > > > vmalloc.
> > > >
> > > > Another point is with a new allocator we won't have weird dependencies on
> > > > CONFIG_MODULE in e.g. bpf and kprobes.
> > > >
> > > > I'll have something ready to post as an RFC in a few days.
> > >
> > > I guess this RFC is similar to unmapped_alloc()? If it replaces
> > > vmalloc, we can probably trim this set down a bit (remove
> > > mod_alloc_params and vmalloc_params, etc.).
> >
> > No, it's not a new allocator. I'm trying to create an API for code
> > allocations that can accommodate all the architectures and it won't be a
> > part of modules code. The modules will use the new API just like every
> > other subsystem that needs to allocate code.
> >
> > I've got a core part of it here:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=jitalloc/v1
> 
> This branch looks like the same scope as this set (but with different
> implementation). So it will still use vmalloc, right?

Yes, it still uses vmalloc. The idea is to decouple code allocations from
modules from one side and make it handle all the variants expected by the
architectures based on a set of parameters each architecture provides.

The first few commits essentially shuffle the code around and replace
arch::module_alloc() with arch::jit_alloc_params.

The commits on top enable some bits that are not available today, like ROX
executable memory and DYNAMIC_FTRACE without modules for x86.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-30 22:48       ` Song Liu
@ 2023-06-01 17:53         ` Kent Overstreet
  2023-06-01 18:21         ` Kent Overstreet
  1 sibling, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2023-06-01 17:53 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Tue, May 30, 2023 at 03:48:51PM -0700, Song Liu wrote:
> On Mon, May 29, 2023 at 11:25 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> > > I don't think we are exposing architecture specific options to users.
> > > Some layer need to handle arch specifics. If the new allocator is
> > > built on top of module_alloc, module_alloc is handling that. If the new
> > > allocator is to replace module_alloc, it needs to handle arch specifics.
> >
> > Ok, I went back and read more thoroughly, I got this part wrong. The
> > actual interface is the mod_mem_type enum, not mod_alloc_params or
> > vmalloc_params.
> >
> > So this was my main complaint, but this actually looks ok now.
> >
> > It would be better to have those structs in a .c file, not the header
> > file - it looks like those are the public interface the way you have it.
> 
> Thanks for this suggestion. It makes a lot of sense. But I am not quite
> sure how we can avoid putting it in the header yet. I will take a closer
> look. OTOH, if we plan to use Mike's new allocator to replace vmalloc,
> we probably don't need this part.

The architectures previously exported constants that were used by
module_alloc(), why not stick with that?

> AFAICT, we don't have a global text_poke() API yet. I can take a look
> into it (if it makes sense).

Great

> Yeah, that's part of the goal to extend the scope from executable to all
> types.

Yeah it took me a bit to wrap my head around how this all makes sense -
it started out as just a better module_alloc(), then there was lots of
talk about hugepages.

I like it more now, looking forward to see how it fits together with
Mike's work :)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Type aware module allocator
  2023-05-30 22:48       ` Song Liu
  2023-06-01 17:53         ` Kent Overstreet
@ 2023-06-01 18:21         ` Kent Overstreet
  1 sibling, 0 replies; 22+ messages in thread
From: Kent Overstreet @ 2023-06-01 18:21 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, bpf, mcgrof, peterz, tglx, x86, rppt

On Tue, May 30, 2023 at 03:48:51PM -0700, Song Liu wrote:
> On Mon, May 29, 2023 at 11:25 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Sat, May 27, 2023 at 10:58:37PM -0700, Song Liu wrote:
> > > I don't think we are exposing architecture specific options to users.
> > > Some layer need to handle arch specifics. If the new allocator is
> > > built on top of module_alloc, module_alloc is handling that. If the new
> > > allocator is to replace module_alloc, it needs to handle arch specifics.
> >
> > Ok, I went back and read more thoroughly, I got this part wrong. The
> > actual interface is the mod_mem_type enum, not mod_alloc_params or
> > vmalloc_params.
> >
> > So this was my main complaint, but this actually looks ok now.
> >
> > It would be better to have those structs in a .c file, not the header
> > file - it looks like those are the public interface the way you have it.
> 
> Thanks for this suggestion. It makes a lot of sense. But I am not quite
> sure how we can avoid putting it in the header yet. I will take a closer
> look. OTOH, if we plan to use Mike's new allocator to replace vmalloc,
> we probably don't need this part.

Well, right now module_alloc() uses arch-exported constants, we could
keep doing that if it makes sense.

Or the structs could go in a separate header file that better indicates
what they're for. The main point is just that - when we're writing new
code and creating a new interface, it's very helpful if we can have the
header file basically be the documentation for what the external
interface is. Put your big kernel doc "what is this thing about" comment
in that file, too :)

> > We just need a helper that either calls text_poke() or does the page
> > permission dance in a single place.
> 
> AFAICT, we don't have a global text_poke() API yet. I can take a look
> into it (if it makes sense).

I don't think anyone's working on this yet, so if you're interested I
think it might be helpful.

> > If we do the same thing for other mod_mem_types, we could potentially
> > allow them to be shared on hugepages too.
> 
> Yeah, that's part of the goal to extend the scope from executable to all
> types.

Yeah, I think the "enumerate types of allocations that want similar page
permission handling" is a good thing to focus on.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-06-01 18:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26  5:15 [PATCH 0/3] Type aware module allocator Song Liu
2023-05-26  5:15 ` [PATCH 1/3] module: Introduce module_alloc_type Song Liu
2023-05-26  6:07   ` Randy Dunlap
2023-05-26 22:29   ` Kent Overstreet
2023-05-26 23:09     ` Song Liu
2023-05-26 23:39       ` Kent Overstreet
2023-05-27  0:03         ` Song Liu
2023-05-27  3:19           ` Kent Overstreet
2023-05-27  6:00             ` Song Liu
2023-05-26  5:15 ` [PATCH 2/3] ftrace: Add swap_func to ftrace_process_locs() Song Liu
2023-05-26  5:15 ` [PATCH 3/3] x86/module: Use module_alloc_type Song Liu
2023-05-27  7:04 ` [PATCH 0/3] Type aware module allocator Kent Overstreet
2023-05-28  5:58   ` Song Liu
2023-05-29 10:45     ` Mike Rapoport
2023-05-30 22:37       ` Song Liu
2023-05-31 13:51         ` Mike Rapoport
2023-05-31 17:03           ` Song Liu
2023-05-31 19:54             ` Mike Rapoport
2023-05-29 18:25     ` Kent Overstreet
2023-05-30 22:48       ` Song Liu
2023-06-01 17:53         ` Kent Overstreet
2023-06-01 18:21         ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).