linux-parisc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/12] mm: jit/text allocator
@ 2023-06-16  8:50 Mike Rapoport
  2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
                   ` (12 more replies)
  0 siblings, 13 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Hi,

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

A centralized infrastructure for code allocation allows allocations of
executable memory as ROX, and future optimizations such as caching large
pages for better iTLB performance and providing sub-page allocations for
users that only need small jit code snippets.

Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu
proposed execmem_alloc [2], but both these approaches were targeting BPF
allocations and lacked the ground work to abstract executable allocations
and split them from the modules core.

Thomas Gleixner's suggested to express module allocation restrictions and
requirements as struct mod_alloc_type_params [3] that would define ranges,
protections and other parameters for different types of allocations used by
modules and following that suggestion Song separated allocations of
different types in modules (commit ac3b43283923 ("module: replace
module_layout with module_memory")) and posted "Type aware module
allocator" set [4].

I liked the idea of parametrising code allocation requirements as a
structure, but I believe the original proposal and Song's module allocator
were too module centric, so I came up with these patches.

This set splits code allocation from modules by introducing
execmem_text_alloc(), execmem_data_alloc(), execmem_free(),
jit_text_alloc() and jit_free() APIs, replaces call sites of module_alloc()
and module_memfree() with the new APIs and implements core text and related
allocation in a central place.

Instead of architecture specific overrides for module_alloc(), the
architectures that require non-default behaviour for text allocation must
fill execmem_alloc_params structure and implement execmem_arch_params()
that returns a pointer to that structure. If an architecture does not
implement execmem_arch_params(), the defaults compatible with the current
modules::module_alloc() are used.

The intended semantics of the new APIs is that execmem APIs should be used
to allocate memory that must reside close to the kernel image because of
addressing mode restrictions, e.g modules on many architectures or dynamic
ftrace trampolines on x86.

The jit APIs are intended for users that can place code anywhere in vmalloc
area, like kprobes on most architectures and BPF on arm/arm64.

While two distinct API cover the major cases, there is still might be need
for arch-specific overrides for some of the usecases. For example, riscv
uses a dedicated range for BPF allocations in order to be able to use
relative addressing, but for kprobes riscv can use the entire vmalloc area.
For such overrides we might introduce jit_text_alloc variant that gets
start + end parameters to restrict the range like Mark Rutland suggested
and then use that variant in arch override.

The new infrastructure allows decoupling of kprobes and ftrace from
modules, and most importantly it paves the way for ROX allocations for
executable memory.

For now I've dropped patches that enable ROX allocations on x86 because
with them modprobe takes ten times more. To make modprobe fast with ROX
allocations more work is required to text poking infrastructure, but this
work is not a prerequisite for this series.

[1] https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
[2] https://lore.kernel.org/all/20221107223921.3451913-1-song@kernel.org/
[3] https://lore.kernel.org/all/87v8mndy3y.ffs@tglx/
[4] https://lore.kernel.org/all/20230526051529.3387103-1-song@kernel.org

v2 changes:
* Separate "module" and "others" allocations with execmem_text_alloc()
and jit_text_alloc()
* Drop ROX entablement on x86
* Add ack for nios2 changes, thanks Dinh Nguyen

v1: https://lore.kernel.org/all/20230601101257.530867-1-rppt@kernel.org

Mike Rapoport (IBM) (12):
  nios2: define virtual address space for modules
  mm: introduce execmem_text_alloc() and jit_text_alloc()
  mm/execmem, arch: convert simple overrides of module_alloc to execmem
  mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  modules, execmem: drop module_alloc
  mm/execmem: introduce execmem_data_alloc()
  arm64, execmem: extend execmem_params for generated code definitions
  riscv: extend execmem_params for kprobes allocations
  powerpc: extend execmem_params for kprobes allocations
  arch: make execmem setup available regardless of CONFIG_MODULES
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  kprobes: remove dependcy on CONFIG_MODULES

 arch/Kconfig                       |   2 +-
 arch/arm/kernel/module.c           |  32 ------
 arch/arm/mm/init.c                 |  36 ++++++
 arch/arm64/include/asm/memory.h    |   8 ++
 arch/arm64/include/asm/module.h    |   6 -
 arch/arm64/kernel/kaslr.c          |   3 +-
 arch/arm64/kernel/module.c         |  47 --------
 arch/arm64/kernel/probes/kprobes.c |   7 --
 arch/arm64/mm/init.c               |  56 +++++++++
 arch/loongarch/kernel/module.c     |   6 -
 arch/loongarch/mm/init.c           |  20 ++++
 arch/mips/kernel/module.c          |  10 +-
 arch/mips/mm/init.c                |  19 ++++
 arch/nios2/include/asm/pgtable.h   |   5 +-
 arch/nios2/kernel/module.c         |  28 +++--
 arch/parisc/kernel/module.c        |  12 +-
 arch/parisc/mm/init.c              |  22 +++-
 arch/powerpc/kernel/kprobes.c      |  16 +--
 arch/powerpc/kernel/module.c       |  37 ------
 arch/powerpc/mm/mem.c              |  59 ++++++++++
 arch/riscv/kernel/module.c         |  10 --
 arch/riscv/kernel/probes/kprobes.c |  10 --
 arch/riscv/mm/init.c               |  34 ++++++
 arch/s390/kernel/ftrace.c          |   4 +-
 arch/s390/kernel/kprobes.c         |   4 +-
 arch/s390/kernel/module.c          |  42 +------
 arch/s390/mm/init.c                |  41 +++++++
 arch/sparc/kernel/module.c         |  33 +-----
 arch/sparc/mm/Makefile             |   2 +
 arch/sparc/mm/execmem.c            |  25 ++++
 arch/sparc/net/bpf_jit_comp_32.c   |   8 +-
 arch/x86/Kconfig                   |   1 +
 arch/x86/kernel/ftrace.c           |  16 +--
 arch/x86/kernel/kprobes/core.c     |   4 +-
 arch/x86/kernel/module.c           |  51 ---------
 arch/x86/mm/init.c                 |  54 +++++++++
 include/linux/execmem.h            | 155 +++++++++++++++++++++++++
 include/linux/moduleloader.h       |  15 ---
 kernel/bpf/core.c                  |  14 +--
 kernel/kprobes.c                   |  51 +++++----
 kernel/module/Kconfig              |   1 +
 kernel/module/main.c               |  45 ++------
 kernel/trace/trace_kprobe.c        |  11 ++
 mm/Kconfig                         |   3 +
 mm/Makefile                        |   1 +
 mm/execmem.c                       | 177 +++++++++++++++++++++++++++++
 mm/mm_init.c                       |   2 +
 47 files changed, 813 insertions(+), 432 deletions(-)
 create mode 100644 arch/sparc/mm/execmem.c
 create mode 100644 include/linux/execmem.h
 create mode 100644 mm/execmem.c


base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
-- 
2.35.1


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v2 01/12] nios2: define virtual address space for modules
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 16:00   ` Edgecombe, Rick P
  2023-06-16 18:14   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Mike Rapoport
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
cannot reach all of vmalloc address space.

Define module space as 32MiB below the kernel base and switch nios2 to
use vmalloc for module allocations.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
---
 arch/nios2/include/asm/pgtable.h |  5 ++++-
 arch/nios2/kernel/module.c       | 19 ++++---------------
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 0f5c2564e9f5..0073b289c6a4 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -25,7 +25,10 @@
 #include <asm-generic/pgtable-nopmd.h>
 
 #define VMALLOC_START		CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
-#define VMALLOC_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
+#define VMALLOC_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
+
+#define MODULES_VADDR		(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
+#define MODULES_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
 
 struct mm_struct;
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 76e0a42d6e36..9c97b7513853 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -21,23 +21,12 @@
 
 #include <asm/cacheflush.h>
 
-/*
- * Modules should NOT be allocated with kmalloc for (obvious) reasons.
- * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
- * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
- * addresses in 0xc0000000)
- */
 void *module_alloc(unsigned long size)
 {
-	if (size == 0)
-		return NULL;
-	return kmalloc(size, GFP_KERNEL);
-}
-
-/* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
-{
-	kfree(module_region);
+	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
+				    GFP_KERNEL, PAGE_KERNEL_EXEC,
+				    VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+				    __builtin_return_address(0));
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
  2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 16:48   ` Kent Overstreet
  2023-06-17 20:38   ` Andy Lutomirski
  2023-06-16  8:50 ` [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem Mike Rapoport
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing
execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.

Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
module_alloc() and execmem_free() and jit_free() are replacements of
module_memfree() to allow updating all call sites to use the new APIs.

The intention semantics for new allocation APIs:

* execmem_text_alloc() should be used to allocate memory that must reside
  close to the kernel image, like loadable kernel modules and generated
  code that is restricted by relative addressing.

* jit_text_alloc() should be used to allocate memory for generated code
  when there are no restrictions for the code placement. For
  architectures that require that any code is within certain distance
  from the kernel image, jit_text_alloc() will be essentially aliased to
  execmem_text_alloc().

The names execmem_text_alloc() and jit_text_alloc() emphasize that the
allocated memory is for executable code, the allocations of the
associated data, like data sections of a module will use
execmem_data_alloc() interface that will be added later.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/powerpc/kernel/kprobes.c    |  4 +--
 arch/s390/kernel/ftrace.c        |  4 +--
 arch/s390/kernel/kprobes.c       |  4 +--
 arch/s390/kernel/module.c        |  5 +--
 arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
 arch/x86/kernel/ftrace.c         |  6 ++--
 arch/x86/kernel/kprobes/core.c   |  4 +--
 include/linux/execmem.h          | 52 ++++++++++++++++++++++++++++++++
 include/linux/moduleloader.h     |  3 --
 kernel/bpf/core.c                | 14 ++++-----
 kernel/kprobes.c                 |  8 ++---
 kernel/module/Kconfig            |  1 +
 kernel/module/main.c             | 25 +++++----------
 mm/Kconfig                       |  3 ++
 mm/Makefile                      |  1 +
 mm/execmem.c                     | 36 ++++++++++++++++++++++
 16 files changed, 130 insertions(+), 48 deletions(-)
 create mode 100644 include/linux/execmem.h
 create mode 100644 mm/execmem.c

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index b20ee72e873a..5db8df5e3657 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@
 #include <linux/extable.h>
 #include <linux/kdebug.h>
 #include <linux/slab.h>
-#include <linux/moduleloader.h>
 #include <linux/set_memory.h>
+#include <linux/execmem.h>
 #include <asm/code-patching.h>
 #include <asm/cacheflush.h>
 #include <asm/sstep.h>
@@ -130,7 +130,7 @@ void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = jit_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..65343f944101 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@
  *   Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
  */
 
-#include <linux/moduleloader.h>
 #include <linux/hardirq.h>
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/kprobes.h>
+#include <linux/execmem.h>
 #include <trace/syscall.h>
 #include <asm/asm-offsets.h>
 #include <asm/text-patching.h>
@@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
 {
 	const char *start, *end;
 
-	ftrace_plt = module_alloc(PAGE_SIZE);
+	ftrace_plt = execmem_text_alloc(PAGE_SIZE);
 	if (!ftrace_plt)
 		panic("cannot allocate ftrace plt\n");
 
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index d4b863ed0aa7..459cd5141346 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@
 
 #define pr_fmt(fmt) "kprobes: " fmt
 
-#include <linux/moduleloader.h>
 #include <linux/kprobes.h>
 #include <linux/ptrace.h>
 #include <linux/preempt.h>
@@ -21,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/hardirq.h>
 #include <linux/ftrace.h>
+#include <linux/execmem.h>
 #include <asm/set_memory.h>
 #include <asm/sections.h>
 #include <asm/dis.h>
@@ -38,7 +38,7 @@ void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = execmem_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 	set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index f1b35dcdf3eb..4a844683dc76 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@
 #include <linux/moduleloader.h>
 #include <linux/bug.h>
 #include <linux/memory.h>
+#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/nospec-branch.h>
 #include <asm/facility.h>
@@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
-	module_memfree(mod->arch.trampolines_start);
+	execmem_free(mod->arch.trampolines_start);
 }
 #endif
 
@@ -509,7 +510,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me,
 
 	size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
 	numpages = DIV_ROUND_UP(size, PAGE_SIZE);
-	start = module_alloc(numpages * PAGE_SIZE);
+	start = execmem_text_alloc(numpages * PAGE_SIZE);
 	if (!start)
 		return -ENOMEM;
 	set_memory_rox((unsigned long)start, numpages);
diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c
index a74e5004c6c8..4261832a9882 100644
--- a/arch/sparc/net/bpf_jit_comp_32.c
+++ b/arch/sparc/net/bpf_jit_comp_32.c
@@ -1,10 +1,10 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <linux/moduleloader.h>
 #include <linux/workqueue.h>
 #include <linux/netdevice.h>
 #include <linux/filter.h>
 #include <linux/cache.h>
 #include <linux/if_vlan.h>
+#include <linux/execmem.h>
 
 #include <asm/cacheflush.h>
 #include <asm/ptrace.h>
@@ -713,7 +713,7 @@ cond_branch:			f_offset = addrs[i + filter[i].jf];
 				if (unlikely(proglen + ilen > oldproglen)) {
 					pr_err("bpb_jit_compile fatal error\n");
 					kfree(addrs);
-					module_memfree(image);
+					execmem_free(image);
 					return;
 				}
 				memcpy(image + proglen, temp, ilen);
@@ -736,7 +736,7 @@ cond_branch:			f_offset = addrs[i + filter[i].jf];
 			break;
 		}
 		if (proglen == oldproglen) {
-			image = module_alloc(proglen);
+			image = execmem_text_alloc(proglen);
 			if (!image)
 				goto out;
 		}
@@ -758,7 +758,7 @@ cond_branch:			f_offset = addrs[i + filter[i].jf];
 void bpf_jit_free(struct bpf_prog *fp)
 {
 	if (fp->jited)
-		module_memfree(fp->bpf_func);
+		execmem_free(fp->bpf_func);
 
 	bpf_prog_unlock_free(fp);
 }
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 5e7ead52cfdb..f77c63bb3203 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -25,6 +25,7 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
+#include <linux/execmem.h>
 
 #include <trace/syscall.h>
 
@@ -261,15 +262,14 @@ void arch_ftrace_update_code(int command)
 #ifdef CONFIG_X86_64
 
 #ifdef CONFIG_MODULES
-#include <linux/moduleloader.h>
 /* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
-	return module_alloc(size);
+	return execmem_text_alloc(size);
 }
 static inline void tramp_free(void *tramp)
 {
-	module_memfree(tramp);
+	execmem_free(tramp);
 }
 #else
 /* Trampolines can only be created if modules are supported */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index f7f6042eb7e6..9294e11d0fb4 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -40,11 +40,11 @@
 #include <linux/kgdb.h>
 #include <linux/ftrace.h>
 #include <linux/kasan.h>
-#include <linux/moduleloader.h>
 #include <linux/objtool.h>
 #include <linux/vmalloc.h>
 #include <linux/pgtable.h>
 #include <linux/set_memory.h>
+#include <linux/execmem.h>
 
 #include <asm/text-patching.h>
 #include <asm/cacheflush.h>
@@ -414,7 +414,7 @@ void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = execmem_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index 000000000000..0d4e5a6985f8
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_ALLOC_H
+#define _LINUX_EXECMEM_ALLOC_H
+
+#include <linux/types.h>
+
+/**
+ * execmem_text_alloc - allocate executable memory
+ * @size: how many bytes of memory are required
+ *
+ * Allocates memory that will contain executable code, either generated or
+ * loaded from kernel modules.
+ *
+ * The memory will have protections defined by architecture for executable
+ * regions.
+ *
+ * The allocated memory will reside in an area that does not impose
+ * restrictions on the addressing modes.
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *execmem_text_alloc(size_t size);
+
+/**
+ * execmem_free - free executable memory
+ * @ptr: pointer to the memory that should be freed
+ */
+void execmem_free(void *ptr);
+
+/**
+ * jit_text_alloc - allocate executable memory
+ * @size: how many bytes of memory are required.
+ *
+ * Allocates memory that will contain generated executable code.
+ *
+ * The memory will have protections defined by architecture for executable
+ * regions.
+ *
+ * The allocated memory will reside in an area that might impose
+ * restrictions on the addressing modes depending on the architecture
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *jit_text_alloc(size_t size);
+
+/**
+ * jit_free - free generated executable memory
+ * @ptr: pointer to the memory that should be freed
+ */
+void jit_free(void *ptr);
+
+#endif /* _LINUX_EXECMEM_ALLOC_H */
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 03be088fb439..b3374342f7af 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -29,9 +29,6 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
    sections.  Returns NULL on failure. */
 void *module_alloc(unsigned long size);
 
-/* Free memory returned from module_alloc. */
-void module_memfree(void *module_region);
-
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7421487422d4..ecb58fa6696c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -22,7 +22,6 @@
 #include <linux/skbuff.h>
 #include <linux/vmalloc.h>
 #include <linux/random.h>
-#include <linux/moduleloader.h>
 #include <linux/bpf.h>
 #include <linux/btf.h>
 #include <linux/objtool.h>
@@ -37,6 +36,7 @@
 #include <linux/nospec.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/memcontrol.h>
+#include <linux/execmem.h>
 
 #include <asm/barrier.h>
 #include <asm/unaligned.h>
@@ -860,7 +860,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
+	pack->ptr = jit_text_alloc(BPF_PROG_PACK_SIZE);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -884,7 +884,7 @@ void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
 	mutex_lock(&pack_mutex);
 	if (size > BPF_PROG_PACK_SIZE) {
 		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
+		ptr = jit_text_alloc(size);
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
 			set_vm_flush_reset_perms(ptr);
@@ -922,7 +922,7 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 
 	mutex_lock(&pack_mutex);
 	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
+		jit_free(hdr);
 		goto out;
 	}
 
@@ -946,7 +946,7 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
 				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
 		list_del(&pack->list);
-		module_memfree(pack->ptr);
+		jit_free(pack->ptr);
 		kfree(pack);
 	}
 out:
@@ -997,12 +997,12 @@ void bpf_jit_uncharge_modmem(u32 size)
 
 void *__weak bpf_jit_alloc_exec(unsigned long size)
 {
-	return module_alloc(size);
+	return jit_text_alloc(size);
 }
 
 void __weak bpf_jit_free_exec(void *addr)
 {
-	module_memfree(addr);
+	jit_free(addr);
 }
 
 struct bpf_binary_header *
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 00e177de91cc..37c928d5deaf 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -26,7 +26,6 @@
 #include <linux/slab.h>
 #include <linux/stddef.h>
 #include <linux/export.h>
-#include <linux/moduleloader.h>
 #include <linux/kallsyms.h>
 #include <linux/freezer.h>
 #include <linux/seq_file.h>
@@ -39,6 +38,7 @@
 #include <linux/jump_label.h>
 #include <linux/static_call.h>
 #include <linux/perf_event.h>
+#include <linux/execmem.h>
 
 #include <asm/sections.h>
 #include <asm/cacheflush.h>
@@ -113,17 +113,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
 	/*
-	 * Use module_alloc() so this page is within +/- 2GB of where the
+	 * Use jit_text_alloc() so this page is within +/- 2GB of where the
 	 * kernel image and loaded module images reside. This is required
 	 * for most of the architectures.
 	 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 	 */
-	return module_alloc(PAGE_SIZE);
+	return jit_text_alloc(PAGE_SIZE);
 }
 
 static void free_insn_page(void *page)
 {
-	module_memfree(page);
+	jit_free(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index 33a2e991f608..813e116bdee6 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -2,6 +2,7 @@
 menuconfig MODULES
 	bool "Enable loadable module support"
 	modules
+	select EXECMEM
 	help
 	  Kernel modules are small pieces of compiled code which can
 	  be inserted in the running kernel, rather than being
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..43810a3bdb81 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -57,6 +57,7 @@
 #include <linux/audit.h>
 #include <linux/cfi.h>
 #include <linux/debugfs.h>
+#include <linux/execmem.h>
 #include <uapi/linux/module.h>
 #include "internal.h"
 
@@ -1186,16 +1187,6 @@ resolve_symbol_wait(struct module *mod,
 	return ksym;
 }
 
-void __weak module_memfree(void *module_region)
-{
-	/*
-	 * This memory may be RO, and freeing RO memory in an interrupt is not
-	 * supported by vmalloc.
-	 */
-	WARN_ON(in_interrupt());
-	vfree(module_region);
-}
-
 void __weak module_arch_cleanup(struct module *mod)
 {
 }
@@ -1214,7 +1205,7 @@ static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
 	if (mod_mem_use_vmalloc(type))
 		return vzalloc(size);
-	return module_alloc(size);
+	return execmem_text_alloc(size);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
@@ -1222,7 +1213,7 @@ static void module_memory_free(void *ptr, enum mod_mem_type type)
 	if (mod_mem_use_vmalloc(type))
 		vfree(ptr);
 	else
-		module_memfree(ptr);
+		execmem_free(ptr);
 }
 
 static void free_mod_mem(struct module *mod)
@@ -2478,9 +2469,9 @@ static void do_free_init(struct work_struct *w)
 
 	llist_for_each_safe(pos, n, list) {
 		initfree = container_of(pos, struct mod_initfree, node);
-		module_memfree(initfree->init_text);
-		module_memfree(initfree->init_data);
-		module_memfree(initfree->init_rodata);
+		execmem_free(initfree->init_text);
+		execmem_free(initfree->init_data);
+		execmem_free(initfree->init_rodata);
 		kfree(initfree);
 	}
 }
@@ -2583,10 +2574,10 @@ static noinline int do_init_module(struct module *mod)
 	 * We want to free module_init, but be aware that kallsyms may be
 	 * walking this with preempt disabled.  In all the failure paths, we
 	 * call synchronize_rcu(), but we don't want to slow down the success
-	 * path. module_memfree() cannot be called in an interrupt, so do the
+	 * path. execmem_free() cannot be called in an interrupt, so do the
 	 * work and call synchronize_rcu() in a work queue.
 	 *
-	 * Note that module_alloc() on most architectures creates W+X page
+	 * Note that execmem_text_alloc() on most architectures creates W+X page
 	 * mappings which won't be cleaned up until do_free_init() runs.  Any
 	 * code such as mark_rodata_ro() which depends on those mappings to
 	 * be cleaned up needs to sync with the queued work - ie
diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..3d2826940c4a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1206,6 +1206,9 @@ config PER_VMA_LOCK
 	  This feature allows locking each virtual memory area separately when
 	  handling page faults instead of taking mmap_lock.
 
+config EXECMEM
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index e29afc890cde..1c25d1b5ffef 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -137,3 +137,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_EXECMEM) += execmem.o
diff --git a/mm/execmem.c b/mm/execmem.c
new file mode 100644
index 000000000000..eac26234eb38
--- /dev/null
+++ b/mm/execmem.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/execmem.h>
+#include <linux/moduleloader.h>
+
+static void *execmem_alloc(size_t size)
+{
+	return module_alloc(size);
+}
+
+void *execmem_text_alloc(size_t size)
+{
+	return execmem_alloc(size);
+}
+
+void execmem_free(void *ptr)
+{
+	/*
+	 * This memory may be RO, and freeing RO memory in an interrupt is not
+	 * supported by vmalloc.
+	 */
+	WARN_ON(in_interrupt());
+	vfree(ptr);
+}
+
+void *jit_text_alloc(size_t size)
+{
+	return execmem_alloc(size);
+}
+
+void jit_free(void *ptr)
+{
+	execmem_free(ptr);
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
  2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
  2023-06-16  8:50 ` [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 18:36   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 04/12] mm/execmem, arch: convert remaining " Mike Rapoport
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Several architectures override module_alloc() only to define address
range for code allocations different than VMALLOC address space.

Provide a generic implementation in execmem that uses the parameters
for address space ranges, required alignment and page protections
provided by architectures.

The architecures must fill execmem_params structure and implement
execmem_arch_params() that returns a pointer to that structure. This
way the execmem initialization won't be called from every architecure,
but rather from a central place, namely initialization of the core
memory management.

The execmem provides execmem_text_alloc() API that wraps
__vmalloc_node_range() with the parameters defined by the architecures.
If an architeture does not implement execmem_arch_params(),
execmem_text_alloc() will fall back to module_alloc().

The name execmem_text_alloc() emphasizes that the allocated memory is
for executable code, the allocations of the associated data, like data
sections of a module will use execmem_data_alloc() interface that will
be added later.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/loongarch/kernel/module.c | 18 ++++++++++--
 arch/mips/kernel/module.c      | 18 +++++++++---
 arch/nios2/kernel/module.c     | 19 +++++++++----
 arch/parisc/kernel/module.c    | 23 +++++++++------
 arch/riscv/kernel/module.c     | 20 +++++++++----
 arch/sparc/kernel/module.c     | 44 +++++++++++++---------------
 include/linux/execmem.h        | 52 ++++++++++++++++++++++++++++++++++
 mm/execmem.c                   | 52 +++++++++++++++++++++++++++++++---
 mm/mm_init.c                   |  2 ++
 9 files changed, 195 insertions(+), 53 deletions(-)

diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index b8b86088b2dd..32b167722c2b 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,6 +18,7 @@
 #include <linux/ftrace.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
+#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/inst.h>
 
@@ -469,10 +470,21 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 	return 0;
 }
 
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.pgprot = PAGE_KERNEL,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-			GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0));
+	execmem_params.modules.text.start = MODULES_VADDR;
+	execmem_params.modules.text.end = MODULES_END;
+
+	return &execmem_params;
 }
 
 static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 0c936cbf20c5..2d370de67383 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -20,6 +20,7 @@
 #include <linux/kernel.h>
 #include <linux/spinlock.h>
 #include <linux/jump_label.h>
+#include <linux/execmem.h>
 
 extern void jump_label_apply_nops(struct module *mod);
 
@@ -33,11 +34,20 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULE_START
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.start = MODULES_VADDR,
+			.end = MODULES_END,
+			.pgprot = PAGE_KERNEL,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
-				GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
+	return &execmem_params;
 }
 #endif
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 9c97b7513853..3cf5723e3c70 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -18,15 +18,24 @@
 #include <linux/fs.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
+#include <linux/execmem.h>
 
 #include <asm/cacheflush.h>
 
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.start = MODULES_VADDR,
+			.end = MODULES_END,
+			.pgprot = PAGE_KERNEL_EXEC,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-				    GFP_KERNEL, PAGE_KERNEL_EXEC,
-				    VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	return &execmem_params;
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index f6e38c4d3904..569b8f52a24b 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -49,6 +49,7 @@
 #include <linux/bug.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
+#include <linux/execmem.h>
 
 #include <asm/unwind.h>
 #include <asm/sections.h>
@@ -173,15 +174,21 @@ static inline int reassemble_22(int as22)
 		((as22 & 0x0003ff) << 3));
 }
 
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.pgprot = PAGE_KERNEL_RWX,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	/* using RWX means less protection for modules, but it's
-	 * easier than trying to map the text, data, init_text and
-	 * init_data correctly */
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-				    GFP_KERNEL,
-				    PAGE_KERNEL_RWX, 0, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	execmem_params.modules.text.start = VMALLOC_START;
+	execmem_params.modules.text.end = VMALLOC_END;
+
+	return &execmem_params;
 }
 
 #ifndef CONFIG_64BIT
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 7c651d55fcbd..ee5e04cd3f21 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -11,6 +11,7 @@
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
 #include <linux/pgtable.h>
+#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
 
@@ -436,12 +437,21 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 }
 
 #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.pgprot = PAGE_KERNEL,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULES_VADDR,
-				    MODULES_END, GFP_KERNEL,
-				    PAGE_KERNEL, 0, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	execmem_params.modules.text.start = MODULES_VADDR;
+	execmem_params.modules.text.end = MODULES_END;
+
+	return &execmem_params;
 }
 #endif
 
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index 66c45a2764bc..ab75e3e69834 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -14,6 +14,10 @@
 #include <linux/string.h>
 #include <linux/ctype.h>
 #include <linux/mm.h>
+#include <linux/execmem.h>
+#ifdef CONFIG_SPARC64
+#include <linux/jump_label.h>
+#endif
 
 #include <asm/processor.h>
 #include <asm/spitfire.h>
@@ -21,34 +25,26 @@
 
 #include "entry.h"
 
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
 #ifdef CONFIG_SPARC64
-
-#include <linux/jump_label.h>
-
-static void *module_map(unsigned long size)
-{
-	if (PAGE_ALIGN(size) > MODULES_LEN)
-		return NULL;
-	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-				GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
-}
+			.start = MODULES_VADDR,
+			.end = MODULES_END,
 #else
-static void *module_map(unsigned long size)
-{
-	return vmalloc(size);
-}
-#endif /* CONFIG_SPARC64 */
-
-void *module_alloc(unsigned long size)
+			.start = VMALLOC_START,
+			.end = VMALLOC_END,
+#endif
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	void *ret;
-
-	ret = module_map(size);
-	if (ret)
-		memset(ret, 0, size);
+	execmem_params.modules.text.pgprot = PAGE_KERNEL;
 
-	return ret;
+	return &execmem_params;
 }
 
 /* Make generic code ignore STT_REGISTER dummy undefined symbols.  */
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 0d4e5a6985f8..75946f23731e 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -4,6 +4,52 @@
 
 #include <linux/types.h>
 
+/**
+ * struct execmem_range - definition of a memory range suitable for code and
+ *			  related data allocations
+ * @start:	address space start
+ * @end:	address space end (inclusive)
+ * @pgprot:	permisssions for memory in this address space
+ * @alignment:	alignment required for text allocations
+ */
+struct execmem_range {
+	unsigned long   start;
+	unsigned long   end;
+	pgprot_t        pgprot;
+	unsigned int	alignment;
+};
+
+/**
+ * struct execmem_modules_range - architecure parameters for modules address
+ *				  space
+ * @text:	address range for text allocations
+ */
+struct execmem_modules_range {
+	struct execmem_range text;
+};
+
+/**
+ * struct execmem_params -	architecure parameters for code allocations
+ * @modules:	parameters for modules address space
+ */
+struct execmem_params {
+	struct execmem_modules_range	modules;
+};
+
+/**
+ * execmem_arch_params - supply parameters for allocations of executable memory
+ *
+ * A hook for architecures to define parameters for allocations of
+ * executable memory described by struct execmem_params
+ *
+ * For architectures that do not implement this method a default set of
+ * parameters will be used
+ *
+ * Return: a structure defining architecture parameters and restrictions
+ * for allocations of executable memory
+ */
+struct execmem_params *execmem_arch_params(void);
+
 /**
  * execmem_text_alloc - allocate executable memory
  * @size: how many bytes of memory are required
@@ -49,4 +95,10 @@ void *jit_text_alloc(size_t size);
  */
 void jit_free(void *ptr);
 
+#ifdef CONFIG_EXECMEM
+void execmem_init(void);
+#else
+static inline void execmem_init(void) {}
+#endif
+
 #endif /* _LINUX_EXECMEM_ALLOC_H */
diff --git a/mm/execmem.c b/mm/execmem.c
index eac26234eb38..c92878cf4d1a 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -5,14 +5,27 @@
 #include <linux/execmem.h>
 #include <linux/moduleloader.h>
 
-static void *execmem_alloc(size_t size)
+struct execmem_params execmem_params;
+
+static void *execmem_alloc(size_t size, unsigned long start, unsigned long end,
+			   unsigned int align, pgprot_t pgprot)
 {
-	return module_alloc(size);
+	return __vmalloc_node_range(size, align, start, end,
+				   GFP_KERNEL, pgprot, VM_FLUSH_RESET_PERMS,
+				   NUMA_NO_NODE, __builtin_return_address(0));
 }
 
 void *execmem_text_alloc(size_t size)
 {
-	return execmem_alloc(size);
+	unsigned long start = execmem_params.modules.text.start;
+	unsigned long end = execmem_params.modules.text.end;
+	pgprot_t pgprot = execmem_params.modules.text.pgprot;
+	unsigned int align = execmem_params.modules.text.alignment;
+
+	if (!execmem_params.modules.text.start)
+		return module_alloc(size);
+
+	return execmem_alloc(size, start, end, align, pgprot);
 }
 
 void execmem_free(void *ptr)
@@ -27,10 +40,41 @@ void execmem_free(void *ptr)
 
 void *jit_text_alloc(size_t size)
 {
-	return execmem_alloc(size);
+	return execmem_text_alloc(size);
 }
 
 void jit_free(void *ptr)
 {
 	execmem_free(ptr);
 }
+
+struct execmem_params * __weak execmem_arch_params(void)
+{
+	return NULL;
+}
+
+static bool execmem_validate_params(struct execmem_params *p)
+{
+	struct execmem_modules_range *m = &p->modules;
+	struct execmem_range *t = &m->text;
+
+	if (!t->alignment || !t->start || !t->end || !pgprot_val(t->pgprot)) {
+		pr_crit("Invalid parameters for execmem allocator, module loading will fail");
+		return false;
+	}
+
+	return true;
+}
+
+void __init execmem_init(void)
+{
+	struct execmem_params *p = execmem_arch_params();
+
+	if (!p)
+		return;
+
+	if (!execmem_validate_params(p))
+		return;
+
+	execmem_params = *p;
+}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..3b11450efe8a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -26,6 +26,7 @@
 #include <linux/pgtable.h>
 #include <linux/swap.h>
 #include <linux/cma.h>
+#include <linux/execmem.h>
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
@@ -2747,4 +2748,5 @@ void __init mm_core_init(void)
 	pti_init();
 	kmsan_init_runtime();
 	mm_cache_init();
+	execmem_init();
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (2 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 16:16   ` Edgecombe, Rick P
  2023-06-16 18:53   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 05/12] modules, execmem: drop module_alloc Mike Rapoport
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc and support for allocation of KASAN shadow required by
arm64, s390 and x86.

The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/arm/kernel/module.c     | 36 ++++++++++---------
 arch/arm64/kernel/module.c   | 68 ++++++++++++++++--------------------
 arch/powerpc/kernel/module.c | 52 +++++++++++++--------------
 arch/s390/kernel/module.c    | 33 ++++++++---------
 arch/x86/kernel/module.c     | 33 +++++++++--------
 include/linux/execmem.h      | 14 ++++++++
 mm/execmem.c                 | 50 ++++++++++++++++++++++----
 7 files changed, 168 insertions(+), 118 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index d59c36dc0494..f66d479c1c7d 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,6 +16,7 @@
 #include <linux/fs.h>
 #include <linux/string.h>
 #include <linux/gfp.h>
+#include <linux/execmem.h>
 
 #include <asm/sections.h>
 #include <asm/smp_plat.h>
@@ -34,23 +35,26 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.start = MODULES_VADDR,
+			.end = MODULES_END,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	/* Silence the initial allocation */
-	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
-		gfp_mask |= __GFP_NOWARN;
-
-	p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-				gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
-	if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
-		return p;
-	return __vmalloc_node_range(size, 1,  VMALLOC_START, VMALLOC_END,
-				GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
+	execmem_params.modules.text.pgprot = PAGE_KERNEL_EXEC;
+
+	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+		execmem_params.modules.text.fallback_start = VMALLOC_START;
+		execmem_params.modules.text.fallback_end = VMALLOC_END;
+	}
+
+	return &execmem_params;
 }
 #endif
 
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 5af4975caeb5..c3d999f3a3dd 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -17,56 +17,50 @@
 #include <linux/moduleloader.h>
 #include <linux/scs.h>
 #include <linux/vmalloc.h>
+#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/insn.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.flags = EXECMEM_KASAN_SHADOW,
+		.text = {
+			.alignment = MODULE_ALIGN,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
 	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	/* Silence the initial allocation */
-	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
-		gfp_mask |= __GFP_NOWARN;
 
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
-	    IS_ENABLED(CONFIG_KASAN_SW_TAGS))
-		/* don't exceed the static module region - see below */
-		module_alloc_end = MODULES_END;
+	execmem_params.modules.text.pgprot = PAGE_KERNEL;
+	execmem_params.modules.text.start = module_alloc_base;
+	execmem_params.modules.text.end = module_alloc_end;
 
-	p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-				module_alloc_end, gfp_mask, PAGE_KERNEL, VM_DEFER_KMEMLEAK,
-				NUMA_NO_NODE, __builtin_return_address(0));
-
-	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
+	/*
+	 * KASAN without KASAN_VMALLOC can only deal with module
+	 * allocations being served from the reserved module region,
+	 * since the remainder of the vmalloc region is already
+	 * backed by zero shadow pages, and punching holes into it
+	 * is non-trivial. Since the module region is not randomized
+	 * when KASAN is enabled without KASAN_VMALLOC, it is even
+	 * less likely that the module region gets exhausted, so we
+	 * can simply omit this fallback in that case.
+	 */
+	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
 	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
 	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
-	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
-		/*
-		 * KASAN without KASAN_VMALLOC can only deal with module
-		 * allocations being served from the reserved module region,
-		 * since the remainder of the vmalloc region is already
-		 * backed by zero shadow pages, and punching holes into it
-		 * is non-trivial. Since the module region is not randomized
-		 * when KASAN is enabled without KASAN_VMALLOC, it is even
-		 * less likely that the module region gets exhausted, so we
-		 * can simply omit this fallback in that case.
-		 */
-		p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-				module_alloc_base + SZ_2G, GFP_KERNEL,
-				PAGE_KERNEL, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
-
-	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
-		vfree(p);
-		return NULL;
+	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
+		unsigned long end = module_alloc_base + SZ_2G;
+
+		execmem_params.modules.text.fallback_start = module_alloc_base;
+		execmem_params.modules.text.fallback_end = end;
 	}
 
-	/* Memory is intended to be executable, reset the pointer tag. */
-	return kasan_reset_tag(p);
+	return &execmem_params;
 }
 
 enum aarch64_reloc_op {
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index f6d6ae0a1692..ba7abff77d98 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -10,6 +10,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
 #include <linux/bug.h>
+#include <linux/execmem.h>
 #include <asm/module.h>
 #include <linux/uaccess.h>
 #include <asm/firmware.h>
@@ -89,39 +90,38 @@ int module_finalize(const Elf_Ehdr *hdr,
 	return 0;
 }
 
-static __always_inline void *
-__module_alloc(unsigned long size, unsigned long start, unsigned long end, bool nowarn)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.alignment = 1,
+		},
+	},
+};
+
+
+struct execmem_params __init *execmem_arch_params(void)
 {
 	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
-	gfp_t gfp = GFP_KERNEL | (nowarn ? __GFP_NOWARN : 0);
-
-	/*
-	 * Don't do huge page allocations for modules yet until more testing
-	 * is done. STRICT_MODULE_RWX may require extra work to support this
-	 * too.
-	 */
-	return __vmalloc_node_range(size, 1, start, end, gfp, prot,
-				    VM_FLUSH_RESET_PERMS,
-				    NUMA_NO_NODE, __builtin_return_address(0));
-}
 
-void *module_alloc(unsigned long size)
-{
 #ifdef MODULES_VADDR
 	unsigned long limit = (unsigned long)_etext - SZ_32M;
-	void *ptr = NULL;
-
-	BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
 
 	/* First try within 32M limit from _etext to avoid branch trampolines */
-	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit)
-		ptr = __module_alloc(size, limit, MODULES_END, true);
-
-	if (!ptr)
-		ptr = __module_alloc(size, MODULES_VADDR, MODULES_END, false);
-
-	return ptr;
+	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
+		execmem_params.modules.text.start = limit;
+		execmem_params.modules.text.end = MODULES_END;
+		execmem_params.modules.text.fallback_start = MODULES_VADDR;
+		execmem_params.modules.text.fallback_end = MODULES_END;
+	} else {
+		execmem_params.modules.text.start = MODULES_VADDR;
+		execmem_params.modules.text.end = MODULES_END;
+	}
 #else
-	return __module_alloc(size, VMALLOC_START, VMALLOC_END, false);
+	execmem_params.modules.text.start = VMALLOC_START;
+	execmem_params.modules.text.end = VMALLOC_END;
 #endif
+
+	execmem_params.modules.text.pgprot = prot;
+
+	return &execmem_params;
 }
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 4a844683dc76..7fff395d26ea 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -55,23 +55,24 @@ static unsigned long get_module_load_offset(void)
 	return module_load_offset;
 }
 
-void *module_alloc(unsigned long size)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.flags = EXECMEM_KASAN_SHADOW,
+		.text = {
+			.alignment = MODULE_ALIGN,
+			.pgprot = PAGE_KERNEL,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
 {
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	if (PAGE_ALIGN(size) > MODULES_LEN)
-		return NULL;
-	p = __vmalloc_node_range(size, MODULE_ALIGN,
-				 MODULES_VADDR + get_module_load_offset(),
-				 MODULES_END, gfp_mask, PAGE_KERNEL,
-				 VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
-				 NUMA_NO_NODE, __builtin_return_address(0));
-	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
-		vfree(p);
-		return NULL;
-	}
-	return p;
+	unsigned long start = MODULES_VADDR + get_module_load_offset();
+
+	execmem_params.modules.text.start = start;
+	execmem_params.modules.text.end = MODULES_END;
+
+	return &execmem_params;
 }
 
 #ifdef CONFIG_FUNCTION_TRACER
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b05f62ee2344..cf9a7d0a8b62 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -19,6 +19,7 @@
 #include <linux/jump_label.h>
 #include <linux/random.h>
 #include <linux/memory.h>
+#include <linux/execmem.h>
 
 #include <asm/text-patching.h>
 #include <asm/page.h>
@@ -65,26 +66,24 @@ static unsigned long int get_module_load_offset(void)
 }
 #endif
 
-void *module_alloc(unsigned long size)
-{
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	if (PAGE_ALIGN(size) > MODULES_LEN)
-		return NULL;
+static struct execmem_params execmem_params = {
+	.modules = {
+		.flags = EXECMEM_KASAN_SHADOW,
+		.text = {
+			.alignment = MODULE_ALIGN,
+		},
+	},
+};
 
-	p = __vmalloc_node_range(size, MODULE_ALIGN,
-				 MODULES_VADDR + get_module_load_offset(),
-				 MODULES_END, gfp_mask, PAGE_KERNEL,
-				 VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
-				 NUMA_NO_NODE, __builtin_return_address(0));
+struct execmem_params __init *execmem_arch_params(void)
+{
+	unsigned long start = MODULES_VADDR + get_module_load_offset();
 
-	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
-		vfree(p);
-		return NULL;
-	}
+	execmem_params.modules.text.start = start;
+	execmem_params.modules.text.end = MODULES_END;
+	execmem_params.modules.text.pgprot = PAGE_KERNEL;
 
-	return p;
+	return &execmem_params;
 }
 
 #ifdef CONFIG_X86_32
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 75946f23731e..68b2bfc79993 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -9,22 +9,36 @@
  *			  related data allocations
  * @start:	address space start
  * @end:	address space end (inclusive)
+ * @fallback_start:	start of the range for fallback allocations
+ * @fallback_end:	end of the range for fallback allocations (inclusive)
  * @pgprot:	permisssions for memory in this address space
  * @alignment:	alignment required for text allocations
  */
 struct execmem_range {
 	unsigned long   start;
 	unsigned long   end;
+	unsigned long   fallback_start;
+	unsigned long   fallback_end;
 	pgprot_t        pgprot;
 	unsigned int	alignment;
 };
 
+/**
+ * enum execmem_module_flags - options for executable memory allocations
+ * @EXECMEM_KASAN_SHADOW:	allocate kasan shadow
+ */
+enum execmem_module_flags {
+	EXECMEM_KASAN_SHADOW	= (1 << 0),
+};
+
 /**
  * struct execmem_modules_range - architecure parameters for modules address
  *				  space
+ * @flags:	options for module memory allocations
  * @text:	address range for text allocations
  */
 struct execmem_modules_range {
+	enum execmem_module_flags flags;
 	struct execmem_range text;
 };
 
diff --git a/mm/execmem.c b/mm/execmem.c
index c92878cf4d1a..2fe36dcc7bdf 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -7,12 +7,46 @@
 
 struct execmem_params execmem_params;
 
-static void *execmem_alloc(size_t size, unsigned long start, unsigned long end,
-			   unsigned int align, pgprot_t pgprot)
+static void *execmem_alloc(size_t len, unsigned long start, unsigned long end,
+			   unsigned int alignment, pgprot_t pgprot,
+			   unsigned long fallback_start,
+			   unsigned long fallback_end,
+			   bool kasan)
 {
-	return __vmalloc_node_range(size, align, start, end,
-				   GFP_KERNEL, pgprot, VM_FLUSH_RESET_PERMS,
-				   NUMA_NO_NODE, __builtin_return_address(0));
+	unsigned long vm_flags  = VM_FLUSH_RESET_PERMS;
+	bool fallback  = !!fallback_start;
+	gfp_t gfp_flags = GFP_KERNEL;
+	void *p;
+
+	if (PAGE_ALIGN(len) > (end - start))
+		return NULL;
+
+	if (kasan)
+		vm_flags |= VM_DEFER_KMEMLEAK;
+
+	if (fallback)
+		gfp_flags |= __GFP_NOWARN;
+
+	p = __vmalloc_node_range(len, alignment, start, end, gfp_flags,
+				 pgprot, vm_flags, NUMA_NO_NODE,
+				 __builtin_return_address(0));
+
+	if (!p && fallback) {
+		start = fallback_start;
+		end = fallback_end;
+		gfp_flags = GFP_KERNEL;
+
+		p = __vmalloc_node_range(len, alignment, start, end, gfp_flags,
+					 pgprot, vm_flags, NUMA_NO_NODE,
+					 __builtin_return_address(0));
+	}
+
+	if (p && kasan && (kasan_alloc_module_shadow(p, len, GFP_KERNEL) < 0)) {
+		vfree(p);
+		return NULL;
+	}
+
+	return kasan_reset_tag(p);
 }
 
 void *execmem_text_alloc(size_t size)
@@ -21,11 +55,15 @@ void *execmem_text_alloc(size_t size)
 	unsigned long end = execmem_params.modules.text.end;
 	pgprot_t pgprot = execmem_params.modules.text.pgprot;
 	unsigned int align = execmem_params.modules.text.alignment;
+	unsigned long fallback_start = execmem_params.modules.text.fallback_start;
+	unsigned long fallback_end = execmem_params.modules.text.fallback_end;
+	bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
 
 	if (!execmem_params.modules.text.start)
 		return module_alloc(size);
 
-	return execmem_alloc(size, start, end, align, pgprot);
+	return execmem_alloc(size, start, end, align, pgprot,
+			     fallback_start, fallback_end, kasan);
 }
 
 void execmem_free(void *ptr)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 05/12] modules, execmem: drop module_alloc
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (3 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 04/12] mm/execmem, arch: convert remaining " Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 18:56   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Define default parameters for address range for code allocations using
the current values in module_alloc() and make execmem_text_alloc() use
these defaults when an architecure does not supply its specific
parameters.

With this, execmem_text_alloc() implements memory allocation in a way
compatible with module_alloc() and can be used as a replacement for
module_alloc().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 include/linux/execmem.h      |  8 ++++++++
 include/linux/moduleloader.h | 12 ------------
 kernel/module/main.c         |  7 -------
 mm/execmem.c                 | 12 ++++++++----
 4 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 68b2bfc79993..b9a97fcdf3c5 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -4,6 +4,14 @@
 
 #include <linux/types.h>
 
+#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
+		!defined(CONFIG_KASAN_VMALLOC)
+#include <linux/kasan.h>
+#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
+#else
+#define MODULE_ALIGN PAGE_SIZE
+#endif
+
 /**
  * struct execmem_range - definition of a memory range suitable for code and
  *			  related data allocations
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index b3374342f7af..4321682fe849 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr,
 /* Additional bytes needed by arch in front of individual sections */
 unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
 
-/* Allocator used for allocating struct module, core sections and init
-   sections.  Returns NULL on failure. */
-void *module_alloc(unsigned long size);
-
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
@@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod);
 /* Any cleanup before freeing mod->module_init */
 void module_arch_freeing_init(struct module *mod);
 
-#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
-		!defined(CONFIG_KASAN_VMALLOC)
-#include <linux/kasan.h>
-#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
-#else
-#define MODULE_ALIGN PAGE_SIZE
-#endif
-
 #endif
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 43810a3bdb81..b445c5ad863a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod)
 	}
 }
 
-void * __weak module_alloc(unsigned long size)
-{
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
-			NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 bool __weak module_init_section(const char *name)
 {
 	return strstarts(name, ".init");
diff --git a/mm/execmem.c b/mm/execmem.c
index 2fe36dcc7bdf..a67acd75ffef 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -59,9 +59,6 @@ void *execmem_text_alloc(size_t size)
 	unsigned long fallback_end = execmem_params.modules.text.fallback_end;
 	bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
 
-	if (!execmem_params.modules.text.start)
-		return module_alloc(size);
-
 	return execmem_alloc(size, start, end, align, pgprot,
 			     fallback_start, fallback_end, kasan);
 }
@@ -108,8 +105,15 @@ void __init execmem_init(void)
 {
 	struct execmem_params *p = execmem_arch_params();
 
-	if (!p)
+	if (!p) {
+		p = &execmem_params;
+		p->modules.text.start = VMALLOC_START;
+		p->modules.text.end = VMALLOC_END;
+		p->modules.text.pgprot = PAGE_KERNEL_EXEC;
+		p->modules.text.alignment = 1;
+
 		return;
+	}
 
 	if (!execmem_validate_params(p))
 		return;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (4 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 05/12] modules, execmem: drop module_alloc Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 16:55   ` Edgecombe, Rick P
                     ` (2 more replies)
  2023-06-16  8:50 ` [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions Mike Rapoport
                   ` (6 subsequent siblings)
  12 siblings, 3 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Data related to code allocations, such as module data section, need to
comply with architecture constraints for its placement and its
allocation right now was done using execmem_text_alloc().

Create a dedicated API for allocating data related to code allocations
and allow architectures to define address ranges for data allocations.

Since currently this is only relevant for powerpc variants that use the
VMALLOC address space for module data allocations, automatically reuse
address ranges defined for text unless address range for data is
explicitly defined by an architecture.

With separation of code and data allocations, data sections of the
modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
was a default on many architectures.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/powerpc/kernel/module.c |  8 +++++++
 include/linux/execmem.h      | 18 +++++++++++++++
 kernel/module/main.c         | 15 +++----------
 mm/execmem.c                 | 43 ++++++++++++++++++++++++++++++++++++
 4 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index ba7abff77d98..4c6c15bf3947 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -103,6 +103,10 @@ struct execmem_params __init *execmem_arch_params(void)
 {
 	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
 
+	/*
+	 * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+	 * allow allocating data in the entire vmalloc space
+	 */
 #ifdef MODULES_VADDR
 	unsigned long limit = (unsigned long)_etext - SZ_32M;
 
@@ -116,6 +120,10 @@ struct execmem_params __init *execmem_arch_params(void)
 		execmem_params.modules.text.start = MODULES_VADDR;
 		execmem_params.modules.text.end = MODULES_END;
 	}
+	execmem_params.modules.data.start = VMALLOC_START;
+	execmem_params.modules.data.end = VMALLOC_END;
+	execmem_params.modules.data.pgprot = PAGE_KERNEL;
+	execmem_params.modules.data.alignment = 1;
 #else
 	execmem_params.modules.text.start = VMALLOC_START;
 	execmem_params.modules.text.end = VMALLOC_END;
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index b9a97fcdf3c5..2e1221310d13 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -44,10 +44,12 @@ enum execmem_module_flags {
  *				  space
  * @flags:	options for module memory allocations
  * @text:	address range for text allocations
+ * @data:	address range for data allocations
  */
 struct execmem_modules_range {
 	enum execmem_module_flags flags;
 	struct execmem_range text;
+	struct execmem_range data;
 };
 
 /**
@@ -89,6 +91,22 @@ struct execmem_params *execmem_arch_params(void);
  */
 void *execmem_text_alloc(size_t size);
 
+/**
+ * execmem_data_alloc - allocate memory for data coupled to code
+ * @size: how many bytes of memory are required
+ *
+ * Allocates memory that will contain data copupled with executable code,
+ * like data sections in kernel modules.
+ *
+ * The memory will have protections defined by architecture.
+ *
+ * The allocated memory will reside in an area that does not impose
+ * restrictions on the addressing modes.
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *execmem_data_alloc(size_t size);
+
 /**
  * execmem_free - free executable memory
  * @ptr: pointer to the memory that should be freed
diff --git a/kernel/module/main.c b/kernel/module/main.c
index b445c5ad863a..d6582bfec1f6 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1195,25 +1195,16 @@ void __weak module_arch_freeing_init(struct module *mod)
 {
 }
 
-static bool mod_mem_use_vmalloc(enum mod_mem_type type)
-{
-	return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) &&
-		mod_mem_type_is_core_data(type);
-}
-
 static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
-	if (mod_mem_use_vmalloc(type))
-		return vzalloc(size);
+	if (mod_mem_type_is_data(type))
+		return execmem_data_alloc(size);
 	return execmem_text_alloc(size);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
 {
-	if (mod_mem_use_vmalloc(type))
-		vfree(ptr);
-	else
-		execmem_free(ptr);
+	execmem_free(ptr);
 }
 
 static void free_mod_mem(struct module *mod)
diff --git a/mm/execmem.c b/mm/execmem.c
index a67acd75ffef..f7bf496ad4c3 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size)
 			     fallback_start, fallback_end, kasan);
 }
 
+void *execmem_data_alloc(size_t size)
+{
+	unsigned long start = execmem_params.modules.data.start;
+	unsigned long end = execmem_params.modules.data.end;
+	pgprot_t pgprot = execmem_params.modules.data.pgprot;
+	unsigned int align = execmem_params.modules.data.alignment;
+	unsigned long fallback_start = execmem_params.modules.data.fallback_start;
+	unsigned long fallback_end = execmem_params.modules.data.fallback_end;
+	bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
+
+	return execmem_alloc(size, start, end, align, pgprot,
+			     fallback_start, fallback_end, kasan);
+}
+
 void execmem_free(void *ptr)
 {
 	/*
@@ -101,6 +115,28 @@ static bool execmem_validate_params(struct execmem_params *p)
 	return true;
 }
 
+static void execmem_init_missing(struct execmem_params *p)
+{
+	struct execmem_modules_range *m = &p->modules;
+
+	if (!pgprot_val(execmem_params.modules.data.pgprot))
+		execmem_params.modules.data.pgprot = PAGE_KERNEL;
+
+	if (!execmem_params.modules.data.alignment)
+		execmem_params.modules.data.alignment = m->text.alignment;
+
+	if (!execmem_params.modules.data.start) {
+		execmem_params.modules.data.start = m->text.start;
+		execmem_params.modules.data.end = m->text.end;
+	}
+
+	if (!execmem_params.modules.data.fallback_start &&
+	    execmem_params.modules.text.fallback_start) {
+		execmem_params.modules.data.fallback_start = m->text.fallback_start;
+		execmem_params.modules.data.fallback_end = m->text.fallback_end;
+	}
+}
+
 void __init execmem_init(void)
 {
 	struct execmem_params *p = execmem_arch_params();
@@ -112,6 +148,11 @@ void __init execmem_init(void)
 		p->modules.text.pgprot = PAGE_KERNEL_EXEC;
 		p->modules.text.alignment = 1;
 
+		p->modules.data.start = VMALLOC_START;
+		p->modules.data.end = VMALLOC_END;
+		p->modules.data.pgprot = PAGE_KERNEL;
+		p->modules.data.alignment = 1;
+
 		return;
 	}
 
@@ -119,4 +160,6 @@ void __init execmem_init(void)
 		return;
 
 	execmem_params = *p;
+
+	execmem_init_missing(p);
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (5 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 20:05   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations Mike Rapoport
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

The memory allocations for kprobes on arm64 can be placed anywhere in
vmalloc address space and currently this is implemented with an override
of alloc_insn_page() in arm64.

Extend execmem_params with a range for generated code allocations and
make kprobes on arm64 use this extension rather than override
alloc_insn_page().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/arm64/kernel/module.c         |  9 +++++++++
 arch/arm64/kernel/probes/kprobes.c |  7 -------
 include/linux/execmem.h            | 11 +++++++++++
 mm/execmem.c                       | 14 +++++++++++++-
 4 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index c3d999f3a3dd..52b09626bc0f 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -30,6 +30,13 @@ static struct execmem_params execmem_params = {
 			.alignment = MODULE_ALIGN,
 		},
 	},
+	.jit = {
+		.text = {
+			.start = VMALLOC_START,
+			.end = VMALLOC_END,
+			.alignment = 1,
+		},
+	},
 };
 
 struct execmem_params __init *execmem_arch_params(void)
@@ -40,6 +47,8 @@ struct execmem_params __init *execmem_arch_params(void)
 	execmem_params.modules.text.start = module_alloc_base;
 	execmem_params.modules.text.end = module_alloc_end;
 
+	execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
+
 	/*
 	 * KASAN without KASAN_VMALLOC can only deal with module
 	 * allocations being served from the reserved module region,
diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
index 70b91a8c6bb3..6fccedd02b2a 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
 	return 0;
 }
 
-void *alloc_insn_page(void)
-{
-	return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-			GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
-			NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 /* arm kprobe: install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 2e1221310d13..dc7c9a446111 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -52,12 +52,23 @@ struct execmem_modules_range {
 	struct execmem_range data;
 };
 
+/**
+ * struct execmem_jit_range - architecure parameters for address space
+ *			      suitable for JIT code allocations
+ * @text:	address range for text allocations
+ */
+struct execmem_jit_range {
+	struct execmem_range text;
+};
+
 /**
  * struct execmem_params -	architecure parameters for code allocations
  * @modules:	parameters for modules address space
+ * @jit:	parameters for jit memory address space
  */
 struct execmem_params {
 	struct execmem_modules_range	modules;
+	struct execmem_jit_range	jit;
 };
 
 /**
diff --git a/mm/execmem.c b/mm/execmem.c
index f7bf496ad4c3..9730ecef9a30 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -89,7 +89,12 @@ void execmem_free(void *ptr)
 
 void *jit_text_alloc(size_t size)
 {
-	return execmem_text_alloc(size);
+	unsigned long start = execmem_params.jit.text.start;
+	unsigned long end = execmem_params.jit.text.end;
+	pgprot_t pgprot = execmem_params.jit.text.pgprot;
+	unsigned int align = execmem_params.jit.text.alignment;
+
+	return execmem_alloc(size, start, end, align, pgprot, 0, 0, false);
 }
 
 void jit_free(void *ptr)
@@ -135,6 +140,13 @@ static void execmem_init_missing(struct execmem_params *p)
 		execmem_params.modules.data.fallback_start = m->text.fallback_start;
 		execmem_params.modules.data.fallback_end = m->text.fallback_end;
 	}
+
+	if (!execmem_params.jit.text.start) {
+		execmem_params.jit.text.start = m->text.start;
+		execmem_params.jit.text.end = m->text.end;
+		execmem_params.jit.text.alignment = m->text.alignment;
+		execmem_params.jit.text.pgprot = m->text.pgprot;
+	}
 }
 
 void __init execmem_init(void)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (6 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 20:09   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 09/12] powerpc: " Mike Rapoport
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

RISC-V overrides kprobes::alloc_insn_range() to use the entire vmalloc area
rather than limit the allocations to the modules area.

Slightly reorder execmem_params initialization to support both 32 and 64
bit variantsi and add definition of jit area to execmem_params to support
generic kprobes::alloc_insn_page().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/riscv/kernel/module.c         | 16 +++++++++++++++-
 arch/riscv/kernel/probes/kprobes.c | 10 ----------
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index ee5e04cd3f21..cca6ed4e9340 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -436,7 +436,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 	return 0;
 }
 
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+#ifdef CONFIG_MMU
 static struct execmem_params execmem_params = {
 	.modules = {
 		.text = {
@@ -444,12 +444,26 @@ static struct execmem_params execmem_params = {
 			.alignment = 1,
 		},
 	},
+	.jit = {
+		.text = {
+			.pgprot = PAGE_KERNEL_READ_EXEC,
+			.alignment = 1,
+		},
+	},
 };
 
 struct execmem_params __init *execmem_arch_params(void)
 {
+#ifdef CONFIG_64BIT
 	execmem_params.modules.text.start = MODULES_VADDR;
 	execmem_params.modules.text.end = MODULES_END;
+#else
+	execmem_params.modules.text.start = VMALLOC_START;
+	execmem_params.modules.text.end = VMALLOC_END;
+#endif
+
+	execmem_params.jit.text.start = VMALLOC_START;
+	execmem_params.jit.text.end = VMALLOC_END;
 
 	return &execmem_params;
 }
diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c
index 2f08c14a933d..e64f2f3064eb 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
 	return 0;
 }
 
-#ifdef CONFIG_MMU
-void *alloc_insn_page(void)
-{
-	return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-				     GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
-				     VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-				     __builtin_return_address(0));
-}
-#endif
-
 /* install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (7 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 20:09   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES Mike Rapoport
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

powerpc overrides kprobes::alloc_insn_page() to remove writable
permissions when STRICT_MODULE_RWX is on.

Add definition of jit area to execmem_params to allow using the generic
kprobes::alloc_insn_page() with the desired permissions.

As powerpc uses breakpoint instructions to inject kprobes, it does not
need to constrain kprobe allocations to the modules area and can use the
entire vmalloc address space.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/powerpc/kernel/kprobes.c | 14 --------------
 arch/powerpc/kernel/module.c  | 13 +++++++++++++
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 5db8df5e3657..14c5ddec3056 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -126,20 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse
 	return (kprobe_opcode_t *)(addr + offset);
 }
 
-void *alloc_insn_page(void)
-{
-	void *page;
-
-	page = jit_text_alloc(PAGE_SIZE);
-	if (!page)
-		return NULL;
-
-	if (strict_module_rwx_enabled())
-		set_memory_rox((unsigned long)page, 1);
-
-	return page;
-}
-
 int arch_prepare_kprobe(struct kprobe *p)
 {
 	int ret = 0;
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 4c6c15bf3947..8e5b379d6da1 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -96,6 +96,11 @@ static struct execmem_params execmem_params = {
 			.alignment = 1,
 		},
 	},
+	.jit = {
+		.text = {
+			.alignment = 1,
+		},
+	},
 };
 
 
@@ -131,5 +136,13 @@ struct execmem_params __init *execmem_arch_params(void)
 
 	execmem_params.modules.text.pgprot = prot;
 
+	execmem_params.jit.text.start = VMALLOC_START;
+	execmem_params.jit.text.end = VMALLOC_END;
+
+	if (strict_module_rwx_enabled())
+		execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
+	else
+		execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC;
+
 	return &execmem_params;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (8 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 09/12] powerpc: " Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 20:17   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

execmem does not depend on modules, on the contrary modules use
execmem.

To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/kernel/module.c and compile it when CONFIG_EXECMEM=y

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/arm/kernel/module.c        | 36 --------------------
 arch/arm/mm/init.c              | 36 ++++++++++++++++++++
 arch/arm64/include/asm/memory.h |  8 +++++
 arch/arm64/include/asm/module.h |  6 ----
 arch/arm64/kernel/kaslr.c       |  3 +-
 arch/arm64/kernel/module.c      | 50 ----------------------------
 arch/arm64/mm/init.c            | 56 +++++++++++++++++++++++++++++++
 arch/loongarch/kernel/module.c  | 18 ----------
 arch/loongarch/mm/init.c        | 20 +++++++++++
 arch/mips/kernel/module.c       | 18 ----------
 arch/mips/mm/init.c             | 19 +++++++++++
 arch/parisc/kernel/module.c     | 17 ----------
 arch/parisc/mm/init.c           | 22 +++++++++++-
 arch/powerpc/kernel/module.c    | 58 --------------------------------
 arch/powerpc/mm/mem.c           | 59 +++++++++++++++++++++++++++++++++
 arch/riscv/kernel/module.c      | 34 -------------------
 arch/riscv/mm/init.c            | 34 +++++++++++++++++++
 arch/s390/kernel/module.c       | 38 ---------------------
 arch/s390/mm/init.c             | 41 +++++++++++++++++++++++
 arch/sparc/kernel/module.c      | 23 -------------
 arch/sparc/mm/Makefile          |  2 ++
 arch/sparc/mm/execmem.c         | 25 ++++++++++++++
 arch/x86/kernel/module.c        | 50 ----------------------------
 arch/x86/mm/init.c              | 54 ++++++++++++++++++++++++++++++
 24 files changed, 376 insertions(+), 351 deletions(-)
 create mode 100644 arch/sparc/mm/execmem.c

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index f66d479c1c7d..054e799e7091 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,48 +16,12 @@
 #include <linux/fs.h>
 #include <linux/string.h>
 #include <linux/gfp.h>
-#include <linux/execmem.h>
 
 #include <asm/sections.h>
 #include <asm/smp_plat.h>
 #include <asm/unwind.h>
 #include <asm/opcodes.h>
 
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR	(((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-static struct execmem_params execmem_params = {
-	.modules = {
-		.text = {
-			.start = MODULES_VADDR,
-			.end = MODULES_END,
-			.alignment = 1,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	execmem_params.modules.text.pgprot = PAGE_KERNEL_EXEC;
-
-	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
-		execmem_params.modules.text.fallback_start = VMALLOC_START;
-		execmem_params.modules.text.fallback_end = VMALLOC_END;
-	}
-
-	return &execmem_params;
-}
-#endif
-
 bool module_init_section(const char *name)
 {
 	return strstarts(name, ".init") ||
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index ce64bdb55a16..cffd29f15782 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -22,6 +22,7 @@
 #include <linux/sizes.h>
 #include <linux/stop_machine.h>
 #include <linux/swiotlb.h>
+#include <linux/execmem.h>
 
 #include <asm/cp15.h>
 #include <asm/mach-types.h>
@@ -486,3 +487,38 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 	free_reserved_area((void *)start, (void *)end, -1, "initrd");
 }
 #endif
+
+#ifdef CONFIG_XIP_KERNEL
+/*
+ * The XIP kernel text is mapped in the module area for modules and
+ * some other stuff to work without any indirect relocations.
+ * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
+ * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
+ */
+#undef MODULES_VADDR
+#define MODULES_VADDR	(((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
+#endif
+
+#if defined(CONFIG_MMU) && defined(CONFIG_EXECMEM)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.start = MODULES_VADDR,
+			.end = MODULES_END,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	execmem_params.modules.text.pgprot = PAGE_KERNEL_EXEC;
+
+	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+		execmem_params.modules.text.fallback_start = VMALLOC_START;
+		execmem_params.modules.text.fallback_end = VMALLOC_END;
+	}
+
+	return &execmem_params;
+}
+#endif
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index c735afdf639b..962ad2d8b5e5 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -214,6 +214,14 @@ static inline bool kaslr_enabled(void)
 	return kaslr_offset() >= MIN_KIMG_ALIGN;
 }
 
+#ifdef CONFIG_RANDOMIZE_BASE
+extern u64 module_alloc_base;
+int kaslr_init(void);
+#else
+static int kaslr_init(void) {}
+#define module_alloc_base	((u64)_etext - MODULES_VSIZE)
+#endif
+
 /*
  * Allow all memory at the discovery stage. We will clip it later.
  */
diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
index 18734fed3bdd..3e7dcc5fa2f2 100644
--- a/arch/arm64/include/asm/module.h
+++ b/arch/arm64/include/asm/module.h
@@ -30,12 +30,6 @@ u64 module_emit_plt_entry(struct module *mod, Elf64_Shdr *sechdrs,
 u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
 				void *loc, u64 val);
 
-#ifdef CONFIG_RANDOMIZE_BASE
-extern u64 module_alloc_base;
-#else
-#define module_alloc_base	((u64)_etext - MODULES_VSIZE)
-#endif
-
 struct plt_entry {
 	/*
 	 * A program that conforms to the AArch64 Procedure Call Standard
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index e7477f21a4c9..1f5ce4e41e59 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -25,7 +25,7 @@ u16 __initdata memstart_offset_seed;
 
 struct arm64_ftr_override kaslr_feature_override __initdata;
 
-static int __init kaslr_init(void)
+int __init kaslr_init(void)
 {
 	u64 module_range;
 	u32 seed;
@@ -90,4 +90,3 @@ static int __init kaslr_init(void)
 
 	return 0;
 }
-subsys_initcall(kaslr_init)
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 52b09626bc0f..6d09b29fe9db 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -17,61 +17,11 @@
 #include <linux/moduleloader.h>
 #include <linux/scs.h>
 #include <linux/vmalloc.h>
-#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/insn.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
 
-static struct execmem_params execmem_params = {
-	.modules = {
-		.flags = EXECMEM_KASAN_SHADOW,
-		.text = {
-			.alignment = MODULE_ALIGN,
-		},
-	},
-	.jit = {
-		.text = {
-			.start = VMALLOC_START,
-			.end = VMALLOC_END,
-			.alignment = 1,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
-
-	execmem_params.modules.text.pgprot = PAGE_KERNEL;
-	execmem_params.modules.text.start = module_alloc_base;
-	execmem_params.modules.text.end = module_alloc_end;
-
-	execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
-
-	/*
-	 * KASAN without KASAN_VMALLOC can only deal with module
-	 * allocations being served from the reserved module region,
-	 * since the remainder of the vmalloc region is already
-	 * backed by zero shadow pages, and punching holes into it
-	 * is non-trivial. Since the module region is not randomized
-	 * when KASAN is enabled without KASAN_VMALLOC, it is even
-	 * less likely that the module region gets exhausted, so we
-	 * can simply omit this fallback in that case.
-	 */
-	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
-	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
-	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
-	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
-		unsigned long end = module_alloc_base + SZ_2G;
-
-		execmem_params.modules.text.fallback_start = module_alloc_base;
-		execmem_params.modules.text.fallback_end = end;
-	}
-
-	return &execmem_params;
-}
-
 enum aarch64_reloc_op {
 	RELOC_OP_NONE,
 	RELOC_OP_ABS,
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 66e70ca47680..7e3c202ff6ea 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -31,6 +31,7 @@
 #include <linux/hugetlb.h>
 #include <linux/acpi_iort.h>
 #include <linux/kmemleak.h>
+#include <linux/execmem.h>
 
 #include <asm/boot.h>
 #include <asm/fixmap.h>
@@ -493,3 +494,58 @@ void dump_mem_limit(void)
 		pr_emerg("Memory Limit: none\n");
 	}
 }
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_params execmem_params = {
+	.modules = {
+		.flags = EXECMEM_KASAN_SHADOW,
+		.text = {
+			.alignment = MODULE_ALIGN,
+		},
+	},
+	.jit = {
+		.text = {
+			.start = VMALLOC_START,
+			.end = VMALLOC_END,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	u64 module_alloc_end;
+
+	kaslr_init();
+
+	module_alloc_end = module_alloc_base + MODULES_VSIZE;
+
+	execmem_params.modules.text.pgprot = PAGE_KERNEL;
+	execmem_params.modules.text.start = module_alloc_base;
+	execmem_params.modules.text.end = module_alloc_end;
+
+	execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
+
+	/*
+	 * KASAN without KASAN_VMALLOC can only deal with module
+	 * allocations being served from the reserved module region,
+	 * since the remainder of the vmalloc region is already
+	 * backed by zero shadow pages, and punching holes into it
+	 * is non-trivial. Since the module region is not randomized
+	 * when KASAN is enabled without KASAN_VMALLOC, it is even
+	 * less likely that the module region gets exhausted, so we
+	 * can simply omit this fallback in that case.
+	 */
+	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
+	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
+	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
+	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
+		unsigned long end = module_alloc_base + SZ_2G;
+
+		execmem_params.modules.text.fallback_start = module_alloc_base;
+		execmem_params.modules.text.fallback_end = end;
+	}
+
+	return &execmem_params;
+}
+#endif
diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index 32b167722c2b..181b5f8b09f1 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,7 +18,6 @@
 #include <linux/ftrace.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
-#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/inst.h>
 
@@ -470,23 +469,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 	return 0;
 }
 
-static struct execmem_params execmem_params = {
-	.modules = {
-		.text = {
-			.pgprot = PAGE_KERNEL,
-			.alignment = 1,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	execmem_params.modules.text.start = MODULES_VADDR;
-	execmem_params.modules.text.end = MODULES_END;
-
-	return &execmem_params;
-}
-
 static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
 				   const Elf_Shdr *sechdrs, struct module *mod)
 {
diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
index 3b7d8129570b..e752f94f87ed 100644
--- a/arch/loongarch/mm/init.c
+++ b/arch/loongarch/mm/init.c
@@ -24,6 +24,7 @@
 #include <linux/gfp.h>
 #include <linux/hugetlb.h>
 #include <linux/mmzone.h>
+#include <linux/execmem.h>
 
 #include <asm/asm-offsets.h>
 #include <asm/bootinfo.h>
@@ -274,3 +275,22 @@ EXPORT_SYMBOL(invalid_pmd_table);
 #endif
 pte_t invalid_pte_table[PTRS_PER_PTE] __page_aligned_bss;
 EXPORT_SYMBOL(invalid_pte_table);
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.pgprot = PAGE_KERNEL,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	execmem_params.modules.text.start = MODULES_VADDR;
+	execmem_params.modules.text.end = MODULES_END;
+
+	return &execmem_params;
+}
+#endif
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 2d370de67383..ebf9496f5db0 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -33,24 +33,6 @@ struct mips_hi16 {
 static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
-#ifdef MODULE_START
-static struct execmem_params execmem_params = {
-	.modules = {
-		.text = {
-			.start = MODULES_VADDR,
-			.end = MODULES_END,
-			.pgprot = PAGE_KERNEL,
-			.alignment = 1,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	return &execmem_params;
-}
-#endif
-
 static void apply_r_mips_32(u32 *location, u32 base, Elf_Addr v)
 {
 	*location = base + v;
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 5a8002839550..630656921a87 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -31,6 +31,7 @@
 #include <linux/gfp.h>
 #include <linux/kcore.h>
 #include <linux/initrd.h>
+#include <linux/execmem.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -568,3 +569,21 @@ EXPORT_SYMBOL_GPL(invalid_pmd_table);
 #endif
 pte_t invalid_pte_table[PTRS_PER_PTE] __page_aligned_bss;
 EXPORT_SYMBOL(invalid_pte_table);
+
+#if defined(CONFIG_EXECMEM) && defined(MODULE_START)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.start = MODULES_VADDR,
+			.end = MODULES_END,
+			.pgprot = PAGE_KERNEL,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	return &execmem_params;
+}
+#endif
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index 569b8f52a24b..b975c9d583f1 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -174,23 +174,6 @@ static inline int reassemble_22(int as22)
 		((as22 & 0x0003ff) << 3));
 }
 
-static struct execmem_params execmem_params = {
-	.modules = {
-		.text = {
-			.pgprot = PAGE_KERNEL_RWX,
-			.alignment = 1,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	execmem_params.modules.text.start = VMALLOC_START;
-	execmem_params.modules.text.end = VMALLOC_END;
-
-	return &execmem_params;
-}
-
 #ifndef CONFIG_64BIT
 static inline unsigned long count_gots(const Elf_Rela *rela, unsigned long n)
 {
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index b0c43f3b0a5f..659eddaa492d 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -24,6 +24,7 @@
 #include <linux/nodemask.h>	/* for node_online_map */
 #include <linux/pagemap.h>	/* for release_pages */
 #include <linux/compat.h>
+#include <linux/execmem.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
@@ -479,7 +480,7 @@ void free_initmem(void)
 	/* finally dump all the instructions which were cached, since the
 	 * pages are no-longer executable */
 	flush_icache_range(init_begin, init_end);
-	
+
 	free_initmem_default(POISON_FREE_INITMEM);
 
 	/* set up a new led state on systems shipped LED State panel */
@@ -891,3 +892,22 @@ static const pgprot_t protection_map[16] = {
 	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_RWX
 };
 DECLARE_VM_GET_PAGE_PROT
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.pgprot = PAGE_KERNEL_RWX,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	execmem_params.modules.text.start = VMALLOC_START;
+	execmem_params.modules.text.end = VMALLOC_END;
+
+	return &execmem_params;
+}
+#endif
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 8e5b379d6da1..b30e00964a60 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -10,7 +10,6 @@
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
 #include <linux/bug.h>
-#include <linux/execmem.h>
 #include <asm/module.h>
 #include <linux/uaccess.h>
 #include <asm/firmware.h>
@@ -89,60 +88,3 @@ int module_finalize(const Elf_Ehdr *hdr,
 
 	return 0;
 }
-
-static struct execmem_params execmem_params = {
-	.modules = {
-		.text = {
-			.alignment = 1,
-		},
-	},
-	.jit = {
-		.text = {
-			.alignment = 1,
-		},
-	},
-};
-
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
-
-	/*
-	 * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
-	 * allow allocating data in the entire vmalloc space
-	 */
-#ifdef MODULES_VADDR
-	unsigned long limit = (unsigned long)_etext - SZ_32M;
-
-	/* First try within 32M limit from _etext to avoid branch trampolines */
-	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
-		execmem_params.modules.text.start = limit;
-		execmem_params.modules.text.end = MODULES_END;
-		execmem_params.modules.text.fallback_start = MODULES_VADDR;
-		execmem_params.modules.text.fallback_end = MODULES_END;
-	} else {
-		execmem_params.modules.text.start = MODULES_VADDR;
-		execmem_params.modules.text.end = MODULES_END;
-	}
-	execmem_params.modules.data.start = VMALLOC_START;
-	execmem_params.modules.data.end = VMALLOC_END;
-	execmem_params.modules.data.pgprot = PAGE_KERNEL;
-	execmem_params.modules.data.alignment = 1;
-#else
-	execmem_params.modules.text.start = VMALLOC_START;
-	execmem_params.modules.text.end = VMALLOC_END;
-#endif
-
-	execmem_params.modules.text.pgprot = prot;
-
-	execmem_params.jit.text.start = VMALLOC_START;
-	execmem_params.jit.text.end = VMALLOC_END;
-
-	if (strict_module_rwx_enabled())
-		execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
-	else
-		execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC;
-
-	return &execmem_params;
-}
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 8b121df7b08f..84cc805d780d 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -16,6 +16,7 @@
 #include <linux/highmem.h>
 #include <linux/suspend.h>
 #include <linux/dma-direct.h>
+#include <linux/execmem.h>
 
 #include <asm/swiotlb.h>
 #include <asm/machdep.h>
@@ -406,3 +407,61 @@ int devmem_is_allowed(unsigned long pfn)
  * the EHEA driver. Drop this when drivers/net/ethernet/ibm/ehea is removed.
  */
 EXPORT_SYMBOL_GPL(walk_system_ram_range);
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.alignment = 1,
+		},
+	},
+	.jit = {
+		.text = {
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
+
+	/*
+	 * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+	 * allow allocating data in the entire vmalloc space
+	 */
+#ifdef MODULES_VADDR
+	unsigned long limit = (unsigned long)_etext - SZ_32M;
+
+	/* First try within 32M limit from _etext to avoid branch trampolines */
+	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
+		execmem_params.modules.text.start = limit;
+		execmem_params.modules.text.end = MODULES_END;
+		execmem_params.modules.text.fallback_start = MODULES_VADDR;
+		execmem_params.modules.text.fallback_end = MODULES_END;
+	} else {
+		execmem_params.modules.text.start = MODULES_VADDR;
+		execmem_params.modules.text.end = MODULES_END;
+	}
+	execmem_params.modules.data.start = VMALLOC_START;
+	execmem_params.modules.data.end = VMALLOC_END;
+	execmem_params.modules.data.pgprot = PAGE_KERNEL;
+	execmem_params.modules.data.alignment = 1;
+#else
+	execmem_params.modules.text.start = VMALLOC_START;
+	execmem_params.modules.text.end = VMALLOC_END;
+#endif
+
+	execmem_params.modules.text.pgprot = prot;
+
+	execmem_params.jit.text.start = VMALLOC_START;
+	execmem_params.jit.text.end = VMALLOC_END;
+
+	if (strict_module_rwx_enabled())
+		execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
+	else
+		execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC;
+
+	return &execmem_params;
+}
+#endif
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index cca6ed4e9340..8af08d5449bf 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -11,7 +11,6 @@
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
 #include <linux/pgtable.h>
-#include <linux/execmem.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
 
@@ -436,39 +435,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 	return 0;
 }
 
-#ifdef CONFIG_MMU
-static struct execmem_params execmem_params = {
-	.modules = {
-		.text = {
-			.pgprot = PAGE_KERNEL,
-			.alignment = 1,
-		},
-	},
-	.jit = {
-		.text = {
-			.pgprot = PAGE_KERNEL_READ_EXEC,
-			.alignment = 1,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-#ifdef CONFIG_64BIT
-	execmem_params.modules.text.start = MODULES_VADDR;
-	execmem_params.modules.text.end = MODULES_END;
-#else
-	execmem_params.modules.text.start = VMALLOC_START;
-	execmem_params.modules.text.end = VMALLOC_END;
-#endif
-
-	execmem_params.jit.text.start = VMALLOC_START;
-	execmem_params.jit.text.end = VMALLOC_END;
-
-	return &execmem_params;
-}
-#endif
-
 int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *me)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 747e5b1ef02d..d7b88d89f021 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -23,6 +23,7 @@
 #ifdef CONFIG_RELOCATABLE
 #include <linux/elf.h>
 #endif
+#include <linux/execmem.h>
 
 #include <asm/fixmap.h>
 #include <asm/tlbflush.h>
@@ -1363,3 +1364,36 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	return vmemmap_populate_basepages(start, end, node, NULL);
 }
 #endif
+
+#if defined(CONFIG_MMU) && defined(CONFIG_EXECMEM)
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+			.pgprot = PAGE_KERNEL,
+			.alignment = 1,
+		},
+	},
+	.jit = {
+		.text = {
+			.pgprot = PAGE_KERNEL_READ_EXEC,
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+#ifdef CONFIG_64BIT
+	execmem_params.modules.text.start = MODULES_VADDR;
+	execmem_params.modules.text.end = MODULES_END;
+#else
+	execmem_params.modules.text.start = VMALLOC_START;
+	execmem_params.modules.text.end = VMALLOC_END;
+#endif
+
+	execmem_params.jit.text.start = VMALLOC_START;
+	execmem_params.jit.text.end = VMALLOC_END;
+
+	return &execmem_params;
+}
+#endif
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 7fff395d26ea..3773c842fc01 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -37,44 +37,6 @@
 
 #define PLT_ENTRY_SIZE 22
 
-static unsigned long get_module_load_offset(void)
-{
-	static DEFINE_MUTEX(module_kaslr_mutex);
-	static unsigned long module_load_offset;
-
-	if (!kaslr_enabled())
-		return 0;
-	/*
-	 * Calculate the module_load_offset the first time this code
-	 * is called. Once calculated it stays the same until reboot.
-	 */
-	mutex_lock(&module_kaslr_mutex);
-	if (!module_load_offset)
-		module_load_offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
-	mutex_unlock(&module_kaslr_mutex);
-	return module_load_offset;
-}
-
-static struct execmem_params execmem_params = {
-	.modules = {
-		.flags = EXECMEM_KASAN_SHADOW,
-		.text = {
-			.alignment = MODULE_ALIGN,
-			.pgprot = PAGE_KERNEL,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	unsigned long start = MODULES_VADDR + get_module_load_offset();
-
-	execmem_params.modules.text.start = start;
-	execmem_params.modules.text.end = MODULES_END;
-
-	return &execmem_params;
-}
-
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 8d94e29adcdb..cab5c2d041d1 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -34,6 +34,7 @@
 #include <linux/percpu.h>
 #include <asm/processor.h>
 #include <linux/uaccess.h>
+#include <linux/execmem.h>
 #include <asm/pgalloc.h>
 #include <asm/kfence.h>
 #include <asm/ptdump.h>
@@ -311,3 +312,43 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
+
+#ifdef CONFIG_EXECMEM
+static unsigned long get_module_load_offset(void)
+{
+	static DEFINE_MUTEX(module_kaslr_mutex);
+	static unsigned long module_load_offset;
+
+	if (!kaslr_enabled())
+		return 0;
+	/*
+	 * Calculate the module_load_offset the first time this code
+	 * is called. Once calculated it stays the same until reboot.
+	 */
+	mutex_lock(&module_kaslr_mutex);
+	if (!module_load_offset)
+		module_load_offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
+	mutex_unlock(&module_kaslr_mutex);
+	return module_load_offset;
+}
+
+static struct execmem_params execmem_params = {
+	.modules = {
+		.flags = EXECMEM_KASAN_SHADOW,
+		.text = {
+			.alignment = MODULE_ALIGN,
+			.pgprot = PAGE_KERNEL,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	unsigned long start = MODULES_VADDR + get_module_load_offset();
+
+	execmem_params.modules.text.start = start;
+	execmem_params.modules.text.end = MODULES_END;
+
+	return &execmem_params;
+}
+#endif
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index ab75e3e69834..dff1d85ba202 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -14,7 +14,6 @@
 #include <linux/string.h>
 #include <linux/ctype.h>
 #include <linux/mm.h>
-#include <linux/execmem.h>
 #ifdef CONFIG_SPARC64
 #include <linux/jump_label.h>
 #endif
@@ -25,28 +24,6 @@
 
 #include "entry.h"
 
-static struct execmem_params execmem_params = {
-	.modules = {
-		.text = {
-#ifdef CONFIG_SPARC64
-			.start = MODULES_VADDR,
-			.end = MODULES_END,
-#else
-			.start = VMALLOC_START,
-			.end = VMALLOC_END,
-#endif
-			.alignment = 1,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	execmem_params.modules.text.pgprot = PAGE_KERNEL;
-
-	return &execmem_params;
-}
-
 /* Make generic code ignore STT_REGISTER dummy undefined symbols.  */
 int module_frob_arch_sections(Elf_Ehdr *hdr,
 			      Elf_Shdr *sechdrs,
diff --git a/arch/sparc/mm/Makefile b/arch/sparc/mm/Makefile
index 871354aa3c00..87e2cf7efb5b 100644
--- a/arch/sparc/mm/Makefile
+++ b/arch/sparc/mm/Makefile
@@ -15,3 +15,5 @@ obj-$(CONFIG_SPARC32)   += leon_mm.o
 
 # Only used by sparc64
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
+
+obj-$(CONFIG_EXECMEM) += execmem.o
diff --git a/arch/sparc/mm/execmem.c b/arch/sparc/mm/execmem.c
new file mode 100644
index 000000000000..74708b7b1f84
--- /dev/null
+++ b/arch/sparc/mm/execmem.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/execmem.h>
+
+static struct execmem_params execmem_params = {
+	.modules = {
+		.text = {
+#ifdef CONFIG_SPARC64
+			.start = MODULES_VADDR,
+			.end = MODULES_END,
+#else
+			.start = VMALLOC_START,
+			.end = VMALLOC_END,
+#endif
+			.alignment = 1,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	execmem_params.modules.text.pgprot = PAGE_KERNEL;
+
+	return &execmem_params;
+}
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index cf9a7d0a8b62..94a00dc103cd 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -19,7 +19,6 @@
 #include <linux/jump_label.h>
 #include <linux/random.h>
 #include <linux/memory.h>
-#include <linux/execmem.h>
 
 #include <asm/text-patching.h>
 #include <asm/page.h>
@@ -37,55 +36,6 @@ do {							\
 } while (0)
 #endif
 
-#ifdef CONFIG_RANDOMIZE_BASE
-static unsigned long module_load_offset;
-
-/* Mutex protects the module_load_offset. */
-static DEFINE_MUTEX(module_kaslr_mutex);
-
-static unsigned long int get_module_load_offset(void)
-{
-	if (kaslr_enabled()) {
-		mutex_lock(&module_kaslr_mutex);
-		/*
-		 * Calculate the module_load_offset the first time this
-		 * code is called. Once calculated it stays the same until
-		 * reboot.
-		 */
-		if (module_load_offset == 0)
-			module_load_offset =
-				get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
-		mutex_unlock(&module_kaslr_mutex);
-	}
-	return module_load_offset;
-}
-#else
-static unsigned long int get_module_load_offset(void)
-{
-	return 0;
-}
-#endif
-
-static struct execmem_params execmem_params = {
-	.modules = {
-		.flags = EXECMEM_KASAN_SHADOW,
-		.text = {
-			.alignment = MODULE_ALIGN,
-		},
-	},
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-	unsigned long start = MODULES_VADDR + get_module_load_offset();
-
-	execmem_params.modules.text.start = start;
-	execmem_params.modules.text.end = MODULES_END;
-	execmem_params.modules.text.pgprot = PAGE_KERNEL;
-
-	return &execmem_params;
-}
-
 #ifdef CONFIG_X86_32
 int apply_relocate(Elf32_Shdr *sechdrs,
 		   const char *strtab,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3cdac0f0055d..b3833cca68bc 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -7,6 +7,7 @@
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
 #include <linux/sched/task.h>
+#include <linux/execmem.h>
 
 #include <asm/set_memory.h>
 #include <asm/e820/api.h>
@@ -1084,3 +1085,56 @@ unsigned long arch_max_swapfile_size(void)
 	return pages;
 }
 #endif
+
+#ifdef CONFIG_EXECMEM
+
+#ifdef CONFIG_RANDOMIZE_BASE
+static unsigned long module_load_offset;
+
+/* Mutex protects the module_load_offset. */
+static DEFINE_MUTEX(module_kaslr_mutex);
+
+static unsigned long int get_module_load_offset(void)
+{
+	if (kaslr_enabled()) {
+		mutex_lock(&module_kaslr_mutex);
+		/*
+		 * Calculate the module_load_offset the first time this
+		 * code is called. Once calculated it stays the same until
+		 * reboot.
+		 */
+		if (module_load_offset == 0)
+			module_load_offset =
+				get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
+		mutex_unlock(&module_kaslr_mutex);
+	}
+	return module_load_offset;
+}
+#else
+static unsigned long int get_module_load_offset(void)
+{
+	return 0;
+}
+#endif
+
+static struct execmem_params execmem_params = {
+	.modules = {
+		.flags = EXECMEM_KASAN_SHADOW,
+		.text = {
+			.alignment = MODULE_ALIGN,
+		},
+	},
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+	unsigned long start = MODULES_VADDR + get_module_load_offset();
+
+	execmem_params.modules.text.start = start;
+	execmem_params.modules.text.end = MODULES_END;
+	execmem_params.modules.text.pgprot = PAGE_KERNEL;
+
+	return &execmem_params;
+}
+
+#endif /* CONFIG_EXECMEM */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (9 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 20:18   ` Song Liu
  2023-06-16  8:50 ` [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
  2023-06-16 17:02 ` [PATCH v2 00/12] mm: jit/text allocator Edgecombe, Rick P
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.

With execmem separated from the modules code, execmem_text_alloc() is
available regardless of CONFIG_MODULES.

Remove dependency of dynamic ftrace on CONFIG_MODULES and make
CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/x86/Kconfig         |  1 +
 arch/x86/kernel/ftrace.c | 10 ----------
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..ab64bbef9e50 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86_64
 	select SWIOTLB
 	select ARCH_HAS_ELFCORE_COMPAT
 	select ZONE_DMA32
+	select EXECMEM if DYNAMIC_FTRACE
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index f77c63bb3203..a824a5d3b129 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command)
 /* Currently only x86_64 supports dynamic trampolines */
 #ifdef CONFIG_X86_64
 
-#ifdef CONFIG_MODULES
-/* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
 	return execmem_text_alloc(size);
@@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp)
 {
 	execmem_free(tramp);
 }
-#else
-/* Trampolines can only be created if modules are supported */
-static inline void *alloc_tramp(unsigned long size)
-{
-	return NULL;
-}
-static inline void tramp_free(void *tramp) { }
-#endif
 
 /* Defined as markers to the end of the ftrace default trampolines */
 extern void ftrace_regs_caller_end(void);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (10 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
@ 2023-06-16  8:50 ` Mike Rapoport
  2023-06-16 11:44   ` Björn Töpel
  2023-06-16 17:02 ` [PATCH v2 00/12] mm: jit/text allocator Edgecombe, Rick P
  12 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-16  8:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with execmem, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/Kconfig                |  2 +-
 kernel/kprobes.c            | 43 +++++++++++++++++++++----------------
 kernel/trace/trace_kprobe.c | 11 ++++++++++
 3 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..f2e9f82c7d0d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -39,9 +39,9 @@ config GENERIC_ENTRY
 
 config KPROBES
 	bool "Kprobes"
-	depends on MODULES
 	depends on HAVE_KPROBES
 	select KALLSYMS
+	select EXECMEM
 	select TASKS_RCU if PREEMPTION
 	help
 	  Kprobes allows you to trap at almost any kernel address and
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 37c928d5deaf..2c2ba29d3f9a 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1568,6 +1568,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
 		goto out;
 	}
 
+#ifdef CONFIG_MODULES
 	/* Check if 'p' is probing a module. */
 	*probed_mod = __module_text_address((unsigned long) p->addr);
 	if (*probed_mod) {
@@ -1591,6 +1592,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
 			ret = -ENOENT;
 		}
 	}
+#endif
+
 out:
 	preempt_enable();
 	jump_label_unlock();
@@ -2484,24 +2487,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end)
 	return 0;
 }
 
-/* Remove all symbols in given area from kprobe blacklist */
-static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
-{
-	struct kprobe_blacklist_entry *ent, *n;
-
-	list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
-		if (ent->start_addr < start || ent->start_addr >= end)
-			continue;
-		list_del(&ent->list);
-		kfree(ent);
-	}
-}
-
-static void kprobe_remove_ksym_blacklist(unsigned long entry)
-{
-	kprobe_remove_area_blacklist(entry, entry + 1);
-}
-
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
 				   char *type, char *sym)
 {
@@ -2566,6 +2551,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
 	return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
+{
+	struct kprobe_blacklist_entry *ent, *n;
+
+	list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
+		if (ent->start_addr < start || ent->start_addr >= end)
+			continue;
+		list_del(&ent->list);
+		kfree(ent);
+	}
+}
+
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+	kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 static void add_module_kprobe_blacklist(struct module *mod)
 {
 	unsigned long start, end;
@@ -2667,6 +2671,7 @@ static struct notifier_block kprobe_module_nb = {
 	.notifier_call = kprobes_module_callback,
 	.priority = 0
 };
+#endif
 
 void kprobe_free_init_mem(void)
 {
@@ -2726,8 +2731,10 @@ static int __init init_kprobes(void)
 	err = arch_init_kprobes();
 	if (!err)
 		err = register_die_notifier(&kprobe_exceptions_nb);
+#ifdef CONFIG_MODULES
 	if (!err)
 		err = register_module_notifier(&kprobe_module_nb);
+#endif
 
 	kprobes_initialized = (err == 0);
 	kprobe_sysctls_init();
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 59cda19a9033..cf804e372554 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -111,6 +111,7 @@ static nokprobe_inline bool trace_kprobe_within_module(struct trace_kprobe *tk,
 	return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
 }
 
+#ifdef CONFIG_MODULES
 static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
 {
 	char *p;
@@ -129,6 +130,12 @@ static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
 
 	return ret;
 }
+#else
+static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
+{
+	return false;
+}
+#endif
 
 static bool trace_kprobe_is_busy(struct dyn_event *ev)
 {
@@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
 	return ret;
 }
 
+#ifdef CONFIG_MODULES
 /* Module notifier call back, checking event on the module */
 static int trace_kprobe_module_callback(struct notifier_block *nb,
 				       unsigned long val, void *data)
@@ -704,6 +712,7 @@ static struct notifier_block trace_kprobe_module_nb = {
 	.notifier_call = trace_kprobe_module_callback,
 	.priority = 1	/* Invoked after kprobe module callback */
 };
+#endif
 
 static int __trace_kprobe_create(int argc, const char *argv[])
 {
@@ -1797,8 +1806,10 @@ static __init int init_kprobe_trace_early(void)
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_MODULES
 	if (register_module_notifier(&trace_kprobe_module_nb))
 		return -EINVAL;
+#endif
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES
  2023-06-16  8:50 ` [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
@ 2023-06-16 11:44   ` Björn Töpel
  2023-06-17  6:52     ` Mike Rapoport
  0 siblings, 1 reply; 65+ messages in thread
From: Björn Töpel @ 2023-06-16 11:44 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

Mike Rapoport <rppt@kernel.org> writes:

> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> code.

I think you can remove the MODULES dependency from BPF_JIT as well:

--8<--
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 2dfe1079f772..fa4587027f8b 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -41,7 +41,6 @@ config BPF_JIT
        bool "Enable BPF Just In Time compiler"
        depends on BPF
        depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
-       depends on MODULES
        help
          BPF programs are normally handled by a BPF interpreter. This option
          allows the kernel to generate native code when a program is loaded
--8<--


Björn

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/12] nios2: define virtual address space for modules
  2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
@ 2023-06-16 16:00   ` Edgecombe, Rick P
  2023-06-17  5:52     ` Mike Rapoport
  2023-06-16 18:14   ` Song Liu
  1 sibling, 1 reply; 65+ messages in thread
From: Edgecombe, Rick P @ 2023-06-16 16:00 UTC (permalink / raw)
  To: linux-kernel, rppt
  Cc: tglx, mcgrof, deller, davem, nadav.amit, linux, netdev,
	linux-mips, linux-riscv, hca, catalin.marinas, kent.overstreet,
	puranjay12, linux-s390, palmer, chenhuacai, tsbogend,
	linux-trace-kernel, linux-parisc, christophe.leroy, x86, mpe,
	mark.rutland, rostedt, linuxppc-dev, will, dinguyen,
	naveen.n.rao, sparclinux, linux-modules, bpf, linux-arm-kernel,
	song, linux-mm, loongarch, akpm

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
>  void *module_alloc(unsigned long size)
>  {
> -       if (size == 0)
> -               return NULL;
> -       return kmalloc(size, GFP_KERNEL);
> -}
> -
> -/* Free memory returned from module_alloc */
> -void module_memfree(void *module_region)
> -{
> -       kfree(module_region);
> +       return __vmalloc_node_range(size, 1, MODULES_VADDR,
> MODULES_END,
> +                                   GFP_KERNEL, PAGE_KERNEL_EXEC,
> +                                   VM_FLUSH_RESET_PERMS,
> NUMA_NO_NODE,
> +                                   __builtin_return_address(0));
>  }
>  
>  int apply_relocate_add(Elf32_Shdr *sechdrs, const char *s

I wonder if the (size == 0) check is really needed, but
__vmalloc_node_range() will WARN on this case where the old code won't.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  2023-06-16  8:50 ` [PATCH v2 04/12] mm/execmem, arch: convert remaining " Mike Rapoport
@ 2023-06-16 16:16   ` Edgecombe, Rick P
  2023-06-17  6:10     ` Mike Rapoport
  2023-06-16 18:53   ` Song Liu
  1 sibling, 1 reply; 65+ messages in thread
From: Edgecombe, Rick P @ 2023-06-16 16:16 UTC (permalink / raw)
  To: linux-kernel, rppt
  Cc: tglx, mcgrof, deller, davem, nadav.amit, linux, netdev,
	linux-mips, linux-riscv, hca, catalin.marinas, kent.overstreet,
	puranjay12, linux-s390, palmer, chenhuacai, tsbogend,
	linux-trace-kernel, linux-parisc, christophe.leroy, x86, mpe,
	mark.rutland, rostedt, linuxppc-dev, will, dinguyen,
	naveen.n.rao, sparclinux, linux-modules, bpf, linux-arm-kernel,
	song, linux-mm, loongarch, akpm

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> -void *module_alloc(unsigned long size)
> -{
> -       gfp_t gfp_mask = GFP_KERNEL;
> -       void *p;
> -
> -       if (PAGE_ALIGN(size) > MODULES_LEN)
> -               return NULL;
> +static struct execmem_params execmem_params = {
> +       .modules = {
> +               .flags = EXECMEM_KASAN_SHADOW,
> +               .text = {
> +                       .alignment = MODULE_ALIGN,
> +               },
> +       },
> +};

Did you consider making these execmem_params's ro_after_init? Not that
it is security sensitive, but it's a nice hint to the reader that it is
only modified at init. And I guess basically free sanitizing of buggy
writes to it.

>  
> -       p = __vmalloc_node_range(size, MODULE_ALIGN,
> -                                MODULES_VADDR +
> get_module_load_offset(),
> -                                MODULES_END, gfp_mask, PAGE_KERNEL,
> -                                VM_FLUSH_RESET_PERMS |
> VM_DEFER_KMEMLEAK,
> -                                NUMA_NO_NODE,
> __builtin_return_address(0));
> +struct execmem_params __init *execmem_arch_params(void)
> +{
> +       unsigned long start = MODULES_VADDR +
> get_module_load_offset();

I think we can drop the mutex's in get_module_load_offset() now, since
execmem_arch_params() should only be called once at init.

>  
> -       if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0))
> {
> -               vfree(p);
> -               return NULL;
> -       }
> +       execmem_params.modules.text.start = start;
> +       execmem_params.modules.text.end = MODULES_END;
> +       execmem_params.modules.text.pgprot = PAGE_KERNEL;
>  
> -       return p;
> +       return &execmem_params;
>  }
>  


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-16  8:50 ` [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Mike Rapoport
@ 2023-06-16 16:48   ` Kent Overstreet
  2023-06-16 18:18     ` Song Liu
  2023-06-17  5:57     ` Mike Rapoport
  2023-06-17 20:38   ` Andy Lutomirski
  1 sibling, 2 replies; 65+ messages in thread
From: Kent Overstreet @ 2023-06-16 16:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Luis Chamberlain, Mark Rutland, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick Edgecombe, Russell King, Song Liu, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> 
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
> 
> The intention semantics for new allocation APIs:
> 
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
> 
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
> 
> The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> allocated memory is for executable code, the allocations of the
> associated data, like data sections of a module will use
> execmem_data_alloc() interface that will be added later.

I like the API split - at the risk of further bikeshedding, perhaps
near_text_alloc() and far_text_alloc()? Would be more explicit.

Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
@ 2023-06-16 16:55   ` Edgecombe, Rick P
  2023-06-17  6:44     ` Mike Rapoport
  2023-06-16 20:01   ` Song Liu
  2023-06-18 22:32   ` Thomas Gleixner
  2 siblings, 1 reply; 65+ messages in thread
From: Edgecombe, Rick P @ 2023-06-16 16:55 UTC (permalink / raw)
  To: linux-kernel, rppt
  Cc: tglx, mcgrof, deller, davem, nadav.amit, linux, netdev,
	linux-mips, linux-riscv, hca, catalin.marinas, kent.overstreet,
	puranjay12, linux-s390, palmer, chenhuacai, tsbogend,
	linux-trace-kernel, linux-parisc, christophe.leroy, x86, mpe,
	mark.rutland, rostedt, linuxppc-dev, will, dinguyen,
	naveen.n.rao, sparclinux, linux-modules, bpf, linux-arm-kernel,
	song, linux-mm, loongarch, akpm

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> Data related to code allocations, such as module data section, need
> to
> comply with architecture constraints for its placement and its
> allocation right now was done using execmem_text_alloc().
> 
> Create a dedicated API for allocating data related to code
> allocations
> and allow architectures to define address ranges for data
> allocations.

Right now the cross-arch way to specify kernel memory permissions is
encoded in the function names of all the set_memory_foo()'s. You can't
just have unified prot names because some arch's have NX and some have
X bits, etc. CPA wouldn't know if it needs to set or unset a bit if you
pass in a PROT.

But then you end up with a new function for *each* combination (i.e.
set_memory_rox()). I wish CPA has flags like mmap() does, and I wonder
if it makes sense here instead of execmem_data_alloc().

Maybe that is an overhaul for another day though...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/12] mm: jit/text allocator
  2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
                   ` (11 preceding siblings ...)
  2023-06-16  8:50 ` [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
@ 2023-06-16 17:02 ` Edgecombe, Rick P
  12 siblings, 0 replies; 65+ messages in thread
From: Edgecombe, Rick P @ 2023-06-16 17:02 UTC (permalink / raw)
  To: linux-kernel, rppt
  Cc: tglx, mcgrof, deller, davem, nadav.amit, linux, netdev,
	linux-mips, linux-riscv, hca, catalin.marinas, kent.overstreet,
	puranjay12, linux-s390, palmer, chenhuacai, tsbogend,
	linux-trace-kernel, linux-parisc, christophe.leroy, x86, mpe,
	mark.rutland, rostedt, linuxppc-dev, will, dinguyen,
	naveen.n.rao, sparclinux, linux-modules, bpf, linux-arm-kernel,
	song, linux-mm, loongarch, akpm

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> Hi,
> 
> module_alloc() is used everywhere as a mean to allocate memory for
> code.
> 
> Beside being semantically wrong, this unnecessarily ties all
> subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to
> modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this
> causes
> additional obstacles for improvements of code allocation.

I like how this series leaves the allocation code centralized at the
end of it because it will be much easier when we get to ROX, huge page,
text_poking() type stuff.

I guess that's the idea. I'm just catching up on what you and Song have
been up to.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/12] nios2: define virtual address space for modules
  2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
  2023-06-16 16:00   ` Edgecombe, Rick P
@ 2023-06-16 18:14   ` Song Liu
  1 sibling, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 18:14 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
> cannot reach all of vmalloc address space.
>
> Define module space as 32MiB below the kernel base and switch nios2 to
> use vmalloc for module allocations.
>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Acked-by: Dinh Nguyen <dinguyen@kernel.org>

Acked-by: Song Liu <song@kernel.org>


> ---
>  arch/nios2/include/asm/pgtable.h |  5 ++++-
>  arch/nios2/kernel/module.c       | 19 ++++---------------
>  2 files changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
> index 0f5c2564e9f5..0073b289c6a4 100644
> --- a/arch/nios2/include/asm/pgtable.h
> +++ b/arch/nios2/include/asm/pgtable.h
> @@ -25,7 +25,10 @@
>  #include <asm-generic/pgtable-nopmd.h>
>
>  #define VMALLOC_START          CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
> -#define VMALLOC_END            (CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
> +#define VMALLOC_END            (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
> +
> +#define MODULES_VADDR          (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
> +#define MODULES_END            (CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
>
>  struct mm_struct;
>
> diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
> index 76e0a42d6e36..9c97b7513853 100644
> --- a/arch/nios2/kernel/module.c
> +++ b/arch/nios2/kernel/module.c
> @@ -21,23 +21,12 @@
>
>  #include <asm/cacheflush.h>
>
> -/*
> - * Modules should NOT be allocated with kmalloc for (obvious) reasons.
> - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
> - * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
> - * addresses in 0xc0000000)
> - */
>  void *module_alloc(unsigned long size)
>  {
> -       if (size == 0)
> -               return NULL;
> -       return kmalloc(size, GFP_KERNEL);
> -}
> -
> -/* Free memory returned from module_alloc */
> -void module_memfree(void *module_region)
> -{
> -       kfree(module_region);
> +       return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> +                                   GFP_KERNEL, PAGE_KERNEL_EXEC,
> +                                   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> +                                   __builtin_return_address(0));
>  }
>
>  int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-16 16:48   ` Kent Overstreet
@ 2023-06-16 18:18     ` Song Liu
  2023-06-17  5:57     ` Mike Rapoport
  1 sibling, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 18:18 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 9:48 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
>
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.
>
> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

Acked-by: Song Liu <song@kernel.org>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem
  2023-06-16  8:50 ` [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem Mike Rapoport
@ 2023-06-16 18:36   ` Song Liu
  0 siblings, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 18:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
>
> Provide a generic implementation in execmem that uses the parameters
> for address space ranges, required alignment and page protections
> provided by architectures.
>
> The architecures must fill execmem_params structure and implement
> execmem_arch_params() that returns a pointer to that structure. This
> way the execmem initialization won't be called from every architecure,
> but rather from a central place, namely initialization of the core
> memory management.
>
> The execmem provides execmem_text_alloc() API that wraps
> __vmalloc_node_range() with the parameters defined by the architecures.
> If an architeture does not implement execmem_arch_params(),
> execmem_text_alloc() will fall back to module_alloc().
>
> The name execmem_text_alloc() emphasizes that the allocated memory is
> for executable code, the allocations of the associated data, like data
> sections of a module will use execmem_data_alloc() interface that will
> be added later.
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Acked-by: Song Liu <song@kernel.org>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  2023-06-16  8:50 ` [PATCH v2 04/12] mm/execmem, arch: convert remaining " Mike Rapoport
  2023-06-16 16:16   ` Edgecombe, Rick P
@ 2023-06-16 18:53   ` Song Liu
  2023-06-17  6:14     ` Mike Rapoport
  1 sibling, 1 reply; 65+ messages in thread
From: Song Liu @ 2023-06-16 18:53 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport <rppt@kernel.org> wrote:
[...]
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index 5af4975caeb5..c3d999f3a3dd 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -17,56 +17,50 @@
>  #include <linux/moduleloader.h>
>  #include <linux/scs.h>
>  #include <linux/vmalloc.h>
> +#include <linux/execmem.h>
>  #include <asm/alternative.h>
>  #include <asm/insn.h>
>  #include <asm/scs.h>
>  #include <asm/sections.h>
>
> -void *module_alloc(unsigned long size)
> +static struct execmem_params execmem_params = {
> +       .modules = {
> +               .flags = EXECMEM_KASAN_SHADOW,
> +               .text = {
> +                       .alignment = MODULE_ALIGN,
> +               },
> +       },
> +};
> +
> +struct execmem_params __init *execmem_arch_params(void)
>  {
>         u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> -       gfp_t gfp_mask = GFP_KERNEL;
> -       void *p;
> -
> -       /* Silence the initial allocation */
> -       if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> -               gfp_mask |= __GFP_NOWARN;
>
> -       if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
> -           IS_ENABLED(CONFIG_KASAN_SW_TAGS))
> -               /* don't exceed the static module region - see below */
> -               module_alloc_end = MODULES_END;
> +       execmem_params.modules.text.pgprot = PAGE_KERNEL;
> +       execmem_params.modules.text.start = module_alloc_base;

I think I mentioned this earlier. For arm64 with CONFIG_RANDOMIZE_BASE,
module_alloc_base is not yet set when execmem_arch_params() is
called. So we will need some extra logic for this.

Thanks,
Song


> +       execmem_params.modules.text.end = module_alloc_end;
>
> -       p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> -                               module_alloc_end, gfp_mask, PAGE_KERNEL, VM_DEFER_KMEMLEAK,
> -                               NUMA_NO_NODE, __builtin_return_address(0));
> -
> -       if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> +       /*
> +        * KASAN without KASAN_VMALLOC can only deal with module
> +        * allocations being served from the reserved module region,
> +        * since the remainder of the vmalloc region is already
> +        * backed by zero shadow pages, and punching holes into it
> +        * is non-trivial. Since the module region is not randomized
> +        * when KASAN is enabled without KASAN_VMALLOC, it is even
> +        * less likely that the module region gets exhausted, so we
> +        * can simply omit this fallback in that case.
> +        */
> +       if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
>             (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
>              (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
> -             !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
> -               /*
> -                * KASAN without KASAN_VMALLOC can only deal with module
> -                * allocations being served from the reserved module region,
> -                * since the remainder of the vmalloc region is already
> -                * backed by zero shadow pages, and punching holes into it
> -                * is non-trivial. Since the module region is not randomized
> -                * when KASAN is enabled without KASAN_VMALLOC, it is even
> -                * less likely that the module region gets exhausted, so we
> -                * can simply omit this fallback in that case.
> -                */
> -               p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> -                               module_alloc_base + SZ_2G, GFP_KERNEL,
> -                               PAGE_KERNEL, 0, NUMA_NO_NODE,
> -                               __builtin_return_address(0));
> -
> -       if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> -               vfree(p);
> -               return NULL;
> +             !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
> +               unsigned long end = module_alloc_base + SZ_2G;
> +
> +               execmem_params.modules.text.fallback_start = module_alloc_base;
> +               execmem_params.modules.text.fallback_end = end;
>         }
>
> -       /* Memory is intended to be executable, reset the pointer tag. */
> -       return kasan_reset_tag(p);
> +       return &execmem_params;
>  }
>
>  enum aarch64_reloc_op {

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 05/12] modules, execmem: drop module_alloc
  2023-06-16  8:50 ` [PATCH v2 05/12] modules, execmem: drop module_alloc Mike Rapoport
@ 2023-06-16 18:56   ` Song Liu
  0 siblings, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 18:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Define default parameters for address range for code allocations using
> the current values in module_alloc() and make execmem_text_alloc() use
> these defaults when an architecure does not supply its specific
> parameters.
>
> With this, execmem_text_alloc() implements memory allocation in a way
> compatible with module_alloc() and can be used as a replacement for
> module_alloc().
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Acked-by: Song Liu <song@kernel.org>

> ---
>  include/linux/execmem.h      |  8 ++++++++
>  include/linux/moduleloader.h | 12 ------------
>  kernel/module/main.c         |  7 -------
>  mm/execmem.c                 | 12 ++++++++----
>  4 files changed, 16 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> index 68b2bfc79993..b9a97fcdf3c5 100644
> --- a/include/linux/execmem.h
> +++ b/include/linux/execmem.h
> @@ -4,6 +4,14 @@
>
>  #include <linux/types.h>
>
> +#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
> +               !defined(CONFIG_KASAN_VMALLOC)
> +#include <linux/kasan.h>
> +#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
> +#else
> +#define MODULE_ALIGN PAGE_SIZE
> +#endif
> +
>  /**
>   * struct execmem_range - definition of a memory range suitable for code and
>   *                       related data allocations
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index b3374342f7af..4321682fe849 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr,
>  /* Additional bytes needed by arch in front of individual sections */
>  unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
>
> -/* Allocator used for allocating struct module, core sections and init
> -   sections.  Returns NULL on failure. */
> -void *module_alloc(unsigned long size);
> -
>  /* Determines if the section name is an init section (that is only used during
>   * module loading).
>   */
> @@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod);
>  /* Any cleanup before freeing mod->module_init */
>  void module_arch_freeing_init(struct module *mod);
>
> -#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
> -               !defined(CONFIG_KASAN_VMALLOC)
> -#include <linux/kasan.h>
> -#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
> -#else
> -#define MODULE_ALIGN PAGE_SIZE
> -#endif
> -
>  #endif
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 43810a3bdb81..b445c5ad863a 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod)
>         }
>  }
>
> -void * __weak module_alloc(unsigned long size)
> -{
> -       return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -                       GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> -                       NUMA_NO_NODE, __builtin_return_address(0));
> -}
> -
>  bool __weak module_init_section(const char *name)
>  {
>         return strstarts(name, ".init");
> diff --git a/mm/execmem.c b/mm/execmem.c
> index 2fe36dcc7bdf..a67acd75ffef 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -59,9 +59,6 @@ void *execmem_text_alloc(size_t size)
>         unsigned long fallback_end = execmem_params.modules.text.fallback_end;
>         bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
>
> -       if (!execmem_params.modules.text.start)
> -               return module_alloc(size);
> -
>         return execmem_alloc(size, start, end, align, pgprot,
>                              fallback_start, fallback_end, kasan);
>  }
> @@ -108,8 +105,15 @@ void __init execmem_init(void)
>  {
>         struct execmem_params *p = execmem_arch_params();
>
> -       if (!p)
> +       if (!p) {
> +               p = &execmem_params;
> +               p->modules.text.start = VMALLOC_START;
> +               p->modules.text.end = VMALLOC_END;
> +               p->modules.text.pgprot = PAGE_KERNEL_EXEC;
> +               p->modules.text.alignment = 1;
> +
>                 return;
> +       }
>
>         if (!execmem_validate_params(p))
>                 return;
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
  2023-06-16 16:55   ` Edgecombe, Rick P
@ 2023-06-16 20:01   ` Song Liu
  2023-06-17  6:51     ` Mike Rapoport
  2023-06-18 22:32   ` Thomas Gleixner
  2 siblings, 1 reply; 65+ messages in thread
From: Song Liu @ 2023-06-16 20:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Data related to code allocations, such as module data section, need to
> comply with architecture constraints for its placement and its
> allocation right now was done using execmem_text_alloc().
>
> Create a dedicated API for allocating data related to code allocations
> and allow architectures to define address ranges for data allocations.
>
> Since currently this is only relevant for powerpc variants that use the
> VMALLOC address space for module data allocations, automatically reuse
> address ranges defined for text unless address range for data is
> explicitly defined by an architecture.
>
> With separation of code and data allocations, data sections of the
> modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
> was a default on many architectures.
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
[...]
>  static void free_mod_mem(struct module *mod)
> diff --git a/mm/execmem.c b/mm/execmem.c
> index a67acd75ffef..f7bf496ad4c3 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size)
>                              fallback_start, fallback_end, kasan);
>  }
>
> +void *execmem_data_alloc(size_t size)
> +{
> +       unsigned long start = execmem_params.modules.data.start;
> +       unsigned long end = execmem_params.modules.data.end;
> +       pgprot_t pgprot = execmem_params.modules.data.pgprot;
> +       unsigned int align = execmem_params.modules.data.alignment;
> +       unsigned long fallback_start = execmem_params.modules.data.fallback_start;
> +       unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> +       bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
> +
> +       return execmem_alloc(size, start, end, align, pgprot,
> +                            fallback_start, fallback_end, kasan);
> +}
> +
>  void execmem_free(void *ptr)
>  {
>         /*
> @@ -101,6 +115,28 @@ static bool execmem_validate_params(struct execmem_params *p)
>         return true;
>  }
>
> +static void execmem_init_missing(struct execmem_params *p)

Shall we call this execmem_default_init_data?

> +{
> +       struct execmem_modules_range *m = &p->modules;
> +
> +       if (!pgprot_val(execmem_params.modules.data.pgprot))
> +               execmem_params.modules.data.pgprot = PAGE_KERNEL;

Do we really need to check each of these? IOW, can we do:

if (!pgprot_val(execmem_params.modules.data.pgprot)) {
       execmem_params.modules.data.pgprot = PAGE_KERNEL;
       execmem_params.modules.data.alignment = m->text.alignment;
       execmem_params.modules.data.start = m->text.start;
       execmem_params.modules.data.end = m->text.end;
       execmem_params.modules.data.fallback_start = m->text.fallback_start;
      execmem_params.modules.data.fallback_end = m->text.fallback_end;
}

Thanks,
Song

[...]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
  2023-06-16  8:50 ` [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions Mike Rapoport
@ 2023-06-16 20:05   ` Song Liu
  2023-06-17  6:57     ` Mike Rapoport
  0 siblings, 1 reply; 65+ messages in thread
From: Song Liu @ 2023-06-16 20:05 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> The memory allocations for kprobes on arm64 can be placed anywhere in
> vmalloc address space and currently this is implemented with an override
> of alloc_insn_page() in arm64.
>
> Extend execmem_params with a range for generated code allocations and
> make kprobes on arm64 use this extension rather than override
> alloc_insn_page().
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> ---
>  arch/arm64/kernel/module.c         |  9 +++++++++
>  arch/arm64/kernel/probes/kprobes.c |  7 -------
>  include/linux/execmem.h            | 11 +++++++++++
>  mm/execmem.c                       | 14 +++++++++++++-
>  4 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index c3d999f3a3dd..52b09626bc0f 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -30,6 +30,13 @@ static struct execmem_params execmem_params = {
>                         .alignment = MODULE_ALIGN,
>                 },
>         },
> +       .jit = {
> +               .text = {
> +                       .start = VMALLOC_START,
> +                       .end = VMALLOC_END,
> +                       .alignment = 1,
> +               },
> +       },
>  };

This is growing fast. :) We have 3 now: text, data, jit. And it will be
5 when we split data into rw data, ro data, ro after init data. I wonder
whether we should still do some type enum here. But we can revisit
this topic later.

Other than that

Acked-by: Song Liu <song@kernel.org>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations
  2023-06-16  8:50 ` [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations Mike Rapoport
@ 2023-06-16 20:09   ` Song Liu
  0 siblings, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 20:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> RISC-V overrides kprobes::alloc_insn_range() to use the entire vmalloc area
> rather than limit the allocations to the modules area.
>
> Slightly reorder execmem_params initialization to support both 32 and 64
> bit variantsi and add definition of jit area to execmem_params to support
> generic kprobes::alloc_insn_page().
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Acked-by: Song Liu <song@kernel.org>


> ---
>  arch/riscv/kernel/module.c         | 16 +++++++++++++++-
>  arch/riscv/kernel/probes/kprobes.c | 10 ----------
>  2 files changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index ee5e04cd3f21..cca6ed4e9340 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -436,7 +436,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>         return 0;
>  }
>
> -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> +#ifdef CONFIG_MMU
>  static struct execmem_params execmem_params = {
>         .modules = {
>                 .text = {
> @@ -444,12 +444,26 @@ static struct execmem_params execmem_params = {
>                         .alignment = 1,
>                 },
>         },
> +       .jit = {
> +               .text = {
> +                       .pgprot = PAGE_KERNEL_READ_EXEC,
> +                       .alignment = 1,
> +               },
> +       },
>  };
>
>  struct execmem_params __init *execmem_arch_params(void)
>  {
> +#ifdef CONFIG_64BIT
>         execmem_params.modules.text.start = MODULES_VADDR;
>         execmem_params.modules.text.end = MODULES_END;
> +#else
> +       execmem_params.modules.text.start = VMALLOC_START;
> +       execmem_params.modules.text.end = VMALLOC_END;
> +#endif
> +
> +       execmem_params.jit.text.start = VMALLOC_START;
> +       execmem_params.jit.text.end = VMALLOC_END;
>
>         return &execmem_params;
>  }
> diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c
> index 2f08c14a933d..e64f2f3064eb 100644
> --- a/arch/riscv/kernel/probes/kprobes.c
> +++ b/arch/riscv/kernel/probes/kprobes.c
> @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
>         return 0;
>  }
>
> -#ifdef CONFIG_MMU
> -void *alloc_insn_page(void)
> -{
> -       return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
> -                                    GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
> -                                    VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> -                                    __builtin_return_address(0));
> -}
> -#endif
> -
>  /* install breakpoint in text */
>  void __kprobes arch_arm_kprobe(struct kprobe *p)
>  {
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations
  2023-06-16  8:50 ` [PATCH v2 09/12] powerpc: " Mike Rapoport
@ 2023-06-16 20:09   ` Song Liu
  0 siblings, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 20:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> powerpc overrides kprobes::alloc_insn_page() to remove writable
> permissions when STRICT_MODULE_RWX is on.
>
> Add definition of jit area to execmem_params to allow using the generic
> kprobes::alloc_insn_page() with the desired permissions.
>
> As powerpc uses breakpoint instructions to inject kprobes, it does not
> need to constrain kprobe allocations to the modules area and can use the
> entire vmalloc address space.
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Acked-by: Song Liu <song@kernel.org>


> ---
>  arch/powerpc/kernel/kprobes.c | 14 --------------
>  arch/powerpc/kernel/module.c  | 13 +++++++++++++
>  2 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 5db8df5e3657..14c5ddec3056 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -126,20 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse
>         return (kprobe_opcode_t *)(addr + offset);
>  }
>
> -void *alloc_insn_page(void)
> -{
> -       void *page;
> -
> -       page = jit_text_alloc(PAGE_SIZE);
> -       if (!page)
> -               return NULL;
> -
> -       if (strict_module_rwx_enabled())
> -               set_memory_rox((unsigned long)page, 1);
> -
> -       return page;
> -}
> -
>  int arch_prepare_kprobe(struct kprobe *p)
>  {
>         int ret = 0;
> diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
> index 4c6c15bf3947..8e5b379d6da1 100644
> --- a/arch/powerpc/kernel/module.c
> +++ b/arch/powerpc/kernel/module.c
> @@ -96,6 +96,11 @@ static struct execmem_params execmem_params = {
>                         .alignment = 1,
>                 },
>         },
> +       .jit = {
> +               .text = {
> +                       .alignment = 1,
> +               },
> +       },
>  };
>
>
> @@ -131,5 +136,13 @@ struct execmem_params __init *execmem_arch_params(void)
>
>         execmem_params.modules.text.pgprot = prot;
>
> +       execmem_params.jit.text.start = VMALLOC_START;
> +       execmem_params.jit.text.end = VMALLOC_END;
> +
> +       if (strict_module_rwx_enabled())
> +               execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
> +       else
> +               execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC;
> +
>         return &execmem_params;
>  }
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES
  2023-06-16  8:50 ` [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES Mike Rapoport
@ 2023-06-16 20:17   ` Song Liu
  0 siblings, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 20:17 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> execmem does not depend on modules, on the contrary modules use
> execmem.
>
> To make execmem available when CONFIG_MODULES=n, for instance for
> kprobes, split execmem_params initialization out from
> arch/kernel/module.c and compile it when CONFIG_EXECMEM=y
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> ---
[...]
> +
> +struct execmem_params __init *execmem_arch_params(void)
> +{
> +       u64 module_alloc_end;
> +
> +       kaslr_init();

Aha, this addresses my comment on the earlier patch. Thanks!

Acked-by: Song Liu <song@kernel.org>


> +
> +       module_alloc_end = module_alloc_base + MODULES_VSIZE;
> +
> +       execmem_params.modules.text.pgprot = PAGE_KERNEL;
> +       execmem_params.modules.text.start = module_alloc_base;
> +       execmem_params.modules.text.end = module_alloc_end;
> +
> +       execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
[...]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  2023-06-16  8:50 ` [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
@ 2023-06-16 20:18   ` Song Liu
  0 siblings, 0 replies; 65+ messages in thread
From: Song Liu @ 2023-06-16 20:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Dynamic ftrace must allocate memory for code and this was impossible
> without CONFIG_MODULES.
>
> With execmem separated from the modules code, execmem_text_alloc() is
> available regardless of CONFIG_MODULES.
>
> Remove dependency of dynamic ftrace on CONFIG_MODULES and make
> CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

Acked-by: Song Liu <song@kernel.org>

> ---
>  arch/x86/Kconfig         |  1 +
>  arch/x86/kernel/ftrace.c | 10 ----------
>  2 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 53bab123a8ee..ab64bbef9e50 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -35,6 +35,7 @@ config X86_64
>         select SWIOTLB
>         select ARCH_HAS_ELFCORE_COMPAT
>         select ZONE_DMA32
> +       select EXECMEM if DYNAMIC_FTRACE
>
>  config FORCE_DYNAMIC_FTRACE
>         def_bool y
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index f77c63bb3203..a824a5d3b129 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command)
>  /* Currently only x86_64 supports dynamic trampolines */
>  #ifdef CONFIG_X86_64
>
> -#ifdef CONFIG_MODULES
> -/* Module allocation simplifies allocating memory for code */
>  static inline void *alloc_tramp(unsigned long size)
>  {
>         return execmem_text_alloc(size);
> @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp)
>  {
>         execmem_free(tramp);
>  }
> -#else
> -/* Trampolines can only be created if modules are supported */
> -static inline void *alloc_tramp(unsigned long size)
> -{
> -       return NULL;
> -}
> -static inline void tramp_free(void *tramp) { }
> -#endif
>
>  /* Defined as markers to the end of the ftrace default trampolines */
>  extern void ftrace_regs_caller_end(void);
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 01/12] nios2: define virtual address space for modules
  2023-06-16 16:00   ` Edgecombe, Rick P
@ 2023-06-17  5:52     ` Mike Rapoport
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  5:52 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, tglx, mcgrof, deller, davem, nadav.amit, linux,
	netdev, linux-mips, linux-riscv, hca, catalin.marinas,
	kent.overstreet, puranjay12, linux-s390, palmer, chenhuacai,
	tsbogend, linux-trace-kernel, linux-parisc, christophe.leroy,
	x86, mpe, mark.rutland, rostedt, linuxppc-dev, will, dinguyen,
	naveen.n.rao, sparclinux, linux-modules, bpf, linux-arm-kernel,
	song, linux-mm, loongarch, akpm

On Fri, Jun 16, 2023 at 04:00:19PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> >  void *module_alloc(unsigned long size)
> >  {
> > -       if (size == 0)
> > -               return NULL;
> > -       return kmalloc(size, GFP_KERNEL);
> > -}
> > -
> > -/* Free memory returned from module_alloc */
> > -void module_memfree(void *module_region)
> > -{
> > -       kfree(module_region);
> > +       return __vmalloc_node_range(size, 1, MODULES_VADDR,
> > MODULES_END,
> > +                                   GFP_KERNEL, PAGE_KERNEL_EXEC,
> > +                                   VM_FLUSH_RESET_PERMS,
> > NUMA_NO_NODE,
> > +                                   __builtin_return_address(0));
> >  }
> >  
> >  int apply_relocate_add(Elf32_Shdr *sechdrs, const char *s
> 
> I wonder if the (size == 0) check is really needed, but
> __vmalloc_node_range() will WARN on this case where the old code won't.

module_alloc() should not be called with zero size, so a warning there
would be appropriate.
Besides, no other module_alloc() had this check.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-16 16:48   ` Kent Overstreet
  2023-06-16 18:18     ` Song Liu
@ 2023-06-17  5:57     ` Mike Rapoport
  1 sibling, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  5:57 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Luis Chamberlain, Mark Rutland, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick Edgecombe, Russell King, Song Liu, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 12:48:02PM -0400, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > 
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > 
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> > 
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> > 
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > 
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> > 
> > The intention semantics for new allocation APIs:
> > 
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> > 
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> > 
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
> 
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.

With near and far it should mention from where and that's getting too long.
I don't mind changing the names, but I couldn't think about something
better than Song's execmem and your jit.
 
> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

Thanks!

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  2023-06-16 16:16   ` Edgecombe, Rick P
@ 2023-06-17  6:10     ` Mike Rapoport
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  6:10 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, tglx, mcgrof, deller, davem, nadav.amit, linux,
	netdev, linux-mips, linux-riscv, hca, catalin.marinas,
	kent.overstreet, puranjay12, linux-s390, palmer, chenhuacai,
	tsbogend, linux-trace-kernel, linux-parisc, christophe.leroy,
	x86, mpe, mark.rutland, rostedt, linuxppc-dev, will, dinguyen,
	naveen.n.rao, sparclinux, linux-modules, bpf, linux-arm-kernel,
	song, linux-mm, loongarch, akpm

On Fri, Jun 16, 2023 at 04:16:28PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> > -void *module_alloc(unsigned long size)
> > -{
> > -       gfp_t gfp_mask = GFP_KERNEL;
> > -       void *p;
> > -
> > -       if (PAGE_ALIGN(size) > MODULES_LEN)
> > -               return NULL;
> > +static struct execmem_params execmem_params = {
> > +       .modules = {
> > +               .flags = EXECMEM_KASAN_SHADOW,
> > +               .text = {
> > +                       .alignment = MODULE_ALIGN,
> > +               },
> > +       },
> > +};
> 
> Did you consider making these execmem_params's ro_after_init? Not that
> it is security sensitive, but it's a nice hint to the reader that it is
> only modified at init. And I guess basically free sanitizing of buggy
> writes to it.

Makes sense.
 
> >  
> > -       p = __vmalloc_node_range(size, MODULE_ALIGN,
> > -                                MODULES_VADDR +
> > get_module_load_offset(),
> > -                                MODULES_END, gfp_mask, PAGE_KERNEL,
> > -                                VM_FLUSH_RESET_PERMS |
> > VM_DEFER_KMEMLEAK,
> > -                                NUMA_NO_NODE,
> > __builtin_return_address(0));
> > +struct execmem_params __init *execmem_arch_params(void)
> > +{
> > +       unsigned long start = MODULES_VADDR +
> > get_module_load_offset();
> 
> I think we can drop the mutex's in get_module_load_offset() now, since
> execmem_arch_params() should only be called once at init.

Right. Even more, the entire get_module_load_offset() can be folded into
execmem_arch_params() as 

	if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && kaslr_enabled())
		module_load_offset =
			get_random_u32_inclusive(1, 1024) * PAGE_SIZE;

> >  
> > -       if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0))
> > {
> > -               vfree(p);
> > -               return NULL;
> > -       }
> > +       execmem_params.modules.text.start = start;
> > +       execmem_params.modules.text.end = MODULES_END;
> > +       execmem_params.modules.text.pgprot = PAGE_KERNEL;
> >  
> > -       return p;
> > +       return &execmem_params;
> >  }
> >  
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  2023-06-16 18:53   ` Song Liu
@ 2023-06-17  6:14     ` Mike Rapoport
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  6:14 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 11:53:54AM -0700, Song Liu wrote:
> On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport <rppt@kernel.org> wrote:
> [...]
> > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > index 5af4975caeb5..c3d999f3a3dd 100644
> > --- a/arch/arm64/kernel/module.c
> > +++ b/arch/arm64/kernel/module.c
> > @@ -17,56 +17,50 @@
> >  #include <linux/moduleloader.h>
> >  #include <linux/scs.h>
> >  #include <linux/vmalloc.h>
> > +#include <linux/execmem.h>
> >  #include <asm/alternative.h>
> >  #include <asm/insn.h>
> >  #include <asm/scs.h>
> >  #include <asm/sections.h>
> >
> > -void *module_alloc(unsigned long size)
> > +static struct execmem_params execmem_params = {
> > +       .modules = {
> > +               .flags = EXECMEM_KASAN_SHADOW,
> > +               .text = {
> > +                       .alignment = MODULE_ALIGN,
> > +               },
> > +       },
> > +};
> > +
> > +struct execmem_params __init *execmem_arch_params(void)
> >  {
> >         u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> > -       gfp_t gfp_mask = GFP_KERNEL;
> > -       void *p;
> > -
> > -       /* Silence the initial allocation */
> > -       if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> > -               gfp_mask |= __GFP_NOWARN;
> >
> > -       if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
> > -           IS_ENABLED(CONFIG_KASAN_SW_TAGS))
> > -               /* don't exceed the static module region - see below */
> > -               module_alloc_end = MODULES_END;
> > +       execmem_params.modules.text.pgprot = PAGE_KERNEL;
> > +       execmem_params.modules.text.start = module_alloc_base;
> 
> I think I mentioned this earlier. For arm64 with CONFIG_RANDOMIZE_BASE,
> module_alloc_base is not yet set when execmem_arch_params() is
> called. So we will need some extra logic for this.

Right, this is taken care of in a later patch, but it actually belongs here.
Good catch!
 
> Thanks,
> Song
> 
> 
> > +       execmem_params.modules.text.end = module_alloc_end;
> >
> > -       p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> > -                               module_alloc_end, gfp_mask, PAGE_KERNEL, VM_DEFER_KMEMLEAK,
> > -                               NUMA_NO_NODE, __builtin_return_address(0));
> > -
> > -       if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> > +       /*
> > +        * KASAN without KASAN_VMALLOC can only deal with module
> > +        * allocations being served from the reserved module region,
> > +        * since the remainder of the vmalloc region is already
> > +        * backed by zero shadow pages, and punching holes into it
> > +        * is non-trivial. Since the module region is not randomized
> > +        * when KASAN is enabled without KASAN_VMALLOC, it is even
> > +        * less likely that the module region gets exhausted, so we
> > +        * can simply omit this fallback in that case.
> > +        */
> > +       if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> >             (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
> >              (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
> > -             !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
> > -               /*
> > -                * KASAN without KASAN_VMALLOC can only deal with module
> > -                * allocations being served from the reserved module region,
> > -                * since the remainder of the vmalloc region is already
> > -                * backed by zero shadow pages, and punching holes into it
> > -                * is non-trivial. Since the module region is not randomized
> > -                * when KASAN is enabled without KASAN_VMALLOC, it is even
> > -                * less likely that the module region gets exhausted, so we
> > -                * can simply omit this fallback in that case.
> > -                */
> > -               p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> > -                               module_alloc_base + SZ_2G, GFP_KERNEL,
> > -                               PAGE_KERNEL, 0, NUMA_NO_NODE,
> > -                               __builtin_return_address(0));
> > -
> > -       if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> > -               vfree(p);
> > -               return NULL;
> > +             !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
> > +               unsigned long end = module_alloc_base + SZ_2G;
> > +
> > +               execmem_params.modules.text.fallback_start = module_alloc_base;
> > +               execmem_params.modules.text.fallback_end = end;
> >         }
> >
> > -       /* Memory is intended to be executable, reset the pointer tag. */
> > -       return kasan_reset_tag(p);
> > +       return &execmem_params;
> >  }
> >
> >  enum aarch64_reloc_op {

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-16 16:55   ` Edgecombe, Rick P
@ 2023-06-17  6:44     ` Mike Rapoport
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  6:44 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, tglx, mcgrof, deller, davem, nadav.amit, linux,
	netdev, linux-mips, linux-riscv, hca, catalin.marinas,
	kent.overstreet, puranjay12, linux-s390, palmer, chenhuacai,
	tsbogend, linux-trace-kernel, linux-parisc, christophe.leroy,
	x86, mpe, mark.rutland, rostedt, linuxppc-dev, will, dinguyen,
	naveen.n.rao, sparclinux, linux-modules, bpf, linux-arm-kernel,
	song, linux-mm, loongarch, akpm

On Fri, Jun 16, 2023 at 04:55:53PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > 
> > Data related to code allocations, such as module data section, need
> > to
> > comply with architecture constraints for its placement and its
> > allocation right now was done using execmem_text_alloc().
> > 
> > Create a dedicated API for allocating data related to code
> > allocations
> > and allow architectures to define address ranges for data
> > allocations.
> 
> Right now the cross-arch way to specify kernel memory permissions is
> encoded in the function names of all the set_memory_foo()'s. You can't
> just have unified prot names because some arch's have NX and some have
> X bits, etc. CPA wouldn't know if it needs to set or unset a bit if you
> pass in a PROT.

The idea is that allocator never uses CPA. It allocates with the
permissions defined by architecture at the first place and then the callers
might use CPA to change them. Ideally, that shouldn't be needed for
anything but ro_after_init, but we are far from there.

> But then you end up with a new function for *each* combination (i.e.
> set_memory_rox()). I wish CPA has flags like mmap() does, and I wonder
> if it makes sense here instead of execmem_data_alloc().

I don't think execmem should handle all the combinations. The code is
always allocated as ROX for architectures that support text poking and as
RWX for those that don't.

Maybe execmem_data_alloc() will need to able to handle RW and RO data
differently at some point, but that's the only variant I can think of, but
even then I don't expect CPA will be used by execmem. 

We also might move resetting the permissions to default from vmalloc to
execmem, but again, we are far from there.
 
> Maybe that is an overhaul for another day though...

CPA surely needs some love :)

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-16 20:01   ` Song Liu
@ 2023-06-17  6:51     ` Mike Rapoport
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  6:51 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 01:01:08PM -0700, Song Liu wrote:
> On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > Data related to code allocations, such as module data section, need to
> > comply with architecture constraints for its placement and its
> > allocation right now was done using execmem_text_alloc().
> >
> > Create a dedicated API for allocating data related to code allocations
> > and allow architectures to define address ranges for data allocations.
> >
> > Since currently this is only relevant for powerpc variants that use the
> > VMALLOC address space for module data allocations, automatically reuse
> > address ranges defined for text unless address range for data is
> > explicitly defined by an architecture.
> >
> > With separation of code and data allocations, data sections of the
> > modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
> > was a default on many architectures.
> >
> > Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> [...]
> >  static void free_mod_mem(struct module *mod)
> > diff --git a/mm/execmem.c b/mm/execmem.c
> > index a67acd75ffef..f7bf496ad4c3 100644
> > --- a/mm/execmem.c
> > +++ b/mm/execmem.c
> > @@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size)
> >                              fallback_start, fallback_end, kasan);
> >  }
> >
> > +void *execmem_data_alloc(size_t size)
> > +{
> > +       unsigned long start = execmem_params.modules.data.start;
> > +       unsigned long end = execmem_params.modules.data.end;
> > +       pgprot_t pgprot = execmem_params.modules.data.pgprot;
> > +       unsigned int align = execmem_params.modules.data.alignment;
> > +       unsigned long fallback_start = execmem_params.modules.data.fallback_start;
> > +       unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> > +       bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
> > +
> > +       return execmem_alloc(size, start, end, align, pgprot,
> > +                            fallback_start, fallback_end, kasan);
> > +}
> > +
> >  void execmem_free(void *ptr)
> >  {
> >         /*
> > @@ -101,6 +115,28 @@ static bool execmem_validate_params(struct execmem_params *p)
> >         return true;
> >  }
> >
> > +static void execmem_init_missing(struct execmem_params *p)
> 
> Shall we call this execmem_default_init_data?

This also fills in jit.text (next patch), so _data doesn't work here :)
And it's not really a default, the defaults are set explicitly for arches
that don't have execmem_arch_params.
 
> > +{
> > +       struct execmem_modules_range *m = &p->modules;
> > +
> > +       if (!pgprot_val(execmem_params.modules.data.pgprot))
> > +               execmem_params.modules.data.pgprot = PAGE_KERNEL;
> 
> Do we really need to check each of these? IOW, can we do:
> 
> if (!pgprot_val(execmem_params.modules.data.pgprot)) {
>        execmem_params.modules.data.pgprot = PAGE_KERNEL;
>        execmem_params.modules.data.alignment = m->text.alignment;
>        execmem_params.modules.data.start = m->text.start;
>        execmem_params.modules.data.end = m->text.end;
>        execmem_params.modules.data.fallback_start = m->text.fallback_start;
>       execmem_params.modules.data.fallback_end = m->text.fallback_end;
> }

Yes, we can have a single check here.
 
> Thanks,
> Song
> 
> [...]

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES
  2023-06-16 11:44   ` Björn Töpel
@ 2023-06-17  6:52     ` Mike Rapoport
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  6:52 UTC (permalink / raw)
  To: Björn Töpel
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

On Fri, Jun 16, 2023 at 01:44:55PM +0200, Björn Töpel wrote:
> Mike Rapoport <rppt@kernel.org> writes:
> 
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> 
> I think you can remove the MODULES dependency from BPF_JIT as well:

Yeah, I think so. Thanks!
 
> --8<--
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 2dfe1079f772..fa4587027f8b 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -41,7 +41,6 @@ config BPF_JIT
>         bool "Enable BPF Just In Time compiler"
>         depends on BPF
>         depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
> -       depends on MODULES
>         help
>           BPF programs are normally handled by a BPF interpreter. This option
>           allows the kernel to generate native code when a program is loaded
> --8<--
> 
> 
> Björn

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
  2023-06-16 20:05   ` Song Liu
@ 2023-06-17  6:57     ` Mike Rapoport
  2023-06-17 15:36       ` Kent Overstreet
  0 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-17  6:57 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 16, 2023 at 01:05:29PM -0700, Song Liu wrote:
> On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > The memory allocations for kprobes on arm64 can be placed anywhere in
> > vmalloc address space and currently this is implemented with an override
> > of alloc_insn_page() in arm64.
> >
> > Extend execmem_params with a range for generated code allocations and
> > make kprobes on arm64 use this extension rather than override
> > alloc_insn_page().
> >
> > Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> > ---
> >  arch/arm64/kernel/module.c         |  9 +++++++++
> >  arch/arm64/kernel/probes/kprobes.c |  7 -------
> >  include/linux/execmem.h            | 11 +++++++++++
> >  mm/execmem.c                       | 14 +++++++++++++-
> >  4 files changed, 33 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > index c3d999f3a3dd..52b09626bc0f 100644
> > --- a/arch/arm64/kernel/module.c
> > +++ b/arch/arm64/kernel/module.c
> > @@ -30,6 +30,13 @@ static struct execmem_params execmem_params = {
> >                         .alignment = MODULE_ALIGN,
> >                 },
> >         },
> > +       .jit = {
> > +               .text = {
> > +                       .start = VMALLOC_START,
> > +                       .end = VMALLOC_END,
> > +                       .alignment = 1,
> > +               },
> > +       },
> >  };
> 
> This is growing fast. :) We have 3 now: text, data, jit. And it will be
> 5 when we split data into rw data, ro data, ro after init data. I wonder
> whether we should still do some type enum here. But we can revisit
> this topic later.

I don't think we'd need 5. Four at most :)

I don't know yet what would be the best way to differentiate RW and RO
data, but ro_after_init surely won't need a new type. It either will be
allocated as RW and then the caller will have to set it RO after
initialization is done, or it will be allocated as RO and the caller will
have to do something like text_poke to update it.
 
> Other than that
> 
> Acked-by: Song Liu <song@kernel.org>

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
  2023-06-17  6:57     ` Mike Rapoport
@ 2023-06-17 15:36       ` Kent Overstreet
  2023-06-17 16:38         ` Song Liu
  0 siblings, 1 reply; 65+ messages in thread
From: Kent Overstreet @ 2023-06-17 15:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Song Liu, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Sat, Jun 17, 2023 at 09:57:59AM +0300, Mike Rapoport wrote:
> > This is growing fast. :) We have 3 now: text, data, jit. And it will be
> > 5 when we split data into rw data, ro data, ro after init data. I wonder
> > whether we should still do some type enum here. But we can revisit
> > this topic later.
> 
> I don't think we'd need 5. Four at most :)
> 
> I don't know yet what would be the best way to differentiate RW and RO
> data, but ro_after_init surely won't need a new type. It either will be
> allocated as RW and then the caller will have to set it RO after
> initialization is done, or it will be allocated as RO and the caller will
> have to do something like text_poke to update it.

Perhaps ro_after_init could use the same allocation interface and share
pages with ro pages - if we just added a refcount for "this page
currently needs to be rw, module is still loading?"

text_poke() approach wouldn't be workable, you'd have to audit and fix
all module init code in the entire kernel.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
  2023-06-17 15:36       ` Kent Overstreet
@ 2023-06-17 16:38         ` Song Liu
  2023-06-17 20:37           ` Kent Overstreet
  0 siblings, 1 reply; 65+ messages in thread
From: Song Liu @ 2023-06-17 16:38 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Sat, Jun 17, 2023 at 8:37 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Sat, Jun 17, 2023 at 09:57:59AM +0300, Mike Rapoport wrote:
> > > This is growing fast. :) We have 3 now: text, data, jit. And it will be
> > > 5 when we split data into rw data, ro data, ro after init data. I wonder
> > > whether we should still do some type enum here. But we can revisit
> > > this topic later.
> >
> > I don't think we'd need 5. Four at most :)
> >
> > I don't know yet what would be the best way to differentiate RW and RO
> > data, but ro_after_init surely won't need a new type. It either will be
> > allocated as RW and then the caller will have to set it RO after
> > initialization is done, or it will be allocated as RO and the caller will
> > have to do something like text_poke to update it.
>
> Perhaps ro_after_init could use the same allocation interface and share
> pages with ro pages - if we just added a refcount for "this page
> currently needs to be rw, module is still loading?"

If we don't relax rules with read only, we will have to separate rw, ro,
and ro_after_init. But we can still have page sharing:

Two modules can put rw data on the same page.
With text poke (ro data poke to be accurate), two modules can put
ro data on the same page.

> text_poke() approach wouldn't be workable, you'd have to audit and fix
> all module init code in the entire kernel.

Agreed. For this reason, each module has to have its own page(s) for
ro_after_init data.

To eventually remove VM_FLUSH_RESET_PERMS, we want
ro_after_init data to share the same allocation interface.

Thanks,
Song

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
  2023-06-17 16:38         ` Song Liu
@ 2023-06-17 20:37           ` Kent Overstreet
  0 siblings, 0 replies; 65+ messages in thread
From: Kent Overstreet @ 2023-06-17 20:37 UTC (permalink / raw)
  To: Song Liu
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Sat, Jun 17, 2023 at 09:38:17AM -0700, Song Liu wrote:
> On Sat, Jun 17, 2023 at 8:37 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Sat, Jun 17, 2023 at 09:57:59AM +0300, Mike Rapoport wrote:
> > > > This is growing fast. :) We have 3 now: text, data, jit. And it will be
> > > > 5 when we split data into rw data, ro data, ro after init data. I wonder
> > > > whether we should still do some type enum here. But we can revisit
> > > > this topic later.
> > >
> > > I don't think we'd need 5. Four at most :)
> > >
> > > I don't know yet what would be the best way to differentiate RW and RO
> > > data, but ro_after_init surely won't need a new type. It either will be
> > > allocated as RW and then the caller will have to set it RO after
> > > initialization is done, or it will be allocated as RO and the caller will
> > > have to do something like text_poke to update it.
> >
> > Perhaps ro_after_init could use the same allocation interface and share
> > pages with ro pages - if we just added a refcount for "this page
> > currently needs to be rw, module is still loading?"
> 
> If we don't relax rules with read only, we will have to separate rw, ro,
> and ro_after_init. But we can still have page sharing:
> 
> Two modules can put rw data on the same page.
> With text poke (ro data poke to be accurate), two modules can put
> ro data on the same page.
> 
> > text_poke() approach wouldn't be workable, you'd have to audit and fix
> > all module init code in the entire kernel.
> 
> Agreed. For this reason, each module has to have its own page(s) for
> ro_after_init data.

Relaxing page permissions to allow for page sharing could also be a
config option. For archs with 64k pages it seems worthwhile.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-16  8:50 ` [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Mike Rapoport
  2023-06-16 16:48   ` Kent Overstreet
@ 2023-06-17 20:38   ` Andy Lutomirski
  2023-06-18  8:00     ` Mike Rapoport
  2023-06-19 11:34     ` Kent Overstreet
  1 sibling, 2 replies; 65+ messages in thread
From: Andy Lutomirski @ 2023-06-17 20:38 UTC (permalink / raw)
  To: Mike Rapoport, Linux Kernel Mailing List
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick P Edgecombe, Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
>
> The intention semantics for new allocation APIs:
>
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
>
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
>

Is there anything in this series to help users do the appropriate synchronization when the actually populate the allocated memory with code?  See here, for example:

https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab405@app.fastmail.com/T/#u

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-17 20:38   ` Andy Lutomirski
@ 2023-06-18  8:00     ` Mike Rapoport
  2023-06-19 17:09       ` Andy Lutomirski
  2023-06-19 11:34     ` Kent Overstreet
  1 sibling, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-18  8:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Kernel Mailing List, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Kent Overstreet, Luis Chamberlain,
	Mark Rutland, Michael Ellerman, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick P Edgecombe,
	Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> 
> Is there anything in this series to help users do the appropriate
> synchronization when the actually populate the allocated memory with
> code?  See here, for example:

This series only factors out the executable allocations from modules and
puts them in a central place.
Anything else would go on top after this lands.
 
> https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab405@app.fastmail.com/T/#u

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
  2023-06-16 16:55   ` Edgecombe, Rick P
  2023-06-16 20:01   ` Song Liu
@ 2023-06-18 22:32   ` Thomas Gleixner
  2023-06-18 23:14     ` Kent Overstreet
  2023-06-19 15:23     ` Mike Rapoport
  2 siblings, 2 replies; 65+ messages in thread
From: Thomas Gleixner @ 2023-06-18 22:32 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Mike Rapoport, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

Mike!

Sorry for being late on this ...

On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote:
>  
> +void *execmem_data_alloc(size_t size)
> +{
> +	unsigned long start = execmem_params.modules.data.start;
> +	unsigned long end = execmem_params.modules.data.end;
> +	pgprot_t pgprot = execmem_params.modules.data.pgprot;
> +	unsigned int align = execmem_params.modules.data.alignment;
> +	unsigned long fallback_start = execmem_params.modules.data.fallback_start;
> +	unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> +	bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;

While I know for sure that you read up on the discussion I had with Song
about data structures, it seems you completely failed to understand it.

> +	return execmem_alloc(size, start, end, align, pgprot,
> +			     fallback_start, fallback_end, kasan);

Having _seven_ intermediate variables to fill _eight_ arguments of a
function instead of handing in @size and a proper struct pointer is
tasteless and disgusting at best.

Six out of those seven parameters are from:

    execmem_params.module.data

while the KASAN shadow part is retrieved from

    execmem_params.module.flags

So what prevents you from having a uniform data structure, which is
extensible and decribes _all_ types of allocations?

Absolutely nothing. The flags part can either be in the type dependend
part or you make the type configs an array as I had suggested originally
and then execmem_alloc() becomes:

void *execmem_alloc(type, size)

and

static inline void *execmem_data_alloc(size_t size)
{
        return execmem_alloc(EXECMEM_TYPE_DATA, size);
}

which gets the type independent parts from @execmem_param.

Just read through your own series and watch the evolution of
execmem_alloc():

  static void *execmem_alloc(size_t size)

  static void *execmem_alloc(size_t size, unsigned long start,
                             unsigned long end, unsigned int align,
                             pgprot_t pgprot)

  static void *execmem_alloc(size_t len, unsigned long start,
                             unsigned long end, unsigned int align,
                             pgprot_t pgprot,
                             unsigned long fallback_start,
                             unsigned long fallback_end,
                             bool kasan)

In a month from now this function will have _ten_ parameters and tons of
horrible wrappers which convert an already existing data structure into
individual function arguments.

Seriously?

If you want this function to be [ab]used outside of the exec_param
configuration space for whatever non-sensical reasons then this still
can be either:

void *execmem_alloc(params, type, size)

static inline void *execmem_data_alloc(size_t size)
{
        return execmem_alloc(&exec_param, EXECMEM_TYPE_DATA, size);
}

or

void *execmem_alloc(type_params, size);

static inline void *execmem_data_alloc(size_t size)
{
        return execmem_alloc(&exec_param.data, size);
}

which both allows you to provide alternative params, right?

Coming back to my conversation with Song:

   "Bad programmers worry about the code. Good programmers worry about
    data structures and their relationships."

You might want to reread:

    https://lore.kernel.org/r/87lenuukj0.ffs@tglx

and the subsequent thread.

The fact that my suggestions had a 'mod_' namespace prefix does not make
any of my points moot.

Song did an extremly good job in abstracting things out, but you decided
to ditch his ground work instead of building on it and keeping the good
parts. That's beyond sad.

Worse, you went the full 'not invented here' path with an outcome which is
even worse than the original hackery from Song which started the above
referenced thread.

I don't know what caused you to decide that ad hoc hackery is better
than proper data structure based design patterns. I actually don't want
to know.

As much as my voice counts:

  NAK-ed-by: Thomas Gleixner <tglx@linutronix.de>

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-18 22:32   ` Thomas Gleixner
@ 2023-06-18 23:14     ` Kent Overstreet
  2023-06-19  0:43       ` Thomas Gleixner
  2023-06-19 15:23     ` Mike Rapoport
  1 sibling, 1 reply; 65+ messages in thread
From: Kent Overstreet @ 2023-06-18 23:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote:
> Mike!
> 
> Sorry for being late on this ...
> 
> On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote:
> >  
> > +void *execmem_data_alloc(size_t size)
> > +{
> > +	unsigned long start = execmem_params.modules.data.start;
> > +	unsigned long end = execmem_params.modules.data.end;
> > +	pgprot_t pgprot = execmem_params.modules.data.pgprot;
> > +	unsigned int align = execmem_params.modules.data.alignment;
> > +	unsigned long fallback_start = execmem_params.modules.data.fallback_start;
> > +	unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> > +	bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
> 
> While I know for sure that you read up on the discussion I had with Song
> about data structures, it seems you completely failed to understand it.
> 
> > +	return execmem_alloc(size, start, end, align, pgprot,
> > +			     fallback_start, fallback_end, kasan);
> 
> Having _seven_ intermediate variables to fill _eight_ arguments of a
> function instead of handing in @size and a proper struct pointer is
> tasteless and disgusting at best.
> 
> Six out of those seven parameters are from:
> 
>     execmem_params.module.data
> 
> while the KASAN shadow part is retrieved from
> 
>     execmem_params.module.flags
> 
> So what prevents you from having a uniform data structure, which is
> extensible and decribes _all_ types of allocations?
> 
> Absolutely nothing. The flags part can either be in the type dependend
> part or you make the type configs an array as I had suggested originally
> and then execmem_alloc() becomes:
> 
> void *execmem_alloc(type, size)
> 
> and
> 
> static inline void *execmem_data_alloc(size_t size)
> {
>         return execmem_alloc(EXECMEM_TYPE_DATA, size);
> }
> 
> which gets the type independent parts from @execmem_param.
> 
> Just read through your own series and watch the evolution of
> execmem_alloc():
> 
>   static void *execmem_alloc(size_t size)
> 
>   static void *execmem_alloc(size_t size, unsigned long start,
>                              unsigned long end, unsigned int align,
>                              pgprot_t pgprot)
> 
>   static void *execmem_alloc(size_t len, unsigned long start,
>                              unsigned long end, unsigned int align,
>                              pgprot_t pgprot,
>                              unsigned long fallback_start,
>                              unsigned long fallback_end,
>                              bool kasan)
> 
> In a month from now this function will have _ten_ parameters and tons of
> horrible wrappers which convert an already existing data structure into
> individual function arguments.
> 
> Seriously?
> 
> If you want this function to be [ab]used outside of the exec_param
> configuration space for whatever non-sensical reasons then this still
> can be either:
> 
> void *execmem_alloc(params, type, size)
> 
> static inline void *execmem_data_alloc(size_t size)
> {
>         return execmem_alloc(&exec_param, EXECMEM_TYPE_DATA, size);
> }
> 
> or
> 
> void *execmem_alloc(type_params, size);
> 
> static inline void *execmem_data_alloc(size_t size)
> {
>         return execmem_alloc(&exec_param.data, size);
> }
> 
> which both allows you to provide alternative params, right?
> 
> Coming back to my conversation with Song:
> 
>    "Bad programmers worry about the code. Good programmers worry about
>     data structures and their relationships."

Thomas, you're confusing an internal interface with external, I made the
same mistake reviewing Song's patchset...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-18 23:14     ` Kent Overstreet
@ 2023-06-19  0:43       ` Thomas Gleixner
  2023-06-19  2:12         ` Kent Overstreet
  2023-06-20 14:51         ` Steven Rostedt
  0 siblings, 2 replies; 65+ messages in thread
From: Thomas Gleixner @ 2023-06-19  0:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

Kent!

On Sun, Jun 18 2023 at 19:14, Kent Overstreet wrote:
> On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote:
>
> Thomas, you're confusing an internal interface with external

No. I am not.

Whether that's an internal function or not does not make any difference
at all.

Having a function growing _eight_ parameters where _six_ of them are
derived from a well defined data structure is a clear sign of design
fail.

It's not rocket science to do:

struct generic_allocation_info {
       ....
};

struct execmem_info {
       ....
       struct generic_allocation_info alloc_info;
;

struct execmem_param {
       ...
       struct execmem_info[NTYPES];
};

and having a function which can either operate on execmem_param and type
or on generic_allocation_info itself. It does not matter as long as the
data structure which is handed into this internal function is
describing it completely or needs a supplementary argument, i.e. flags.

Having tons of wrappers which do:

       a = generic_info.a;
       b = generic_info.b;
       ....
       n = generic_info.n;

       internal_func(a, b, ....,, n);

is just hillarious and to repeat myself tasteless and therefore
disgusting.

That's CS course first semester hackery, but TBH, I can only tell from
imagination because I did not take CS courses - maybe that's the
problem...

Data structure driven design works not from the usage site down to the
internals. It's the other way round:

  1) Define a data structure which describes what the internal function
     needs to know

  2) Implement use case specific variants which describe that

  3) Hand the use case specific variant to the internal function
     eventually with some minimal supplementary information.

Object oriented basics, right?

There is absolutely nothing wrong with having

      internal_func(generic_info *, modifier);

but having

      internal_func(a, b, ... n)

is fundamentally wrong in the context of an extensible and future proof
internal function.

Guess what. Today it's sufficient to have _eight_ arguments and we are
happy to have 10 nonsensical wrappers around this internal
function. Tomorrow there happens to be a use case which needs another
argument so you end up:

  Changing 10 wrappers plus the function declaration and definition in
  one go

instead of adding

  One new naturally 0 initialized member to generic_info and be done
  with it.

Look at the evolution of execmem_alloc() in this very pathset which I
pointed out. That very patchset covers _two_ of at least _six_ cases
Song and myself identified. It already had _three_ steps of evolution
from _one_ to _five_ to _eight_ parameters.

C is not like e.g. python where you can "solve" that problem by simply
doing:

- internal_func(a, b, c):
+ internal_func(a, b, c, d=None, ..., n=None):

But that's not a solution either. That's a horrible workaround even in
python once your parameter space gets sufficiently large. The way out in
python is to have **kwargs. But that's not an option in C, and not
necessarily the best option for python either.

Even in python or any other object oriented language you get to the
point where you have to rethink your approach, go back to the drawing
board and think about data representation.

But creating a new interface based on "let's see what we need over
time and add parameters as we see fit" is simply wrong to begin with
independent of the programming language.

Even if the _eight_ parameters are the end of the range, then they are
beyond justifyable because that's way beyond the natural register
argument space of any architecture and you are offloading your lazyness
to wrappers and the compiler to emit pointlessly horrible code.

There is a reason why new syscalls which need more than a few parameters
are based on 'struct DATA_WHICH_I_NEED_TO_KNOW' and 'flags'.

We've got burned on the non-extensibilty often enough. Why would a new
internal function have any different requirements especially as it is
neither implemented to the full extent nor a hotpath function?

Now you might argue that it _is_ a "hotpath" due to the BPF usage, but
then even more so as any intermediate wrapper which converts from one
data representation to another data representation is not going to
increase performance, right?

> ... I made the same mistake reviewing Song's patchset...

Songs series had rough edges, but was way more data structure driven
and palatable than this hackery.

The fact that you made a mistake while reviewing Songs series has
absolutely nothing to do with the above or my previous reply to Mike.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-19  0:43       ` Thomas Gleixner
@ 2023-06-19  2:12         ` Kent Overstreet
  2023-06-20 14:51         ` Steven Rostedt
  1 sibling, 0 replies; 65+ messages in thread
From: Kent Overstreet @ 2023-06-19  2:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 19, 2023 at 02:43:58AM +0200, Thomas Gleixner wrote:
> Kent!

Hi Thomas :)

> No. I am not.

Ok.

> Whether that's an internal function or not does not make any difference
> at all.

Well, at the risk of this discussion going completely off the rails, I
have to disagree with you there. External interfaces and high level
semantics are more important to get right from the outset, internal
implementation details can be cleaned up later, within reason.

And the discussion on this patchset has been more focused on those
external interfaces, which seems like the right approach to me.

> > ... I made the same mistake reviewing Song's patchset...
> 
> Songs series had rough edges, but was way more data structure driven
> and palatable than this hackery.

I liked that aspect of Song's patchset too, and I'm actually inclined to
agree with you that this patchset might get a bit cleaner with more of
that, but really, this semes like just quibbling over calling convention
for an internal helper function.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-17 20:38   ` Andy Lutomirski
  2023-06-18  8:00     ` Mike Rapoport
@ 2023-06-19 11:34     ` Kent Overstreet
  1 sibling, 0 replies; 65+ messages in thread
From: Kent Overstreet @ 2023-06-19 11:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Rapoport, Linux Kernel Mailing List, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Mark Rutland, Michael Ellerman, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick P Edgecombe,
	Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> 
> Is there anything in this series to help users do the appropriate synchronization when the actually populate the allocated memory with code?  See here, for example:
> 
> https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab405@app.fastmail.com/T/#u

We're still in need of an arch independent text_poke() api.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-18 22:32   ` Thomas Gleixner
  2023-06-18 23:14     ` Kent Overstreet
@ 2023-06-19 15:23     ` Mike Rapoport
  1 sibling, 0 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-19 15:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Mark Rutland,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick Edgecombe, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote:
> Mike!
> 
> Sorry for being late on this ...
> 
> On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote:
> 
> The fact that my suggestions had a 'mod_' namespace prefix does not make
> any of my points moot.

The prefix does not matter. What matters is what we are trying to abstract.
Your suggestion is based of the memory used by modules. I'm abstracting
address spaces for different types of executable and related memory. They
are similar, yes, but they are not the same.

The TEXT, INIT_TEXT and *_DATA do not match to what we have from arch POV.
They have modules with text, rw data, ro data and ro after init data and
the memory for the generated code. The memory for modules and memory for
other users have different restrictions for their placement, so using a
single TEXT type for them is semantically wrong. BPF and kprobes do not
necessarily must be at the same address range as modules and init text does
not differ from normal text.

> Song did an extremly good job in abstracting things out, but you decided
> to ditch his ground work instead of building on it and keeping the good
> parts. That's beyond sad.

Actually not. The core idea to describe address range suitable for code
allocations with a structure and have arch code initialize this structure
at boot and be done with it is the same. But I don't think vmalloc
parameters belong there, they should be completely encapsulated in the
allocator. Having fallback range named explicitly is IMO clearer than an
array of address spaces.

I accept your point that the structures describing ranges for different
types should be unified and I've got carried away with making the wrappers
to convert that structure to parameters to the core allocation function.

I've chosen to define ranges as fields in the containing structure rather
than enum with types and an array because I strongly feel that the callers
should not care about these parameters. These parameters are defined by
architecture and the callers should not need to know how each and every
arch defines restrictions suitable for modules, bpf or kprobes.

That's also the reason to have different names for API calls, exactly to
avoid having alloc(KPROBES,...), alloc(BPF, ...), alloc(MODULES, ...) an so
on.

All in all, if I filter all the ranting, this boils down to having a
unified structure for all the address ranges and passing this structure
from the wrappers to the core alloc as is rather that translating it to
separate parameters, with which I agree.

> Thanks,
> 
>         tglx

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-18  8:00     ` Mike Rapoport
@ 2023-06-19 17:09       ` Andy Lutomirski
  2023-06-19 20:18         ` Nadav Amit
                           ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Andy Lutomirski @ 2023-06-19 17:09 UTC (permalink / raw)
  To: Mike Rapoport, Mark Rutland, Kees Cook
  Cc: Linux Kernel Mailing List, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Kent Overstreet, Luis Chamberlain,
	Mark Rutland, Michael Ellerman, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick P Edgecombe,
	Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers



On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the other hand, this is probably the right time to at least start thinking about synchronization, at least to the extent that it might make us want to change this API.  (I'm not at all saying that this series should require changes -- I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look like the one think in the Linux ecosystem that actually intelligently and efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and jump to it with minimal synchronization (just the standard implicit ordering in the kernel that populates the pages before setting up the PTEs and whatever user synchronization is needed to avoid jumping into the mapping before mmap() finishes).  It works across CPUs, and the only possible way userspace can screw it up (for a read-only mapping of read-only text, anyway) is to jump to the mapping too early, in which case userspace gets a page fault.  Incoherence is impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other architectures, too, although I think more cache management is needed on the kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather fraught, and I bet many things do it wrong when userspace is multithreaded.  But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some fiddling, because off the top of my head, this doesn't match how it works now.)

To make alternatives easier, this could work, maybe (haven't fully thought it through):

jit_text_alloc()
jit_text_map_rw_inplace() -> map at the target address, but RW, !X

write the text and apply alternatives

jit_text_finalize() -> change from RW to RX *and synchronize*

jit_text_finalize() would either need to wait for RCU (possibly extra heavy weight RCU to get "serialization") or send an IPI.

This is slower than the alloc, write, map solution, but allows alternatives to be applied at the final address.


Even fancier variants where the writing is some using something like use_temporary_mm() might even make sense.


To what extent does performance matter for the various users?  module loading is slow, and I don't think we care that much.  eBPF loaded is not super fast, and we care to a limited extent.  I *think* the bcachefs use case needs to be very fast, but I'm not sure it can be fast and supportable.

Anyway, food for thought.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-19 17:09       ` Andy Lutomirski
@ 2023-06-19 20:18         ` Nadav Amit
  2023-06-20 17:24           ` Andy Lutomirski
  2023-06-25 16:14         ` Mike Rapoport
  2023-06-26 13:01         ` Mark Rutland
  2 siblings, 1 reply; 65+ messages in thread
From: Nadav Amit @ 2023-06-19 20:18 UTC (permalink / raw)
  To: Andy Lutomirski, Song Liu
  Cc: Mike Rapoport, Mark Rutland, Kees Cook,
	Linux Kernel Mailing List, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Kent Overstreet, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers



> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

Andy, would you mind explaining why you think a sync is not needed? I mean I have a “feeling” that perhaps TSO can guarantee something based on the order of write and page-table update. Is that the argument?

On this regard, one thing that I clearly do not understand is why *today* it is ok for users of bpf_arch_text_copy() not to call text_poke_sync(). Am I missing something?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-19  0:43       ` Thomas Gleixner
  2023-06-19  2:12         ` Kent Overstreet
@ 2023-06-20 14:51         ` Steven Rostedt
  2023-06-20 15:32           ` Alexei Starovoitov
  1 sibling, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2023-06-20 14:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kent Overstreet, Mike Rapoport, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Mark Rutland, Michael Ellerman, Nadav Amit, Naveen N. Rao,
	Palmer Dabbelt, Puranjay Mohan, Rick Edgecombe, Russell King,
	Song Liu, Thomas Bogendoerfer, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, 19 Jun 2023 02:43:58 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Now you might argue that it _is_ a "hotpath" due to the BPF usage, but
> then even more so as any intermediate wrapper which converts from one
> data representation to another data representation is not going to
> increase performance, right?

Just as a side note. BPF can not attach its return calling code to
functions that have more than 6 parameters (3 on 32 bit x86), because of
the way BPF return path trampoline works. It is a requirement that all
parameters live in registers, and none on the stack.

-- Steve

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
  2023-06-20 14:51         ` Steven Rostedt
@ 2023-06-20 15:32           ` Alexei Starovoitov
  0 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2023-06-20 15:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Kent Overstreet, Mike Rapoport, LKML,
	Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Luis Chamberlain, Mark Rutland, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick Edgecombe, Russell King, Song Liu, Thomas Bogendoerfer,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, ppc-dev, loongarch, Network Development,
	sparclinux, X86 ML

On Tue, Jun 20, 2023 at 7:51 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 19 Jun 2023 02:43:58 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
> > Now you might argue that it _is_ a "hotpath" due to the BPF usage, but
> > then even more so as any intermediate wrapper which converts from one
> > data representation to another data representation is not going to
> > increase performance, right?
>
> Just as a side note. BPF can not attach its return calling code to
> functions that have more than 6 parameters (3 on 32 bit x86), because of
> the way BPF return path trampoline works. It is a requirement that all
> parameters live in registers, and none on the stack.

It's actually 7 and that restriction is being lifted.
The patch set to attach to <= 12 is being discussed.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-19 20:18         ` Nadav Amit
@ 2023-06-20 17:24           ` Andy Lutomirski
  0 siblings, 0 replies; 65+ messages in thread
From: Andy Lutomirski @ 2023-06-20 17:24 UTC (permalink / raw)
  To: Nadav Amit, Song Liu
  Cc: Mike Rapoport, Mark Rutland, Kees Cook,
	Linux Kernel Mailing List, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Kent Overstreet, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers



On Mon, Jun 19, 2023, at 1:18 PM, Nadav Amit wrote:
>> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)
>
> Andy, would you mind explaining why you think a sync is not needed? I 
> mean I have a “feeling” that perhaps TSO can guarantee something based 
> on the order of write and page-table update. Is that the argument?

Sorry, when I say "no sync" I mean no cross-CPU synchronization.  I'm assuming the underlying sequence of events is:

allocate physical pages (jit_text_alloc)

write to them (with MOV, memcpy, whatever), via the direct map or via a temporary mm

do an appropriate *local* barrier (which, on x86, is probably implied by TSO, as the subsequent pagetable change is at least a release; also, any any previous temporary mm stuff would have done MOV CR3 afterwards, which is a full "serializing" barrier)

optionally zap the direct map via IPI, assuming the pages are direct mapped (but this could be avoided with a smart enough allocator and temporary_mm above)

install the final RX PTE (jit_text_map), which does a MOV or maybe a LOCK CMPXCHG16B.  Note that the virtual address in question was not readable or executable before this, and all CPUs have serialized since the last time it was executable.

either jump to the new text locally, or:

1. Do a store-release to tell other CPUs that the text is mapped
2. Other CPU does a load-acquire to detect that the text is mapped and jumps to the text

This is all approximately the same thing that plain old mmap(..., PROT_EXEC, ...) does.

>
> On this regard, one thing that I clearly do not understand is why 
> *today* it is ok for users of bpf_arch_text_copy() not to call 
> text_poke_sync(). Am I missing something?

I cannot explain this, because I suspect the current code is wrong.  But it's only wrong across CPUs, because bpf_arch_text_copy goes through text_poke_copy, which calls unuse_temporary_mm(), which is serializing.  And it's plausible that most eBPF use cases don't actually cause the loaded program to get used on a different CPU without first serializing on the CPU that ends up using it.  (Context switches and interrupts are serializing.)

FRED could make interrupts non-serializing. I sincerely hope that FRED doesn't cause this all to fall apart.

--Andy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-19 17:09       ` Andy Lutomirski
  2023-06-19 20:18         ` Nadav Amit
@ 2023-06-25 16:14         ` Mike Rapoport
  2023-06-25 16:59           ` Andy Lutomirski
  2023-06-26 12:31           ` Mark Rutland
  2023-06-26 13:01         ` Mark Rutland
  2 siblings, 2 replies; 65+ messages in thread
From: Mike Rapoport @ 2023-06-25 16:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Kees Cook, Linux Kernel Mailing List,
	Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> 
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On
> the other hand, this is probably the right time to at least start
> thinking about synchronization, at least to the extent that it might make
> us want to change this API.  (I'm not at all saying that this series
> should require changes -- I'm just saying that this is a good time to
> think about how this should work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually
> intelligently and efficiently maps new text into an address space:
> mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> and jump to it with minimal synchronization (just the standard implicit
> ordering in the kernel that populates the pages before setting up the
> PTEs and whatever user synchronization is needed to avoid jumping into
> the mapping before mmap() finishes).  It works across CPUs, and the only
> possible way userspace can screw it up (for a read-only mapping of
> read-only text, anyway) is to jump to the mapping too early, in which
> case userspace gets a page fault.  Incoherence is impossible, and no one
> needs to "serialize" (in the SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on
> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> IPI to map executable text into usermode, but I could easily be wrong.
> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> remember the details.)
> 
> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> rather fraught, and I bet many things do it wrong when userspace is
> multithreaded.  But not in production because it's mostly not used in
> production.)
> 
> But jit_text_alloc() can't do this, because the order of operations
> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> before the text is populated, so there is no atomic change from not-there
> to populated-and-executable.  Which means that there is an opportunity
> for CPUs, speculatively or otherwise, to start filling various caches
> with intermediate states of the text, which means that various
> architectures (even x86!) may need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite
> coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address,
> but does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on
> x86, I think)
> 
> could be more efficient and/or safer.
> 
> (Modules could use this too.  Getting alternatives right might take some
> fiddling, because off the top of my head, this doesn't match how it works
> now.)
> 
> To make alternatives easier, this could work, maybe (haven't fully
> thought it through):
> 
> jit_text_alloc()
> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> 
> write the text and apply alternatives
> 
> jit_text_finalize() -> change from RW to RX *and synchronize*
> 
> jit_text_finalize() would either need to wait for RCU (possibly extra
> heavy weight RCU to get "serialization") or send an IPI.

This essentially how modules work now. The memory is allocated RW, written
and updated with alternatives and then made ROX in the end with set_memory
APIs.

The issue with not having the memory mapped X when it's written is that we
cannot use large pages to map it. One of the goals is to have executable
memory mapped with large pages and make code allocator able to divide that
page among several callers.

So the idea was that jit_text_alloc() will have a cache of large pages
mapped ROX, will allocate memory from those caches and there will be
jit_update() that uses text poking for writing to that memory.

Upon allocation of a large page to increase the cache, that large page will
be "invalidated" by filling it with breakpoint instructions (e.g int3 on
x86)

To improve the performance of this process, we can write to !X copy and
then text_poke it to the actual address in one go. This will require some
changes to get the alternatives right.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-25 16:14         ` Mike Rapoport
@ 2023-06-25 16:59           ` Andy Lutomirski
  2023-06-25 17:42             ` Mike Rapoport
  2023-06-26 12:31           ` Mark Rutland
  1 sibling, 1 reply; 65+ messages in thread
From: Andy Lutomirski @ 2023-06-25 16:59 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, Kees Cook, Linux Kernel Mailing List,
	Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers



On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
>> 
>> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
>> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >> >
>> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >> >
>> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> >> > and puts the burden of code allocation to the modules code.
>> >> >
>> >> > Several architectures override module_alloc() because of various
>> >> > constraints where the executable memory can be located and this causes
>> >> > additional obstacles for improvements of code allocation.
>> >> >
>> >> > Start splitting code allocation from modules by introducing
>> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >> >
>> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> >> > module_alloc() and execmem_free() and jit_free() are replacements of
>> >> > module_memfree() to allow updating all call sites to use the new APIs.
>> >> >
>> >> > The intention semantics for new allocation APIs:
>> >> >
>> >> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >> >   close to the kernel image, like loadable kernel modules and generated
>> >> >   code that is restricted by relative addressing.
>> >> >
>> >> > * jit_text_alloc() should be used to allocate memory for generated code
>> >> >   when there are no restrictions for the code placement. For
>> >> >   architectures that require that any code is within certain distance
>> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >> >   execmem_text_alloc().
>> >> >
>> >> 
>> >> Is there anything in this series to help users do the appropriate
>> >> synchronization when the actually populate the allocated memory with
>> >> code?  See here, for example:
>> >
>> > This series only factors out the executable allocations from modules and
>> > puts them in a central place.
>> > Anything else would go on top after this lands.
>> 
>> Hmm.
>> 
>> On the one hand, there's nothing wrong with factoring out common code. On
>> the other hand, this is probably the right time to at least start
>> thinking about synchronization, at least to the extent that it might make
>> us want to change this API.  (I'm not at all saying that this series
>> should require changes -- I'm just saying that this is a good time to
>> think about how this should work.)
>> 
>> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
>> look like the one think in the Linux ecosystem that actually
>> intelligently and efficiently maps new text into an address space:
>> mmap().
>> 
>> On x86, you can mmap() an existing file full of executable code PROT_EXEC
>> and jump to it with minimal synchronization (just the standard implicit
>> ordering in the kernel that populates the pages before setting up the
>> PTEs and whatever user synchronization is needed to avoid jumping into
>> the mapping before mmap() finishes).  It works across CPUs, and the only
>> possible way userspace can screw it up (for a read-only mapping of
>> read-only text, anyway) is to jump to the mapping too early, in which
>> case userspace gets a page fault.  Incoherence is impossible, and no one
>> needs to "serialize" (in the SDM sense).
>> 
>> I think the same sequence (from userspace's perspective) works on other
>> architectures, too, although I think more cache management is needed on
>> the kernel's end.  As far as I know, no Linux SMP architecture needs an
>> IPI to map executable text into usermode, but I could easily be wrong.
>> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
>> remember the details.)
>> 
>> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
>> rather fraught, and I bet many things do it wrong when userspace is
>> multithreaded.  But not in production because it's mostly not used in
>> production.)
>> 
>> But jit_text_alloc() can't do this, because the order of operations
>> doesn't match.  With jit_text_alloc(), the executable mapping shows up
>> before the text is populated, so there is no atomic change from not-there
>> to populated-and-executable.  Which means that there is an opportunity
>> for CPUs, speculatively or otherwise, to start filling various caches
>> with intermediate states of the text, which means that various
>> architectures (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite
>> coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address,
>> but does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on
>> x86, I think)
>> 
>> could be more efficient and/or safer.
>> 
>> (Modules could use this too.  Getting alternatives right might take some
>> fiddling, because off the top of my head, this doesn't match how it works
>> now.)
>> 
>> To make alternatives easier, this could work, maybe (haven't fully
>> thought it through):
>> 
>> jit_text_alloc()
>> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
>> 
>> write the text and apply alternatives
>> 
>> jit_text_finalize() -> change from RW to RX *and synchronize*
>> 
>> jit_text_finalize() would either need to wait for RCU (possibly extra
>> heavy weight RCU to get "serialization") or send an IPI.
>
> This essentially how modules work now. The memory is allocated RW, written
> and updated with alternatives and then made ROX in the end with set_memory
> APIs.
>
> The issue with not having the memory mapped X when it's written is that we
> cannot use large pages to map it. One of the goals is to have executable
> memory mapped with large pages and make code allocator able to divide that
> page among several callers.
>
> So the idea was that jit_text_alloc() will have a cache of large pages
> mapped ROX, will allocate memory from those caches and there will be
> jit_update() that uses text poking for writing to that memory.
>
> Upon allocation of a large page to increase the cache, that large page will
> be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> x86)

Is this actually valid?  In between int3 and real code, there’s a potential torn read of real code mixed up with 0xcc.

>
> To improve the performance of this process, we can write to !X copy and
> then text_poke it to the actual address in one go. This will require some
> changes to get the alternatives right.
>
> -- 
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-25 16:59           ` Andy Lutomirski
@ 2023-06-25 17:42             ` Mike Rapoport
  2023-06-25 18:07               ` Kent Overstreet
  0 siblings, 1 reply; 65+ messages in thread
From: Mike Rapoport @ 2023-06-25 17:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Rutland, Kees Cook, Linux Kernel Mailing List,
	Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> 
> 
> On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> >> 
> >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> >> >
> >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >> >
> >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> >> > and puts the burden of code allocation to the modules code.
> >> >> >
> >> >> > Several architectures override module_alloc() because of various
> >> >> > constraints where the executable memory can be located and this causes
> >> >> > additional obstacles for improvements of code allocation.
> >> >> >
> >> >> > Start splitting code allocation from modules by introducing
> >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >> >
> >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >> >
> >> >> > The intention semantics for new allocation APIs:
> >> >> >
> >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >> >   code that is restricted by relative addressing.
> >> >> >
> >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >> >   when there are no restrictions for the code placement. For
> >> >> >   architectures that require that any code is within certain distance
> >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >> >   execmem_text_alloc().
> >> >> >
> >> >> 
> >> >> Is there anything in this series to help users do the appropriate
> >> >> synchronization when the actually populate the allocated memory with
> >> >> code?  See here, for example:
> >> >
> >> > This series only factors out the executable allocations from modules and
> >> > puts them in a central place.
> >> > Anything else would go on top after this lands.
> >> 
> >> Hmm.
> >> 
> >> On the one hand, there's nothing wrong with factoring out common code. On
> >> the other hand, this is probably the right time to at least start
> >> thinking about synchronization, at least to the extent that it might make
> >> us want to change this API.  (I'm not at all saying that this series
> >> should require changes -- I'm just saying that this is a good time to
> >> think about how this should work.)
> >> 
> >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> >> look like the one think in the Linux ecosystem that actually
> >> intelligently and efficiently maps new text into an address space:
> >> mmap().
> >> 
> >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> >> and jump to it with minimal synchronization (just the standard implicit
> >> ordering in the kernel that populates the pages before setting up the
> >> PTEs and whatever user synchronization is needed to avoid jumping into
> >> the mapping before mmap() finishes).  It works across CPUs, and the only
> >> possible way userspace can screw it up (for a read-only mapping of
> >> read-only text, anyway) is to jump to the mapping too early, in which
> >> case userspace gets a page fault.  Incoherence is impossible, and no one
> >> needs to "serialize" (in the SDM sense).
> >> 
> >> I think the same sequence (from userspace's perspective) works on other
> >> architectures, too, although I think more cache management is needed on
> >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> >> IPI to map executable text into usermode, but I could easily be wrong.
> >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> >> remember the details.)
> >> 
> >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> >> rather fraught, and I bet many things do it wrong when userspace is
> >> multithreaded.  But not in production because it's mostly not used in
> >> production.)
> >> 
> >> But jit_text_alloc() can't do this, because the order of operations
> >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> >> before the text is populated, so there is no atomic change from not-there
> >> to populated-and-executable.  Which means that there is an opportunity
> >> for CPUs, speculatively or otherwise, to start filling various caches
> >> with intermediate states of the text, which means that various
> >> architectures (even x86!) may need serialization.
> >> 
> >> For eBPF- and module- like use cases, where JITting/code gen is quite
> >> coarse-grained, perhaps something vaguely like:
> >> 
> >> jit_text_alloc() -> returns a handle and an executable virtual address,
> >> but does *not* map it there
> >> jit_text_write() -> write to that handle
> >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> >> x86, I think)
> >> 
> >> could be more efficient and/or safer.
> >> 
> >> (Modules could use this too.  Getting alternatives right might take some
> >> fiddling, because off the top of my head, this doesn't match how it works
> >> now.)
> >> 
> >> To make alternatives easier, this could work, maybe (haven't fully
> >> thought it through):
> >> 
> >> jit_text_alloc()
> >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> >> 
> >> write the text and apply alternatives
> >> 
> >> jit_text_finalize() -> change from RW to RX *and synchronize*
> >> 
> >> jit_text_finalize() would either need to wait for RCU (possibly extra
> >> heavy weight RCU to get "serialization") or send an IPI.
> >
> > This essentially how modules work now. The memory is allocated RW, written
> > and updated with alternatives and then made ROX in the end with set_memory
> > APIs.
> >
> > The issue with not having the memory mapped X when it's written is that we
> > cannot use large pages to map it. One of the goals is to have executable
> > memory mapped with large pages and make code allocator able to divide that
> > page among several callers.
> >
> > So the idea was that jit_text_alloc() will have a cache of large pages
> > mapped ROX, will allocate memory from those caches and there will be
> > jit_update() that uses text poking for writing to that memory.
> >
> > Upon allocation of a large page to increase the cache, that large page will
> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > x86)
> 
> Is this actually valid?  In between int3 and real code, there’s a
> potential torn read of real code mixed up with 0xcc.
 
You mean while doing text poking?

> > To improve the performance of this process, we can write to !X copy and
> > then text_poke it to the actual address in one go. This will require some
> > changes to get the alternatives right.
> >
> > -- 
> > Sincerely yours,
> > Mike.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-25 17:42             ` Mike Rapoport
@ 2023-06-25 18:07               ` Kent Overstreet
  2023-06-26  6:13                 ` Song Liu
  0 siblings, 1 reply; 65+ messages in thread
From: Kent Overstreet @ 2023-06-25 18:07 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andy Lutomirski, Mark Rutland, Kees Cook,
	Linux Kernel Mailing List, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers, pjt, torvalds

On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > >> 
> > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > >> >> >
> > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > >> >> >
> > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > >> >> > and puts the burden of code allocation to the modules code.
> > >> >> >
> > >> >> > Several architectures override module_alloc() because of various
> > >> >> > constraints where the executable memory can be located and this causes
> > >> >> > additional obstacles for improvements of code allocation.
> > >> >> >
> > >> >> > Start splitting code allocation from modules by introducing
> > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > >> >> >
> > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > >> >> >
> > >> >> > The intention semantics for new allocation APIs:
> > >> >> >
> > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > >> >> >   code that is restricted by relative addressing.
> > >> >> >
> > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > >> >> >   when there are no restrictions for the code placement. For
> > >> >> >   architectures that require that any code is within certain distance
> > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > >> >> >   execmem_text_alloc().
> > >> >> >
> > >> >> 
> > >> >> Is there anything in this series to help users do the appropriate
> > >> >> synchronization when the actually populate the allocated memory with
> > >> >> code?  See here, for example:
> > >> >
> > >> > This series only factors out the executable allocations from modules and
> > >> > puts them in a central place.
> > >> > Anything else would go on top after this lands.
> > >> 
> > >> Hmm.
> > >> 
> > >> On the one hand, there's nothing wrong with factoring out common code. On
> > >> the other hand, this is probably the right time to at least start
> > >> thinking about synchronization, at least to the extent that it might make
> > >> us want to change this API.  (I'm not at all saying that this series
> > >> should require changes -- I'm just saying that this is a good time to
> > >> think about how this should work.)
> > >> 
> > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > >> look like the one think in the Linux ecosystem that actually
> > >> intelligently and efficiently maps new text into an address space:
> > >> mmap().
> > >> 
> > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > >> and jump to it with minimal synchronization (just the standard implicit
> > >> ordering in the kernel that populates the pages before setting up the
> > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > >> possible way userspace can screw it up (for a read-only mapping of
> > >> read-only text, anyway) is to jump to the mapping too early, in which
> > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > >> needs to "serialize" (in the SDM sense).
> > >> 
> > >> I think the same sequence (from userspace's perspective) works on other
> > >> architectures, too, although I think more cache management is needed on
> > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > >> IPI to map executable text into usermode, but I could easily be wrong.
> > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > >> remember the details.)
> > >> 
> > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > >> rather fraught, and I bet many things do it wrong when userspace is
> > >> multithreaded.  But not in production because it's mostly not used in
> > >> production.)
> > >> 
> > >> But jit_text_alloc() can't do this, because the order of operations
> > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > >> before the text is populated, so there is no atomic change from not-there
> > >> to populated-and-executable.  Which means that there is an opportunity
> > >> for CPUs, speculatively or otherwise, to start filling various caches
> > >> with intermediate states of the text, which means that various
> > >> architectures (even x86!) may need serialization.
> > >> 
> > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > >> coarse-grained, perhaps something vaguely like:
> > >> 
> > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > >> but does *not* map it there
> > >> jit_text_write() -> write to that handle
> > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > >> x86, I think)
> > >> 
> > >> could be more efficient and/or safer.
> > >> 
> > >> (Modules could use this too.  Getting alternatives right might take some
> > >> fiddling, because off the top of my head, this doesn't match how it works
> > >> now.)
> > >> 
> > >> To make alternatives easier, this could work, maybe (haven't fully
> > >> thought it through):
> > >> 
> > >> jit_text_alloc()
> > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > >> 
> > >> write the text and apply alternatives
> > >> 
> > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > >> 
> > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > >> heavy weight RCU to get "serialization") or send an IPI.
> > >
> > > This essentially how modules work now. The memory is allocated RW, written
> > > and updated with alternatives and then made ROX in the end with set_memory
> > > APIs.
> > >
> > > The issue with not having the memory mapped X when it's written is that we
> > > cannot use large pages to map it. One of the goals is to have executable
> > > memory mapped with large pages and make code allocator able to divide that
> > > page among several callers.
> > >
> > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > mapped ROX, will allocate memory from those caches and there will be
> > > jit_update() that uses text poking for writing to that memory.
> > >
> > > Upon allocation of a large page to increase the cache, that large page will
> > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > x86)
> > 
> > Is this actually valid?  In between int3 and real code, there’s a
> > potential torn read of real code mixed up with 0xcc.
>  
> You mean while doing text poking?

I think we've been getting distracted by text_poke(). text_poke() does
updates via a different virtual address which introduce new
synchroniation wrinkles, but it's not the main issue.

As _think_ I understand it, the root of the issue is that speculative
execution - and that per Andy, speculative execution doesn't obey memory
barriers.

I have _not_ dug into the details of how retpolines work and all the
spectre stuff that was going on, but - retpoline uses lfence, doesn't
it? And if speculative execution is the issue here, isn't retpoline what
we need?

For this particular issue, I'm not sure "invalidate by filling with
illegal instructions" makes sense. For that to work, would the processor
have to execute a serialize operation and a retry on hitting an illegal
instruction - or perhaps we do in the interrupt handler?

But if filling with illegal instructions does act as a speculation
barrier, then the issue is that a torn read could generate a legal but
incorrect instruction.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-25 18:07               ` Kent Overstreet
@ 2023-06-26  6:13                 ` Song Liu
  2023-06-26  9:54                   ` Puranjay Mohan
  0 siblings, 1 reply; 65+ messages in thread
From: Song Liu @ 2023-06-26  6:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Rapoport, Andy Lutomirski, Mark Rutland, Kees Cook,
	Linux Kernel Mailing List, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers, pjt, torvalds

On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > >
> > >
> > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > >>
> > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > >> >> >
> > > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > > >> >> >
> > > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > > >> >> > and puts the burden of code allocation to the modules code.
> > > >> >> >
> > > >> >> > Several architectures override module_alloc() because of various
> > > >> >> > constraints where the executable memory can be located and this causes
> > > >> >> > additional obstacles for improvements of code allocation.
> > > >> >> >
> > > >> >> > Start splitting code allocation from modules by introducing
> > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > > >> >> >
> > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > > >> >> >
> > > >> >> > The intention semantics for new allocation APIs:
> > > >> >> >
> > > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > > >> >> >   code that is restricted by relative addressing.
> > > >> >> >
> > > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > > >> >> >   when there are no restrictions for the code placement. For
> > > >> >> >   architectures that require that any code is within certain distance
> > > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > > >> >> >   execmem_text_alloc().
> > > >> >> >
> > > >> >>
> > > >> >> Is there anything in this series to help users do the appropriate
> > > >> >> synchronization when the actually populate the allocated memory with
> > > >> >> code?  See here, for example:
> > > >> >
> > > >> > This series only factors out the executable allocations from modules and
> > > >> > puts them in a central place.
> > > >> > Anything else would go on top after this lands.
> > > >>
> > > >> Hmm.
> > > >>
> > > >> On the one hand, there's nothing wrong with factoring out common code. On
> > > >> the other hand, this is probably the right time to at least start
> > > >> thinking about synchronization, at least to the extent that it might make
> > > >> us want to change this API.  (I'm not at all saying that this series
> > > >> should require changes -- I'm just saying that this is a good time to
> > > >> think about how this should work.)
> > > >>
> > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > > >> look like the one think in the Linux ecosystem that actually
> > > >> intelligently and efficiently maps new text into an address space:
> > > >> mmap().
> > > >>
> > > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > > >> and jump to it with minimal synchronization (just the standard implicit
> > > >> ordering in the kernel that populates the pages before setting up the
> > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > > >> possible way userspace can screw it up (for a read-only mapping of
> > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > > >> needs to "serialize" (in the SDM sense).
> > > >>
> > > >> I think the same sequence (from userspace's perspective) works on other
> > > >> architectures, too, although I think more cache management is needed on
> > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > > >> remember the details.)
> > > >>
> > > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > > >> rather fraught, and I bet many things do it wrong when userspace is
> > > >> multithreaded.  But not in production because it's mostly not used in
> > > >> production.)
> > > >>
> > > >> But jit_text_alloc() can't do this, because the order of operations
> > > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > > >> before the text is populated, so there is no atomic change from not-there
> > > >> to populated-and-executable.  Which means that there is an opportunity
> > > >> for CPUs, speculatively or otherwise, to start filling various caches
> > > >> with intermediate states of the text, which means that various
> > > >> architectures (even x86!) may need serialization.
> > > >>
> > > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > > >> coarse-grained, perhaps something vaguely like:
> > > >>
> > > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > > >> but does *not* map it there
> > > >> jit_text_write() -> write to that handle
> > > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > > >> x86, I think)
> > > >>
> > > >> could be more efficient and/or safer.
> > > >>
> > > >> (Modules could use this too.  Getting alternatives right might take some
> > > >> fiddling, because off the top of my head, this doesn't match how it works
> > > >> now.)
> > > >>
> > > >> To make alternatives easier, this could work, maybe (haven't fully
> > > >> thought it through):
> > > >>
> > > >> jit_text_alloc()
> > > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > > >>
> > > >> write the text and apply alternatives
> > > >>
> > > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > > >>
> > > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > > >> heavy weight RCU to get "serialization") or send an IPI.
> > > >
> > > > This essentially how modules work now. The memory is allocated RW, written
> > > > and updated with alternatives and then made ROX in the end with set_memory
> > > > APIs.
> > > >
> > > > The issue with not having the memory mapped X when it's written is that we
> > > > cannot use large pages to map it. One of the goals is to have executable
> > > > memory mapped with large pages and make code allocator able to divide that
> > > > page among several callers.
> > > >
> > > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > > mapped ROX, will allocate memory from those caches and there will be
> > > > jit_update() that uses text poking for writing to that memory.
> > > >
> > > > Upon allocation of a large page to increase the cache, that large page will
> > > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > > x86)
> > >
> > > Is this actually valid?  In between int3 and real code, there’s a
> > > potential torn read of real code mixed up with 0xcc.
> >
> > You mean while doing text poking?
>
> I think we've been getting distracted by text_poke(). text_poke() does
> updates via a different virtual address which introduce new
> synchroniation wrinkles, but it's not the main issue.
>
> As _think_ I understand it, the root of the issue is that speculative
> execution - and that per Andy, speculative execution doesn't obey memory
> barriers.
>
> I have _not_ dug into the details of how retpolines work and all the
> spectre stuff that was going on, but - retpoline uses lfence, doesn't
> it? And if speculative execution is the issue here, isn't retpoline what
> we need?
>
> For this particular issue, I'm not sure "invalidate by filling with
> illegal instructions" makes sense. For that to work, would the processor
> have to execute a serialize operation and a retry on hitting an illegal
> instruction - or perhaps we do in the interrupt handler?
>
> But if filling with illegal instructions does act as a speculation
> barrier, then the issue is that a torn read could generate a legal but
> incorrect instruction.

What is a "torn read" here? I assume it is an instruction read that
goes at the wrong instruction boundary (CISC). If this is correct, do
we need to handle torn read caused by software bug, or hardware
bit flip, or both?

Thanks,
Song

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-26  6:13                 ` Song Liu
@ 2023-06-26  9:54                   ` Puranjay Mohan
  2023-06-26 12:23                     ` Mark Rutland
  0 siblings, 1 reply; 65+ messages in thread
From: Puranjay Mohan @ 2023-06-26  9:54 UTC (permalink / raw)
  To: Song Liu
  Cc: Kent Overstreet, Mike Rapoport, Andy Lutomirski, Mark Rutland,
	Kees Cook, Linux Kernel Mailing List, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Rick P Edgecombe, Russell King (Oracle),
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers, pjt, torvalds

On Mon, Jun 26, 2023 at 8:13 AM Song Liu <song@kernel.org> wrote:
>
> On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > >
> > > >
> > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > >>
> > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > >> >> >
> > > > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > > > >> >> >
> > > > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > >> >> >
> > > > >> >> > Several architectures override module_alloc() because of various
> > > > >> >> > constraints where the executable memory can be located and this causes
> > > > >> >> > additional obstacles for improvements of code allocation.
> > > > >> >> >
> > > > >> >> > Start splitting code allocation from modules by introducing
> > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > > > >> >> >
> > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > > > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > > > >> >> >
> > > > >> >> > The intention semantics for new allocation APIs:
> > > > >> >> >
> > > > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > > > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > > > >> >> >   code that is restricted by relative addressing.
> > > > >> >> >
> > > > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > > > >> >> >   when there are no restrictions for the code placement. For
> > > > >> >> >   architectures that require that any code is within certain distance
> > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > > > >> >> >   execmem_text_alloc().
> > > > >> >> >
> > > > >> >>
> > > > >> >> Is there anything in this series to help users do the appropriate
> > > > >> >> synchronization when the actually populate the allocated memory with
> > > > >> >> code?  See here, for example:
> > > > >> >
> > > > >> > This series only factors out the executable allocations from modules and
> > > > >> > puts them in a central place.
> > > > >> > Anything else would go on top after this lands.
> > > > >>
> > > > >> Hmm.
> > > > >>
> > > > >> On the one hand, there's nothing wrong with factoring out common code. On
> > > > >> the other hand, this is probably the right time to at least start
> > > > >> thinking about synchronization, at least to the extent that it might make
> > > > >> us want to change this API.  (I'm not at all saying that this series
> > > > >> should require changes -- I'm just saying that this is a good time to
> > > > >> think about how this should work.)
> > > > >>
> > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > > > >> look like the one think in the Linux ecosystem that actually
> > > > >> intelligently and efficiently maps new text into an address space:
> > > > >> mmap().
> > > > >>
> > > > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > > > >> and jump to it with minimal synchronization (just the standard implicit
> > > > >> ordering in the kernel that populates the pages before setting up the
> > > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > > > >> needs to "serialize" (in the SDM sense).
> > > > >>
> > > > >> I think the same sequence (from userspace's perspective) works on other
> > > > >> architectures, too, although I think more cache management is needed on
> > > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > > > >> remember the details.)
> > > > >>
> > > > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > > > >> rather fraught, and I bet many things do it wrong when userspace is
> > > > >> multithreaded.  But not in production because it's mostly not used in
> > > > >> production.)
> > > > >>
> > > > >> But jit_text_alloc() can't do this, because the order of operations
> > > > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > > > >> before the text is populated, so there is no atomic change from not-there
> > > > >> to populated-and-executable.  Which means that there is an opportunity
> > > > >> for CPUs, speculatively or otherwise, to start filling various caches
> > > > >> with intermediate states of the text, which means that various
> > > > >> architectures (even x86!) may need serialization.
> > > > >>
> > > > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > > > >> coarse-grained, perhaps something vaguely like:
> > > > >>
> > > > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > > > >> but does *not* map it there
> > > > >> jit_text_write() -> write to that handle
> > > > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > > > >> x86, I think)
> > > > >>
> > > > >> could be more efficient and/or safer.
> > > > >>
> > > > >> (Modules could use this too.  Getting alternatives right might take some
> > > > >> fiddling, because off the top of my head, this doesn't match how it works
> > > > >> now.)
> > > > >>
> > > > >> To make alternatives easier, this could work, maybe (haven't fully
> > > > >> thought it through):
> > > > >>
> > > > >> jit_text_alloc()
> > > > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > > > >>
> > > > >> write the text and apply alternatives
> > > > >>
> > > > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > > > >>
> > > > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > > > >> heavy weight RCU to get "serialization") or send an IPI.
> > > > >
> > > > > This essentially how modules work now. The memory is allocated RW, written
> > > > > and updated with alternatives and then made ROX in the end with set_memory
> > > > > APIs.
> > > > >
> > > > > The issue with not having the memory mapped X when it's written is that we
> > > > > cannot use large pages to map it. One of the goals is to have executable
> > > > > memory mapped with large pages and make code allocator able to divide that
> > > > > page among several callers.
> > > > >
> > > > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > > > mapped ROX, will allocate memory from those caches and there will be
> > > > > jit_update() that uses text poking for writing to that memory.
> > > > >
> > > > > Upon allocation of a large page to increase the cache, that large page will
> > > > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > > > x86)
> > > >
> > > > Is this actually valid?  In between int3 and real code, there’s a
> > > > potential torn read of real code mixed up with 0xcc.
> > >
> > > You mean while doing text poking?
> >
> > I think we've been getting distracted by text_poke(). text_poke() does
> > updates via a different virtual address which introduce new
> > synchroniation wrinkles, but it's not the main issue.
> >
> > As _think_ I understand it, the root of the issue is that speculative
> > execution - and that per Andy, speculative execution doesn't obey memory
> > barriers.
> >
> > I have _not_ dug into the details of how retpolines work and all the
> > spectre stuff that was going on, but - retpoline uses lfence, doesn't
> > it? And if speculative execution is the issue here, isn't retpoline what
> > we need?
> >
> > For this particular issue, I'm not sure "invalidate by filling with
> > illegal instructions" makes sense. For that to work, would the processor
> > have to execute a serialize operation and a retry on hitting an illegal
> > instruction - or perhaps we do in the interrupt handler?
> >
> > But if filling with illegal instructions does act as a speculation
> > barrier, then the issue is that a torn read could generate a legal but
> > incorrect instruction.
>
> What is a "torn read" here? I assume it is an instruction read that
> goes at the wrong instruction boundary (CISC). If this is correct, do
> we need to handle torn read caused by software bug, or hardware
> bit flip, or both?

On ARM64 (RISC), torn reads can't happen because the instruction fetch
is word aligned. If we replace the whole instruction atomically then there
won't be half old - half new instruction fetches.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-26  9:54                   ` Puranjay Mohan
@ 2023-06-26 12:23                     ` Mark Rutland
  0 siblings, 0 replies; 65+ messages in thread
From: Mark Rutland @ 2023-06-26 12:23 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Song Liu, Kent Overstreet, Mike Rapoport, Andy Lutomirski,
	Kees Cook, Linux Kernel Mailing List, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Rick P Edgecombe, Russell King (Oracle),
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers, pjt, torvalds

On Mon, Jun 26, 2023 at 11:54:02AM +0200, Puranjay Mohan wrote:
> On Mon, Jun 26, 2023 at 8:13 AM Song Liu <song@kernel.org> wrote:
> >
> > On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > > >
> > > > >
> > > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > > >>
> > > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > > >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > > >> >> >
> > > > > >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > > > > >> >> >
> > > > > >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > > >> >> >
> > > > > >> >> > Several architectures override module_alloc() because of various
> > > > > >> >> > constraints where the executable memory can be located and this causes
> > > > > >> >> > additional obstacles for improvements of code allocation.
> > > > > >> >> >
> > > > > >> >> > Start splitting code allocation from modules by introducing
> > > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > > > > >> >> >
> > > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > > > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > > > > >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> > > > > >> >> >
> > > > > >> >> > The intention semantics for new allocation APIs:
> > > > > >> >> >
> > > > > >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > > > > >> >> >   close to the kernel image, like loadable kernel modules and generated
> > > > > >> >> >   code that is restricted by relative addressing.
> > > > > >> >> >
> > > > > >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> > > > > >> >> >   when there are no restrictions for the code placement. For
> > > > > >> >> >   architectures that require that any code is within certain distance
> > > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > > > > >> >> >   execmem_text_alloc().
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >> Is there anything in this series to help users do the appropriate
> > > > > >> >> synchronization when the actually populate the allocated memory with
> > > > > >> >> code?  See here, for example:
> > > > > >> >
> > > > > >> > This series only factors out the executable allocations from modules and
> > > > > >> > puts them in a central place.
> > > > > >> > Anything else would go on top after this lands.
> > > > > >>
> > > > > >> Hmm.
> > > > > >>
> > > > > >> On the one hand, there's nothing wrong with factoring out common code. On
> > > > > >> the other hand, this is probably the right time to at least start
> > > > > >> thinking about synchronization, at least to the extent that it might make
> > > > > >> us want to change this API.  (I'm not at all saying that this series
> > > > > >> should require changes -- I'm just saying that this is a good time to
> > > > > >> think about how this should work.)
> > > > > >>
> > > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > > > > >> look like the one think in the Linux ecosystem that actually
> > > > > >> intelligently and efficiently maps new text into an address space:
> > > > > >> mmap().
> > > > > >>
> > > > > >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > > > > >> and jump to it with minimal synchronization (just the standard implicit
> > > > > >> ordering in the kernel that populates the pages before setting up the
> > > > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > > > >> the mapping before mmap() finishes).  It works across CPUs, and the only
> > > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > > > >> case userspace gets a page fault.  Incoherence is impossible, and no one
> > > > > >> needs to "serialize" (in the SDM sense).
> > > > > >>
> > > > > >> I think the same sequence (from userspace's perspective) works on other
> > > > > >> architectures, too, although I think more cache management is needed on
> > > > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > > > > >> remember the details.)
> > > > > >>
> > > > > >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > > > > >> rather fraught, and I bet many things do it wrong when userspace is
> > > > > >> multithreaded.  But not in production because it's mostly not used in
> > > > > >> production.)
> > > > > >>
> > > > > >> But jit_text_alloc() can't do this, because the order of operations
> > > > > >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > > > > >> before the text is populated, so there is no atomic change from not-there
> > > > > >> to populated-and-executable.  Which means that there is an opportunity
> > > > > >> for CPUs, speculatively or otherwise, to start filling various caches
> > > > > >> with intermediate states of the text, which means that various
> > > > > >> architectures (even x86!) may need serialization.
> > > > > >>
> > > > > >> For eBPF- and module- like use cases, where JITting/code gen is quite
> > > > > >> coarse-grained, perhaps something vaguely like:
> > > > > >>
> > > > > >> jit_text_alloc() -> returns a handle and an executable virtual address,
> > > > > >> but does *not* map it there
> > > > > >> jit_text_write() -> write to that handle
> > > > > >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> > > > > >> x86, I think)
> > > > > >>
> > > > > >> could be more efficient and/or safer.
> > > > > >>
> > > > > >> (Modules could use this too.  Getting alternatives right might take some
> > > > > >> fiddling, because off the top of my head, this doesn't match how it works
> > > > > >> now.)
> > > > > >>
> > > > > >> To make alternatives easier, this could work, maybe (haven't fully
> > > > > >> thought it through):
> > > > > >>
> > > > > >> jit_text_alloc()
> > > > > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > > > > >>
> > > > > >> write the text and apply alternatives
> > > > > >>
> > > > > >> jit_text_finalize() -> change from RW to RX *and synchronize*
> > > > > >>
> > > > > >> jit_text_finalize() would either need to wait for RCU (possibly extra
> > > > > >> heavy weight RCU to get "serialization") or send an IPI.
> > > > > >
> > > > > > This essentially how modules work now. The memory is allocated RW, written
> > > > > > and updated with alternatives and then made ROX in the end with set_memory
> > > > > > APIs.
> > > > > >
> > > > > > The issue with not having the memory mapped X when it's written is that we
> > > > > > cannot use large pages to map it. One of the goals is to have executable
> > > > > > memory mapped with large pages and make code allocator able to divide that
> > > > > > page among several callers.
> > > > > >
> > > > > > So the idea was that jit_text_alloc() will have a cache of large pages
> > > > > > mapped ROX, will allocate memory from those caches and there will be
> > > > > > jit_update() that uses text poking for writing to that memory.
> > > > > >
> > > > > > Upon allocation of a large page to increase the cache, that large page will
> > > > > > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > > > > > x86)
> > > > >
> > > > > Is this actually valid?  In between int3 and real code, there’s a
> > > > > potential torn read of real code mixed up with 0xcc.
> > > >
> > > > You mean while doing text poking?
> > >
> > > I think we've been getting distracted by text_poke(). text_poke() does
> > > updates via a different virtual address which introduce new
> > > synchroniation wrinkles, but it's not the main issue.
> > >
> > > As _think_ I understand it, the root of the issue is that speculative
> > > execution - and that per Andy, speculative execution doesn't obey memory
> > > barriers.
> > >
> > > I have _not_ dug into the details of how retpolines work and all the
> > > spectre stuff that was going on, but - retpoline uses lfence, doesn't
> > > it? And if speculative execution is the issue here, isn't retpoline what
> > > we need?
> > >
> > > For this particular issue, I'm not sure "invalidate by filling with
> > > illegal instructions" makes sense. For that to work, would the processor
> > > have to execute a serialize operation and a retry on hitting an illegal
> > > instruction - or perhaps we do in the interrupt handler?
> > >
> > > But if filling with illegal instructions does act as a speculation
> > > barrier, then the issue is that a torn read could generate a legal but
> > > incorrect instruction.
> >
> > What is a "torn read" here? I assume it is an instruction read that
> > goes at the wrong instruction boundary (CISC). If this is correct, do
> > we need to handle torn read caused by software bug, or hardware
> > bit flip, or both?
> 
> On ARM64 (RISC), torn reads can't happen because the instruction fetch
> is word aligned. If we replace the whole instruction atomically then there
> won't be half old - half new instruction fetches.

Unfortunately, that's only guaranteed for a subset of instructions (e.g. B,
NOP), and in general CPUs can fetch an instruction multiple times, and could
fetch arbitrary portions of the instruction each time.

Please see the "Concurrent Modification and Execution of instructions" rules in
the ARM ARM.

For arm64, in general, you need to inhibit any concurrent execution (e.g. by
stopping-the-world) when patching text, and afterwards you need cache maintence
followed by a context-syncrhonization-event (aking to an x86 serializing
instruction) on CPUs which will execute the new instruction(s).

There are a bunch of special cases where we can omit some of that, but in
general the architectural guarnatees are *very* weak and require SW to perform
several bits of work to guarantee the new instructions will be executed without issues.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-25 16:14         ` Mike Rapoport
  2023-06-25 16:59           ` Andy Lutomirski
@ 2023-06-26 12:31           ` Mark Rutland
  2023-06-26 17:48             ` Song Liu
  1 sibling, 1 reply; 65+ messages in thread
From: Mark Rutland @ 2023-06-26 12:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andy Lutomirski, Kees Cook, Linux Kernel Mailing List,
	Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Sun, Jun 25, 2023 at 07:14:17PM +0300, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > 
> > On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > >> >
> > >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > >> >
> > >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > >> > and puts the burden of code allocation to the modules code.
> > >> >
> > >> > Several architectures override module_alloc() because of various
> > >> > constraints where the executable memory can be located and this causes
> > >> > additional obstacles for improvements of code allocation.
> > >> >
> > >> > Start splitting code allocation from modules by introducing
> > >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > >> >
> > >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> > module_memfree() to allow updating all call sites to use the new APIs.
> > >> >
> > >> > The intention semantics for new allocation APIs:
> > >> >
> > >> > * execmem_text_alloc() should be used to allocate memory that must reside
> > >> >   close to the kernel image, like loadable kernel modules and generated
> > >> >   code that is restricted by relative addressing.
> > >> >
> > >> > * jit_text_alloc() should be used to allocate memory for generated code
> > >> >   when there are no restrictions for the code placement. For
> > >> >   architectures that require that any code is within certain distance
> > >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> > >> >   execmem_text_alloc().
> > >> >
> > >> 
> > >> Is there anything in this series to help users do the appropriate
> > >> synchronization when the actually populate the allocated memory with
> > >> code?  See here, for example:
> > >
> > > This series only factors out the executable allocations from modules and
> > > puts them in a central place.
> > > Anything else would go on top after this lands.
> > 
> > Hmm.
> > 
> > On the one hand, there's nothing wrong with factoring out common code. On
> > the other hand, this is probably the right time to at least start
> > thinking about synchronization, at least to the extent that it might make
> > us want to change this API.  (I'm not at all saying that this series
> > should require changes -- I'm just saying that this is a good time to
> > think about how this should work.)
> > 
> > The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > look like the one think in the Linux ecosystem that actually
> > intelligently and efficiently maps new text into an address space:
> > mmap().
> > 
> > On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > and jump to it with minimal synchronization (just the standard implicit
> > ordering in the kernel that populates the pages before setting up the
> > PTEs and whatever user synchronization is needed to avoid jumping into
> > the mapping before mmap() finishes).  It works across CPUs, and the only
> > possible way userspace can screw it up (for a read-only mapping of
> > read-only text, anyway) is to jump to the mapping too early, in which
> > case userspace gets a page fault.  Incoherence is impossible, and no one
> > needs to "serialize" (in the SDM sense).
> > 
> > I think the same sequence (from userspace's perspective) works on other
> > architectures, too, although I think more cache management is needed on
> > the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > IPI to map executable text into usermode, but I could easily be wrong.
> > (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > remember the details.)
> > 
> > Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > rather fraught, and I bet many things do it wrong when userspace is
> > multithreaded.  But not in production because it's mostly not used in
> > production.)
> > 
> > But jit_text_alloc() can't do this, because the order of operations
> > doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > before the text is populated, so there is no atomic change from not-there
> > to populated-and-executable.  Which means that there is an opportunity
> > for CPUs, speculatively or otherwise, to start filling various caches
> > with intermediate states of the text, which means that various
> > architectures (even x86!) may need serialization.
> > 
> > For eBPF- and module- like use cases, where JITting/code gen is quite
> > coarse-grained, perhaps something vaguely like:
> > 
> > jit_text_alloc() -> returns a handle and an executable virtual address,
> > but does *not* map it there
> > jit_text_write() -> write to that handle
> > jit_text_map() -> map it and synchronize if needed (no sync needed on
> > x86, I think)
> > 
> > could be more efficient and/or safer.
> > 
> > (Modules could use this too.  Getting alternatives right might take some
> > fiddling, because off the top of my head, this doesn't match how it works
> > now.)
> > 
> > To make alternatives easier, this could work, maybe (haven't fully
> > thought it through):
> > 
> > jit_text_alloc()
> > jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> > 
> > write the text and apply alternatives
> > 
> > jit_text_finalize() -> change from RW to RX *and synchronize*
> > 
> > jit_text_finalize() would either need to wait for RCU (possibly extra
> > heavy weight RCU to get "serialization") or send an IPI.
> 
> This essentially how modules work now. The memory is allocated RW, written
> and updated with alternatives and then made ROX in the end with set_memory
> APIs.
> 
> The issue with not having the memory mapped X when it's written is that we
> cannot use large pages to map it. One of the goals is to have executable
> memory mapped with large pages and make code allocator able to divide that
> page among several callers.
> 
> So the idea was that jit_text_alloc() will have a cache of large pages
> mapped ROX, will allocate memory from those caches and there will be
> jit_update() that uses text poking for writing to that memory.
> 
> Upon allocation of a large page to increase the cache, that large page will
> be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> x86)

Does that work on x86?

That is in no way gauranteed for other architectures; on arm64 you need
explicit cache maintenance (with I-cache maintenance at the VA to be executed
from) followed by context-synchronization-events (e.g. via ISB instructions, or
IPIs).

Mark.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-19 17:09       ` Andy Lutomirski
  2023-06-19 20:18         ` Nadav Amit
  2023-06-25 16:14         ` Mike Rapoport
@ 2023-06-26 13:01         ` Mark Rutland
  2 siblings, 0 replies; 65+ messages in thread
From: Mark Rutland @ 2023-06-26 13:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Rapoport, Kees Cook, Linux Kernel Mailing List,
	Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On the
> other hand, this is probably the right time to at least start thinking about
> synchronization, at least to the extent that it might make us want to change
> this API.  (I'm not at all saying that this series should require changes --
> I'm just saying that this is a good time to think about how this should
> work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually intelligently
> and efficiently maps new text into an address space: mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC and
> jump to it with minimal synchronization (just the standard implicit ordering
> in the kernel that populates the pages before setting up the PTEs and
> whatever user synchronization is needed to avoid jumping into the mapping
> before mmap() finishes).  It works across CPUs, and the only possible way
> userspace can screw it up (for a read-only mapping of read-only text, anyway)
> is to jump to the mapping too early, in which case userspace gets a page
> fault.  Incoherence is impossible, and no one needs to "serialize" (in the
> SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on the
> kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to
> map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V
> has very developer-unfriendly icache management, but I don't remember the
> details.)

That's my understanding too, with a couple of details:

1) After the copy we perform and complete all the data + instruction cache
   maintenance *before* marking the mapping as executable.

2) Even *after* the mapping is marked executable, a thread could take a
   spurious fault on an instruction fetch for the new instructions. One way to
   think about this is that the CPU attempted to speculate the instructions
   earlier, saw that the mapping was faulting, and placed a "generate a fault
   here" operation into its pipeline to generate that later.

   The CPU pipeline/OoO-engine/whatever is effectively a transient cache for
   operations in-flight which is only ever "invalidated" by a
   context-synchronization-event (akin to an x86 serializing effect).

   We're only guarnateed to have a new instruction fetch (from the I-cache into
   the CPU pipeline) after the next context synchronization event (akin to an x86
   serializing effect), and luckily out exception entry/exit is architecturally
   guarnateed to provide that (unless we explicitly opt out via a control bit).

I know we're a bit lax with that today: I think we omit the
context-synchronization-event when enabling ftrace callsites, and worse, for
static keys. Those are both on my TODO list of nasty problems that require
careful auditing...

> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> rather fraught, and I bet many things do it wrong when userspace is
> multithreaded.  But not in production because it's mostly not used in
> production.)

I suspect uprobes needs a look too...

I'll need to go dig into all that a bit before I have more of an opinion on the
shape of the API.

Thanks,
Mark.

> But jit_text_alloc() can't do this, because the order of operations doesn't
> match.  With jit_text_alloc(), the executable mapping shows up before the
> text is populated, so there is no atomic change from not-there to
> populated-and-executable.  Which means that there is an opportunity for CPUs,
> speculatively or otherwise, to start filling various caches with intermediate
> states of the text, which means that various architectures (even x86!) may
> need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite
> coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)
> 
> could be more efficient and/or safer.
> 
> (Modules could use this too.  Getting alternatives right might take some
> fiddling, because off the top of my head, this doesn't match how it works
> now.)
> 
> To make alternatives easier, this could work, maybe (haven't fully thought it through):
> 
> jit_text_alloc()
> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> 
> write the text and apply alternatives
> 
> jit_text_finalize() -> change from RW to RX *and synchronize*
> 
> jit_text_finalize() would either need to wait for RCU (possibly extra heavy
> weight RCU to get "serialization") or send an IPI.
> 
> This is slower than the alloc, write, map solution, but allows alternatives
> to be applied at the final address.
> 
> 
> Even fancier variants where the writing is some using something like
> use_temporary_mm() might even make sense.
> 
> 
> To what extent does performance matter for the various users?  module loading
> is slow, and I don't think we care that much.  eBPF loaded is not super fast,
> and we care to a limited extent.  I *think* the bcachefs use case needs to be
> very fast, but I'm not sure it can be fast and supportable.
> 
> Anyway, food for thought.
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-26 12:31           ` Mark Rutland
@ 2023-06-26 17:48             ` Song Liu
  2023-07-17 17:23               ` Andy Lutomirski
  0 siblings, 1 reply; 65+ messages in thread
From: Song Liu @ 2023-06-26 17:48 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Mike Rapoport, Andy Lutomirski, Kees Cook,
	Linux Kernel Mailing List, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Kent Overstreet, Luis Chamberlain,
	Michael Ellerman, Nadav Amit, Naveen N. Rao, Palmer Dabbelt,
	Puranjay Mohan, Rick P Edgecombe, Russell King (Oracle),
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers

On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
[...]
> >
> > So the idea was that jit_text_alloc() will have a cache of large pages
> > mapped ROX, will allocate memory from those caches and there will be
> > jit_update() that uses text poking for writing to that memory.
> >
> > Upon allocation of a large page to increase the cache, that large page will
> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > x86)
>
> Does that work on x86?
>
> That is in no way gauranteed for other architectures; on arm64 you need
> explicit cache maintenance (with I-cache maintenance at the VA to be executed
> from) followed by context-synchronization-events (e.g. via ISB instructions, or
> IPIs).

I guess we need:
1) Invalidate unused part of the huge ROX pages;
2) Do not put two jit users (including module text, bpf, etc.) in the
same cache line;
3) Explicit cache maintenance;
4) context-synchronization-events.

Would these (or a subset of them) be sufficient to protect us from torn read?

Thanks,
Song

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
  2023-06-26 17:48             ` Song Liu
@ 2023-07-17 17:23               ` Andy Lutomirski
  0 siblings, 0 replies; 65+ messages in thread
From: Andy Lutomirski @ 2023-07-17 17:23 UTC (permalink / raw)
  To: Song Liu, Mark Rutland
  Cc: Mike Rapoport, Kees Cook, Linux Kernel Mailing List,
	Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Nadav Amit, Naveen N. Rao, Palmer Dabbelt, Puranjay Mohan,
	Rick P Edgecombe, Russell King (Oracle),
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	the arch/x86 maintainers



On Mon, Jun 26, 2023, at 10:48 AM, Song Liu wrote:
> On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland <mark.rutland@arm.com> wrote:
>>
> [...]
>> >
>> > So the idea was that jit_text_alloc() will have a cache of large pages
>> > mapped ROX, will allocate memory from those caches and there will be
>> > jit_update() that uses text poking for writing to that memory.
>> >
>> > Upon allocation of a large page to increase the cache, that large page will
>> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
>> > x86)
>>
>> Does that work on x86?
>>
>> That is in no way gauranteed for other architectures; on arm64 you need
>> explicit cache maintenance (with I-cache maintenance at the VA to be executed
>> from) followed by context-synchronization-events (e.g. via ISB instructions, or
>> IPIs).
>
> I guess we need:
> 1) Invalidate unused part of the huge ROX pages;
> 2) Do not put two jit users (including module text, bpf, etc.) in the
> same cache line;
> 3) Explicit cache maintenance;
> 4) context-synchronization-events.
>
> Would these (or a subset of them) be sufficient to protect us from torn read?

Maybe?  #4 is sufficiently vague that I can't really interpret it.

I have a half-drafted email asking for official clarification on the rules that might help shed light on this.  I find that this type of request works best when it's really well written :)

>
> Thanks,
> Song

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2023-07-17 17:24 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
2023-06-16 16:00   ` Edgecombe, Rick P
2023-06-17  5:52     ` Mike Rapoport
2023-06-16 18:14   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Mike Rapoport
2023-06-16 16:48   ` Kent Overstreet
2023-06-16 18:18     ` Song Liu
2023-06-17  5:57     ` Mike Rapoport
2023-06-17 20:38   ` Andy Lutomirski
2023-06-18  8:00     ` Mike Rapoport
2023-06-19 17:09       ` Andy Lutomirski
2023-06-19 20:18         ` Nadav Amit
2023-06-20 17:24           ` Andy Lutomirski
2023-06-25 16:14         ` Mike Rapoport
2023-06-25 16:59           ` Andy Lutomirski
2023-06-25 17:42             ` Mike Rapoport
2023-06-25 18:07               ` Kent Overstreet
2023-06-26  6:13                 ` Song Liu
2023-06-26  9:54                   ` Puranjay Mohan
2023-06-26 12:23                     ` Mark Rutland
2023-06-26 12:31           ` Mark Rutland
2023-06-26 17:48             ` Song Liu
2023-07-17 17:23               ` Andy Lutomirski
2023-06-26 13:01         ` Mark Rutland
2023-06-19 11:34     ` Kent Overstreet
2023-06-16  8:50 ` [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem Mike Rapoport
2023-06-16 18:36   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 04/12] mm/execmem, arch: convert remaining " Mike Rapoport
2023-06-16 16:16   ` Edgecombe, Rick P
2023-06-17  6:10     ` Mike Rapoport
2023-06-16 18:53   ` Song Liu
2023-06-17  6:14     ` Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 05/12] modules, execmem: drop module_alloc Mike Rapoport
2023-06-16 18:56   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
2023-06-16 16:55   ` Edgecombe, Rick P
2023-06-17  6:44     ` Mike Rapoport
2023-06-16 20:01   ` Song Liu
2023-06-17  6:51     ` Mike Rapoport
2023-06-18 22:32   ` Thomas Gleixner
2023-06-18 23:14     ` Kent Overstreet
2023-06-19  0:43       ` Thomas Gleixner
2023-06-19  2:12         ` Kent Overstreet
2023-06-20 14:51         ` Steven Rostedt
2023-06-20 15:32           ` Alexei Starovoitov
2023-06-19 15:23     ` Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions Mike Rapoport
2023-06-16 20:05   ` Song Liu
2023-06-17  6:57     ` Mike Rapoport
2023-06-17 15:36       ` Kent Overstreet
2023-06-17 16:38         ` Song Liu
2023-06-17 20:37           ` Kent Overstreet
2023-06-16  8:50 ` [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations Mike Rapoport
2023-06-16 20:09   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 09/12] powerpc: " Mike Rapoport
2023-06-16 20:09   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES Mike Rapoport
2023-06-16 20:17   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
2023-06-16 20:18   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
2023-06-16 11:44   ` Björn Töpel
2023-06-17  6:52     ` Mike Rapoport
2023-06-16 17:02 ` [PATCH v2 00/12] mm: jit/text allocator Edgecombe, Rick P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).