linux-efi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration
@ 2024-02-20 20:32 Maxwell Bland
  2024-02-20 20:32 ` [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides Maxwell Bland
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-20 20:32 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas,
	christophe.leroy, cl, daniel, dave.hansen, david, dennis,
	dvyukov, glider, gor, guoren, haoluo, hca, hch, john.fastabend,
	jolsa, kasan-dev, kpsingh, linux-arch, linux, linux-efi,
	linux-kernel, linux-mm, linuxppc-dev, linux-riscv, linux-s390,
	lstoakes, mark.rutland, martin.lau, meted, michael.christie,
	mjguzik, mpe, mst, muchun.song, naveen.n.rao, npiggin, palmer,
	paul.walmsley, quic_nprakash, quic_pkondeti, rick.p.edgecombe,
	ryabinin.a.a, ryan.roberts, samitolvanen, sdf, song, surenb,
	svens, tj, urezki, vincenzo.frascino, will, wuqiang.matt,
	yonghong.song, zlim.lnx, mbland, awheeler

Reworks ARM's virtual memory allocation infrastructure to support
dynamic enforcement of page middle directory PXNTable restrictions
rather than only during the initial memory mapping. Runtime enforcement
of this bit prevents write-then-execute attacks, where malicious code is
staged in vmalloc'd data regions, and later the page table is changed to
make this code executable.

Previously the entire region from VMALLOC_START to VMALLOC_END was
vulnerable, but now the vulnerable region is restricted to the 2GB
reserved by module_alloc, a region which is generally read-only and more
difficult to inject staging code into, e.g., data must pass the BPF
verifier. These changes also set the stage for other systems, such as
KVM-level (EL2) changes to mark page tables immutable and code page
verification changes, forging a path toward complete mitigation of
kernel exploits on ARM.

Implementing this required minimal changes to the generic vmalloc
interface in the kernel to allow architecture overrides of some vmalloc
wrapper functions, refactoring vmalloc calls to use a standard interface
in the generic kernel, and passing the address parameter already passed
into PTE allocation to the pte_allocate child function call.

The new arm64 vmalloc wrapper functions ensure vmalloc data is not
allocated into the region reserved for module_alloc. arm64 BPF and
kprobe code also see a two-line-change ensuring their allocations abide
by the segmentation of code from data. Finally, arm64's pmd_populate
function is modified to set the PXNTable bit appropriately.

Signed-off-by: Maxwell Bland <mbland@motorola.com>

---

After Mark Rutland's feedback last week on my more minimal patch, see

<CAP5Mv+ydhk=Ob4b40ZahGMgT-5+-VEHxtmA=-LkJiEOOU+K6hw@mail.gmail.com>

I adopted a more sweeping and more correct overhaul of ARM's virtual
memory allocation infrastructure to support these changes. This patch
guarantees our ability to write future systems with a strong and
accessible distinction between code and data at the page allocation
layer, bolstering the guarantees of complementary contributions, i.e.
W^X and kCFI.

The current patch minimally reduces available vmalloc space, removing
the 2GB that should be reserved for code allocations regardless, and I
feel really benefits the kernel by making several memory allocation
interfaces more uniform, and providing hooks for non-ARM architectures
to follow suit.

I have done some minimal runtime testing using Torvald's test-tlb script
on a QEMU VM, but maybe more extensive benchmarking is needed?

Size: Before Patch -> After Patch
4k: 4.09ns  4.15ns  4.41ns  4.43ns -> 3.68ns  3.73ns  3.67ns  3.73ns 
8k: 4.22ns  4.19ns  4.30ns  4.15ns -> 3.99ns  3.89ns  4.12ns  4.04ns 
16k: 3.97ns  4.31ns  4.30ns  4.28ns -> 4.03ns  3.98ns  4.06ns  4.06ns 
32k: 3.82ns  4.51ns  4.25ns  4.31ns -> 3.99ns  4.09ns  4.07ns  5.17ns 
64k: 4.50ns  5.59ns  6.13ns  6.14ns -> 4.23ns  4.26ns  5.91ns  5.93ns 
128k: 5.06ns  4.47ns  6.75ns  6.69ns -> 4.47ns  4.71ns  6.54ns  6.44ns 
256k: 4.83ns  4.43ns  6.62ns  6.21ns -> 4.39ns  4.62ns  6.71ns  6.65ns 
512k: 4.45ns  4.75ns  6.19ns  6.65ns -> 4.86ns  5.26ns  7.77ns  6.68ns 
1M: 4.72ns  4.73ns  6.74ns  6.47ns -> 4.29ns  4.45ns  6.87ns  6.59ns 
2M: 4.66ns  4.86ns  14.49ns  15.00ns -> 4.53ns  4.57ns  15.91ns  15.90ns 
4M: 4.85ns  4.95ns  15.90ns  15.98ns -> 4.48ns  4.74ns  17.27ns  17.36ns 
6M: 4.94ns  5.03ns  17.19ns  17.31ns -> 4.70ns  4.93ns  18.02ns  18.23ns 
8M: 5.05ns  5.18ns  17.49ns  17.64ns -> 4.96ns  5.07ns  18.84ns  18.72ns 
16M: 5.55ns  5.79ns  20.99ns  23.70ns -> 5.46ns  5.72ns  22.76ns  26.51ns
32M: 8.54ns  9.06ns  124.61ns 125.07ns -> 8.43ns  8.59ns  116.83ns 138.83ns
64M: 8.42ns  8.63ns  196.17ns 204.52ns -> 8.26ns  8.43ns  193.49ns 203.85ns
128M: 8.31ns  8.58ns  230.46ns 242.63ns -> 8.22ns  8.39ns  227.99ns 240.29ns
256M: 8.80ns  8.80ns  248.24ns 261.68ns -> 8.35ns  8.55ns  250.18ns 262.20ns

Note I also chose to enforce PXNTable at the PMD layer only (for now),
since the 194 descriptors which are affected by this change on my
testing setup are not sufficient to warrant enforcement at a coarser
granularity.

The architecture-independent changes (I term "generic") can be
classified only as refactoring, but I feel are also major improvements
in that they standardize most uses of the vmalloc interface across the
kernel.

Note this patch reduces the arm64 allocated region for BPF and kprobes,
but only to match with the existing allocation choices made by the
generic kernel. I will admit I do not understand why BPF JIT allocation
code was duplicated into arm64, but I also feel that this was either an
artifact or that these overrides for generic allocation should require a
specific KConfig as they trade off between security and space. That
said, I have chosen not to wrap this patch in a KConfig interface, as I
feel the changes provide significant benefit to the arm64 kernel's
baseline security, though a KConfig could certainly be added if the
maintainers see the need.

Maxwell Bland (4):
  mm/vmalloc: allow arch-specific vmalloc_node overrides
  mm: pgalloc: support address-conditional pmd allocation
  arm64: separate code and data virtual memory allocation
  arm64: dynamic enforcement of pmd-level PXNTable

 arch/arm/kernel/irq.c               |  2 +-
 arch/arm64/include/asm/pgalloc.h    | 11 +++++-
 arch/arm64/include/asm/vmalloc.h    |  8 ++++
 arch/arm64/include/asm/vmap_stack.h |  2 +-
 arch/arm64/kernel/efi.c             |  2 +-
 arch/arm64/kernel/module.c          |  7 ++++
 arch/arm64/kernel/probes/kprobes.c  |  2 +-
 arch/arm64/mm/Makefile              |  3 +-
 arch/arm64/mm/trans_pgd.c           |  2 +-
 arch/arm64/mm/vmalloc.c             | 57 +++++++++++++++++++++++++++++
 arch/arm64/net/bpf_jit_comp.c       |  5 ++-
 arch/powerpc/kernel/irq.c           |  2 +-
 arch/riscv/include/asm/irq_stack.h  |  2 +-
 arch/s390/hypfs/hypfs_diag.c        |  2 +-
 arch/s390/kernel/setup.c            |  6 +--
 arch/s390/kernel/sthyi.c            |  2 +-
 include/asm-generic/pgalloc.h       | 18 +++++++++
 include/linux/mm.h                  |  4 +-
 include/linux/vmalloc.h             | 15 +++++++-
 kernel/bpf/syscall.c                |  4 +-
 kernel/fork.c                       |  4 +-
 kernel/scs.c                        |  3 +-
 lib/objpool.c                       |  2 +-
 lib/test_vmalloc.c                  |  6 +--
 mm/hugetlb_vmemmap.c                |  4 +-
 mm/kasan/init.c                     | 22 ++++++-----
 mm/memory.c                         |  4 +-
 mm/percpu.c                         |  2 +-
 mm/pgalloc-track.h                  |  3 +-
 mm/sparse-vmemmap.c                 |  2 +-
 mm/util.c                           |  3 +-
 mm/vmalloc.c                        | 39 +++++++-------------
 32 files changed, 176 insertions(+), 74 deletions(-)
 create mode 100644 arch/arm64/mm/vmalloc.c


base-commit: b401b621758e46812da61fa58a67c3fd8d91de0d
-- 
2.39.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides
  2024-02-20 20:32 [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Maxwell Bland
@ 2024-02-20 20:32 ` Maxwell Bland
  2024-02-21  5:43   ` Christoph Hellwig
  2024-02-21  6:59   ` Christophe Leroy
  2024-02-20 20:32 ` [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation Maxwell Bland
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-20 20:32 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas,
	christophe.leroy, cl, daniel, dave.hansen, david, dennis,
	dvyukov, glider, gor, guoren, haoluo, hca, hch, john.fastabend,
	jolsa, kasan-dev, kpsingh, linux-arch, linux, linux-efi,
	linux-kernel, linux-mm, linuxppc-dev, linux-riscv, linux-s390,
	lstoakes, mark.rutland, martin.lau, meted, michael.christie,
	mjguzik, mpe, mst, muchun.song, naveen.n.rao, npiggin, palmer,
	paul.walmsley, quic_nprakash, quic_pkondeti, rick.p.edgecombe,
	ryabinin.a.a, ryan.roberts, samitolvanen, sdf, song, surenb,
	svens, tj, urezki, vincenzo.frascino, will, wuqiang.matt,
	yonghong.song, zlim.lnx, mbland, awheeler

Present non-uniform use of __vmalloc_node and __vmalloc_node_range makes
enforcing appropriate code and data seperation untenable on certain
microarchitectures, as VMALLOC_START and VMALLOC_END are monolithic
while the use of the vmalloc interface is non-monolithic: in particular,
appropriate randomness in ASLR makes it such that code regions must fall
in some region between VMALLOC_START and VMALLOC_end, but this
necessitates that code pages are intermingled with data pages, meaning
code-specific protections, such as arm64's PXNTable, cannot be
performantly runtime enforced.

The solution to this problem allows architectures to override the
vmalloc wrapper functions by enforcing that the rest of the kernel does
not reimplement __vmalloc_node by using __vmalloc_node_range with the
same parameters as __vmalloc_node or provides a __weak tag to those
functions using __vmalloc_node_range with parameters repeating those of
__vmalloc_node.

Two benefits of this approach are (1) greater flexibility to each
architecture for handling of virtual memory while not compromising the
kernel's vmalloc logic and (2) more uniform use of the __vmalloc_node
interface, reserving the more specialized __vmalloc_node_range for more
specialized cases, such as kasan's shadow memory.

Signed-off-by: Maxwell Bland <mbland@motorola.com>
---
 arch/arm/kernel/irq.c               |  2 +-
 arch/arm64/include/asm/vmap_stack.h |  2 +-
 arch/arm64/kernel/efi.c             |  2 +-
 arch/powerpc/kernel/irq.c           |  2 +-
 arch/riscv/include/asm/irq_stack.h  |  2 +-
 arch/s390/hypfs/hypfs_diag.c        |  2 +-
 arch/s390/kernel/setup.c            |  6 ++---
 arch/s390/kernel/sthyi.c            |  2 +-
 include/linux/vmalloc.h             | 15 ++++++++++-
 kernel/bpf/syscall.c                |  4 +--
 kernel/fork.c                       |  4 +--
 kernel/scs.c                        |  3 +--
 lib/objpool.c                       |  2 +-
 lib/test_vmalloc.c                  |  6 ++---
 mm/util.c                           |  3 +--
 mm/vmalloc.c                        | 39 +++++++++++------------------
 16 files changed, 47 insertions(+), 49 deletions(-)

diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c
index fe28fc1f759d..109f4f363621 100644
--- a/arch/arm/kernel/irq.c
+++ b/arch/arm/kernel/irq.c
@@ -61,7 +61,7 @@ static void __init init_irq_stacks(void)
 						       THREAD_SIZE_ORDER);
 		else
 			stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN,
-					       THREADINFO_GFP, NUMA_NO_NODE,
+					       THREADINFO_GFP, 0, NUMA_NO_NODE,
 					       __builtin_return_address(0));
 
 		if (WARN_ON(!stack))
diff --git a/arch/arm64/include/asm/vmap_stack.h b/arch/arm64/include/asm/vmap_stack.h
index 20873099c035..57a7eaa720d5 100644
--- a/arch/arm64/include/asm/vmap_stack.h
+++ b/arch/arm64/include/asm/vmap_stack.h
@@ -21,7 +21,7 @@ static inline unsigned long *arch_alloc_vmap_stack(size_t stack_size, int node)
 
 	BUILD_BUG_ON(!IS_ENABLED(CONFIG_VMAP_STACK));
 
-	p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node,
+	p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, 0, node,
 			__builtin_return_address(0));
 	return kasan_reset_tag(p);
 }
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..48efa31a9161 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -205,7 +205,7 @@ static int __init arm64_efi_rt_init(void)
 		return 0;
 
 	p = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_KERNEL,
-			   NUMA_NO_NODE, &&l);
+			   0, NUMA_NO_NODE, &&l);
 l:	if (!p) {
 		pr_warn("Failed to allocate EFI runtime stack\n");
 		clear_bit(EFI_RUNTIME_SERVICES, &efi.flags);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 6f7d4edaa0bc..ceb7ea07ca28 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -308,7 +308,7 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(do_IRQ)
 static void *__init alloc_vm_stack(void)
 {
 	return __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, THREADINFO_GFP,
-			      NUMA_NO_NODE, (void *)_RET_IP_);
+			      0, NUMA_NO_NODE, (void *)_RET_IP_);
 }
 
 static void __init vmap_irqstack_init(void)
diff --git a/arch/riscv/include/asm/irq_stack.h b/arch/riscv/include/asm/irq_stack.h
index 6441ded3b0cf..d2410735bde0 100644
--- a/arch/riscv/include/asm/irq_stack.h
+++ b/arch/riscv/include/asm/irq_stack.h
@@ -24,7 +24,7 @@ static inline unsigned long *arch_alloc_vmap_stack(size_t stack_size, int node)
 {
 	void *p;
 
-	p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node,
+	p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, 0, node,
 			__builtin_return_address(0));
 	return kasan_reset_tag(p);
 }
diff --git a/arch/s390/hypfs/hypfs_diag.c b/arch/s390/hypfs/hypfs_diag.c
index 279b7bba4d43..16359d854288 100644
--- a/arch/s390/hypfs/hypfs_diag.c
+++ b/arch/s390/hypfs/hypfs_diag.c
@@ -70,7 +70,7 @@ void *diag204_get_buffer(enum diag204_format fmt, int *pages)
 			return ERR_PTR(-EOPNOTSUPP);
 	}
 	diag204_buf = __vmalloc_node(array_size(*pages, PAGE_SIZE),
-				     PAGE_SIZE, GFP_KERNEL, NUMA_NO_NODE,
+				     PAGE_SIZE, GFP_KERNEL, 0, NUMA_NO_NODE,
 				     __builtin_return_address(0));
 	if (!diag204_buf)
 		return ERR_PTR(-ENOMEM);
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index d1f3b56e7afc..2c25b4e9f20a 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -254,7 +254,7 @@ static void __init conmode_default(void)
 		cpcmd("QUERY TERM", query_buffer, 1024, NULL);
 		ptr = strstr(query_buffer, "CONMODE");
 		/*
-		 * Set the conmode to 3215 so that the device recognition 
+		 * Set the conmode to 3215 so that the device recognition
 		 * will set the cu_type of the console to 3215. If the
 		 * conmode is 3270 and we don't set it back then both
 		 * 3215 and the 3270 driver will try to access the console
@@ -314,7 +314,7 @@ static inline void setup_zfcpdump(void) {}
 
  /*
  * Reboot, halt and power_off stubs. They just call _machine_restart,
- * _machine_halt or _machine_power_off. 
+ * _machine_halt or _machine_power_off.
  */
 
 void machine_restart(char *command)
@@ -364,7 +364,7 @@ unsigned long stack_alloc(void)
 	void *ret;
 
 	ret = __vmalloc_node(THREAD_SIZE, THREAD_SIZE, THREADINFO_GFP,
-			     NUMA_NO_NODE, __builtin_return_address(0));
+			     0, NUMA_NO_NODE, __builtin_return_address(0));
 	kmemleak_not_leak(ret);
 	return (unsigned long)ret;
 #else
diff --git a/arch/s390/kernel/sthyi.c b/arch/s390/kernel/sthyi.c
index 30bb20461db4..5bf239bcdae9 100644
--- a/arch/s390/kernel/sthyi.c
+++ b/arch/s390/kernel/sthyi.c
@@ -318,7 +318,7 @@ static void fill_diag(struct sthyi_sctns *sctns)
 		return;
 
 	diag204_buf = __vmalloc_node(array_size(pages, PAGE_SIZE),
-				     PAGE_SIZE, GFP_KERNEL, NUMA_NO_NODE,
+				     PAGE_SIZE, GFP_KERNEL, 0, NUMA_NO_NODE,
 				     __builtin_return_address(0));
 	if (!diag204_buf)
 		return;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..f13bd711ad7d 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -150,7 +150,8 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller) __alloc_size(1);
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
-		int node, const void *caller) __alloc_size(1);
+		unsigned long vm_flags, int node, const void *caller)
+		__alloc_size(1);
 void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
 
 extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
@@ -295,4 +296,16 @@ bool vmalloc_dump_obj(void *object);
 static inline bool vmalloc_dump_obj(void *object) { return false; }
 #endif
 
+#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
+#define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
+#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
+#define GFP_VMALLOC32 (GFP_DMA | GFP_KERNEL)
+#else
+/*
+ * 64b systems should always have either DMA or DMA32 zones. For others
+ * GFP_DMA32 should do the right thing and use the normal zone.
+ */
+#define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
+#endif
+
 #endif /* _LINUX_VMALLOC_H */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a1f18681721c..79c11307ff40 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -303,8 +303,8 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 			return area;
 	}
 
-	return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
-			gfp | GFP_KERNEL | __GFP_RETRY_MAYFAIL, PAGE_KERNEL,
+	return __vmalloc_node(size, align,
+			gfp | GFP_KERNEL | __GFP_RETRY_MAYFAIL,
 			flags, numa_node, __builtin_return_address(0));
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d944e92a43f..800bb1c76000 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -304,10 +304,8 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	 * so memcg accounting is performed manually on assigning/releasing
 	 * stacks to tasks. Drop __GFP_ACCOUNT.
 	 */
-	stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
-				     VMALLOC_START, VMALLOC_END,
+	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN,
 				     THREADINFO_GFP & ~__GFP_ACCOUNT,
-				     PAGE_KERNEL,
 				     0, node, __builtin_return_address(0));
 	if (!stack)
 		return -ENOMEM;
diff --git a/kernel/scs.c b/kernel/scs.c
index d7809affe740..5b89fb08a392 100644
--- a/kernel/scs.c
+++ b/kernel/scs.c
@@ -43,8 +43,7 @@ static void *__scs_alloc(int node)
 		}
 	}
 
-	s = __vmalloc_node_range(SCS_SIZE, 1, VMALLOC_START, VMALLOC_END,
-				    GFP_SCS, PAGE_KERNEL, 0, node,
+	s = __vmalloc_node(SCS_SIZE, 1, GFP_SCS, 0, node,
 				    __builtin_return_address(0));
 
 out:
diff --git a/lib/objpool.c b/lib/objpool.c
index cfdc02420884..f0acd421a652 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -80,7 +80,7 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs,
 			slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
 		else
 			slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
-				cpu_to_node(i), __builtin_return_address(0));
+				0, cpu_to_node(i), __builtin_return_address(0));
 		if (!slot)
 			return -ENOMEM;
 		memset(slot, 0, size);
diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index 3718d9886407..6bde73f892f9 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -97,7 +97,7 @@ static int random_size_align_alloc_test(void)
 		size = ((rnd % 10) + 1) * PAGE_SIZE;
 
 		ptr = __vmalloc_node(size, align, GFP_KERNEL | __GFP_ZERO, 0,
-				__builtin_return_address(0));
+				0, __builtin_return_address(0));
 		if (!ptr)
 			return -1;
 
@@ -120,7 +120,7 @@ static int align_shift_alloc_test(void)
 		align = ((unsigned long) 1) << i;
 
 		ptr = __vmalloc_node(PAGE_SIZE, align, GFP_KERNEL|__GFP_ZERO, 0,
-				__builtin_return_address(0));
+				0, __builtin_return_address(0));
 		if (!ptr)
 			return -1;
 
@@ -138,7 +138,7 @@ static int fix_align_alloc_test(void)
 	for (i = 0; i < test_loop_count; i++) {
 		ptr = __vmalloc_node(5 * PAGE_SIZE, THREAD_ALIGN << 1,
 				GFP_KERNEL | __GFP_ZERO, 0,
-				__builtin_return_address(0));
+				0, __builtin_return_address(0));
 		if (!ptr)
 			return -1;
 
diff --git a/mm/util.c b/mm/util.c
index 5a6a9802583b..c6b7111215e2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -639,8 +639,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	 * about the resulting pointer, and cannot play
 	 * protection games.
 	 */
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-			flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+	return __vmalloc_node(size, 1, flags, VM_ALLOW_HUGE_VMAP,
 			node, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kvmalloc_node);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..18ece28e79d3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3119,7 +3119,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE) {
-		area->pages = __vmalloc_node(array_size, 1, nested_gfp, node,
+		area->pages = __vmalloc_node(array_size, 1, nested_gfp, 0, node,
 					area->caller);
 	} else {
 		area->pages = kmalloc_node(array_size, nested_gfp, node);
@@ -3379,11 +3379,12 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *__vmalloc_node(unsigned long size, unsigned long align,
-			    gfp_t gfp_mask, int node, const void *caller)
+__weak void *__vmalloc_node(unsigned long size, unsigned long align,
+			    gfp_t gfp_mask, unsigned long vm_flags, int node,
+			    const void *caller)
 {
 	return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
-				gfp_mask, PAGE_KERNEL, 0, node, caller);
+				gfp_mask, PAGE_KERNEL, vm_flags, node, caller);
 }
 /*
  * This is only for performance analysis of vmalloc and stress purpose.
@@ -3396,7 +3397,7 @@ EXPORT_SYMBOL_GPL(__vmalloc_node);
 
 void *__vmalloc(unsigned long size, gfp_t gfp_mask)
 {
-	return __vmalloc_node(size, 1, gfp_mask, NUMA_NO_NODE,
+	return __vmalloc_node(size, 1, gfp_mask, 0, NUMA_NO_NODE,
 				__builtin_return_address(0));
 }
 EXPORT_SYMBOL(__vmalloc);
@@ -3415,7 +3416,7 @@ EXPORT_SYMBOL(__vmalloc);
  */
 void *vmalloc(unsigned long size)
 {
-	return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE,
+	return __vmalloc_node(size, 1, GFP_KERNEL, 0, NUMA_NO_NODE,
 				__builtin_return_address(0));
 }
 EXPORT_SYMBOL(vmalloc);
@@ -3432,7 +3433,7 @@ EXPORT_SYMBOL(vmalloc);
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
+__weak void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
 {
 	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
 				    gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
@@ -3455,7 +3456,7 @@ EXPORT_SYMBOL_GPL(vmalloc_huge);
  */
 void *vzalloc(unsigned long size)
 {
-	return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE,
+	return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, 0, NUMA_NO_NODE,
 				__builtin_return_address(0));
 }
 EXPORT_SYMBOL(vzalloc);
@@ -3469,7 +3470,7 @@ EXPORT_SYMBOL(vzalloc);
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *vmalloc_user(unsigned long size)
+__weak void *vmalloc_user(unsigned long size)
 {
 	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
 				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
@@ -3493,7 +3494,7 @@ EXPORT_SYMBOL(vmalloc_user);
  */
 void *vmalloc_node(unsigned long size, int node)
 {
-	return __vmalloc_node(size, 1, GFP_KERNEL, node,
+	return __vmalloc_node(size, 1, GFP_KERNEL, 0, node,
 			__builtin_return_address(0));
 }
 EXPORT_SYMBOL(vmalloc_node);
@@ -3511,23 +3512,11 @@ EXPORT_SYMBOL(vmalloc_node);
  */
 void *vzalloc_node(unsigned long size, int node)
 {
-	return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, node,
+	return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, 0, node,
 				__builtin_return_address(0));
 }
 EXPORT_SYMBOL(vzalloc_node);
 
-#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
-#define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
-#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
-#define GFP_VMALLOC32 (GFP_DMA | GFP_KERNEL)
-#else
-/*
- * 64b systems should always have either DMA or DMA32 zones. For others
- * GFP_DMA32 should do the right thing and use the normal zone.
- */
-#define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
-#endif
-
 /**
  * vmalloc_32 - allocate virtually contiguous memory (32bit addressable)
  * @size:	allocation size
@@ -3539,7 +3528,7 @@ EXPORT_SYMBOL(vzalloc_node);
  */
 void *vmalloc_32(unsigned long size)
 {
-	return __vmalloc_node(size, 1, GFP_VMALLOC32, NUMA_NO_NODE,
+	return __vmalloc_node(size, 1, GFP_VMALLOC32, 0, NUMA_NO_NODE,
 			__builtin_return_address(0));
 }
 EXPORT_SYMBOL(vmalloc_32);
@@ -3553,7 +3542,7 @@ EXPORT_SYMBOL(vmalloc_32);
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *vmalloc_32_user(unsigned long size)
+__weak void *vmalloc_32_user(unsigned long size)
 {
 	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
 				    GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL,
-- 
2.39.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation
  2024-02-20 20:32 [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Maxwell Bland
  2024-02-20 20:32 ` [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides Maxwell Bland
@ 2024-02-20 20:32 ` Maxwell Bland
  2024-02-21  7:13   ` Christophe Leroy
  2024-02-20 20:32 ` [PATCH 3/4] arm64: separate code and data virtual memory allocation Maxwell Bland
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Maxwell Bland @ 2024-02-20 20:32 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas,
	christophe.leroy, cl, daniel, dave.hansen, david, dennis,
	dvyukov, glider, gor, guoren, haoluo, hca, hch, john.fastabend,
	jolsa, kasan-dev, kpsingh, linux-arch, linux, linux-efi,
	linux-kernel, linux-mm, linuxppc-dev, linux-riscv, linux-s390,
	lstoakes, mark.rutland, martin.lau, meted, michael.christie,
	mjguzik, mpe, mst, muchun.song, naveen.n.rao, npiggin, palmer,
	paul.walmsley, quic_nprakash, quic_pkondeti, rick.p.edgecombe,
	ryabinin.a.a, ryan.roberts, samitolvanen, sdf, song, surenb,
	svens, tj, urezki, vincenzo.frascino, will, wuqiang.matt,
	yonghong.song, zlim.lnx, mbland, awheeler

While other descriptors (e.g. pud) allow allocations conditional on
which virtual address is allocated, pmd descriptor allocations do not.
However, adding support for this is straightforward and is beneficial to
future kernel development targeting the PMD memory granularity.

As many architectures already implement pmd_populate_kernel in an
address-generic manner, it is necessary to roll out support
incrementally. For this purpose a preprocessor flag,
__HAVE_ARCH_ADDR_COND_PMD is introduced to capture whether the
architecture supports some feature requiring PMD allocation conditional
on virtual address. Some microarchitectures (e.g. arm64) support
configurations for table descriptors, for example to enforce Privilege
eXecute Never, which benefit from knowing the virtual memory addresses
referenced by PMDs.

Thus two major arguments in favor of this change are (1) unformity of
allocation between PMD and other table descriptor types and (2) the
capability of address-specific PMD allocation.

Signed-off-by: Maxwell Bland <mbland@motorola.com>
---
 include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
 include/linux/mm.h            |  4 ++--
 mm/hugetlb_vmemmap.c          |  4 ++--
 mm/kasan/init.c               | 22 +++++++++++++---------
 mm/memory.c                   |  4 ++--
 mm/percpu.c                   |  2 +-
 mm/pgalloc-track.h            |  3 ++-
 mm/sparse-vmemmap.c           |  2 +-
 8 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 879e5f8aa5e9..e5cdce77c6e4 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -142,6 +142,24 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 }
 #endif
 
+#ifdef __HAVE_ARCH_ADDR_COND_PMD
+static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
+			pte_t *ptep, unsigned long address);
+#else
+static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
+			pte_t *ptep);
+#endif
+
+static inline void pmd_populate_kernel_at(struct mm_struct *mm, pmd_t *pmdp,
+			pte_t *ptep, unsigned long address)
+{
+#ifdef __HAVE_ARCH_ADDR_COND_PMD
+	pmd_populate_kernel(mm, pmdp, ptep, address);
+#else
+	pmd_populate_kernel(mm, pmdp, ptep);
+#endif
+}
+
 #ifndef __HAVE_ARCH_PMD_FREE
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..6a9d5ded428d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2782,7 +2782,7 @@ static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-int __pte_alloc_kernel(pmd_t *pmd);
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 #if defined(CONFIG_MMU)
 
@@ -2977,7 +2977,7 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
-	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address)) ? \
 		NULL: pte_offset_kernel(pmd, address))
 
 #if USE_SPLIT_PMD_PTLOCKS
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index da177e49d956..1f5664b656f1 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -58,7 +58,7 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 	if (!pgtable)
 		return -ENOMEM;
 
-	pmd_populate_kernel(&init_mm, &__pmd, pgtable);
+	pmd_populate_kernel_at(&init_mm, &__pmd, pgtable, addr);
 
 	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
 		pte_t entry, *pte;
@@ -81,7 +81,7 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 
 		/* Make pte visible before pmd. See comment in pmd_install(). */
 		smp_wmb();
-		pmd_populate_kernel(&init_mm, pmd, pgtable);
+		pmd_populate_kernel_at(&init_mm, pmd, pgtable, addr);
 		if (!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH))
 			flush_tlb_kernel_range(start, start + PMD_SIZE);
 	} else {
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index 89895f38f722..1e31d965a14e 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -116,8 +116,9 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 		next = pmd_addr_end(addr, end);
 
 		if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
-			pmd_populate_kernel(&init_mm, pmd,
-					lm_alias(kasan_early_shadow_pte));
+			pmd_populate_kernel_at(&init_mm, pmd,
+					lm_alias(kasan_early_shadow_pte),
+					addr);
 			continue;
 		}
 
@@ -131,7 +132,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 			if (!p)
 				return -ENOMEM;
 
-			pmd_populate_kernel(&init_mm, pmd, p);
+			pmd_populate_kernel_at(&init_mm, pmd, p, addr);
 		}
 		zero_pte_populate(pmd, addr, next);
 	} while (pmd++, addr = next, addr != end);
@@ -157,8 +158,9 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 			pud_populate(&init_mm, pud,
 					lm_alias(kasan_early_shadow_pmd));
 			pmd = pmd_offset(pud, addr);
-			pmd_populate_kernel(&init_mm, pmd,
-					lm_alias(kasan_early_shadow_pte));
+			pmd_populate_kernel_at(&init_mm, pmd,
+					lm_alias(kasan_early_shadow_pte),
+					addr);
 			continue;
 		}
 
@@ -203,8 +205,9 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 			pud_populate(&init_mm, pud,
 					lm_alias(kasan_early_shadow_pmd));
 			pmd = pmd_offset(pud, addr);
-			pmd_populate_kernel(&init_mm, pmd,
-					lm_alias(kasan_early_shadow_pte));
+			pmd_populate_kernel_at(&init_mm, pmd,
+					lm_alias(kasan_early_shadow_pte),
+					addr);
 			continue;
 		}
 
@@ -266,8 +269,9 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 			pud_populate(&init_mm, pud,
 					lm_alias(kasan_early_shadow_pmd));
 			pmd = pmd_offset(pud, addr);
-			pmd_populate_kernel(&init_mm, pmd,
-					lm_alias(kasan_early_shadow_pte));
+			pmd_populate_kernel_at(&init_mm, pmd,
+					lm_alias(kasan_early_shadow_pte),
+					addr);
 			continue;
 		}
 
diff --git a/mm/memory.c b/mm/memory.c
index 15f8b10ea17c..15702822d904 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -447,7 +447,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 	return 0;
 }
 
-int __pte_alloc_kernel(pmd_t *pmd)
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 {
 	pte_t *new = pte_alloc_one_kernel(&init_mm);
 	if (!new)
@@ -456,7 +456,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
 	spin_lock(&init_mm.page_table_lock);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		smp_wmb(); /* See comment in pmd_install() */
-		pmd_populate_kernel(&init_mm, pmd, new);
+		pmd_populate_kernel_at(&init_mm, pmd, new, address);
 		new = NULL;
 	}
 	spin_unlock(&init_mm.page_table_lock);
diff --git a/mm/percpu.c b/mm/percpu.c
index 4e11fc1e6def..7312e584c1b5 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3238,7 +3238,7 @@ void __init __weak pcpu_populate_pte(unsigned long addr)
 		new = memblock_alloc(PTE_TABLE_SIZE, PTE_TABLE_SIZE);
 		if (!new)
 			goto err_alloc;
-		pmd_populate_kernel(&init_mm, pmd, new);
+		pmd_populate_kernel_at(&init_mm, pmd, new, addr);
 	}
 
 	return;
diff --git a/mm/pgalloc-track.h b/mm/pgalloc-track.h
index e9e879de8649..0984681c03d4 100644
--- a/mm/pgalloc-track.h
+++ b/mm/pgalloc-track.h
@@ -45,7 +45,8 @@ static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
 
 #define pte_alloc_kernel_track(pmd, address, mask)			\
 	((unlikely(pmd_none(*(pmd))) &&					\
-	  (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
+	  (__pte_alloc_kernel(pmd, address) ||				\
+		({*(mask) |= PGTBL_PMD_MODIFIED; 0; }))) ?		\
 		NULL: pte_offset_kernel(pmd, address))
 
 #endif /* _LINUX_PGALLOC_TRACK_H */
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a2cbe44c48e1..d876cc4dc700 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -191,7 +191,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
-		pmd_populate_kernel(&init_mm, pmd, p);
+		pmd_populate_kernel_at(&init_mm, pmd, p, addr);
 	}
 	return pmd;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] arm64: separate code and data virtual memory allocation
  2024-02-20 20:32 [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Maxwell Bland
  2024-02-20 20:32 ` [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides Maxwell Bland
  2024-02-20 20:32 ` [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation Maxwell Bland
@ 2024-02-20 20:32 ` Maxwell Bland
  2024-02-21  7:20   ` Christophe Leroy
  2024-02-20 20:32 ` [PATCH 4/4] arm64: dynamic enforcement of pmd-level PXNTable Maxwell Bland
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Maxwell Bland @ 2024-02-20 20:32 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas,
	christophe.leroy, cl, daniel, dave.hansen, david, dennis,
	dvyukov, glider, gor, guoren, haoluo, hca, hch, john.fastabend,
	jolsa, kasan-dev, kpsingh, linux-arch, linux, linux-efi,
	linux-kernel, linux-mm, linuxppc-dev, linux-riscv, linux-s390,
	lstoakes, mark.rutland, martin.lau, meted, michael.christie,
	mjguzik, mpe, mst, muchun.song, naveen.n.rao, npiggin, palmer,
	paul.walmsley, quic_nprakash, quic_pkondeti, rick.p.edgecombe,
	ryabinin.a.a, ryan.roberts, samitolvanen, sdf, song, surenb,
	svens, tj, urezki, vincenzo.frascino, will, wuqiang.matt,
	yonghong.song, zlim.lnx, mbland, awheeler

Current BPF and kprobe instruction allocation interfaces do not match
the base kernel and intermingle code and data pages within the same
sections. In the case of BPF, this appears to be a result of code
duplication between the kernel's JIT compiler and arm64's JIT.  However,
This is no longer necessary given the possibility of overriding vmalloc
wrapper functions.

arm64's vmalloc_node routines now include a layer of indirection which
splits the vmalloc region into two segments surrounding the middle
module_alloc region determined by ASLR. To support this,
code_region_start and code_region_end are defined to match the 2GB
boundary chosen by the kernel module ASLR initialization routine.

The result is a large benefits to overall kernel security, as code pages
now remain protected by this ASLR routine and protections can be defined
linearly for code regions rather than through PTE-level tracking.

Signed-off-by: Maxwell Bland <mbland@motorola.com>
---
 arch/arm64/include/asm/vmalloc.h   |  3 ++
 arch/arm64/kernel/module.c         |  7 ++++
 arch/arm64/kernel/probes/kprobes.c |  2 +-
 arch/arm64/mm/Makefile             |  3 +-
 arch/arm64/mm/vmalloc.c            | 57 ++++++++++++++++++++++++++++++
 arch/arm64/net/bpf_jit_comp.c      |  5 +--
 6 files changed, 73 insertions(+), 4 deletions(-)
 create mode 100644 arch/arm64/mm/vmalloc.c

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 38fafffe699f..dbcf8ad20265 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -31,4 +31,7 @@ static inline pgprot_t arch_vmap_pgprot_tagged(pgprot_t prot)
 	return pgprot_tagged(prot);
 }
 
+extern unsigned long code_region_start __ro_after_init;
+extern unsigned long code_region_end __ro_after_init;
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index dd851297596e..c4fe753a71a9 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -29,6 +29,10 @@
 static u64 module_direct_base __ro_after_init = 0;
 static u64 module_plt_base __ro_after_init = 0;
 
+/* For pre-init vmalloc, assume the worst-case code range */
+unsigned long code_region_start __ro_after_init = (u64) (_end - SZ_2G);
+unsigned long code_region_end __ro_after_init = (u64) (_text + SZ_2G);
+
 /*
  * Choose a random page-aligned base address for a window of 'size' bytes which
  * entirely contains the interval [start, end - 1].
@@ -101,6 +105,9 @@ static int __init module_init_limits(void)
 		module_plt_base = random_bounding_box(SZ_2G, min, max);
 	}
 
+	code_region_start = module_plt_base;
+	code_region_end = module_plt_base + SZ_2G;
+
 	pr_info("%llu pages in range for non-PLT usage",
 		module_direct_base ? (SZ_128M - kernel_size) / PAGE_SIZE : 0);
 	pr_info("%llu pages in range for PLT usage",
diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
index 70b91a8c6bb3..c9e109d6c8bc 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -131,7 +131,7 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
 
 void *alloc_insn_page(void)
 {
-	return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
+	return __vmalloc_node_range(PAGE_SIZE, 1, code_region_start, code_region_end,
 			GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
 			NUMA_NO_NODE, __builtin_return_address(0));
 }
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..730b805d8388 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -2,7 +2,8 @@
 obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
-				   context.o proc.o pageattr.o fixmap.o
+				   context.o proc.o pageattr.o fixmap.o \
+				   vmalloc.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
diff --git a/arch/arm64/mm/vmalloc.c b/arch/arm64/mm/vmalloc.c
new file mode 100644
index 000000000000..b6d2fa841f90
--- /dev/null
+++ b/arch/arm64/mm/vmalloc.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+
+static void *__vmalloc_node_range_split(unsigned long size, unsigned long align,
+			unsigned long start, unsigned long end,
+			unsigned long exclusion_start, unsigned long exclusion_end, gfp_t gfp_mask,
+			pgprot_t prot, unsigned long vm_flags, int node,
+			const void *caller)
+{
+	void *res = NULL;
+
+	res = __vmalloc_node_range(size, align, start, exclusion_start,
+				gfp_mask, prot, vm_flags, node, caller);
+	if (!res)
+		res = __vmalloc_node_range(size, align, exclusion_end, end,
+				gfp_mask, prot, vm_flags, node, caller);
+
+	return res;
+}
+
+void *__vmalloc_node(unsigned long size, unsigned long align,
+			    gfp_t gfp_mask, unsigned long vm_flags, int node,
+			    const void *caller)
+{
+	return __vmalloc_node_range_split(size, align, VMALLOC_START,
+				VMALLOC_END, code_region_start, code_region_end,
+				gfp_mask, PAGE_KERNEL, vm_flags, node, caller);
+}
+
+void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
+{
+	return __vmalloc_node_range_split(size, 1, VMALLOC_START, VMALLOC_END,
+				code_region_start, code_region_end,
+				gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+				NUMA_NO_NODE, __builtin_return_address(0));
+}
+
+void *vmalloc_user(unsigned long size)
+{
+	return __vmalloc_node_range_split(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
+				code_region_start, code_region_end,
+				GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
+				VM_USERMAP, NUMA_NO_NODE,
+				__builtin_return_address(0));
+}
+
+void *vmalloc_32_user(unsigned long size)
+{
+	return __vmalloc_node_range_split(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
+				code_region_start, code_region_end,
+				GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL,
+				VM_USERMAP, NUMA_NO_NODE,
+				__builtin_return_address(0));
+}
+
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 8955da5c47cf..40426f3a9bdf 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -13,6 +13,7 @@
 #include <linux/memory.h>
 #include <linux/printk.h>
 #include <linux/slab.h>
+#include <linux/moduleloader.h>
 
 #include <asm/asm-extable.h>
 #include <asm/byteorder.h>
@@ -1690,12 +1691,12 @@ u64 bpf_jit_alloc_exec_limit(void)
 void *bpf_jit_alloc_exec(unsigned long size)
 {
 	/* Memory is intended to be executable, reset the pointer tag. */
-	return kasan_reset_tag(vmalloc(size));
+	return kasan_reset_tag(module_alloc(size));
 }
 
 void bpf_jit_free_exec(void *addr)
 {
-	return vfree(addr);
+	return module_memfree(addr);
 }
 
 /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] arm64: dynamic enforcement of pmd-level PXNTable
  2024-02-20 20:32 [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Maxwell Bland
                   ` (2 preceding siblings ...)
  2024-02-20 20:32 ` [PATCH 3/4] arm64: separate code and data virtual memory allocation Maxwell Bland
@ 2024-02-20 20:32 ` Maxwell Bland
  2024-02-21  7:32 ` [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Christophe Leroy
  2024-02-21 14:50 ` Conor Dooley
  5 siblings, 0 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-20 20:32 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas,
	christophe.leroy, cl, daniel, dave.hansen, david, dennis,
	dvyukov, glider, gor, guoren, haoluo, hca, hch, john.fastabend,
	jolsa, kasan-dev, kpsingh, linux-arch, linux, linux-efi,
	linux-kernel, linux-mm, linuxppc-dev, linux-riscv, linux-s390,
	lstoakes, mark.rutland, martin.lau, meted, michael.christie,
	mjguzik, mpe, mst, muchun.song, naveen.n.rao, npiggin, palmer,
	paul.walmsley, quic_nprakash, quic_pkondeti, rick.p.edgecombe,
	ryabinin.a.a, ryan.roberts, samitolvanen, sdf, song, surenb,
	svens, tj, urezki, vincenzo.frascino, will, wuqiang.matt,
	yonghong.song, zlim.lnx, mbland, awheeler

In an attempt to protect against write-then-execute attacks wherein an
adversary stages malicious code into a data page and then later uses a
write gadget to mark the data page executable, arm64 enforces PXNTable
when allocating pmd descriptors during the init process. However, these
protections are not maintained for dynamic memory allocations, creating
an extensive threat surface to write-then-execute attacks targeting
pages allocated through the vmalloc interface.

Straightforward modifications to the pgalloc interface allow for the
dynamic enforcement of PXNTable, restricting writable and
privileged-executable code pages to known kernel text, bpf-allocated
programs, and kprobe-allocated pages, all of which have more extensive
verification interfaces than the generic vmalloc region.

This patch adds a preprocessor define to check whether a pmd is
allocated by vmalloc and exists outside of a known code region, and if
so, marks the pmd as PXNTable, protecting over 100 last-level page
tables from manipulation in the process.

Signed-off-by: Maxwell Bland <mbland@motorola.com>
---
 arch/arm64/include/asm/pgalloc.h | 11 +++++++++--
 arch/arm64/include/asm/vmalloc.h |  5 +++++
 arch/arm64/mm/trans_pgd.c        |  2 +-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 237224484d0f..5e9262241e8b 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -13,6 +13,7 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+#define __HAVE_ARCH_ADDR_COND_PMD
 #define __HAVE_ARCH_PGD_FREE
 #include <asm-generic/pgalloc.h>
 
@@ -74,10 +75,16 @@ static inline void __pmd_populate(pmd_t *pmdp, phys_addr_t ptep,
  * of the mm address space.
  */
 static inline void
-pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp, pte_t *ptep)
+pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp, pte_t *ptep,
+			unsigned long address)
 {
+	pmdval_t pmd = PMD_TYPE_TABLE | PMD_TABLE_UXN;
 	VM_BUG_ON(mm && mm != &init_mm);
-	__pmd_populate(pmdp, __pa(ptep), PMD_TYPE_TABLE | PMD_TABLE_UXN);
+	if (IS_DATA_VMALLOC_ADDR(address) &&
+		IS_DATA_VMALLOC_ADDR(address + PMD_SIZE)) {
+		pmd |= PMD_TABLE_PXN;
+	}
+	__pmd_populate(pmdp, __pa(ptep), pmd);
 }
 
 static inline void
diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index dbcf8ad20265..6f254ab83f4a 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -34,4 +34,9 @@ static inline pgprot_t arch_vmap_pgprot_tagged(pgprot_t prot)
 extern unsigned long code_region_start __ro_after_init;
 extern unsigned long code_region_end __ro_after_init;
 
+#define IS_DATA_VMALLOC_ADDR(vaddr) (((vaddr) < code_region_start || \
+				      (vaddr) > code_region_end) && \
+				      ((vaddr) >= VMALLOC_START && \
+				       (vaddr) < VMALLOC_END))
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 7b14df3c6477..7f903c51e1eb 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -69,7 +69,7 @@ static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp,
 	dst_ptep = trans_alloc(info);
 	if (!dst_ptep)
 		return -ENOMEM;
-	pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
+	pmd_populate_kernel_at(NULL, dst_pmdp, dst_ptep, addr);
 	dst_ptep = pte_offset_kernel(dst_pmdp, start);
 
 	src_ptep = pte_offset_kernel(src_pmdp, start);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides
  2024-02-20 20:32 ` [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides Maxwell Bland
@ 2024-02-21  5:43   ` Christoph Hellwig
  2024-02-21  7:38     ` Christophe Leroy
  2024-02-21  6:59   ` Christophe Leroy
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-02-21  5:43 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: linux-arm-kernel, gregkh, agordeev, akpm, andreyknvl, andrii,
	aneesh.kumar, aou, ardb, arnd, ast, borntraeger, bpf, brauner,
	catalin.marinas, christophe.leroy, cl, daniel, dave.hansen,
	david, dennis, dvyukov, glider, gor, guoren, haoluo, hca, hch,
	john.fastabend, jolsa, kasan-dev, kpsingh, linux-arch, linux,
	linux-efi, linux-kernel, linux-mm, linuxppc-dev, linux-riscv,
	linux-s390, lstoakes, mark.rutland, martin.lau, meted,
	michael.christie, mjguzik, mpe, mst, muchun.song, naveen.n.rao,
	npiggin, palmer, paul.walmsley, quic_nprakash, quic_pkondeti,
	rick.p.edgecombe, ryabinin.a.a, ryan.roberts, samitolvanen, sdf,
	song, surenb, svens, tj, urezki, vincenzo.frascino, will,
	wuqiang.matt, yonghong.song, zlim.lnx, awheeler

On Tue, Feb 20, 2024 at 02:32:53PM -0600, Maxwell Bland wrote:
> Present non-uniform use of __vmalloc_node and __vmalloc_node_range makes
> enforcing appropriate code and data seperation untenable on certain
> microarchitectures, as VMALLOC_START and VMALLOC_END are monolithic
> while the use of the vmalloc interface is non-monolithic: in particular,
> appropriate randomness in ASLR makes it such that code regions must fall
> in some region between VMALLOC_START and VMALLOC_end, but this
> necessitates that code pages are intermingled with data pages, meaning
> code-specific protections, such as arm64's PXNTable, cannot be
> performantly runtime enforced.

That's not actually true.  We have MODULE_START/END to separate them,
which is used by mips only for now.

> 
> The solution to this problem allows architectures to override the
> vmalloc wrapper functions by enforcing that the rest of the kernel does
> not reimplement __vmalloc_node by using __vmalloc_node_range with the
> same parameters as __vmalloc_node or provides a __weak tag to those
> functions using __vmalloc_node_range with parameters repeating those of
> __vmalloc_node.

I'm really not too happy about overriding the functions.  Especially
as the separation is a generally good idea and it would be good to
move everyone (or at least all modern architectures) over to a scheme
like this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides
  2024-02-20 20:32 ` [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides Maxwell Bland
  2024-02-21  5:43   ` Christoph Hellwig
@ 2024-02-21  6:59   ` Christophe Leroy
  2024-02-21 17:19     ` Maxwell Bland
  1 sibling, 1 reply; 17+ messages in thread
From: Christophe Leroy @ 2024-02-21  6:59 UTC (permalink / raw)
  To: Maxwell Bland, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, david, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	awheeler



Le 20/02/2024 à 21:32, Maxwell Bland a écrit :
> [Vous ne recevez pas souvent de courriers de mbland@motorola.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> Present non-uniform use of __vmalloc_node and __vmalloc_node_range makes
> enforcing appropriate code and data seperation untenable on certain
> microarchitectures, as VMALLOC_START and VMALLOC_END are monolithic
> while the use of the vmalloc interface is non-monolithic: in particular,
> appropriate randomness in ASLR makes it such that code regions must fall
> in some region between VMALLOC_START and VMALLOC_end, but this
> necessitates that code pages are intermingled with data pages, meaning
> code-specific protections, such as arm64's PXNTable, cannot be
> performantly runtime enforced.
> 
> The solution to this problem allows architectures to override the
> vmalloc wrapper functions by enforcing that the rest of the kernel does
> not reimplement __vmalloc_node by using __vmalloc_node_range with the
> same parameters as __vmalloc_node or provides a __weak tag to those
> functions using __vmalloc_node_range with parameters repeating those of
> __vmalloc_node.
> 
> Two benefits of this approach are (1) greater flexibility to each
> architecture for handling of virtual memory while not compromising the
> kernel's vmalloc logic and (2) more uniform use of the __vmalloc_node
> interface, reserving the more specialized __vmalloc_node_range for more
> specialized cases, such as kasan's shadow memory.

I'm not sure I understand the message. What I understand is that you 
allow architectures to override vmalloc_node().

In the code you add __weak for that. But you also add the flags to the 
parameters and I can't understand why when reading the above description.

Christophe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation
  2024-02-20 20:32 ` [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation Maxwell Bland
@ 2024-02-21  7:13   ` Christophe Leroy
  2024-02-21  9:27     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Christophe Leroy @ 2024-02-21  7:13 UTC (permalink / raw)
  To: Maxwell Bland, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, david, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	awheeler



Le 20/02/2024 à 21:32, Maxwell Bland a écrit :
> [Vous ne recevez pas souvent de courriers de mbland@motorola.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> While other descriptors (e.g. pud) allow allocations conditional on
> which virtual address is allocated, pmd descriptor allocations do not.
> However, adding support for this is straightforward and is beneficial to
> future kernel development targeting the PMD memory granularity.
> 
> As many architectures already implement pmd_populate_kernel in an
> address-generic manner, it is necessary to roll out support
> incrementally. For this purpose a preprocessor flag,

Is it really worth it ? It is only 48 call sites that need to be 
updated. It would avoid that processor flag and avoid introducing that 
pmd_populate_kernel_at() in kernel core.

$ git grep -l pmd_populate_kernel -- arch/ | wc -l
48


> __HAVE_ARCH_ADDR_COND_PMD is introduced to capture whether the
> architecture supports some feature requiring PMD allocation conditional
> on virtual address. Some microarchitectures (e.g. arm64) support
> configurations for table descriptors, for example to enforce Privilege
> eXecute Never, which benefit from knowing the virtual memory addresses
> referenced by PMDs.
> 
> Thus two major arguments in favor of this change are (1) unformity of
> allocation between PMD and other table descriptor types and (2) the
> capability of address-specific PMD allocation.

Can you give more details on that uniformity ? I can't find any function 
called pud_populate_kernel().

Previously, pmd_populate_kernel() had same arguments as pmd_populate(). 
Why not also updating pmd_populate() to keep consistancy ? (can be done 
in a follow-up patch, not in this patch).

> 
> Signed-off-by: Maxwell Bland <mbland@motorola.com>
> ---
>   include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
>   include/linux/mm.h            |  4 ++--
>   mm/hugetlb_vmemmap.c          |  4 ++--
>   mm/kasan/init.c               | 22 +++++++++++++---------
>   mm/memory.c                   |  4 ++--
>   mm/percpu.c                   |  2 +-
>   mm/pgalloc-track.h            |  3 ++-
>   mm/sparse-vmemmap.c           |  2 +-
>   8 files changed, 41 insertions(+), 18 deletions(-)
> 
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index 879e5f8aa5e9..e5cdce77c6e4 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -142,6 +142,24 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
>   }
>   #endif
> 
> +#ifdef __HAVE_ARCH_ADDR_COND_PMD
> +static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
> +                       pte_t *ptep, unsigned long address);
> +#else
> +static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
> +                       pte_t *ptep);
> +#endif
> +
> +static inline void pmd_populate_kernel_at(struct mm_struct *mm, pmd_t *pmdp,
> +                       pte_t *ptep, unsigned long address)
> +{
> +#ifdef __HAVE_ARCH_ADDR_COND_PMD
> +       pmd_populate_kernel(mm, pmdp, ptep, address);
> +#else
> +       pmd_populate_kernel(mm, pmdp, ptep);
> +#endif
> +}
> +
>   #ifndef __HAVE_ARCH_PMD_FREE
>   static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
>   {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f5a97dec5169..6a9d5ded428d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2782,7 +2782,7 @@ static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
>   #endif
> 
>   int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
> -int __pte_alloc_kernel(pmd_t *pmd);
> +int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
> 
>   #if defined(CONFIG_MMU)
> 
> @@ -2977,7 +2977,7 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>                   NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
> 
>   #define pte_alloc_kernel(pmd, address)                 \
> -       ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
> +       ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address)) ? \
>                  NULL: pte_offset_kernel(pmd, address))
> 
>   #if USE_SPLIT_PMD_PTLOCKS
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index da177e49d956..1f5664b656f1 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -58,7 +58,7 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
>          if (!pgtable)
>                  return -ENOMEM;
> 
> -       pmd_populate_kernel(&init_mm, &__pmd, pgtable);
> +       pmd_populate_kernel_at(&init_mm, &__pmd, pgtable, addr);
> 
>          for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
>                  pte_t entry, *pte;
> @@ -81,7 +81,7 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
> 
>                  /* Make pte visible before pmd. See comment in pmd_install(). */
>                  smp_wmb();
> -               pmd_populate_kernel(&init_mm, pmd, pgtable);
> +               pmd_populate_kernel_at(&init_mm, pmd, pgtable, addr);
>                  if (!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH))
>                          flush_tlb_kernel_range(start, start + PMD_SIZE);
>          } else {
> diff --git a/mm/kasan/init.c b/mm/kasan/init.c
> index 89895f38f722..1e31d965a14e 100644
> --- a/mm/kasan/init.c
> +++ b/mm/kasan/init.c
> @@ -116,8 +116,9 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
>                  next = pmd_addr_end(addr, end);
> 
>                  if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
> -                       pmd_populate_kernel(&init_mm, pmd,
> -                                       lm_alias(kasan_early_shadow_pte));
> +                       pmd_populate_kernel_at(&init_mm, pmd,
> +                                       lm_alias(kasan_early_shadow_pte),
> +                                       addr);
>                          continue;
>                  }
> 
> @@ -131,7 +132,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
>                          if (!p)
>                                  return -ENOMEM;
> 
> -                       pmd_populate_kernel(&init_mm, pmd, p);
> +                       pmd_populate_kernel_at(&init_mm, pmd, p, addr);
>                  }
>                  zero_pte_populate(pmd, addr, next);
>          } while (pmd++, addr = next, addr != end);
> @@ -157,8 +158,9 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
>                          pud_populate(&init_mm, pud,
>                                          lm_alias(kasan_early_shadow_pmd));
>                          pmd = pmd_offset(pud, addr);
> -                       pmd_populate_kernel(&init_mm, pmd,
> -                                       lm_alias(kasan_early_shadow_pte));
> +                       pmd_populate_kernel_at(&init_mm, pmd,
> +                                       lm_alias(kasan_early_shadow_pte),
> +                                       addr);
>                          continue;
>                  }
> 
> @@ -203,8 +205,9 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
>                          pud_populate(&init_mm, pud,
>                                          lm_alias(kasan_early_shadow_pmd));
>                          pmd = pmd_offset(pud, addr);
> -                       pmd_populate_kernel(&init_mm, pmd,
> -                                       lm_alias(kasan_early_shadow_pte));
> +                       pmd_populate_kernel_at(&init_mm, pmd,
> +                                       lm_alias(kasan_early_shadow_pte),
> +                                       addr);
>                          continue;
>                  }
> 
> @@ -266,8 +269,9 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
>                          pud_populate(&init_mm, pud,
>                                          lm_alias(kasan_early_shadow_pmd));
>                          pmd = pmd_offset(pud, addr);
> -                       pmd_populate_kernel(&init_mm, pmd,
> -                                       lm_alias(kasan_early_shadow_pte));
> +                       pmd_populate_kernel_at(&init_mm, pmd,
> +                                       lm_alias(kasan_early_shadow_pte),
> +                                       addr);
>                          continue;
>                  }
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 15f8b10ea17c..15702822d904 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -447,7 +447,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>          return 0;
>   }
> 
> -int __pte_alloc_kernel(pmd_t *pmd)
> +int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
>   {
>          pte_t *new = pte_alloc_one_kernel(&init_mm);
>          if (!new)
> @@ -456,7 +456,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
>          spin_lock(&init_mm.page_table_lock);
>          if (likely(pmd_none(*pmd))) {   /* Has another populated it ? */
>                  smp_wmb(); /* See comment in pmd_install() */
> -               pmd_populate_kernel(&init_mm, pmd, new);
> +               pmd_populate_kernel_at(&init_mm, pmd, new, address);
>                  new = NULL;
>          }
>          spin_unlock(&init_mm.page_table_lock);
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 4e11fc1e6def..7312e584c1b5 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -3238,7 +3238,7 @@ void __init __weak pcpu_populate_pte(unsigned long addr)
>                  new = memblock_alloc(PTE_TABLE_SIZE, PTE_TABLE_SIZE);
>                  if (!new)
>                          goto err_alloc;
> -               pmd_populate_kernel(&init_mm, pmd, new);
> +               pmd_populate_kernel_at(&init_mm, pmd, new, addr);
>          }
> 
>          return;
> diff --git a/mm/pgalloc-track.h b/mm/pgalloc-track.h
> index e9e879de8649..0984681c03d4 100644
> --- a/mm/pgalloc-track.h
> +++ b/mm/pgalloc-track.h
> @@ -45,7 +45,8 @@ static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
> 
>   #define pte_alloc_kernel_track(pmd, address, mask)                     \
>          ((unlikely(pmd_none(*(pmd))) &&                                 \
> -         (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
> +         (__pte_alloc_kernel(pmd, address) ||                          \
> +               ({*(mask) |= PGTBL_PMD_MODIFIED; 0; }))) ?              \
>                  NULL: pte_offset_kernel(pmd, address))
> 
>   #endif /* _LINUX_PGALLOC_TRACK_H */
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index a2cbe44c48e1..d876cc4dc700 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -191,7 +191,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>                  void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>                  if (!p)
>                          return NULL;
> -               pmd_populate_kernel(&init_mm, pmd, p);
> +               pmd_populate_kernel_at(&init_mm, pmd, p, addr);
>          }
>          return pmd;
>   }
> --
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] arm64: separate code and data virtual memory allocation
  2024-02-20 20:32 ` [PATCH 3/4] arm64: separate code and data virtual memory allocation Maxwell Bland
@ 2024-02-21  7:20   ` Christophe Leroy
  0 siblings, 0 replies; 17+ messages in thread
From: Christophe Leroy @ 2024-02-21  7:20 UTC (permalink / raw)
  To: Maxwell Bland, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, david, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	awheeler



Le 20/02/2024 à 21:32, Maxwell Bland a écrit :
> [Vous ne recevez pas souvent de courriers de mbland@motorola.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> Current BPF and kprobe instruction allocation interfaces do not match
> the base kernel and intermingle code and data pages within the same
> sections. In the case of BPF, this appears to be a result of code
> duplication between the kernel's JIT compiler and arm64's JIT.  However,
> This is no longer necessary given the possibility of overriding vmalloc
> wrapper functions.

Why do you need to override vmalloc wrapper functions for that ?

See powerpc, for kprobes, alloc_insn_page() uses module_alloc().
On powerpc, the approach is that vmalloc() provides non-exec memory 
while module_alloc() provides executable memory.

Christophe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration
  2024-02-20 20:32 [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Maxwell Bland
                   ` (3 preceding siblings ...)
  2024-02-20 20:32 ` [PATCH 4/4] arm64: dynamic enforcement of pmd-level PXNTable Maxwell Bland
@ 2024-02-21  7:32 ` Christophe Leroy
  2024-02-21 17:57   ` Maxwell Bland
  2024-02-21 14:50 ` Conor Dooley
  5 siblings, 1 reply; 17+ messages in thread
From: Christophe Leroy @ 2024-02-21  7:32 UTC (permalink / raw)
  To: Maxwell Bland, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, david, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	awheeler



Le 20/02/2024 à 21:32, Maxwell Bland a écrit :
> [Vous ne recevez pas souvent de courriers de mbland@motorola.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> Reworks ARM's virtual memory allocation infrastructure to support
> dynamic enforcement of page middle directory PXNTable restrictions
> rather than only during the initial memory mapping. Runtime enforcement
> of this bit prevents write-then-execute attacks, where malicious code is
> staged in vmalloc'd data regions, and later the page table is changed to
> make this code executable.
> 
> Previously the entire region from VMALLOC_START to VMALLOC_END was
> vulnerable, but now the vulnerable region is restricted to the 2GB
> reserved by module_alloc, a region which is generally read-only and more
> difficult to inject staging code into, e.g., data must pass the BPF
> verifier. These changes also set the stage for other systems, such as
> KVM-level (EL2) changes to mark page tables immutable and code page
> verification changes, forging a path toward complete mitigation of
> kernel exploits on ARM.
> 
> Implementing this required minimal changes to the generic vmalloc
> interface in the kernel to allow architecture overrides of some vmalloc
> wrapper functions, refactoring vmalloc calls to use a standard interface
> in the generic kernel, and passing the address parameter already passed
> into PTE allocation to the pte_allocate child function call.
> 
> The new arm64 vmalloc wrapper functions ensure vmalloc data is not
> allocated into the region reserved for module_alloc. arm64 BPF and
> kprobe code also see a two-line-change ensuring their allocations abide
> by the segmentation of code from data. Finally, arm64's pmd_populate
> function is modified to set the PXNTable bit appropriately.

On powerpc (book3s/32) we have more or less the same although it is not 
directly linked to PMDs: the virtual 4G address space is split in 
segments of 256M. On each segment there's a bit called NX to forbit 
execution. Vmalloc space is allocated in a segment with NX bit set while 
Module spare is allocated in a segment with NX bit unset. We never have 
to override vmalloc wrappers. All consumers of exec memory allocate it 
using module_alloc() while vmalloc() provides non-exec memory.

For modules, all you have to do is select 
ARCH_WANTS_MODULES_DATA_IN_VMALLOC and module data will be allocated 
using vmalloc() hence non-exec memory in our case.

Christophe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides
  2024-02-21  5:43   ` Christoph Hellwig
@ 2024-02-21  7:38     ` Christophe Leroy
  0 siblings, 0 replies; 17+ messages in thread
From: Christophe Leroy @ 2024-02-21  7:38 UTC (permalink / raw)
  To: Christoph Hellwig, Maxwell Bland
  Cc: linux-arm-kernel, gregkh, agordeev, akpm, andreyknvl, andrii,
	aneesh.kumar, aou, ardb, arnd, ast, borntraeger, bpf, brauner,
	catalin.marinas, cl, daniel, dave.hansen, david, dennis, dvyukov,
	glider, gor, guoren, haoluo, hca, john.fastabend, jolsa,
	kasan-dev, kpsingh, linux-arch, linux, linux-efi, linux-kernel,
	linux-mm, linuxppc-dev, linux-riscv, linux-s390, lstoakes,
	mark.rutland, martin.lau, meted, michael.christie, mjguzik, mpe,
	mst, muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	awheeler



Le 21/02/2024 à 06:43, Christoph Hellwig a écrit :
> On Tue, Feb 20, 2024 at 02:32:53PM -0600, Maxwell Bland wrote:
>> Present non-uniform use of __vmalloc_node and __vmalloc_node_range makes
>> enforcing appropriate code and data seperation untenable on certain
>> microarchitectures, as VMALLOC_START and VMALLOC_END are monolithic
>> while the use of the vmalloc interface is non-monolithic: in particular,
>> appropriate randomness in ASLR makes it such that code regions must fall
>> in some region between VMALLOC_START and VMALLOC_end, but this
>> necessitates that code pages are intermingled with data pages, meaning
>> code-specific protections, such as arm64's PXNTable, cannot be
>> performantly runtime enforced.
> 
> That's not actually true.  We have MODULE_START/END to separate them,
> which is used by mips only for now.

We have MODULES_VADDR and MODULES_END that are used by arm, arm64, 
loongarcg, powerpc, riscv, s390, sparc, x86_64

is_vmalloc_or_module_addr() is using MODULES_VADDR so I guess this 
function fails on mips ?

> 
>>
>> The solution to this problem allows architectures to override the
>> vmalloc wrapper functions by enforcing that the rest of the kernel does
>> not reimplement __vmalloc_node by using __vmalloc_node_range with the
>> same parameters as __vmalloc_node or provides a __weak tag to those
>> functions using __vmalloc_node_range with parameters repeating those of
>> __vmalloc_node.
> 
> I'm really not too happy about overriding the functions.  Especially
> as the separation is a generally good idea and it would be good to
> move everyone (or at least all modern architectures) over to a scheme
> like this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation
  2024-02-21  7:13   ` Christophe Leroy
@ 2024-02-21  9:27     ` David Hildenbrand
  2024-02-21 15:54       ` [External] " Maxwell Bland
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-02-21  9:27 UTC (permalink / raw)
  To: Christophe Leroy, Maxwell Bland, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	awheeler

On 21.02.24 08:13, Christophe Leroy wrote:
> 
> 
> Le 20/02/2024 à 21:32, Maxwell Bland a écrit :
>> [Vous ne recevez pas souvent de courriers de mbland@motorola.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
>>
>> While other descriptors (e.g. pud) allow allocations conditional on
>> which virtual address is allocated, pmd descriptor allocations do not.
>> However, adding support for this is straightforward and is beneficial to
>> future kernel development targeting the PMD memory granularity.
>>
>> As many architectures already implement pmd_populate_kernel in an
>> address-generic manner, it is necessary to roll out support
>> incrementally. For this purpose a preprocessor flag,
> 
> Is it really worth it ? It is only 48 call sites that need to be
> updated. It would avoid that processor flag and avoid introducing that
> pmd_populate_kernel_at() in kernel core.

+1, let's avoid that if possible.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration
  2024-02-20 20:32 [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Maxwell Bland
                   ` (4 preceding siblings ...)
  2024-02-21  7:32 ` [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Christophe Leroy
@ 2024-02-21 14:50 ` Conor Dooley
  2024-02-21 15:42   ` [External] " Maxwell Bland
  5 siblings, 1 reply; 17+ messages in thread
From: Conor Dooley @ 2024-02-21 14:50 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: linux-arm-kernel, gregkh, agordeev, akpm, andreyknvl, andrii,
	aneesh.kumar, aou, ardb, arnd, ast, borntraeger, bpf, brauner,
	catalin.marinas, christophe.leroy, cl, daniel, dave.hansen,
	david, dennis, dvyukov, glider, gor, guoren, haoluo, hca, hch,
	john.fastabend, jolsa, kasan-dev, kpsingh, linux-arch, linux,
	linux-efi, linux-kernel, linux-mm, linuxppc-dev, linux-riscv,
	linux-s390, lstoakes, mark.rutland, martin.lau, meted,
	michael.christie, mjguzik, mpe, mst, muchun.song, naveen.n.rao,
	npiggin, palmer, paul.walmsley, quic_nprakash, quic_pkondeti,
	rick.p.edgecombe, ryabinin.a.a, ryan.roberts, samitolvanen, sdf,
	song, surenb, svens, tj, urezki, vincenzo.frascino, will,
	wuqiang.matt, yonghong.song, zlim.lnx, awheeler

[-- Attachment #1: Type: text/plain, Size: 413 bytes --]

Hey Maxwell,

FYI:

>   mm/vmalloc: allow arch-specific vmalloc_node overrides
>   mm: pgalloc: support address-conditional pmd allocation

With these two arch/riscv/configs/* are broken with calls to undeclared
functions.

>   arm64: separate code and data virtual memory allocation
>   arm64: dynamic enforcement of pmd-level PXNTable

And with these two the 32-bit and nommu builds are broken.

Cheers,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [External] Re: [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration
  2024-02-21 14:50 ` Conor Dooley
@ 2024-02-21 15:42   ` Maxwell Bland
  0 siblings, 0 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-21 15:42 UTC (permalink / raw)
  To: Conor Dooley
  Cc: linux-arm-kernel, gregkh, agordeev, akpm, andreyknvl, andrii,
	aneesh.kumar, aou, ardb, arnd, ast, borntraeger, bpf, brauner,
	catalin.marinas, christophe.leroy, cl, daniel, dave.hansen,
	david, dennis, dvyukov, glider, gor, guoren, haoluo, hca, hch,
	john.fastabend, jolsa, kasan-dev, kpsingh, linux-arch, linux,
	linux-efi, linux-kernel, linux-mm, linuxppc-dev, linux-riscv,
	linux-s390, lstoakes, mark.rutland, martin.lau, meted,
	michael.christie, mjguzik, mpe, mst, muchun.song, naveen.n.rao,
	npiggin, palmer, paul.walmsley, quic_nprakash, quic_pkondeti,
	rick.p.edgecombe, ryabinin.a.a, ryan.roberts, samitolvanen, sdf,
	song, surenb, svens, tj, urezki, vincenzo.frascino, will,
	wuqiang.matt, yonghong.song, zlim.lnx, Andrew Wheeler

> From: Conor Dooley <conor@kernel.org>
> FYI:
> 
> >   mm/vmalloc: allow arch-specific vmalloc_node overrides
> >   mm: pgalloc: support address-conditional pmd allocation
> 
> With these two arch/riscv/configs/* are broken with calls to undeclared
> functions.

Will fix, thanks! I will also figure out how to make sure this doesn't happen again for some other architecture.

> >   arm64: separate code and data virtual memory allocation
> >   arm64: dynamic enforcement of pmd-level PXNTable
> 
> And with these two the 32-bit and nommu builds are broken.

Was not aware there was a dependency here. I will see what I can do.

Thank you,
Maxwell

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [External] Re: [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation
  2024-02-21  9:27     ` David Hildenbrand
@ 2024-02-21 15:54       ` Maxwell Bland
  0 siblings, 0 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-21 15:54 UTC (permalink / raw)
  To: David Hildenbrand, Christophe Leroy, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	Andrew Wheeler

> On February 21, 2024 3:27 AM David Hildenbrand wrote
> On 21.02.24 08:13, Christophe Leroy wrote:
> > Le 20/02/2024 à 21:32, Maxwell Bland a écrit :
> >>
> >> While other descriptors (e.g. pud) allow allocations conditional on
> >> which virtual address is allocated, pmd descriptor allocations do not.
> >> However, adding support for this is straightforward and is beneficial to
> >> future kernel development targeting the PMD memory granularity.
> >>
> >> As many architectures already implement pmd_populate_kernel in an
> >> address-generic manner, it is necessary to roll out support
> >> incrementally. For this purpose a preprocessor flag,
> >
> > Is it really worth it ? It is only 48 call sites that need to be
> > updated. It would avoid that processor flag and avoid introducing that
> > pmd_populate_kernel_at() in kernel core.
> 
> +1, let's avoid that if possible.

Will fix, thank you!

Maxwell

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides
  2024-02-21  6:59   ` Christophe Leroy
@ 2024-02-21 17:19     ` Maxwell Bland
  0 siblings, 0 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-21 17:19 UTC (permalink / raw)
  To: Christophe Leroy, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, david, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	Andrew Wheeler

> On Wednesday, February 21, 2024 12:59 AM, Christophe Leroy wrote:
> 
> In the code you add __weak for that. But you also add the flags to the
> parameters and I can't understand why when reading the above description.

This  change was made to allow most kernel interfaces use vmalloc_node and
enable the overrides to work. It also reduces the number of kernel locations
which would need to be change if there was ever a change to the
vmalloc_node_range interface.

However, there is a pushback to overriding the vmalloc interface, so this change
will likely not show up in my final patch.

Regards,
Maxwell


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration
  2024-02-21  7:32 ` [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Christophe Leroy
@ 2024-02-21 17:57   ` Maxwell Bland
  0 siblings, 0 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-21 17:57 UTC (permalink / raw)
  To: Christophe Leroy, linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas, cl,
	daniel, dave.hansen, david, dennis, dvyukov, glider, gor, guoren,
	haoluo, hca, hch, john.fastabend, jolsa, kasan-dev, kpsingh,
	linux-arch, linux, linux-efi, linux-kernel, linux-mm,
	linuxppc-dev, linux-riscv, linux-s390, lstoakes, mark.rutland,
	martin.lau, meted, michael.christie, mjguzik, mpe, mst,
	muchun.song, naveen.n.rao, npiggin, palmer, paul.walmsley,
	quic_nprakash, quic_pkondeti, rick.p.edgecombe, ryabinin.a.a,
	ryan.roberts, samitolvanen, sdf, song, surenb, svens, tj, urezki,
	vincenzo.frascino, will, wuqiang.matt, yonghong.song, zlim.lnx,
	Andrew Wheeler

> On Wednesday, February 21, 2024 at 1:32 AM, Christophe Leroy wrote:
> 
> On powerpc (book3s/32) we have more or less the same although it is not
> directly linked to PMDs: the virtual 4G address space is split in
> segments of 256M. On each segment there's a bit called NX to forbit
> execution. Vmalloc space is allocated in a segment with NX bit set while
> Module spare is allocated in a segment with NX bit unset. We never have
> to override vmalloc wrappers. All consumers of exec memory allocate it
> using module_alloc() while vmalloc() provides non-exec memory.
> 
> For modules, all you have to do is select
> ARCH_WANTS_MODULES_DATA_IN_VMALLOC and module data will be allocated
> using vmalloc() hence non-exec memory in our case.

This critique has led me to some valuable ideas, and I can definitely find a simpler
approach without overrides.

I do want to mention changes to how VMALLOC_* and MODULE_* constants
are used on arm64 may introduce other issues. See discussion/code on the patch
that motivated this patch at:

https://lore.kernel.org/all/CAP5Mv+ydhk=Ob4b40ZahGMgT-5+-VEHxtmA=-LkJiEOOU+K6hw@mail.gmail.com/

In short, maybe the issue of code/data intermixing requires a rework of arm64
memory infrastructure, but I see a potentially elegant solution here based on the
comments given on this patch.

Thanks,
Maxwell


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-02-21 17:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-20 20:32 [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Maxwell Bland
2024-02-20 20:32 ` [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides Maxwell Bland
2024-02-21  5:43   ` Christoph Hellwig
2024-02-21  7:38     ` Christophe Leroy
2024-02-21  6:59   ` Christophe Leroy
2024-02-21 17:19     ` Maxwell Bland
2024-02-20 20:32 ` [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation Maxwell Bland
2024-02-21  7:13   ` Christophe Leroy
2024-02-21  9:27     ` David Hildenbrand
2024-02-21 15:54       ` [External] " Maxwell Bland
2024-02-20 20:32 ` [PATCH 3/4] arm64: separate code and data virtual memory allocation Maxwell Bland
2024-02-21  7:20   ` Christophe Leroy
2024-02-20 20:32 ` [PATCH 4/4] arm64: dynamic enforcement of pmd-level PXNTable Maxwell Bland
2024-02-21  7:32 ` [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration Christophe Leroy
2024-02-21 17:57   ` Maxwell Bland
2024-02-21 14:50 ` Conor Dooley
2024-02-21 15:42   ` [External] " Maxwell Bland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).