linux-hardening.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN
@ 2022-06-13 14:45 Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 01/26] arm64: head: move kimage_vaddr variable into C file Ard Biesheuvel
                   ` (26 more replies)
  0 siblings, 27 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

[ TL;DR this series does the following:
  - move variable definitions and assignments out of early asm code
    where possible, and get rid of explicit cache maintenance;
  - convert initial ID map so it covers the entire loaded image as well
    as the DT blob;
  - create the kernel mapping only once instead of twice (for KASLR),
    and do it with the MMU and caches on;
  - avoid mappings that are both writable and executable entirely;
  - avoid parsing the DT while the kernel text and rodata are still
    mapped writable;
  - allow WXN to be enabled (with an opt-out) so writable mappings are
    never executable. ]

This is a followup to a previous series of mine [0][1], and it aims to
streamline the boot flow with respect to cache maintenance and redundant
copying of data in memory, as well as eliminate writable executable
mappings at any time during the boot or after.

Additionally, this series removes the little dance we do to create a
kernel mapping, relocate the kernel, run the KASLR init code, tear down
the old mapping and create a new one, relocate the kernel again, and
finally enter the kernel proper. Instead, it invokes a minimal C
function 'kaslr_early_init()' while running from the ID map which
includes a temporary mapping of the FDT. This change represents a
substantial chunk of the diffstat, as it requires some work to
instantiate code that can run safely from an arbitrary load address.

The WXN support was tested using a Debian bullseye mixed AArch64/armhf
user space, running Gnome shell, Chromium, Firefox, LibreOffice, etc.
Some minimal tweaks are needed to avoid entries appearing in the kernel
log regarding attempts from user space to create PROT_EXEC+PROT_WRITE
mappings, but in most cases (libffi, for instance), the library in
question already carries a workaround, but didn't enable it by default
because it did not detect selinux as being active (which it was not, in
this case)

Changes since v3:
- drop changes for entering with the MMU enabled for now;
- reject mmap() and mprotect() calls with PROT_WRITE and PROT_EXEC flags
  passed when WXN is in effect; this essentially matches the behavior of
  both selinux and PaX, and most distros (including Android) can already
  deal with this just fine;
- defer KASLR initialization to an initcall() to the extent possible.

Changes since v2:
- create a separate, initial ID map that is discarded after boot, and
  create the permanent ID map from C code using the ordinary memory
  mapping code;
- refactor the extended ID map handling, and along with it, simplify the
  early memory mapping macros, so that we can deal with an extended ID
  map that requires multiple table entries at intermediate levels;
- eliminate all variable assignments with the MMU off from the happy
  flow;
- replace temporary FDT mapping in TTBR1 with a FDT mapping in the
  initial ID map;
- use read-only attributes for all code mappings, so we can boot with
  WXN enabled if we elect to do so.

Changes since v1:
- Remove the dodgy handling of the KASLR seed, which was necessary to
  avoid doing two iterations of the setup/teardown of the page tables.
  This is now dealt with by creating the TTBR1 page tables while
  executing from TTBR0, and so all memory manipulations are still done
  with the MMU and caches on.
- Only boot from EFI with the MMU and caches on if the image was not
  moved around in memory. Otherwise, we cannot rely on the firmware's ID
  map to have created an executable mapping for the copied code.

[0] https://lore.kernel.org/all/20220304175657.2744400-1-ardb@kernel.org/
[1] https://lore.kernel.org/all/20220330154205.2483167-1-ardb@kernel.org/
[2] https://lore.kernel.org/all/20220411094824.4176877-1-ardb@kernel.org/

Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>

Ard Biesheuvel (26):
  arm64: head: move kimage_vaddr variable into C file
  arm64: mm: make vabits_actual a build time constant if possible
  arm64: head: move assignment of idmap_t0sz to C code
  arm64: head: drop idmap_ptrs_per_pgd
  arm64: head: simplify page table mapping macros (slightly)
  arm64: head: switch to map_memory macro for the extended ID map
  arm64: head: split off idmap creation code
  arm64: kernel: drop unnecessary PoC cache clean+invalidate
  arm64: head: pass ID map root table address to __enable_mmu()
  arm64: mm: provide idmap pointer to cpu_replace_ttbr1()
  arm64: head: add helper function to remap regions in early page tables
  arm64: head: cover entire kernel image in initial ID map
  arm64: head: use relative references to the RELA and RELR tables
  arm64: head: create a temporary FDT mapping in the initial ID map
  arm64: idreg-override: use early FDT mapping in ID map
  arm64: head: factor out TTBR1 assignment into a macro
  arm64: head: populate kernel page tables with MMU and caches on
  arm64: head: record CPU boot mode after enabling the MMU
  arm64: kaslr: defer initialization to late initcall where permitted
  arm64: head: avoid relocating the kernel twice for KASLR
  arm64: setup: drop early FDT pointer helpers
  arm64: mm: move ro_after_init section into the data segment
  arm64: head: remap the kernel text/inittext region read-only
  mm: add arch hook to validate mmap() prot flags
  arm64: mm: add support for WXN memory translation attribute
  arm64: kernel: move ID map out of .text mapping

 arch/arm64/Kconfig                      |  11 +
 arch/arm64/include/asm/assembler.h      |  37 +-
 arch/arm64/include/asm/cpufeature.h     |   8 +
 arch/arm64/include/asm/kernel-pgtable.h |  18 +-
 arch/arm64/include/asm/memory.h         |   4 +
 arch/arm64/include/asm/mman.h           |  36 ++
 arch/arm64/include/asm/mmu_context.h    |  46 +-
 arch/arm64/include/asm/setup.h          |   3 -
 arch/arm64/kernel/Makefile              |   2 +-
 arch/arm64/kernel/cpufeature.c          |   2 +-
 arch/arm64/kernel/head.S                | 536 ++++++++++----------
 arch/arm64/kernel/hyp-stub.S            |   4 +-
 arch/arm64/kernel/idreg-override.c      |  33 +-
 arch/arm64/kernel/image-vars.h          |   4 +
 arch/arm64/kernel/kaslr.c               | 149 +-----
 arch/arm64/kernel/pi/Makefile           |  33 ++
 arch/arm64/kernel/pi/kaslr_early.c      | 112 ++++
 arch/arm64/kernel/setup.c               |  15 -
 arch/arm64/kernel/sleep.S               |   1 +
 arch/arm64/kernel/suspend.c             |   2 +-
 arch/arm64/kernel/vmlinux.lds.S         |  63 ++-
 arch/arm64/mm/kasan_init.c              |   4 +-
 arch/arm64/mm/mmu.c                     |  96 +++-
 arch/arm64/mm/proc.S                    |  29 +-
 include/linux/mman.h                    |  15 +
 mm/mmap.c                               |   3 +
 26 files changed, 763 insertions(+), 503 deletions(-)
 create mode 100644 arch/arm64/kernel/pi/Makefile
 create mode 100644 arch/arm64/kernel/pi/kaslr_early.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v4 01/26] arm64: head: move kimage_vaddr variable into C file
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-14  8:26   ` Anshuman Khandual
  2022-06-13 14:45 ` [PATCH v4 02/26] arm64: mm: make vabits_actual a build time constant if possible Ard Biesheuvel
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

This variable definition does not need to be in head.S so move it out.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S | 7 -------
 arch/arm64/mm/mmu.c      | 3 +++
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 6a98f1a38c29..1cdecce552bb 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -469,13 +469,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 	ASM_BUG()
 SYM_FUNC_END(__primary_switched)
 
-	.pushsection ".rodata", "a"
-SYM_DATA_START(kimage_vaddr)
-	.quad		_text
-SYM_DATA_END(kimage_vaddr)
-EXPORT_SYMBOL(kimage_vaddr)
-	.popsection
-
 /*
  * end early head section, begin head code that is also used for
  * hotplug and needs to have the same protections as the text region
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index c5563ff990da..7148928e3932 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -49,6 +49,9 @@ u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
 u64 __section(".mmuoff.data.write") vabits_actual;
 EXPORT_SYMBOL(vabits_actual);
 
+u64 kimage_vaddr __ro_after_init = (u64)&_text;
+EXPORT_SYMBOL(kimage_vaddr);
+
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 02/26] arm64: mm: make vabits_actual a build time constant if possible
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 01/26] arm64: head: move kimage_vaddr variable into C file Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-14  8:25   ` Anshuman Khandual
  2022-06-13 14:45 ` [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code Ard Biesheuvel
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Currently, we only support 52-bit virtual addressing on 64k pages
configurations, and in all other cases, vabits_actual is guaranteed to
equal VA_BITS (== VA_BITS_MIN). So get rid of the variable entirely in
that case.

While at it, move the assignment out of the asm entry code - it has no
need to be there.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/memory.h |  4 ++++
 arch/arm64/kernel/head.S        | 15 +--------------
 arch/arm64/mm/mmu.c             | 15 ++++++++++++++-
 3 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 0af70d9abede..c751cd9b94f8 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -174,7 +174,11 @@
 #include <linux/types.h>
 #include <asm/bug.h>
 
+#if VA_BITS > 48
 extern u64			vabits_actual;
+#else
+#define vabits_actual		((u64)VA_BITS)
+#endif
 
 extern s64			memstart_addr;
 /* PHYS_OFFSET - the physical address of the start of memory. */
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 1cdecce552bb..dc07858eb673 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -293,19 +293,6 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	adrp	x0, idmap_pg_dir
 	adrp	x3, __idmap_text_start		// __pa(__idmap_text_start)
 
-#ifdef CONFIG_ARM64_VA_BITS_52
-	mrs_s	x6, SYS_ID_AA64MMFR2_EL1
-	and	x6, x6, #(0xf << ID_AA64MMFR2_LVA_SHIFT)
-	mov	x5, #52
-	cbnz	x6, 1f
-#endif
-	mov	x5, #VA_BITS_MIN
-1:
-	adr_l	x6, vabits_actual
-	str	x5, [x6]
-	dmb	sy
-	dc	ivac, x6		// Invalidate potentially stale cache line
-
 	/*
 	 * VA_BITS may be too small to allow for an ID mapping to be created
 	 * that covers system RAM if that is located sufficiently high in the
@@ -713,7 +700,7 @@ SYM_FUNC_START(__enable_mmu)
 SYM_FUNC_END(__enable_mmu)
 
 SYM_FUNC_START(__cpu_secondary_check52bitva)
-#ifdef CONFIG_ARM64_VA_BITS_52
+#if VA_BITS > 48
 	ldr_l	x0, vabits_actual
 	cmp	x0, #52
 	b.ne	2f
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7148928e3932..17b339c1a326 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -46,8 +46,10 @@
 u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
 u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
 
-u64 __section(".mmuoff.data.write") vabits_actual;
+#if VA_BITS > 48
+u64 vabits_actual __ro_after_init = VA_BITS_MIN;
 EXPORT_SYMBOL(vabits_actual);
+#endif
 
 u64 kimage_vaddr __ro_after_init = (u64)&_text;
 EXPORT_SYMBOL(kimage_vaddr);
@@ -772,6 +774,17 @@ void __init paging_init(void)
 {
 	pgd_t *pgdp = pgd_set_fixmap(__pa_symbol(swapper_pg_dir));
 
+#if VA_BITS > 48
+	if (cpuid_feature_extract_unsigned_field(
+				read_sysreg_s(SYS_ID_AA64MMFR2_EL1),
+				ID_AA64MMFR2_LVA_SHIFT))
+		vabits_actual = VA_BITS;
+
+	/* make the variable visible to secondaries with the MMU off */
+	dcache_clean_inval_poc((u64)&vabits_actual,
+			       (u64)&vabits_actual + sizeof(vabits_actual));
+#endif
+
 	map_kernel(pgdp);
 	map_mem(pgdp);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 01/26] arm64: head: move kimage_vaddr variable into C file Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 02/26] arm64: mm: make vabits_actual a build time constant if possible Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-14  9:22   ` Anshuman Khandual
  2022-06-24 12:36   ` Will Deacon
  2022-06-13 14:45 ` [PATCH v4 04/26] arm64: head: drop idmap_ptrs_per_pgd Ard Biesheuvel
                   ` (23 subsequent siblings)
  26 siblings, 2 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Setting idmap_t0sz involves fiddling with the caches if done with the
MMU off. Since we will be creating an initial ID map with the MMU and
caches off, and the permanent ID map with the MMU and caches on, let's
move this assignment of idmap_t0sz out of the startup code, and replace
it with a macro that simply issues the three instructions needed to
calculate the value wherever it is needed before the MMU is turned on.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/assembler.h   | 14 ++++++++++++++
 arch/arm64/include/asm/mmu_context.h |  2 +-
 arch/arm64/kernel/head.S             | 13 +------------
 arch/arm64/mm/mmu.c                  |  5 ++++-
 arch/arm64/mm/proc.S                 |  2 +-
 5 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 8c5a61aeaf8e..9468f45c07a6 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -359,6 +359,20 @@ alternative_cb_end
 	bfi	\valreg, \t1sz, #TCR_T1SZ_OFFSET, #TCR_TxSZ_WIDTH
 	.endm
 
+/*
+ * idmap_get_t0sz - get the T0SZ value needed to cover the ID map
+ *
+ * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
+ * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
+ * this number conveniently equals the number of leading zeroes in
+ * the physical address of _end.
+ */
+	.macro	idmap_get_t0sz, reg
+	adrp	\reg, _end
+	orr	\reg, \reg, #(1 << VA_BITS_MIN) - 1
+	clz	\reg, \reg
+	.endm
+
 /*
  * tcr_compute_pa_size - set TCR.(I)PS to the highest supported
  * ID_AA64MMFR0_EL1.PARange value
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 6770667b34a3..6ac0086ebb1a 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -60,7 +60,7 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
  * TCR_T0SZ(VA_BITS), unless system RAM is positioned very high in
  * physical memory, in which case it will be smaller.
  */
-extern u64 idmap_t0sz;
+extern int idmap_t0sz;
 extern u64 idmap_ptrs_per_pgd;
 
 /*
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index dc07858eb673..7f361bc72d12 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -299,22 +299,11 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	 * physical address space. So for the ID map, use an extended virtual
 	 * range in that case, and configure an additional translation level
 	 * if needed.
-	 *
-	 * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
-	 * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
-	 * this number conveniently equals the number of leading zeroes in
-	 * the physical address of __idmap_text_end.
 	 */
-	adrp	x5, __idmap_text_end
-	clz	x5, x5
+	idmap_get_t0sz x5
 	cmp	x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
 	b.ge	1f			// .. then skip VA range extension
 
-	adr_l	x6, idmap_t0sz
-	str	x5, [x6]
-	dmb	sy
-	dc	ivac, x6		// Invalidate potentially stale cache line
-
 #if (VA_BITS < 48)
 #define EXTRA_SHIFT	(PGDIR_SHIFT + PAGE_SHIFT - 3)
 #define EXTRA_PTRS	(1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17b339c1a326..103bf4ae408d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -43,7 +43,7 @@
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
 
-u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
+int idmap_t0sz __ro_after_init;
 u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
 
 #if VA_BITS > 48
@@ -785,6 +785,9 @@ void __init paging_init(void)
 			       (u64)&vabits_actual + sizeof(vabits_actual));
 #endif
 
+	idmap_t0sz = min(63UL - __fls(__pa_symbol(_end)),
+			 TCR_T0SZ(VA_BITS_MIN));
+
 	map_kernel(pgdp);
 	map_mem(pgdp);
 
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 972ce8d7f2c5..97cd67697212 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -470,7 +470,7 @@ SYM_FUNC_START(__cpu_setup)
 	add		x9, x9, #64
 	tcr_set_t1sz	tcr, x9
 #else
-	ldr_l		x9, idmap_t0sz
+	idmap_get_t0sz	x9
 #endif
 	tcr_set_t0sz	tcr, x9
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 04/26] arm64: head: drop idmap_ptrs_per_pgd
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-15  4:07   ` Anshuman Khandual
  2022-06-13 14:45 ` [PATCH v4 05/26] arm64: head: simplify page table mapping macros (slightly) Ard Biesheuvel
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

The assignment of idmap_ptrs_per_pgd lacks any cache invalidation, even
though it is updated with the MMU and caches disabled. However, we never
bother to read the value again except in the very next instruction, and
so we can just drop the variable entirely.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/mmu_context.h | 1 -
 arch/arm64/kernel/head.S             | 7 +++----
 arch/arm64/mm/mmu.c                  | 1 -
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 6ac0086ebb1a..7b387c3b312a 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -61,7 +61,6 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
  * physical memory, in which case it will be smaller.
  */
 extern int idmap_t0sz;
-extern u64 idmap_ptrs_per_pgd;
 
 /*
  * Ensure TCR.T0SZ is set to the provided value.
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 7f361bc72d12..53126a35d73c 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -300,6 +300,7 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	 * range in that case, and configure an additional translation level
 	 * if needed.
 	 */
+	mov	x4, #PTRS_PER_PGD
 	idmap_get_t0sz x5
 	cmp	x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
 	b.ge	1f			// .. then skip VA range extension
@@ -319,18 +320,16 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 #error "Mismatch between VA_BITS and page size/number of translation levels"
 #endif
 
-	mov	x4, EXTRA_PTRS
-	create_table_entry x0, x3, EXTRA_SHIFT, x4, x5, x6
+	mov	x2, EXTRA_PTRS
+	create_table_entry x0, x3, EXTRA_SHIFT, x2, x5, x6
 #else
 	/*
 	 * If VA_BITS == 48, we don't have to configure an additional
 	 * translation level, but the top-level table has more entries.
 	 */
 	mov	x4, #1 << (PHYS_MASK_SHIFT - PGDIR_SHIFT)
-	str_l	x4, idmap_ptrs_per_pgd, x5
 #endif
 1:
-	ldr_l	x4, idmap_ptrs_per_pgd
 	adr_l	x6, __idmap_text_end		// __pa(__idmap_text_end)
 
 	map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 103bf4ae408d..0f95c91e5a8e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -44,7 +44,6 @@
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
 
 int idmap_t0sz __ro_after_init;
-u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
 
 #if VA_BITS > 48
 u64 vabits_actual __ro_after_init = VA_BITS_MIN;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 05/26] arm64: head: simplify page table mapping macros (slightly)
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 04/26] arm64: head: drop idmap_ptrs_per_pgd Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 06/26] arm64: head: switch to map_memory macro for the extended ID map Ard Biesheuvel
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Simplify the macros in head.S that are used to set up the early page
tables, by switching to immediates for the number of bits that are
interpreted as the table index at each level. This makes it much
easier to infer from the instruction stream what is going on, and
reduces the number of instructions emitted substantially.

Note that the extended ID map for cases where no additional level needs
to be configured now uses a compile time size as well, which means that
we interpret up to 10 bits as the table index at the root level (for
52-bit physical addressing), without taking into account whether or not
this is supported on the current system.  However, those bits can only
be set if we are executing the image from an address that exceeds the
48-bit PA range, and are guaranteed to be cleared otherwise, and given
that we are dealing with a mapping in the lower TTBR0 range of the
address space, the result is therefore the same as if we'd mask off only
6 bits.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S | 55 ++++++++------------
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 53126a35d73c..9fdde2f9cc0f 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -179,31 +179,20 @@ SYM_CODE_END(preserve_boot_args)
  *	vstart:	virtual address of start of range
  *	vend:	virtual address of end of range - we map [vstart, vend]
  *	shift:	shift used to transform virtual address into index
- *	ptrs:	number of entries in page table
+ *	order:  #imm 2log(number of entries in page table)
  *	istart:	index in table corresponding to vstart
  *	iend:	index in table corresponding to vend
  *	count:	On entry: how many extra entries were required in previous level, scales
  *			  our end index.
  *		On exit: returns how many extra entries required for next page table level
  *
- * Preserves:	vstart, vend, shift, ptrs
+ * Preserves:	vstart, vend
  * Returns:	istart, iend, count
  */
-	.macro compute_indices, vstart, vend, shift, ptrs, istart, iend, count
-	lsr	\iend, \vend, \shift
-	mov	\istart, \ptrs
-	sub	\istart, \istart, #1
-	and	\iend, \iend, \istart	// iend = (vend >> shift) & (ptrs - 1)
-	mov	\istart, \ptrs
-	mul	\istart, \istart, \count
-	add	\iend, \iend, \istart	// iend += count * ptrs
-					// our entries span multiple tables
-
-	lsr	\istart, \vstart, \shift
-	mov	\count, \ptrs
-	sub	\count, \count, #1
-	and	\istart, \istart, \count
-
+	.macro compute_indices, vstart, vend, shift, order, istart, iend, count
+	ubfx	\istart, \vstart, \shift, \order
+	ubfx	\iend, \vend, \shift, \order
+	add	\iend, \iend, \count, lsl \order
 	sub	\count, \iend, \istart
 	.endm
 
@@ -218,38 +207,39 @@ SYM_CODE_END(preserve_boot_args)
  *	vend:	virtual address of end of range - we map [vstart, vend - 1]
  *	flags:	flags to use to map last level entries
  *	phys:	physical address corresponding to vstart - physical memory is contiguous
- *	pgds:	the number of pgd entries
+ *	order:  #imm 2log(number of entries in PGD table)
  *
  * Temporaries:	istart, iend, tmp, count, sv - these need to be different registers
  * Preserves:	vstart, flags
  * Corrupts:	tbl, rtbl, vend, istart, iend, tmp, count, sv
  */
-	.macro map_memory, tbl, rtbl, vstart, vend, flags, phys, pgds, istart, iend, tmp, count, sv
+	.macro map_memory, tbl, rtbl, vstart, vend, flags, phys, order, istart, iend, tmp, count, sv
 	sub \vend, \vend, #1
 	add \rtbl, \tbl, #PAGE_SIZE
-	mov \sv, \rtbl
 	mov \count, #0
-	compute_indices \vstart, \vend, #PGDIR_SHIFT, \pgds, \istart, \iend, \count
+
+	compute_indices \vstart, \vend, #PGDIR_SHIFT, #\order, \istart, \iend, \count
+	mov \sv, \rtbl
 	populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
 	mov \tbl, \sv
-	mov \sv, \rtbl
 
 #if SWAPPER_PGTABLE_LEVELS > 3
-	compute_indices \vstart, \vend, #PUD_SHIFT, #PTRS_PER_PUD, \istart, \iend, \count
+	compute_indices \vstart, \vend, #PUD_SHIFT, #(PAGE_SHIFT - 3), \istart, \iend, \count
+	mov \sv, \rtbl
 	populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
 	mov \tbl, \sv
-	mov \sv, \rtbl
 #endif
 
 #if SWAPPER_PGTABLE_LEVELS > 2
-	compute_indices \vstart, \vend, #SWAPPER_TABLE_SHIFT, #PTRS_PER_PMD, \istart, \iend, \count
+	compute_indices \vstart, \vend, #SWAPPER_TABLE_SHIFT, #(PAGE_SHIFT - 3), \istart, \iend, \count
+	mov \sv, \rtbl
 	populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
 	mov \tbl, \sv
 #endif
 
-	compute_indices \vstart, \vend, #SWAPPER_BLOCK_SHIFT, #PTRS_PER_PTE, \istart, \iend, \count
-	bic \count, \phys, #SWAPPER_BLOCK_SIZE - 1
-	populate_entries \tbl, \count, \istart, \iend, \flags, #SWAPPER_BLOCK_SIZE, \tmp
+	compute_indices \vstart, \vend, #SWAPPER_BLOCK_SHIFT, #(PAGE_SHIFT - 3), \istart, \iend, \count
+	bic \rtbl, \phys, #SWAPPER_BLOCK_SIZE - 1
+	populate_entries \tbl, \rtbl, \istart, \iend, \flags, #SWAPPER_BLOCK_SIZE, \tmp
 	.endm
 
 /*
@@ -300,12 +290,12 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	 * range in that case, and configure an additional translation level
 	 * if needed.
 	 */
-	mov	x4, #PTRS_PER_PGD
 	idmap_get_t0sz x5
 	cmp	x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
 	b.ge	1f			// .. then skip VA range extension
 
 #if (VA_BITS < 48)
+#define IDMAP_PGD_ORDER	(VA_BITS - PGDIR_SHIFT)
 #define EXTRA_SHIFT	(PGDIR_SHIFT + PAGE_SHIFT - 3)
 #define EXTRA_PTRS	(1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
 
@@ -323,16 +313,16 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	mov	x2, EXTRA_PTRS
 	create_table_entry x0, x3, EXTRA_SHIFT, x2, x5, x6
 #else
+#define IDMAP_PGD_ORDER	(PHYS_MASK_SHIFT - PGDIR_SHIFT)
 	/*
 	 * If VA_BITS == 48, we don't have to configure an additional
 	 * translation level, but the top-level table has more entries.
 	 */
-	mov	x4, #1 << (PHYS_MASK_SHIFT - PGDIR_SHIFT)
 #endif
 1:
 	adr_l	x6, __idmap_text_end		// __pa(__idmap_text_end)
 
-	map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14
+	map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14
 
 	/*
 	 * Map the kernel image (starting with PHYS_OFFSET).
@@ -340,13 +330,12 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	adrp	x0, init_pg_dir
 	mov_q	x5, KIMAGE_VADDR		// compile time __va(_text)
 	add	x5, x5, x23			// add KASLR displacement
-	mov	x4, PTRS_PER_PGD
 	adrp	x6, _end			// runtime __pa(_end)
 	adrp	x3, _text			// runtime __pa(_text)
 	sub	x6, x6, x3			// _end - _text
 	add	x6, x6, x5			// runtime __va(_end)
 
-	map_memory x0, x1, x5, x6, x7, x3, x4, x10, x11, x12, x13, x14
+	map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
 
 	/*
 	 * Since the page tables have been populated with non-cacheable
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 06/26] arm64: head: switch to map_memory macro for the extended ID map
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 05/26] arm64: head: simplify page table mapping macros (slightly) Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 07/26] arm64: head: split off idmap creation code Ard Biesheuvel
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

In a future patch, we will start using an ID map that covers the entire
image, rather than a single page. This means that we need to deal with
the pathological case of an extended ID map where the kernel image does
not fit neatly inside a single entry at the root level, which means we
will need to create additional table entries and map additional pages
for page tables.

The existing map_memory macro already takes care of most of that, so
let's just extend it to deal with this case as well. While at it, drop
the conditional branch on the value of T0SZ: we don't set the variable
anymore in the entry code, and so we can just let the map_memory macro
deal with the case where the output address exceeds VA_BITS.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S | 76 ++++++++++----------
 1 file changed, 37 insertions(+), 39 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 9fdde2f9cc0f..eb54c0289c8a 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -122,29 +122,6 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
 	b	dcache_inval_poc		// tail call
 SYM_CODE_END(preserve_boot_args)
 
-/*
- * Macro to create a table entry to the next page.
- *
- *	tbl:	page table address
- *	virt:	virtual address
- *	shift:	#imm page table shift
- *	ptrs:	#imm pointers per table page
- *
- * Preserves:	virt
- * Corrupts:	ptrs, tmp1, tmp2
- * Returns:	tbl -> next level table page address
- */
-	.macro	create_table_entry, tbl, virt, shift, ptrs, tmp1, tmp2
-	add	\tmp1, \tbl, #PAGE_SIZE
-	phys_to_pte \tmp2, \tmp1
-	orr	\tmp2, \tmp2, #PMD_TYPE_TABLE	// address of next table and entry type
-	lsr	\tmp1, \virt, #\shift
-	sub	\ptrs, \ptrs, #1
-	and	\tmp1, \tmp1, \ptrs		// table index
-	str	\tmp2, [\tbl, \tmp1, lsl #3]
-	add	\tbl, \tbl, #PAGE_SIZE		// next level table page
-	.endm
-
 /*
  * Macro to populate page table entries, these entries can be pointers to the next level
  * or last level entries pointing to physical memory.
@@ -209,15 +186,27 @@ SYM_CODE_END(preserve_boot_args)
  *	phys:	physical address corresponding to vstart - physical memory is contiguous
  *	order:  #imm 2log(number of entries in PGD table)
  *
+ * If extra_shift is set, an extra level will be populated if the end address does
+ * not fit in 'extra_shift' bits. This assumes vend is in the TTBR0 range.
+ *
  * Temporaries:	istart, iend, tmp, count, sv - these need to be different registers
  * Preserves:	vstart, flags
  * Corrupts:	tbl, rtbl, vend, istart, iend, tmp, count, sv
  */
-	.macro map_memory, tbl, rtbl, vstart, vend, flags, phys, order, istart, iend, tmp, count, sv
+	.macro map_memory, tbl, rtbl, vstart, vend, flags, phys, order, istart, iend, tmp, count, sv, extra_shift
 	sub \vend, \vend, #1
 	add \rtbl, \tbl, #PAGE_SIZE
 	mov \count, #0
 
+	.ifnb	\extra_shift
+	tst	\vend, #~((1 << (\extra_shift)) - 1)
+	b.eq	.L_\@
+	compute_indices \vstart, \vend, #\extra_shift, #(PAGE_SHIFT - 3), \istart, \iend, \count
+	mov \sv, \rtbl
+	populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
+	mov \tbl, \sv
+	.endif
+.L_\@:
 	compute_indices \vstart, \vend, #PGDIR_SHIFT, #\order, \istart, \iend, \count
 	mov \sv, \rtbl
 	populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
@@ -284,20 +273,32 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	adrp	x3, __idmap_text_start		// __pa(__idmap_text_start)
 
 	/*
-	 * VA_BITS may be too small to allow for an ID mapping to be created
-	 * that covers system RAM if that is located sufficiently high in the
-	 * physical address space. So for the ID map, use an extended virtual
-	 * range in that case, and configure an additional translation level
-	 * if needed.
+	 * The ID map carries a 1:1 mapping of the physical address range
+	 * covered by the loaded image, which could be anywhere in DRAM. This
+	 * means that the required size of the VA (== PA) space is decided at
+	 * boot time, and could be more than the configured size of the VA
+	 * space for ordinary kernel and user space mappings.
+	 *
+	 * There are three cases to consider here:
+	 * - 39 <= VA_BITS < 48, and the ID map needs up to 48 VA bits to cover
+	 *   the placement of the image. In this case, we configure one extra
+	 *   level of translation on the fly for the ID map only. (This case
+	 *   also covers 42-bit VA/52-bit PA on 64k pages).
+	 *
+	 * - VA_BITS == 48, and the ID map needs more than 48 VA bits. This can
+	 *   only happen when using 64k pages, in which case we need to extend
+	 *   the root level table rather than add a level. Note that we can
+	 *   treat this case as 'always extended' as long as we take care not
+	 *   to program an unsupported T0SZ value into the TCR register.
+	 *
+	 * - Combinations that would require two additional levels of
+	 *   translation are not supported, e.g., VA_BITS==36 on 16k pages, or
+	 *   VA_BITS==39/4k pages with 5-level paging, where the input address
+	 *   requires more than 47 or 48 bits, respectively.
 	 */
-	idmap_get_t0sz x5
-	cmp	x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
-	b.ge	1f			// .. then skip VA range extension
-
 #if (VA_BITS < 48)
 #define IDMAP_PGD_ORDER	(VA_BITS - PGDIR_SHIFT)
 #define EXTRA_SHIFT	(PGDIR_SHIFT + PAGE_SHIFT - 3)
-#define EXTRA_PTRS	(1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
 
 	/*
 	 * If VA_BITS < 48, we have to configure an additional table level.
@@ -309,20 +310,17 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 #if VA_BITS != EXTRA_SHIFT
 #error "Mismatch between VA_BITS and page size/number of translation levels"
 #endif
-
-	mov	x2, EXTRA_PTRS
-	create_table_entry x0, x3, EXTRA_SHIFT, x2, x5, x6
 #else
 #define IDMAP_PGD_ORDER	(PHYS_MASK_SHIFT - PGDIR_SHIFT)
+#define EXTRA_SHIFT
 	/*
 	 * If VA_BITS == 48, we don't have to configure an additional
 	 * translation level, but the top-level table has more entries.
 	 */
 #endif
-1:
 	adr_l	x6, __idmap_text_end		// __pa(__idmap_text_end)
 
-	map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14
+	map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
 
 	/*
 	 * Map the kernel image (starting with PHYS_OFFSET).
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 07/26] arm64: head: split off idmap creation code
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 06/26] arm64: head: switch to map_memory macro for the extended ID map Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 08/26] arm64: kernel: drop unnecessary PoC cache clean+invalidate Ard Biesheuvel
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Split off the creation of the ID map page tables, so that we can avoid
running it again unnecessarily when KASLR is in effect (which only
randomizes the virtual placement). This will permit us to drop some
explicit cache maintenance to the PoC which was necessary because the
cache invalidation being performed on some global variables might
otherwise clobber unrelated variables that happen to share a cacheline.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S | 101 ++++++++++----------
 1 file changed, 52 insertions(+), 49 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index eb54c0289c8a..1cbc52097bf9 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -84,7 +84,7 @@
 	 *  Register   Scope                      Purpose
 	 *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0
 	 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
-	 *  x28        __create_page_tables()                   callee preserved temp register
+	 *  x28        clear_page_tables()                      callee preserved temp register
 	 *  x19/x20    __primary_switch()                       callee preserved temp registers
 	 *  x24        __primary_switch() .. relocate_kernel()  current RELR displacement
 	 */
@@ -94,7 +94,10 @@ SYM_CODE_START(primary_entry)
 	adrp	x23, __PHYS_OFFSET
 	and	x23, x23, MIN_KIMG_ALIGN - 1	// KASLR offset, defaults to 0
 	bl	set_cpu_boot_mode_flag
-	bl	__create_page_tables
+	bl	clear_page_tables
+	bl	create_idmap
+	bl	create_kernel_mapping
+
 	/*
 	 * The following calls CPU setup code, see arch/arm64/mm/proc.S for
 	 * details.
@@ -122,6 +125,35 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
 	b	dcache_inval_poc		// tail call
 SYM_CODE_END(preserve_boot_args)
 
+SYM_FUNC_START_LOCAL(clear_page_tables)
+	mov	x28, lr
+
+	/*
+	 * Invalidate the init page tables to avoid potential dirty cache lines
+	 * being evicted. Other page tables are allocated in rodata as part of
+	 * the kernel image, and thus are clean to the PoC per the boot
+	 * protocol.
+	 */
+	adrp	x0, init_pg_dir
+	adrp	x1, init_pg_end
+	bl	dcache_inval_poc
+
+	/*
+	 * Clear the init page tables.
+	 */
+	adrp	x0, init_pg_dir
+	adrp	x1, init_pg_end
+	sub	x1, x1, x0
+1:	stp	xzr, xzr, [x0], #16
+	stp	xzr, xzr, [x0], #16
+	stp	xzr, xzr, [x0], #16
+	stp	xzr, xzr, [x0], #16
+	subs	x1, x1, #64
+	b.ne	1b
+
+	ret	x28
+SYM_FUNC_END(clear_page_tables)
+
 /*
  * Macro to populate page table entries, these entries can be pointers to the next level
  * or last level entries pointing to physical memory.
@@ -231,44 +263,8 @@ SYM_CODE_END(preserve_boot_args)
 	populate_entries \tbl, \rtbl, \istart, \iend, \flags, #SWAPPER_BLOCK_SIZE, \tmp
 	.endm
 
-/*
- * Setup the initial page tables. We only setup the barest amount which is
- * required to get the kernel running. The following sections are required:
- *   - identity mapping to enable the MMU (low address, TTBR0)
- *   - first few MB of the kernel linear mapping to jump to once the MMU has
- *     been enabled
- */
-SYM_FUNC_START_LOCAL(__create_page_tables)
-	mov	x28, lr
 
-	/*
-	 * Invalidate the init page tables to avoid potential dirty cache lines
-	 * being evicted. Other page tables are allocated in rodata as part of
-	 * the kernel image, and thus are clean to the PoC per the boot
-	 * protocol.
-	 */
-	adrp	x0, init_pg_dir
-	adrp	x1, init_pg_end
-	bl	dcache_inval_poc
-
-	/*
-	 * Clear the init page tables.
-	 */
-	adrp	x0, init_pg_dir
-	adrp	x1, init_pg_end
-	sub	x1, x1, x0
-1:	stp	xzr, xzr, [x0], #16
-	stp	xzr, xzr, [x0], #16
-	stp	xzr, xzr, [x0], #16
-	stp	xzr, xzr, [x0], #16
-	subs	x1, x1, #64
-	b.ne	1b
-
-	mov	x7, SWAPPER_MM_MMUFLAGS
-
-	/*
-	 * Create the identity mapping.
-	 */
+SYM_FUNC_START_LOCAL(create_idmap)
 	adrp	x0, idmap_pg_dir
 	adrp	x3, __idmap_text_start		// __pa(__idmap_text_start)
 
@@ -319,12 +315,23 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	 */
 #endif
 	adr_l	x6, __idmap_text_end		// __pa(__idmap_text_end)
+	mov	x7, SWAPPER_MM_MMUFLAGS
 
 	map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
 
 	/*
-	 * Map the kernel image (starting with PHYS_OFFSET).
+	 * Since the page tables have been populated with non-cacheable
+	 * accesses (MMU disabled), invalidate those tables again to
+	 * remove any speculatively loaded cache lines.
 	 */
+	dmb	sy
+
+	adrp	x0, idmap_pg_dir
+	adrp	x1, idmap_pg_end
+	b	dcache_inval_poc		// tail call
+SYM_FUNC_END(create_idmap)
+
+SYM_FUNC_START_LOCAL(create_kernel_mapping)
 	adrp	x0, init_pg_dir
 	mov_q	x5, KIMAGE_VADDR		// compile time __va(_text)
 	add	x5, x5, x23			// add KASLR displacement
@@ -332,6 +339,7 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	adrp	x3, _text			// runtime __pa(_text)
 	sub	x6, x6, x3			// _end - _text
 	add	x6, x6, x5			// runtime __va(_end)
+	mov	x7, SWAPPER_MM_MMUFLAGS
 
 	map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
 
@@ -342,16 +350,10 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
 	 */
 	dmb	sy
 
-	adrp	x0, idmap_pg_dir
-	adrp	x1, idmap_pg_end
-	bl	dcache_inval_poc
-
 	adrp	x0, init_pg_dir
 	adrp	x1, init_pg_end
-	bl	dcache_inval_poc
-
-	ret	x28
-SYM_FUNC_END(__create_page_tables)
+	b	dcache_inval_poc		// tail call
+SYM_FUNC_END(create_kernel_mapping)
 
 	/*
 	 * Initialize CPU registers with task-specific and cpu-specific context.
@@ -836,7 +838,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 	pre_disable_mmu_workaround
 	msr	sctlr_el1, x20			// disable the MMU
 	isb
-	bl	__create_page_tables		// recreate kernel mapping
+	bl	clear_page_tables
+	bl	create_kernel_mapping		// Recreate kernel mapping
 
 	tlbi	vmalle1				// Remove any stale TLB entries
 	dsb	nsh
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 08/26] arm64: kernel: drop unnecessary PoC cache clean+invalidate
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 07/26] arm64: head: split off idmap creation code Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-15  4:32   ` Anshuman Khandual
  2022-06-13 14:45 ` [PATCH v4 09/26] arm64: head: pass ID map root table address to __enable_mmu() Ard Biesheuvel
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Some early boot code runs before the virtual placement of the kernel is
finalized, and we used to go back to the very start and recreate the ID
map along with the page tables describing the virtual kernel mapping,
and this involved setting some global variables with the caches off.

In order to ensure that global state created by the KASLR code is not
corrupted by the cache invalidation that occurs in that case, we needed
to clean those global variables to the PoC explicitly.

This is no longer needed now that the ID map is created only once (and
the associated global variable updates are no longer repeated). So drop
the cache maintenance that is no longer necessary.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/kaslr.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index 418b2bba1521..d5542666182f 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -13,7 +13,6 @@
 #include <linux/pgtable.h>
 #include <linux/random.h>
 
-#include <asm/cacheflush.h>
 #include <asm/fixmap.h>
 #include <asm/kernel-pgtable.h>
 #include <asm/memory.h>
@@ -72,9 +71,6 @@ u64 __init kaslr_early_init(void)
 	 * we end up running with module randomization disabled.
 	 */
 	module_alloc_base = (u64)_etext - MODULES_VSIZE;
-	dcache_clean_inval_poc((unsigned long)&module_alloc_base,
-			    (unsigned long)&module_alloc_base +
-				    sizeof(module_alloc_base));
 
 	/*
 	 * Try to map the FDT early. If this fails, we simply bail,
@@ -174,13 +170,6 @@ u64 __init kaslr_early_init(void)
 	module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
 	module_alloc_base &= PAGE_MASK;
 
-	dcache_clean_inval_poc((unsigned long)&module_alloc_base,
-			    (unsigned long)&module_alloc_base +
-				    sizeof(module_alloc_base));
-	dcache_clean_inval_poc((unsigned long)&memstart_offset_seed,
-			    (unsigned long)&memstart_offset_seed +
-				    sizeof(memstart_offset_seed));
-
 	return offset;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 09/26] arm64: head: pass ID map root table address to __enable_mmu()
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (7 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 08/26] arm64: kernel: drop unnecessary PoC cache clean+invalidate Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 10/26] arm64: mm: provide idmap pointer to cpu_replace_ttbr1() Ard Biesheuvel
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

We will be adding an initial ID map that covers the entire kernel image,
so we will pass the actual ID map root table to use to __enable_mmu(),
rather than hard code it.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S  | 14 ++++++++------
 arch/arm64/kernel/sleep.S |  1 +
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 1cbc52097bf9..70c462bbd6bf 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -595,6 +595,7 @@ SYM_FUNC_START_LOCAL(secondary_startup)
 	bl	__cpu_secondary_check52bitva
 	bl	__cpu_setup			// initialise processor
 	adrp	x1, swapper_pg_dir
+	adrp	x2, idmap_pg_dir
 	bl	__enable_mmu
 	ldr	x8, =__secondary_switched
 	br	x8
@@ -648,6 +649,7 @@ SYM_FUNC_END(__secondary_too_slow)
  *
  *  x0  = SCTLR_EL1 value for turning on the MMU.
  *  x1  = TTBR1_EL1 value
+ *  x2  = ID map root table address
  *
  * Returns to the caller via x30/lr. This requires the caller to be covered
  * by the .idmap.text section.
@@ -656,14 +658,13 @@ SYM_FUNC_END(__secondary_too_slow)
  * If it isn't, park the CPU
  */
 SYM_FUNC_START(__enable_mmu)
-	mrs	x2, ID_AA64MMFR0_EL1
-	ubfx	x2, x2, #ID_AA64MMFR0_TGRAN_SHIFT, 4
-	cmp     x2, #ID_AA64MMFR0_TGRAN_SUPPORTED_MIN
+	mrs	x3, ID_AA64MMFR0_EL1
+	ubfx	x3, x3, #ID_AA64MMFR0_TGRAN_SHIFT, 4
+	cmp     x3, #ID_AA64MMFR0_TGRAN_SUPPORTED_MIN
 	b.lt    __no_granule_support
-	cmp     x2, #ID_AA64MMFR0_TGRAN_SUPPORTED_MAX
+	cmp     x3, #ID_AA64MMFR0_TGRAN_SUPPORTED_MAX
 	b.gt    __no_granule_support
-	update_early_cpu_boot_status 0, x2, x3
-	adrp	x2, idmap_pg_dir
+	update_early_cpu_boot_status 0, x3, x4
 	phys_to_ttbr x1, x1
 	phys_to_ttbr x2, x2
 	msr	ttbr0_el1, x2			// load TTBR0
@@ -819,6 +820,7 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 #endif
 
 	adrp	x1, init_pg_dir
+	adrp	x2, idmap_pg_dir
 	bl	__enable_mmu
 #ifdef CONFIG_RELOCATABLE
 #ifdef CONFIG_RELR
diff --git a/arch/arm64/kernel/sleep.S b/arch/arm64/kernel/sleep.S
index 4ea9392f86e0..e36b09d942f7 100644
--- a/arch/arm64/kernel/sleep.S
+++ b/arch/arm64/kernel/sleep.S
@@ -104,6 +104,7 @@ SYM_CODE_START(cpu_resume)
 	bl	__cpu_setup
 	/* enable the MMU early - so we can access sleep_save_stash by va */
 	adrp	x1, swapper_pg_dir
+	adrp	x2, idmap_pg_dir
 	bl	__enable_mmu
 	ldr	x8, =_cpu_resume
 	br	x8
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 10/26] arm64: mm: provide idmap pointer to cpu_replace_ttbr1()
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (8 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 09/26] arm64: head: pass ID map root table address to __enable_mmu() Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 11/26] arm64: head: add helper function to remap regions in early page tables Ard Biesheuvel
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

In preparation for changing the way we initialize the permanent ID map,
update cpu_replace_ttbr1() so we can use it with the initial ID map as
well.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/mmu_context.h | 13 +++++++++----
 arch/arm64/kernel/cpufeature.c       |  2 +-
 arch/arm64/kernel/suspend.c          |  2 +-
 arch/arm64/mm/kasan_init.c           |  4 ++--
 arch/arm64/mm/mmu.c                  |  2 +-
 5 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 7b387c3b312a..c7ccd82db1d2 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -105,13 +105,18 @@ static inline void cpu_uninstall_idmap(void)
 		cpu_switch_mm(mm->pgd, mm);
 }
 
-static inline void cpu_install_idmap(void)
+static inline void __cpu_install_idmap(pgd_t *idmap)
 {
 	cpu_set_reserved_ttbr0();
 	local_flush_tlb_all();
 	cpu_set_idmap_tcr_t0sz();
 
-	cpu_switch_mm(lm_alias(idmap_pg_dir), &init_mm);
+	cpu_switch_mm(lm_alias(idmap), &init_mm);
+}
+
+static inline void cpu_install_idmap(void)
+{
+	__cpu_install_idmap(idmap_pg_dir);
 }
 
 /*
@@ -142,7 +147,7 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
  * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
  * avoiding the possibility of conflicting TLB entries being allocated.
  */
-static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp)
+static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
 {
 	typedef void (ttbr_replace_func)(phys_addr_t);
 	extern ttbr_replace_func idmap_cpu_replace_ttbr1;
@@ -165,7 +170,7 @@ static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp)
 
 	replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1));
 
-	cpu_install_idmap();
+	__cpu_install_idmap(idmap);
 	replace_phys(ttbr1);
 	cpu_uninstall_idmap();
 }
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index c2a64c9e451e..f37d8f69c339 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -3275,7 +3275,7 @@ subsys_initcall_sync(init_32bit_el0_mask);
 
 static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
 {
-	cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
+	cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
 }
 
 /*
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index 2b0887e58a7c..9135fe0f3df5 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -52,7 +52,7 @@ void notrace __cpu_suspend_exit(void)
 
 	/* Restore CnP bit in TTBR1_EL1 */
 	if (system_supports_cnp())
-		cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
+		cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
 
 	/*
 	 * PSTATE was not saved over suspend/resume, re-enable any detected
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index c12cd700598f..e969e68de005 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -236,7 +236,7 @@ static void __init kasan_init_shadow(void)
 	 */
 	memcpy(tmp_pg_dir, swapper_pg_dir, sizeof(tmp_pg_dir));
 	dsb(ishst);
-	cpu_replace_ttbr1(lm_alias(tmp_pg_dir));
+	cpu_replace_ttbr1(lm_alias(tmp_pg_dir), idmap_pg_dir);
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
@@ -280,7 +280,7 @@ static void __init kasan_init_shadow(void)
 				PAGE_KERNEL_RO));
 
 	memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE);
-	cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
+	cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
 }
 
 static void __init kasan_init_depth(void)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 0f95c91e5a8e..74f9982c30a7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -792,7 +792,7 @@ void __init paging_init(void)
 
 	pgd_clear_fixmap();
 
-	cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
+	cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
 	init_mm.pgd = swapper_pg_dir;
 
 	memblock_phys_free(__pa_symbol(init_pg_dir),
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 11/26] arm64: head: add helper function to remap regions in early page tables
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (9 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 10/26] arm64: mm: provide idmap pointer to cpu_replace_ttbr1() Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 12/26] arm64: head: cover entire kernel image in initial ID map Ard Biesheuvel
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

The asm macros used to create the initial ID map and kernel mappings
don't support randomly remapping parts of the address space after it has
been populated. What we can do, however, given that all block or page
mappings are created at the final level, is take a subset of the mapped
range and update its attributes or output address. This will permit us
to make parts of these page tables read-only, or remap a part of it to
cover the device tree.

So add a helper that encapsulates this.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S | 33 ++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 70c462bbd6bf..7397555f8437 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -263,6 +263,39 @@ SYM_FUNC_END(clear_page_tables)
 	populate_entries \tbl, \rtbl, \istart, \iend, \flags, #SWAPPER_BLOCK_SIZE, \tmp
 	.endm
 
+/*
+ * Remap a subregion created with the map_memory macro with modified attributes
+ * or output address. The entire remapped region must have been covered in the
+ * invocation of map_memory.
+ *
+ * x0: last level table address (returned in first argument to map_memory)
+ * x1: start VA of the existing mapping
+ * x2: start VA of the region to update
+ * x3: end VA of the region to update (exclusive)
+ * x4: start PA associated with the region to update
+ * x5: attributes to set on the updated region
+ * x6: order of the last level mappings
+ */
+SYM_FUNC_START_LOCAL(remap_region)
+	sub	x3, x3, #1		// make end inclusive
+
+	// Get the index offset for the start of the last level table
+	lsr	x1, x1, x6
+	bfi	x1, xzr, #0, #PAGE_SHIFT - 3
+
+	// Derive the start and end indexes into the last level table
+	// associated with the provided region
+	lsr	x2, x2, x6
+	lsr	x3, x3, x6
+	sub	x2, x2, x1
+	sub	x3, x3, x1
+
+	mov	x1, #1
+	lsl	x6, x1, x6		// block size at this level
+
+	populate_entries x0, x4, x2, x3, x5, x6, x7
+	ret
+SYM_FUNC_END(remap_region)
 
 SYM_FUNC_START_LOCAL(create_idmap)
 	adrp	x0, idmap_pg_dir
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 12/26] arm64: head: cover entire kernel image in initial ID map
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (10 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 11/26] arm64: head: add helper function to remap regions in early page tables Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 13/26] arm64: head: use relative references to the RELA and RELR tables Ard Biesheuvel
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

As a first step towards avoiding the need to create, tear down and
recreate the kernel virtual mapping with MMU and caches disabled, start
by expanding the ID map so it covers the page tables as well as all
executable code. This will allow us to populate the page tables with the
MMU and caches on, and call KASLR init code before setting up the
virtual mapping.

Since this ID map is only needed at boot, create it as a temporary set
of page tables, and populate the permanent ID map after enabling the MMU
and caches. While at it, switch to read-only attributes for the where
possible, as writable permissions are only needed for the initial kernel
page tables. Note that on 4k granule configurations, the permanent ID
map will now be reduced to a single page rather than a 2M block mapping.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/kernel-pgtable.h | 16 ++++++---
 arch/arm64/kernel/head.S                | 31 +++++++++++------
 arch/arm64/kernel/vmlinux.lds.S         |  7 ++--
 arch/arm64/mm/mmu.c                     | 35 +++++++++++++++++++-
 arch/arm64/mm/proc.S                    |  8 +++--
 5 files changed, 76 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index 96dc0f7da258..5395e5a04f35 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -35,10 +35,8 @@
  */
 #if ARM64_KERNEL_USES_PMD_MAPS
 #define SWAPPER_PGTABLE_LEVELS	(CONFIG_PGTABLE_LEVELS - 1)
-#define IDMAP_PGTABLE_LEVELS	(ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT) - 1)
 #else
 #define SWAPPER_PGTABLE_LEVELS	(CONFIG_PGTABLE_LEVELS)
-#define IDMAP_PGTABLE_LEVELS	(ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT))
 #endif
 
 
@@ -87,7 +85,13 @@
 			+ EARLY_PUDS((vstart), (vend))	/* each PUD needs a next level page table */	\
 			+ EARLY_PMDS((vstart), (vend)))	/* each PMD needs a next level page table */
 #define INIT_DIR_SIZE (PAGE_SIZE * EARLY_PAGES(KIMAGE_VADDR, _end))
-#define IDMAP_DIR_SIZE		(IDMAP_PGTABLE_LEVELS * PAGE_SIZE)
+
+/* the initial ID map may need two extra pages if it needs to be extended */
+#if VA_BITS < 48
+#define INIT_IDMAP_DIR_SIZE	(INIT_DIR_SIZE + (2 * PAGE_SIZE))
+#else
+#define INIT_IDMAP_DIR_SIZE	INIT_DIR_SIZE
+#endif
 
 /* Initial memory map size */
 #if ARM64_KERNEL_USES_PMD_MAPS
@@ -107,9 +111,11 @@
 #define SWAPPER_PMD_FLAGS	(PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
 
 #if ARM64_KERNEL_USES_PMD_MAPS
-#define SWAPPER_MM_MMUFLAGS	(PMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS)
+#define SWAPPER_RW_MMUFLAGS	(PMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS)
+#define SWAPPER_RX_MMUFLAGS	(SWAPPER_RW_MMUFLAGS | PMD_SECT_RDONLY)
 #else
-#define SWAPPER_MM_MMUFLAGS	(PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS)
+#define SWAPPER_RW_MMUFLAGS	(PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS)
+#define SWAPPER_RX_MMUFLAGS	(SWAPPER_RW_MMUFLAGS | PTE_RDONLY)
 #endif
 
 /*
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 7397555f8437..93734c91a29a 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -87,6 +87,7 @@
 	 *  x28        clear_page_tables()                      callee preserved temp register
 	 *  x19/x20    __primary_switch()                       callee preserved temp registers
 	 *  x24        __primary_switch() .. relocate_kernel()  current RELR displacement
+	 *  x28        create_idmap()                           callee preserved temp register
 	 */
 SYM_CODE_START(primary_entry)
 	bl	preserve_boot_args
@@ -298,9 +299,7 @@ SYM_FUNC_START_LOCAL(remap_region)
 SYM_FUNC_END(remap_region)
 
 SYM_FUNC_START_LOCAL(create_idmap)
-	adrp	x0, idmap_pg_dir
-	adrp	x3, __idmap_text_start		// __pa(__idmap_text_start)
-
+	mov	x28, lr
 	/*
 	 * The ID map carries a 1:1 mapping of the physical address range
 	 * covered by the loaded image, which could be anywhere in DRAM. This
@@ -347,11 +346,22 @@ SYM_FUNC_START_LOCAL(create_idmap)
 	 * translation level, but the top-level table has more entries.
 	 */
 #endif
-	adr_l	x6, __idmap_text_end		// __pa(__idmap_text_end)
-	mov	x7, SWAPPER_MM_MMUFLAGS
+	adrp	x0, init_idmap_pg_dir
+	adrp	x3, _text
+	adrp	x6, _end
+	mov	x7, SWAPPER_RX_MMUFLAGS
 
 	map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
 
+	/* Remap the kernel page tables r/w in the ID map */
+	adrp	x1, _text
+	adrp	x2, init_pg_dir
+	adrp	x3, init_pg_end
+	bic	x4, x2, #SWAPPER_BLOCK_SIZE - 1
+	mov	x5, SWAPPER_RW_MMUFLAGS
+	mov	x6, #SWAPPER_BLOCK_SHIFT
+	bl	remap_region
+
 	/*
 	 * Since the page tables have been populated with non-cacheable
 	 * accesses (MMU disabled), invalidate those tables again to
@@ -359,9 +369,10 @@ SYM_FUNC_START_LOCAL(create_idmap)
 	 */
 	dmb	sy
 
-	adrp	x0, idmap_pg_dir
-	adrp	x1, idmap_pg_end
-	b	dcache_inval_poc		// tail call
+	adrp	x0, init_idmap_pg_dir
+	adrp	x1, init_idmap_pg_end
+	bl	dcache_inval_poc
+	ret	x28
 SYM_FUNC_END(create_idmap)
 
 SYM_FUNC_START_LOCAL(create_kernel_mapping)
@@ -372,7 +383,7 @@ SYM_FUNC_START_LOCAL(create_kernel_mapping)
 	adrp	x3, _text			// runtime __pa(_text)
 	sub	x6, x6, x3			// _end - _text
 	add	x6, x6, x5			// runtime __va(_end)
-	mov	x7, SWAPPER_MM_MMUFLAGS
+	mov	x7, SWAPPER_RW_MMUFLAGS
 
 	map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
 
@@ -853,7 +864,7 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 #endif
 
 	adrp	x1, init_pg_dir
-	adrp	x2, idmap_pg_dir
+	adrp	x2, init_idmap_pg_dir
 	bl	__enable_mmu
 #ifdef CONFIG_RELOCATABLE
 #ifdef CONFIG_RELR
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 8a078c0ee140..0ce3a7c9f8c4 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -199,8 +199,7 @@ SECTIONS
 	}
 
 	idmap_pg_dir = .;
-	. += IDMAP_DIR_SIZE;
-	idmap_pg_end = .;
+	. += PAGE_SIZE;
 
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 	tramp_pg_dir = .;
@@ -236,6 +235,10 @@ SECTIONS
 	__inittext_end = .;
 	__initdata_begin = .;
 
+	init_idmap_pg_dir = .;
+	. += INIT_IDMAP_DIR_SIZE;
+	init_idmap_pg_end = .;
+
 	.init.data : {
 		INIT_DATA
 		INIT_SETUP(16)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 74f9982c30a7..ed3a4b87529b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -769,9 +769,40 @@ static void __init map_kernel(pgd_t *pgdp)
 	kasan_copy_shadow(pgdp);
 }
 
+static void __init create_idmap(void)
+{
+	u64 start = __pa_symbol(__idmap_text_start);
+	u64 size = __pa_symbol(__idmap_text_end) - start;
+	pgd_t *pgd = idmap_pg_dir;
+	u64 pgd_phys;
+
+	/* check if we need an additional level of translation */
+	if (VA_BITS < 48 && idmap_t0sz < TCR_T0SZ(VA_BITS_MIN)) {
+		pgd_phys = early_pgtable_alloc(PAGE_SHIFT);
+		set_pgd(&idmap_pg_dir[start >> VA_BITS],
+			__pgd(pgd_phys | P4D_TYPE_TABLE));
+		pgd = __va(pgd_phys);
+	}
+	__create_pgd_mapping(pgd, start, start, size, PAGE_KERNEL_ROX,
+			     early_pgtable_alloc, 0);
+
+	if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) {
+		extern u32 __idmap_kpti_flag;
+		u64 pa = __pa_symbol(&__idmap_kpti_flag);
+
+		/*
+		 * The KPTI G-to-nG conversion code needs a read-write mapping
+		 * of its synchronization flag in the ID map.
+		 */
+		__create_pgd_mapping(pgd, pa, pa, sizeof(u32), PAGE_KERNEL,
+				     early_pgtable_alloc, 0);
+	}
+}
+
 void __init paging_init(void)
 {
 	pgd_t *pgdp = pgd_set_fixmap(__pa_symbol(swapper_pg_dir));
+	extern pgd_t init_idmap_pg_dir[];
 
 #if VA_BITS > 48
 	if (cpuid_feature_extract_unsigned_field(
@@ -792,13 +823,15 @@ void __init paging_init(void)
 
 	pgd_clear_fixmap();
 
-	cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
+	cpu_replace_ttbr1(lm_alias(swapper_pg_dir), init_idmap_pg_dir);
 	init_mm.pgd = swapper_pg_dir;
 
 	memblock_phys_free(__pa_symbol(init_pg_dir),
 			   __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
 
 	memblock_allow_resize();
+
+	create_idmap();
 }
 
 /*
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 97cd67697212..493b8ffc9be5 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -249,8 +249,10 @@ SYM_FUNC_END(idmap_cpu_replace_ttbr1)
  *
  * Called exactly once from stop_machine context by each CPU found during boot.
  */
-__idmap_kpti_flag:
-	.long	1
+	.pushsection	".data", "aw", %progbits
+SYM_DATA(__idmap_kpti_flag, .long 1)
+	.popsection
+
 SYM_FUNC_START(idmap_kpti_install_ng_mappings)
 	cpu		.req	w0
 	temp_pte	.req	x0
@@ -273,7 +275,7 @@ SYM_FUNC_START(idmap_kpti_install_ng_mappings)
 
 	mov	x5, x3				// preserve temp_pte arg
 	mrs	swapper_ttb, ttbr1_el1
-	adr	flag_ptr, __idmap_kpti_flag
+	adr_l	flag_ptr, __idmap_kpti_flag
 
 	cbnz	cpu, __idmap_kpti_secondary
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 13/26] arm64: head: use relative references to the RELA and RELR tables
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (11 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 12/26] arm64: head: cover entire kernel image in initial ID map Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 14/26] arm64: head: create a temporary FDT mapping in the initial ID map Ard Biesheuvel
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Formerly, we had to access the RELA and RELR tables via the kernel
mapping that was being relocated, and so deriving the start and end
addresses using ADRP/ADD references was not possible, as the relocation
code runs from the ID map.

Now that we map the entire kernel image via the ID map, we can simplify
this, and just load the entries via the ID map as well.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S        | 13 ++++---------
 arch/arm64/kernel/vmlinux.lds.S | 12 ++++--------
 2 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 93734c91a29a..f1497f7b4da0 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -757,13 +757,10 @@ SYM_FUNC_START_LOCAL(__relocate_kernel)
 	 * Iterate over each entry in the relocation table, and apply the
 	 * relocations in place.
 	 */
-	ldr	w9, =__rela_offset		// offset to reloc table
-	ldr	w10, =__rela_size		// size of reloc table
-
+	adr_l	x9, __rela_start
+	adr_l	x10, __rela_end
 	mov_q	x11, KIMAGE_VADDR		// default virtual offset
 	add	x11, x11, x23			// actual virtual offset
-	add	x9, x9, x11			// __va(.rela)
-	add	x10, x9, x10			// __va(.rela) + sizeof(.rela)
 
 0:	cmp	x9, x10
 	b.hs	1f
@@ -813,10 +810,8 @@ SYM_FUNC_START_LOCAL(__relocate_kernel)
 	 * __relocate_kernel is called twice with non-zero displacements (i.e.
 	 * if there is both a physical misalignment and a KASLR displacement).
 	 */
-	ldr	w9, =__relr_offset		// offset to reloc table
-	ldr	w10, =__relr_size		// size of reloc table
-	add	x9, x9, x11			// __va(.relr)
-	add	x10, x9, x10			// __va(.relr) + sizeof(.relr)
+	adr_l	x9, __relr_start
+	adr_l	x10, __relr_end
 
 	sub	x15, x23, x24			// delta from previous offset
 	cbz	x15, 7f				// nothing to do if unchanged
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 0ce3a7c9f8c4..45131e354e27 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -257,21 +257,17 @@ SECTIONS
 	HYPERVISOR_RELOC_SECTION
 
 	.rela.dyn : ALIGN(8) {
+		__rela_start = .;
 		*(.rela .rela*)
+		__rela_end = .;
 	}
 
-	__rela_offset	= ABSOLUTE(ADDR(.rela.dyn) - KIMAGE_VADDR);
-	__rela_size	= SIZEOF(.rela.dyn);
-
-#ifdef CONFIG_RELR
 	.relr.dyn : ALIGN(8) {
+		__relr_start = .;
 		*(.relr.dyn)
+		__relr_end = .;
 	}
 
-	__relr_offset	= ABSOLUTE(ADDR(.relr.dyn) - KIMAGE_VADDR);
-	__relr_size	= SIZEOF(.relr.dyn);
-#endif
-
 	. = ALIGN(SEGMENT_ALIGN);
 	__initdata_end = .;
 	__init_end = .;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 14/26] arm64: head: create a temporary FDT mapping in the initial ID map
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (12 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 13/26] arm64: head: use relative references to the RELA and RELR tables Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 15/26] arm64: idreg-override: use early FDT mapping in " Ard Biesheuvel
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

We need to access the DT very early to get at the command line and the
KASLR seed, which currently means we rely on some hacks to call into the
kernel before really calling into the kernel, which is undesirable.

So instead, let's create a mapping for the FDT in the initial ID map,
which is feasible now that it has been extended to cover more than a
single page or block, and can be updated in place to remap other output
addresses.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/kernel-pgtable.h |  6 ++++--
 arch/arm64/kernel/head.S                | 14 +++++++++++++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index 5395e5a04f35..02e59fa8f293 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -8,6 +8,7 @@
 #ifndef __ASM_KERNEL_PGTABLE_H
 #define __ASM_KERNEL_PGTABLE_H
 
+#include <asm/boot.h>
 #include <asm/pgtable-hwdef.h>
 #include <asm/sparsemem.h>
 
@@ -88,10 +89,11 @@
 
 /* the initial ID map may need two extra pages if it needs to be extended */
 #if VA_BITS < 48
-#define INIT_IDMAP_DIR_SIZE	(INIT_DIR_SIZE + (2 * PAGE_SIZE))
+#define INIT_IDMAP_DIR_SIZE	((INIT_IDMAP_DIR_PAGES + 2) * PAGE_SIZE)
 #else
-#define INIT_IDMAP_DIR_SIZE	INIT_DIR_SIZE
+#define INIT_IDMAP_DIR_SIZE	(INIT_IDMAP_DIR_PAGES * PAGE_SIZE)
 #endif
+#define INIT_IDMAP_DIR_PAGES	EARLY_PAGES(KIMAGE_VADDR, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE)
 
 /* Initial memory map size */
 #if ARM64_KERNEL_USES_PMD_MAPS
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index f1497f7b4da0..8283ff848328 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -83,6 +83,7 @@
 	 *
 	 *  Register   Scope                      Purpose
 	 *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0
+	 *  x22        create_idmap() .. start_kernel()         ID map VA of the DT blob
 	 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
 	 *  x28        clear_page_tables()                      callee preserved temp register
 	 *  x19/x20    __primary_switch()                       callee preserved temp registers
@@ -348,7 +349,7 @@ SYM_FUNC_START_LOCAL(create_idmap)
 #endif
 	adrp	x0, init_idmap_pg_dir
 	adrp	x3, _text
-	adrp	x6, _end
+	adrp	x6, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
 	mov	x7, SWAPPER_RX_MMUFLAGS
 
 	map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
@@ -362,6 +363,17 @@ SYM_FUNC_START_LOCAL(create_idmap)
 	mov	x6, #SWAPPER_BLOCK_SHIFT
 	bl	remap_region
 
+	/* Remap the FDT after the kernel image */
+	adrp	x1, _text
+	adrp	x22, _end + SWAPPER_BLOCK_SIZE
+	bic	x2, x22, #SWAPPER_BLOCK_SIZE - 1
+	bfi	x22, x21, #0, #SWAPPER_BLOCK_SHIFT		// remapped FDT address
+	add	x3, x2, #MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
+	bic	x4, x21, #SWAPPER_BLOCK_SIZE - 1
+	mov	x5, SWAPPER_RW_MMUFLAGS
+	mov	x6, #SWAPPER_BLOCK_SHIFT
+	bl	remap_region
+
 	/*
 	 * Since the page tables have been populated with non-cacheable
 	 * accesses (MMU disabled), invalidate those tables again to
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 15/26] arm64: idreg-override: use early FDT mapping in ID map
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (13 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 14/26] arm64: head: create a temporary FDT mapping in the initial ID map Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 16/26] arm64: head: factor out TTBR1 assignment into a macro Ard Biesheuvel
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Instead of calling into the kernel to map the FDT into the kernel page
tables before even calling start_kernel(), let's switch to the initial,
temporary mapping of the device tree that has been added to the ID map.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S           |  1 +
 arch/arm64/kernel/idreg-override.c | 17 ++++++-----------
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 8283ff848328..64ebff634b83 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -472,6 +472,7 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 #endif
 	mov	x0, x21				// pass FDT address in x0
 	bl	early_fdt_map			// Try mapping the FDT early
+	mov	x0, x22				// pass FDT address in x0
 	bl	init_feature_override		// Parse cpu feature overrides
 #ifdef CONFIG_RANDOMIZE_BASE
 	tst	x23, ~(MIN_KIMG_ALIGN - 1)	// already running randomized?
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index 8a2ceb591686..f92836e196e5 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -201,16 +201,11 @@ static __init void __parse_cmdline(const char *cmdline, bool parse_aliases)
 	} while (1);
 }
 
-static __init const u8 *get_bootargs_cmdline(void)
+static __init const u8 *get_bootargs_cmdline(const void *fdt)
 {
 	const u8 *prop;
-	void *fdt;
 	int node;
 
-	fdt = get_early_fdt_ptr();
-	if (!fdt)
-		return NULL;
-
 	node = fdt_path_offset(fdt, "/chosen");
 	if (node < 0)
 		return NULL;
@@ -222,9 +217,9 @@ static __init const u8 *get_bootargs_cmdline(void)
 	return strlen(prop) ? prop : NULL;
 }
 
-static __init void parse_cmdline(void)
+static __init void parse_cmdline(const void *fdt)
 {
-	const u8 *prop = get_bootargs_cmdline();
+	const u8 *prop = get_bootargs_cmdline(fdt);
 
 	if (IS_ENABLED(CONFIG_CMDLINE_FORCE) || !prop)
 		__parse_cmdline(CONFIG_CMDLINE, true);
@@ -234,9 +229,9 @@ static __init void parse_cmdline(void)
 }
 
 /* Keep checkers quiet */
-void init_feature_override(void);
+void init_feature_override(const void *fdt);
 
-asmlinkage void __init init_feature_override(void)
+asmlinkage void __init init_feature_override(const void *fdt)
 {
 	int i;
 
@@ -247,7 +242,7 @@ asmlinkage void __init init_feature_override(void)
 		}
 	}
 
-	parse_cmdline();
+	parse_cmdline(fdt);
 
 	for (i = 0; i < ARRAY_SIZE(regs); i++) {
 		if (regs[i]->override)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 16/26] arm64: head: factor out TTBR1 assignment into a macro
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (14 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 15/26] arm64: idreg-override: use early FDT mapping in " Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on Ard Biesheuvel
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Create a macro load_ttbr1 to avoid having to repeat the same instruction
sequence 3 times in a subsequent patch. No functional change intended.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/assembler.h | 17 +++++++++++++----
 arch/arm64/kernel/head.S           |  5 +----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 9468f45c07a6..b2584709c332 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -479,6 +479,18 @@ alternative_endif
 	_cond_extable .Licache_op\@, \fixup
 	.endm
 
+/*
+ * load_ttbr1 - install @pgtbl as a TTBR1 page table
+ * pgtbl preserved
+ * tmp1/tmp2 clobbered, either may overlap with pgtbl
+ */
+	.macro		load_ttbr1, pgtbl, tmp1, tmp2
+	phys_to_ttbr	\tmp1, \pgtbl
+	offset_ttbr1 	\tmp1, \tmp2
+	msr		ttbr1_el1, \tmp1
+	isb
+	.endm
+
 /*
  * To prevent the possibility of old and new partial table walks being visible
  * in the tlb, switch the ttbr to a zero page when we invalidate the old
@@ -492,10 +504,7 @@ alternative_endif
 	isb
 	tlbi	vmalle1
 	dsb	nsh
-	phys_to_ttbr \tmp, \page_table
-	offset_ttbr1 \tmp, \tmp2
-	msr	ttbr1_el1, \tmp
-	isb
+	load_ttbr1 \page_table, \tmp, \tmp2
 	.endm
 
 /*
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 64ebff634b83..d704d0bd8ffc 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -722,12 +722,9 @@ SYM_FUNC_START(__enable_mmu)
 	cmp     x3, #ID_AA64MMFR0_TGRAN_SUPPORTED_MAX
 	b.gt    __no_granule_support
 	update_early_cpu_boot_status 0, x3, x4
-	phys_to_ttbr x1, x1
 	phys_to_ttbr x2, x2
 	msr	ttbr0_el1, x2			// load TTBR0
-	offset_ttbr1 x1, x3
-	msr	ttbr1_el1, x1			// load TTBR1
-	isb
+	load_ttbr1 x1, x1, x3
 
 	set_sctlr_el1	x0
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (15 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 16/26] arm64: head: factor out TTBR1 assignment into a macro Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-24 12:56   ` Will Deacon
  2022-06-13 14:45 ` [PATCH v4 18/26] arm64: head: record CPU boot mode after enabling the MMU Ard Biesheuvel
                   ` (9 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Now that we can access the entire kernel image via the ID map, we can
execute the page table population code with the MMU and caches enabled.
The only thing we need to ensure is that translations via TTBR1 remain
disabled while we are updating the page tables the second time around,
in case KASLR wants them to be randomized.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S | 62 +++++---------------
 1 file changed, 16 insertions(+), 46 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index d704d0bd8ffc..583cbea865e1 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -85,8 +85,6 @@
 	 *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0
 	 *  x22        create_idmap() .. start_kernel()         ID map VA of the DT blob
 	 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
-	 *  x28        clear_page_tables()                      callee preserved temp register
-	 *  x19/x20    __primary_switch()                       callee preserved temp registers
 	 *  x24        __primary_switch() .. relocate_kernel()  current RELR displacement
 	 *  x28        create_idmap()                           callee preserved temp register
 	 */
@@ -96,9 +94,7 @@ SYM_CODE_START(primary_entry)
 	adrp	x23, __PHYS_OFFSET
 	and	x23, x23, MIN_KIMG_ALIGN - 1	// KASLR offset, defaults to 0
 	bl	set_cpu_boot_mode_flag
-	bl	clear_page_tables
 	bl	create_idmap
-	bl	create_kernel_mapping
 
 	/*
 	 * The following calls CPU setup code, see arch/arm64/mm/proc.S for
@@ -128,32 +124,14 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
 SYM_CODE_END(preserve_boot_args)
 
 SYM_FUNC_START_LOCAL(clear_page_tables)
-	mov	x28, lr
-
-	/*
-	 * Invalidate the init page tables to avoid potential dirty cache lines
-	 * being evicted. Other page tables are allocated in rodata as part of
-	 * the kernel image, and thus are clean to the PoC per the boot
-	 * protocol.
-	 */
-	adrp	x0, init_pg_dir
-	adrp	x1, init_pg_end
-	bl	dcache_inval_poc
-
 	/*
 	 * Clear the init page tables.
 	 */
 	adrp	x0, init_pg_dir
 	adrp	x1, init_pg_end
-	sub	x1, x1, x0
-1:	stp	xzr, xzr, [x0], #16
-	stp	xzr, xzr, [x0], #16
-	stp	xzr, xzr, [x0], #16
-	stp	xzr, xzr, [x0], #16
-	subs	x1, x1, #64
-	b.ne	1b
-
-	ret	x28
+	sub	x2, x1, x0
+	mov	x1, xzr
+	b	__pi_memset			// tail call
 SYM_FUNC_END(clear_page_tables)
 
 /*
@@ -399,16 +377,8 @@ SYM_FUNC_START_LOCAL(create_kernel_mapping)
 
 	map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
 
-	/*
-	 * Since the page tables have been populated with non-cacheable
-	 * accesses (MMU disabled), invalidate those tables again to
-	 * remove any speculatively loaded cache lines.
-	 */
-	dmb	sy
-
-	adrp	x0, init_pg_dir
-	adrp	x1, init_pg_end
-	b	dcache_inval_poc		// tail call
+	dsb	ishst				// sync with page table walker
+	ret
 SYM_FUNC_END(create_kernel_mapping)
 
 	/*
@@ -863,14 +833,15 @@ SYM_FUNC_END(__relocate_kernel)
 #endif
 
 SYM_FUNC_START_LOCAL(__primary_switch)
-#ifdef CONFIG_RANDOMIZE_BASE
-	mov	x19, x0				// preserve new SCTLR_EL1 value
-	mrs	x20, sctlr_el1			// preserve old SCTLR_EL1 value
-#endif
-
-	adrp	x1, init_pg_dir
+	adrp	x1, reserved_pg_dir
 	adrp	x2, init_idmap_pg_dir
 	bl	__enable_mmu
+
+	bl	clear_page_tables
+	bl	create_kernel_mapping
+
+	adrp	x1, init_pg_dir
+	load_ttbr1 x1, x1, x2
 #ifdef CONFIG_RELOCATABLE
 #ifdef CONFIG_RELR
 	mov	x24, #0				// no RELR displacement yet
@@ -886,9 +857,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 	 * to take into account by discarding the current kernel mapping and
 	 * creating a new one.
 	 */
-	pre_disable_mmu_workaround
-	msr	sctlr_el1, x20			// disable the MMU
-	isb
+	adrp	x1, reserved_pg_dir		// Disable translations via TTBR1
+	load_ttbr1 x1, x1, x2
 	bl	clear_page_tables
 	bl	create_kernel_mapping		// Recreate kernel mapping
 
@@ -896,8 +866,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 	dsb	nsh
 	isb
 
-	set_sctlr_el1	x19			// re-enable the MMU
-
+	adrp	x1, init_pg_dir			// Re-enable translations via TTBR1
+	load_ttbr1 x1, x1, x2
 	bl	__relocate_kernel
 #endif
 #endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 18/26] arm64: head: record CPU boot mode after enabling the MMU
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (16 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 19/26] arm64: kaslr: defer initialization to late initcall where permitted Ard Biesheuvel
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

In order to avoid having to touch memory with the MMU and caches
disabled, and therefore having to invalidate it from the caches
explicitly, just defer storing the value until after the MMU has been
turned on, unless we are giving up with an error.

While at it, move the associated variable definitions into C code.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S     | 50 +++++---------------
 arch/arm64/kernel/hyp-stub.S |  4 +-
 arch/arm64/mm/mmu.c          |  8 ++++
 3 files changed, 23 insertions(+), 39 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 583cbea865e1..8de346dd4470 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -82,6 +82,7 @@
 	 * primary lowlevel boot path:
 	 *
 	 *  Register   Scope                      Purpose
+	 *  x20        primary_entry() .. __primary_switch()    CPU boot mode
 	 *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0
 	 *  x22        create_idmap() .. start_kernel()         ID map VA of the DT blob
 	 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
@@ -91,9 +92,9 @@
 SYM_CODE_START(primary_entry)
 	bl	preserve_boot_args
 	bl	init_kernel_el			// w0=cpu_boot_mode
+	mov	x20, x0
 	adrp	x23, __PHYS_OFFSET
 	and	x23, x23, MIN_KIMG_ALIGN - 1	// KASLR offset, defaults to 0
-	bl	set_cpu_boot_mode_flag
 	bl	create_idmap
 
 	/*
@@ -429,6 +430,9 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 	sub	x4, x4, x0			// the kernel virtual and
 	str_l	x4, kimage_voffset, x5		// physical mappings
 
+	mov	x0, x20
+	bl	set_cpu_boot_mode_flag
+
 	// Clear BSS
 	adr_l	x0, __bss_start
 	mov	x1, xzr
@@ -454,6 +458,7 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 	ret					// to __primary_switch()
 0:
 #endif
+	mov	x0, x20
 	bl	switch_to_vhe			// Prefer VHE if possible
 	ldp	x29, x30, [sp], #16
 	bl	start_kernel
@@ -553,52 +558,21 @@ SYM_FUNC_START_LOCAL(set_cpu_boot_mode_flag)
 	b.ne	1f
 	add	x1, x1, #4
 1:	str	w0, [x1]			// Save CPU boot mode
-	dmb	sy
-	dc	ivac, x1			// Invalidate potentially stale cache line
 	ret
 SYM_FUNC_END(set_cpu_boot_mode_flag)
 
-/*
- * These values are written with the MMU off, but read with the MMU on.
- * Writers will invalidate the corresponding address, discarding up to a
- * 'Cache Writeback Granule' (CWG) worth of data. The linker script ensures
- * sufficient alignment that the CWG doesn't overlap another section.
- */
-	.pushsection ".mmuoff.data.write", "aw"
-/*
- * We need to find out the CPU boot mode long after boot, so we need to
- * store it in a writable variable.
- *
- * This is not in .bss, because we set it sufficiently early that the boot-time
- * zeroing of .bss would clobber it.
- */
-SYM_DATA_START(__boot_cpu_mode)
-	.long	BOOT_CPU_MODE_EL2
-	.long	BOOT_CPU_MODE_EL1
-SYM_DATA_END(__boot_cpu_mode)
-/*
- * The booting CPU updates the failed status @__early_cpu_boot_status,
- * with MMU turned off.
- */
-SYM_DATA_START(__early_cpu_boot_status)
-	.quad 	0
-SYM_DATA_END(__early_cpu_boot_status)
-
-	.popsection
-
 	/*
 	 * This provides a "holding pen" for platforms to hold all secondary
 	 * cores are held until we're ready for them to initialise.
 	 */
 SYM_FUNC_START(secondary_holding_pen)
 	bl	init_kernel_el			// w0=cpu_boot_mode
-	bl	set_cpu_boot_mode_flag
-	mrs	x0, mpidr_el1
+	mrs	x2, mpidr_el1
 	mov_q	x1, MPIDR_HWID_BITMASK
-	and	x0, x0, x1
+	and	x2, x2, x1
 	adr_l	x3, secondary_holding_pen_release
 pen:	ldr	x4, [x3]
-	cmp	x4, x0
+	cmp	x4, x2
 	b.eq	secondary_startup
 	wfe
 	b	pen
@@ -610,7 +584,6 @@ SYM_FUNC_END(secondary_holding_pen)
 	 */
 SYM_FUNC_START(secondary_entry)
 	bl	init_kernel_el			// w0=cpu_boot_mode
-	bl	set_cpu_boot_mode_flag
 	b	secondary_startup
 SYM_FUNC_END(secondary_entry)
 
@@ -618,6 +591,7 @@ SYM_FUNC_START_LOCAL(secondary_startup)
 	/*
 	 * Common entry point for secondary CPUs.
 	 */
+	mov	x20, x0				// preserve boot mode
 	bl	switch_to_vhe
 	bl	__cpu_secondary_check52bitva
 	bl	__cpu_setup			// initialise processor
@@ -629,6 +603,9 @@ SYM_FUNC_START_LOCAL(secondary_startup)
 SYM_FUNC_END(secondary_startup)
 
 SYM_FUNC_START_LOCAL(__secondary_switched)
+	mov	x0, x20
+	bl	set_cpu_boot_mode_flag
+	str_l	xzr, __early_cpu_boot_status, x3
 	adr_l	x5, vectors
 	msr	vbar_el1, x5
 	isb
@@ -691,7 +668,6 @@ SYM_FUNC_START(__enable_mmu)
 	b.lt    __no_granule_support
 	cmp     x3, #ID_AA64MMFR0_TGRAN_SUPPORTED_MAX
 	b.gt    __no_granule_support
-	update_early_cpu_boot_status 0, x3, x4
 	phys_to_ttbr x2, x2
 	msr	ttbr0_el1, x2			// load TTBR0
 	load_ttbr1 x1, x1, x3
diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index 43d212618834..5bafb53fafb4 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -223,11 +223,11 @@ SYM_FUNC_END(__hyp_reset_vectors)
 
 /*
  * Entry point to switch to VHE if deemed capable
+ *
+ * w0: boot mode, as returned by init_kernel_el()
  */
 SYM_FUNC_START(switch_to_vhe)
 	// Need to have booted at EL2
-	adr_l	x1, __boot_cpu_mode
-	ldr	w0, [x1]
 	cmp	w0, #BOOT_CPU_MODE_EL2
 	b.ne	1f
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ed3a4b87529b..9828ad826837 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -56,6 +56,14 @@ EXPORT_SYMBOL(kimage_vaddr);
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
 
+u32 __boot_cpu_mode[] = { BOOT_CPU_MODE_EL2, BOOT_CPU_MODE_EL1 };
+
+/*
+ * The booting CPU updates the failed status @__early_cpu_boot_status,
+ * with MMU turned off.
+ */
+long __section(".mmuoff.data.write") __early_cpu_boot_status;
+
 /*
  * Empty_zero_page is a special page that is used for zero-initialized data
  * and COW.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 19/26] arm64: kaslr: defer initialization to late initcall where permitted
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (17 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 18/26] arm64: head: record CPU boot mode after enabling the MMU Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-24 13:08   ` Will Deacon
  2022-06-13 14:45 ` [PATCH v4 20/26] arm64: head: avoid relocating the kernel twice for KASLR Ard Biesheuvel
                   ` (7 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

The early KASLR init code runs extremely early, and anything that could
be deferred until later should be. So let's defer the randomization of
the module region until much later - this also simplifies the
arithmetic, given that we no longer have to reason about the link time
vs load time placement of the core kernel explicitly. Also get rid of
the global status variable, and infer the status reported by the
diagnostic print from other KASLR related context.

While at it, get rid of the special case for KASAN without
KASAN_VMALLOC, which never occurs in practice.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/kaslr.c | 95 +++++++++-----------
 1 file changed, 40 insertions(+), 55 deletions(-)

diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index d5542666182f..af9ffe4d0f0f 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -20,14 +20,6 @@
 #include <asm/sections.h>
 #include <asm/setup.h>
 
-enum kaslr_status {
-	KASLR_ENABLED,
-	KASLR_DISABLED_CMDLINE,
-	KASLR_DISABLED_NO_SEED,
-	KASLR_DISABLED_FDT_REMAP,
-};
-
-static enum kaslr_status __initdata kaslr_status;
 u64 __ro_after_init module_alloc_base;
 u16 __initdata memstart_offset_seed;
 
@@ -63,15 +55,9 @@ struct arm64_ftr_override kaslr_feature_override __initdata;
 u64 __init kaslr_early_init(void)
 {
 	void *fdt;
-	u64 seed, offset, mask, module_range;
+	u64 seed, offset, mask;
 	unsigned long raw;
 
-	/*
-	 * Set a reasonable default for module_alloc_base in case
-	 * we end up running with module randomization disabled.
-	 */
-	module_alloc_base = (u64)_etext - MODULES_VSIZE;
-
 	/*
 	 * Try to map the FDT early. If this fails, we simply bail,
 	 * and proceed with KASLR disabled. We will make another
@@ -79,7 +65,6 @@ u64 __init kaslr_early_init(void)
 	 */
 	fdt = get_early_fdt_ptr();
 	if (!fdt) {
-		kaslr_status = KASLR_DISABLED_FDT_REMAP;
 		return 0;
 	}
 
@@ -93,7 +78,6 @@ u64 __init kaslr_early_init(void)
 	 * return 0 if that is the case.
 	 */
 	if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
-		kaslr_status = KASLR_DISABLED_CMDLINE;
 		return 0;
 	}
 
@@ -106,7 +90,6 @@ u64 __init kaslr_early_init(void)
 		seed ^= raw;
 
 	if (!seed) {
-		kaslr_status = KASLR_DISABLED_NO_SEED;
 		return 0;
 	}
 
@@ -126,19 +109,43 @@ u64 __init kaslr_early_init(void)
 	/* use the top 16 bits to randomize the linear region */
 	memstart_offset_seed = seed >> 48;
 
-	if (!IS_ENABLED(CONFIG_KASAN_VMALLOC) &&
-	    (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
-	     IS_ENABLED(CONFIG_KASAN_SW_TAGS)))
-		/*
-		 * KASAN without KASAN_VMALLOC does not expect the module region
-		 * to intersect the vmalloc region, since shadow memory is
-		 * allocated for each module at load time, whereas the vmalloc
-		 * region is shadowed by KASAN zero pages. So keep modules
-		 * out of the vmalloc region if KASAN is enabled without
-		 * KASAN_VMALLOC, and put the kernel well within 4 GB of the
-		 * module region.
-		 */
-		return offset % SZ_2G;
+	return offset;
+}
+
+static int __init kaslr_init(void)
+{
+	u64 module_range;
+	u32 seed;
+
+	/*
+	 * Set a reasonable default for module_alloc_base in case
+	 * we end up running with module randomization disabled.
+	 */
+	module_alloc_base = (u64)_etext - MODULES_VSIZE;
+
+	if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
+		pr_info("KASLR disabled on command line\n");
+		return 0;
+	}
+
+	if (!kaslr_offset()) {
+		pr_warn("KASLR disabled due to lack of seed\n");
+		return 0;
+	}
+
+	pr_info("KASLR enabled\n");
+
+	/*
+	 * KASAN without KASAN_VMALLOC does not expect the module region to
+	 * intersect the vmalloc region, since shadow memory is allocated for
+	 * each module at load time, whereas the vmalloc region will already be
+	 * shadowed by KASAN zero pages.
+	 */
+	BUILD_BUG_ON((IS_ENABLED(CONFIG_KASAN_GENERIC) ||
+	              IS_ENABLED(CONFIG_KASAN_SW_TAGS)) &&
+		     !IS_ENABLED(CONFIG_KASAN_VMALLOC));
+
+	seed = get_random_u32();
 
 	if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
 		/*
@@ -150,8 +157,7 @@ u64 __init kaslr_early_init(void)
 		 * resolved normally.)
 		 */
 		module_range = SZ_2G - (u64)(_end - _stext);
-		module_alloc_base = max((u64)_end + offset - SZ_2G,
-					(u64)MODULES_VADDR);
+		module_alloc_base = max((u64)_end - SZ_2G, (u64)MODULES_VADDR);
 	} else {
 		/*
 		 * Randomize the module region by setting module_alloc_base to
@@ -163,33 +169,12 @@ u64 __init kaslr_early_init(void)
 		 * when ARM64_MODULE_PLTS is enabled.
 		 */
 		module_range = MODULES_VSIZE - (u64)(_etext - _stext);
-		module_alloc_base = (u64)_etext + offset - MODULES_VSIZE;
 	}
 
 	/* use the lower 21 bits to randomize the base of the module region */
 	module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
 	module_alloc_base &= PAGE_MASK;
 
-	return offset;
-}
-
-static int __init kaslr_init(void)
-{
-	switch (kaslr_status) {
-	case KASLR_ENABLED:
-		pr_info("KASLR enabled\n");
-		break;
-	case KASLR_DISABLED_CMDLINE:
-		pr_info("KASLR disabled on command line\n");
-		break;
-	case KASLR_DISABLED_NO_SEED:
-		pr_warn("KASLR disabled due to lack of seed\n");
-		break;
-	case KASLR_DISABLED_FDT_REMAP:
-		pr_warn("KASLR disabled due to FDT remapping failure\n");
-		break;
-	}
-
 	return 0;
 }
-core_initcall(kaslr_init)
+late_initcall(kaslr_init)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 20/26] arm64: head: avoid relocating the kernel twice for KASLR
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (18 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 19/26] arm64: kaslr: defer initialization to late initcall where permitted Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-24 13:16   ` Will Deacon
  2022-06-13 14:45 ` [PATCH v4 21/26] arm64: setup: drop early FDT pointer helpers Ard Biesheuvel
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Currently, when KASLR is in effect, we set up the kernel virtual address
space twice: the first time, the KASLR seed is looked up in the device
tree, and the kernel virtual mapping is torn down and recreated again,
after which the relocations are applied a second time. The latter step
means that statically initialized global pointer variables will be reset
to their initial values, and to ensure that BSS variables are not set to
values based on the initial translation, they are cleared again as well.

All of this is needed because we need the command line (taken from the
DT) to tell us whether or not to randomize the virtual address space
before entering the kernel proper. However, this code has expanded
little by little and now creates global state unrelated to the virtual
randomization of the kernel before the mapping is torn down and set up
again, and the BSS cleared for a second time. This has created some
issues in the past, and it would be better to avoid this little dance if
possible.

So instead, let's use the temporary mapping of the device tree, and
execute the bare minimum of code to decide whether or not KASLR should
be enabled, and what the seed is. Only then, create the virtual kernel
mapping, clear BSS, etc and proceed as normal.  This avoids the issues
around inconsistent global state due to BSS being cleared twice, and is
generally more maintainable, as it permits us to defer all the remaining
DT parsing and KASLR initialization to a later time.

This means the relocation fixup code runs only a single time as well,
allowing us to simplify the RELR handling code too, which is not
idempotent and was therefore required to keep track of the offset that
was applied the first time around.

Note that this means we have to clone a pair of FDT library objects, so
that we can control how they are built - we need the stack protector
and other instrumentation disabled so that the code can tolerate being
called this early. Note that only the kernel page tables and the
temporary stack are mapped read-write at this point, which ensures that
the early code does not modify any global state inadvertently.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/Makefile         |   2 +-
 arch/arm64/kernel/head.S           |  73 ++++---------
 arch/arm64/kernel/image-vars.h     |   4 +
 arch/arm64/kernel/kaslr.c          |  87 ---------------
 arch/arm64/kernel/pi/Makefile      |  33 ++++++
 arch/arm64/kernel/pi/kaslr_early.c | 112 ++++++++++++++++++++
 6 files changed, 171 insertions(+), 140 deletions(-)

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index fa7981d0d917..88a96511580e 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -59,7 +59,7 @@ obj-$(CONFIG_ACPI)			+= acpi.o
 obj-$(CONFIG_ACPI_NUMA)			+= acpi_numa.o
 obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL)	+= acpi_parking_protocol.o
 obj-$(CONFIG_PARAVIRT)			+= paravirt.o
-obj-$(CONFIG_RANDOMIZE_BASE)		+= kaslr.o
+obj-$(CONFIG_RANDOMIZE_BASE)		+= kaslr.o pi/
 obj-$(CONFIG_HIBERNATION)		+= hibernate.o hibernate-asm.o
 obj-$(CONFIG_ELF_CORE)			+= elfcore.o
 obj-$(CONFIG_KEXEC_CORE)		+= machine_kexec.o relocate_kernel.o	\
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 8de346dd4470..5a2ff6466b6b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -86,15 +86,13 @@
 	 *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0
 	 *  x22        create_idmap() .. start_kernel()         ID map VA of the DT blob
 	 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
-	 *  x24        __primary_switch() .. relocate_kernel()  current RELR displacement
+	 *  x24        __primary_switch()                       linear map KASLR seed
 	 *  x28        create_idmap()                           callee preserved temp register
 	 */
 SYM_CODE_START(primary_entry)
 	bl	preserve_boot_args
 	bl	init_kernel_el			// w0=cpu_boot_mode
 	mov	x20, x0
-	adrp	x23, __PHYS_OFFSET
-	and	x23, x23, MIN_KIMG_ALIGN - 1	// KASLR offset, defaults to 0
 	bl	create_idmap
 
 	/*
@@ -441,6 +439,10 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 	bl	__pi_memset
 	dsb	ishst				// Make zero page visible to PTW
 
+#ifdef CONFIG_RANDOMIZE_BASE
+	adrp	x5, memstart_offset_seed	// Save KASLR linear map seed
+	strh	w24, [x5, :lo12:memstart_offset_seed]
+#endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	bl	kasan_early_init
 #endif
@@ -448,16 +450,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 	bl	early_fdt_map			// Try mapping the FDT early
 	mov	x0, x22				// pass FDT address in x0
 	bl	init_feature_override		// Parse cpu feature overrides
-#ifdef CONFIG_RANDOMIZE_BASE
-	tst	x23, ~(MIN_KIMG_ALIGN - 1)	// already running randomized?
-	b.ne	0f
-	bl	kaslr_early_init		// parse FDT for KASLR options
-	cbz	x0, 0f				// KASLR disabled? just proceed
-	orr	x23, x23, x0			// record KASLR offset
-	ldp	x29, x30, [sp], #16		// we must enable KASLR, return
-	ret					// to __primary_switch()
-0:
-#endif
 	mov	x0, x20
 	bl	switch_to_vhe			// Prefer VHE if possible
 	ldp	x29, x30, [sp], #16
@@ -759,27 +751,17 @@ SYM_FUNC_START_LOCAL(__relocate_kernel)
 	 * entry in x9, the address being relocated by the current address or
 	 * bitmap entry in x13 and the address being relocated by the current
 	 * bit in x14.
-	 *
-	 * Because addends are stored in place in the binary, RELR relocations
-	 * cannot be applied idempotently. We use x24 to keep track of the
-	 * currently applied displacement so that we can correctly relocate if
-	 * __relocate_kernel is called twice with non-zero displacements (i.e.
-	 * if there is both a physical misalignment and a KASLR displacement).
 	 */
 	adr_l	x9, __relr_start
 	adr_l	x10, __relr_end
 
-	sub	x15, x23, x24			// delta from previous offset
-	cbz	x15, 7f				// nothing to do if unchanged
-	mov	x24, x23			// save new offset
-
 2:	cmp	x9, x10
 	b.hs	7f
 	ldr	x11, [x9], #8
 	tbnz	x11, #0, 3f			// branch to handle bitmaps
 	add	x13, x11, x23
 	ldr	x12, [x13]			// relocate address entry
-	add	x12, x12, x15
+	add	x12, x12, x23
 	str	x12, [x13], #8			// adjust to start of bitmap
 	b	2b
 
@@ -788,7 +770,7 @@ SYM_FUNC_START_LOCAL(__relocate_kernel)
 	cbz	x11, 6f
 	tbz	x11, #0, 5f			// skip bit if not set
 	ldr	x12, [x14]			// relocate bit
-	add	x12, x12, x15
+	add	x12, x12, x23
 	str	x12, [x14]
 
 5:	add	x14, x14, #8			// move to next bit's address
@@ -812,40 +794,27 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 	adrp	x1, reserved_pg_dir
 	adrp	x2, init_idmap_pg_dir
 	bl	__enable_mmu
-
+#ifdef CONFIG_RELOCATABLE
+	adrp	x23, __PHYS_OFFSET
+	and	x23, x23, MIN_KIMG_ALIGN - 1
+#ifdef CONFIG_RANDOMIZE_BASE
+	mov	x0, x22
+	adrp	x1, init_pg_end
+	mov	sp, x1
+	mov	x29, xzr
+	bl	__pi_kaslr_early_init
+	and	x24, x0, #SZ_2M - 1		// capture memstart offset seed
+	bic	x0, x0, #SZ_2M - 1
+	orr	x23, x23, x0			// record kernel offset
+#endif
+#endif
 	bl	clear_page_tables
 	bl	create_kernel_mapping
 
 	adrp	x1, init_pg_dir
 	load_ttbr1 x1, x1, x2
 #ifdef CONFIG_RELOCATABLE
-#ifdef CONFIG_RELR
-	mov	x24, #0				// no RELR displacement yet
-#endif
 	bl	__relocate_kernel
-#ifdef CONFIG_RANDOMIZE_BASE
-	ldr	x8, =__primary_switched
-	adrp	x0, __PHYS_OFFSET
-	blr	x8
-
-	/*
-	 * If we return here, we have a KASLR displacement in x23 which we need
-	 * to take into account by discarding the current kernel mapping and
-	 * creating a new one.
-	 */
-	adrp	x1, reserved_pg_dir		// Disable translations via TTBR1
-	load_ttbr1 x1, x1, x2
-	bl	clear_page_tables
-	bl	create_kernel_mapping		// Recreate kernel mapping
-
-	tlbi	vmalle1				// Remove any stale TLB entries
-	dsb	nsh
-	isb
-
-	adrp	x1, init_pg_dir			// Re-enable translations via TTBR1
-	load_ttbr1 x1, x1, x2
-	bl	__relocate_kernel
-#endif
 #endif
 	ldr	x8, =__primary_switched
 	adrp	x0, __PHYS_OFFSET
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 241c86b67d01..0c381a405bf0 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -41,6 +41,10 @@ __efistub_dcache_clean_poc = __pi_dcache_clean_poc;
 __efistub___memcpy		= __pi_memcpy;
 __efistub___memmove		= __pi_memmove;
 __efistub___memset		= __pi_memset;
+
+__pi___memcpy			= __pi_memcpy;
+__pi___memmove			= __pi_memmove;
+__pi___memset			= __pi_memset;
 #endif
 
 __efistub__text			= _text;
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index af9ffe4d0f0f..06515afce692 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -23,95 +23,8 @@
 u64 __ro_after_init module_alloc_base;
 u16 __initdata memstart_offset_seed;
 
-static __init u64 get_kaslr_seed(void *fdt)
-{
-	int node, len;
-	fdt64_t *prop;
-	u64 ret;
-
-	node = fdt_path_offset(fdt, "/chosen");
-	if (node < 0)
-		return 0;
-
-	prop = fdt_getprop_w(fdt, node, "kaslr-seed", &len);
-	if (!prop || len != sizeof(u64))
-		return 0;
-
-	ret = fdt64_to_cpu(*prop);
-	*prop = 0;
-	return ret;
-}
-
 struct arm64_ftr_override kaslr_feature_override __initdata;
 
-/*
- * This routine will be executed with the kernel mapped at its default virtual
- * address, and if it returns successfully, the kernel will be remapped, and
- * start_kernel() will be executed from a randomized virtual offset. The
- * relocation will result in all absolute references (e.g., static variables
- * containing function pointers) to be reinitialized, and zero-initialized
- * .bss variables will be reset to 0.
- */
-u64 __init kaslr_early_init(void)
-{
-	void *fdt;
-	u64 seed, offset, mask;
-	unsigned long raw;
-
-	/*
-	 * Try to map the FDT early. If this fails, we simply bail,
-	 * and proceed with KASLR disabled. We will make another
-	 * attempt at mapping the FDT in setup_machine()
-	 */
-	fdt = get_early_fdt_ptr();
-	if (!fdt) {
-		return 0;
-	}
-
-	/*
-	 * Retrieve (and wipe) the seed from the FDT
-	 */
-	seed = get_kaslr_seed(fdt);
-
-	/*
-	 * Check if 'nokaslr' appears on the command line, and
-	 * return 0 if that is the case.
-	 */
-	if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
-		return 0;
-	}
-
-	/*
-	 * Mix in any entropy obtainable architecturally if enabled
-	 * and supported.
-	 */
-
-	if (arch_get_random_seed_long_early(&raw))
-		seed ^= raw;
-
-	if (!seed) {
-		return 0;
-	}
-
-	/*
-	 * OK, so we are proceeding with KASLR enabled. Calculate a suitable
-	 * kernel image offset from the seed. Let's place the kernel in the
-	 * middle half of the VMALLOC area (VA_BITS_MIN - 2), and stay clear of
-	 * the lower and upper quarters to avoid colliding with other
-	 * allocations.
-	 * Even if we could randomize at page granularity for 16k and 64k pages,
-	 * let's always round to 2 MB so we don't interfere with the ability to
-	 * map using contiguous PTEs
-	 */
-	mask = ((1UL << (VA_BITS_MIN - 2)) - 1) & ~(SZ_2M - 1);
-	offset = BIT(VA_BITS_MIN - 3) + (seed & mask);
-
-	/* use the top 16 bits to randomize the linear region */
-	memstart_offset_seed = seed >> 48;
-
-	return offset;
-}
-
 static int __init kaslr_init(void)
 {
 	u64 module_range;
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
new file mode 100644
index 000000000000..839291430cb3
--- /dev/null
+++ b/arch/arm64/kernel/pi/Makefile
@@ -0,0 +1,33 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright 2022 Google LLC
+
+KBUILD_CFLAGS	:= $(subst $(CC_FLAGS_FTRACE),,$(KBUILD_CFLAGS)) -fpie \
+		   -Os -DDISABLE_BRANCH_PROFILING $(DISABLE_STACKLEAK_PLUGIN) \
+		   $(call cc-option,-mbranch-protection=none) \
+		   -I$(srctree)/scripts/dtc/libfdt -fno-stack-protector \
+		   -include $(srctree)/include/linux/hidden.h \
+		   -D__DISABLE_EXPORTS -ffreestanding -D__NO_FORTIFY \
+		   $(call cc-option,-fno-addrsig)
+
+# remove SCS flags from all objects in this directory
+KBUILD_CFLAGS	:= $(filter-out $(CC_FLAGS_SCS), $(KBUILD_CFLAGS))
+# disable LTO
+KBUILD_CFLAGS	:= $(filter-out $(CC_FLAGS_LTO), $(KBUILD_CFLAGS))
+
+GCOV_PROFILE	:= n
+KASAN_SANITIZE	:= n
+KCSAN_SANITIZE	:= n
+UBSAN_SANITIZE	:= n
+KCOV_INSTRUMENT	:= n
+
+$(obj)/%.pi.o: OBJCOPYFLAGS := --prefix-symbols=__pi_ \
+			       --remove-section=.note.gnu.property \
+			       --prefix-alloc-sections=.init
+$(obj)/%.pi.o: $(obj)/%.o FORCE
+	$(call if_changed,objcopy)
+
+$(obj)/lib-%.o: $(srctree)/lib/%.c FORCE
+	$(call if_changed_rule,cc_o_c)
+
+obj-y		:= kaslr_early.pi.o lib-fdt.pi.o lib-fdt_ro.pi.o
+extra-y		:= $(patsubst %.pi.o,%.o,$(obj-y))
diff --git a/arch/arm64/kernel/pi/kaslr_early.c b/arch/arm64/kernel/pi/kaslr_early.c
new file mode 100644
index 000000000000..6c3855e69395
--- /dev/null
+++ b/arch/arm64/kernel/pi/kaslr_early.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2022 Google LLC
+// Author: Ard Biesheuvel <ardb@google.com>
+
+// NOTE: code in this file runs *very* early, and is not permitted to use
+// global variables or anything that relies on absolute addressing.
+
+#include <linux/libfdt.h>
+#include <linux/init.h>
+#include <linux/linkage.h>
+#include <linux/types.h>
+#include <linux/sizes.h>
+#include <linux/string.h>
+
+#include <asm/archrandom.h>
+#include <asm/memory.h>
+
+/* taken from lib/string.c */
+static char *__strstr(const char *s1, const char *s2)
+{
+	size_t l1, l2;
+
+	l2 = strlen(s2);
+	if (!l2)
+		return (char *)s1;
+	l1 = strlen(s1);
+	while (l1 >= l2) {
+		l1--;
+		if (!memcmp(s1, s2, l2))
+			return (char *)s1;
+		s1++;
+	}
+	return NULL;
+}
+static bool cmdline_contains_nokaslr(const u8 *cmdline)
+{
+	const u8 *str;
+
+	str = __strstr(cmdline, "nokaslr");
+	return str == cmdline || (str > cmdline && *(str - 1) == ' ');
+}
+
+static bool is_kaslr_disabled_cmdline(void *fdt)
+{
+	if (!IS_ENABLED(CONFIG_CMDLINE_FORCE)) {
+		int node;
+		const u8 *prop;
+
+		node = fdt_path_offset(fdt, "/chosen");
+		if (node < 0)
+			goto out;
+
+		prop = fdt_getprop(fdt, node, "bootargs", NULL);
+		if (!prop)
+			goto out;
+
+		if (cmdline_contains_nokaslr(prop))
+			return true;
+
+		if (IS_ENABLED(CONFIG_CMDLINE_EXTEND))
+			goto out;
+
+		return false;
+	}
+out:
+	return cmdline_contains_nokaslr(CONFIG_CMDLINE);
+}
+
+static u64 get_kaslr_seed(void *fdt)
+{
+	int node, len;
+	fdt64_t *prop;
+	u64 ret;
+
+	node = fdt_path_offset(fdt, "/chosen");
+	if (node < 0)
+		return 0;
+
+	prop = fdt_getprop_w(fdt, node, "kaslr-seed", &len);
+	if (!prop || len != sizeof(u64))
+		return 0;
+
+	ret = fdt64_to_cpu(*prop);
+	*prop = 0;
+	return ret;
+}
+
+asmlinkage u64 kaslr_early_init(void *fdt)
+{
+	u64 seed;
+
+	if (is_kaslr_disabled_cmdline(fdt))
+		return 0;
+
+	seed = get_kaslr_seed(fdt);
+	if (!seed) {
+#ifdef CONFIG_ARCH_RANDOM
+		 if (!__early_cpu_has_rndr() ||
+		     !__arm64_rndr((unsigned long *)&seed))
+#endif
+		return 0;
+	}
+
+	/*
+	 * OK, so we are proceeding with KASLR enabled. Calculate a suitable
+	 * kernel image offset from the seed. Let's place the kernel in the
+	 * middle half of the VMALLOC area (VA_BITS_MIN - 2), and stay clear of
+	 * the lower and upper quarters to avoid colliding with other
+	 * allocations.
+	 */
+	return BIT(VA_BITS_MIN - 3) + (seed & GENMASK(VA_BITS_MIN - 3, 0));
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 21/26] arm64: setup: drop early FDT pointer helpers
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (19 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 20/26] arm64: head: avoid relocating the kernel twice for KASLR Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 14:45 ` [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment Ard Biesheuvel
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

We no longer need to call into the kernel to map the FDT before calling
into the kernel so let's drop the helpers we added for this.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/setup.h |  3 ---
 arch/arm64/kernel/head.S       |  2 --
 arch/arm64/kernel/setup.c      | 15 ---------------
 3 files changed, 20 deletions(-)

diff --git a/arch/arm64/include/asm/setup.h b/arch/arm64/include/asm/setup.h
index 6437df661700..5f147a418281 100644
--- a/arch/arm64/include/asm/setup.h
+++ b/arch/arm64/include/asm/setup.h
@@ -5,9 +5,6 @@
 
 #include <uapi/asm/setup.h>
 
-void *get_early_fdt_ptr(void);
-void early_fdt_map(u64 dt_phys);
-
 /*
  * These two variables are used in the head.S file.
  */
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 5a2ff6466b6b..6bf685f988f1 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -446,8 +446,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	bl	kasan_early_init
 #endif
-	mov	x0, x21				// pass FDT address in x0
-	bl	early_fdt_map			// Try mapping the FDT early
 	mov	x0, x22				// pass FDT address in x0
 	bl	init_feature_override		// Parse cpu feature overrides
 	mov	x0, x20
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index cf3a759f10d4..6c2120afe542 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -163,21 +163,6 @@ static void __init smp_build_mpidr_hash(void)
 		pr_warn("Large number of MPIDR hash buckets detected\n");
 }
 
-static void *early_fdt_ptr __initdata;
-
-void __init *get_early_fdt_ptr(void)
-{
-	return early_fdt_ptr;
-}
-
-asmlinkage void __init early_fdt_map(u64 dt_phys)
-{
-	int fdt_size;
-
-	early_fixmap_init();
-	early_fdt_ptr = fixmap_remap_fdt(dt_phys, &fdt_size, PAGE_KERNEL);
-}
-
 static void __init setup_machine_fdt(phys_addr_t dt_phys)
 {
 	int size;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (20 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 21/26] arm64: setup: drop early FDT pointer helpers Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 17:00   ` Kees Cook
  2022-06-13 14:45 ` [PATCH v4 23/26] arm64: head: remap the kernel text/inittext region read-only Ard Biesheuvel
                   ` (4 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Currently, the ro_after_init sections sits right in the middle of the
text/rodata/inittext segment, making it difficult to map any of those
non-writable during early boot. So instead, move it to the start of
.data, and update the init sequences so that the section is remapped
read-only once startup completes.

Note that this moves the entire HYP data section into .data as well -
this likely needs to remain as a single block for now, but could perhaps
split into a .rodata and .data..ro_after_init section later.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/vmlinux.lds.S | 42 ++++++++++++--------
 arch/arm64/mm/mmu.c             | 29 ++++++++------
 2 files changed, 42 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 45131e354e27..736aca63dad1 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -59,6 +59,7 @@
 
 #define RO_EXCEPTION_TABLE_ALIGN	4
 #define RUNTIME_DISCARD_EXIT
+#define RO_AFTER_INIT_DATA
 
 #include <asm-generic/vmlinux.lds.h>
 #include <asm/cache.h>
@@ -188,30 +189,13 @@ SECTIONS
 	/* everything from this point to __init_begin will be marked RO NX */
 	RO_DATA(PAGE_SIZE)
 
-	HYPERVISOR_DATA_SECTIONS
-
 	/* code sections that are never executed via the kernel mapping */
 	.rodata.text : {
 		TRAMP_TEXT
 		HIBERNATE_TEXT
 		KEXEC_TEXT
-		. = ALIGN(PAGE_SIZE);
 	}
 
-	idmap_pg_dir = .;
-	. += PAGE_SIZE;
-
-#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-	tramp_pg_dir = .;
-	. += PAGE_SIZE;
-#endif
-
-	reserved_pg_dir = .;
-	. += PAGE_SIZE;
-
-	swapper_pg_dir = .;
-	. += PAGE_SIZE;
-
 	. = ALIGN(SEGMENT_ALIGN);
 	__init_begin = .;
 	__inittext_begin = .;
@@ -274,6 +258,30 @@ SECTIONS
 
 	_data = .;
 	_sdata = .;
+
+	__start_ro_after_init = .;
+	idmap_pg_dir = .;
+	. += PAGE_SIZE;
+
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+	tramp_pg_dir = .;
+	. += PAGE_SIZE;
+#endif
+	reserved_pg_dir = .;
+	. += PAGE_SIZE;
+
+	swapper_pg_dir = .;
+	. += PAGE_SIZE;
+
+	HYPERVISOR_DATA_SECTIONS
+
+	.data.ro_after_init : {
+		*(.data..ro_after_init)
+		JUMP_TABLE_DATA
+		. = ALIGN(SEGMENT_ALIGN);
+		__end_ro_after_init = .;
+	}
+
 	RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
 
 	/*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9828ad826837..e9b074ffc768 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -495,11 +495,17 @@ static void __init __map_memblock(pgd_t *pgdp, phys_addr_t start,
 void __init mark_linear_text_alias_ro(void)
 {
 	/*
-	 * Remove the write permissions from the linear alias of .text/.rodata
+	 * Remove the write permissions from the linear alias of .text/.rodata/ro_after_init
 	 */
 	update_mapping_prot(__pa_symbol(_stext), (unsigned long)lm_alias(_stext),
 			    (unsigned long)__init_begin - (unsigned long)_stext,
 			    PAGE_KERNEL_RO);
+
+	update_mapping_prot(__pa_symbol(__start_ro_after_init),
+			    (unsigned long)lm_alias(__start_ro_after_init),
+			    (unsigned long)__end_ro_after_init -
+			    (unsigned long)__start_ro_after_init,
+			    PAGE_KERNEL_RO);
 }
 
 static bool crash_mem_map __initdata;
@@ -608,12 +614,10 @@ void mark_rodata_ro(void)
 {
 	unsigned long section_size;
 
-	/*
-	 * mark .rodata as read only. Use __init_begin rather than __end_rodata
-	 * to cover NOTES and EXCEPTION_TABLE.
-	 */
-	section_size = (unsigned long)__init_begin - (unsigned long)__start_rodata;
-	update_mapping_prot(__pa_symbol(__start_rodata), (unsigned long)__start_rodata,
+	section_size = (unsigned long)__end_ro_after_init -
+		       (unsigned long)__start_ro_after_init;
+	update_mapping_prot(__pa_symbol(__start_ro_after_init),
+			    (unsigned long)__start_ro_after_init,
 			    section_size, PAGE_KERNEL_RO);
 
 	debug_checkwx();
@@ -733,18 +737,19 @@ static void __init map_kernel(pgd_t *pgdp)
 		text_prot = __pgprot_modify(text_prot, PTE_GP, PTE_GP);
 
 	/*
-	 * Only rodata will be remapped with different permissions later on,
-	 * all other segments are allowed to use contiguous mappings.
+	 * Only data will be partially remapped with different permissions
+	 * later on, all other segments are allowed to use contiguous mappings.
 	 */
 	map_kernel_segment(pgdp, _stext, _etext, text_prot, &vmlinux_text, 0,
 			   VM_NO_GUARD);
-	map_kernel_segment(pgdp, __start_rodata, __inittext_begin, PAGE_KERNEL,
-			   &vmlinux_rodata, NO_CONT_MAPPINGS, VM_NO_GUARD);
+	map_kernel_segment(pgdp, __start_rodata, __inittext_begin, PAGE_KERNEL_RO,
+			   &vmlinux_rodata, 0, VM_NO_GUARD);
 	map_kernel_segment(pgdp, __inittext_begin, __inittext_end, text_prot,
 			   &vmlinux_inittext, 0, VM_NO_GUARD);
 	map_kernel_segment(pgdp, __initdata_begin, __initdata_end, PAGE_KERNEL,
 			   &vmlinux_initdata, 0, VM_NO_GUARD);
-	map_kernel_segment(pgdp, _data, _end, PAGE_KERNEL, &vmlinux_data, 0, 0);
+	map_kernel_segment(pgdp, _data, _end, PAGE_KERNEL, &vmlinux_data,
+			   NO_CONT_MAPPINGS | NO_BLOCK_MAPPINGS, 0);
 
 	if (!READ_ONCE(pgd_val(*pgd_offset_pgd(pgdp, FIXADDR_START)))) {
 		/*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 23/26] arm64: head: remap the kernel text/inittext region read-only
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (21 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 16:57   ` Kees Cook
  2022-06-13 14:45 ` [PATCH v4 24/26] mm: add arch hook to validate mmap() prot flags Ard Biesheuvel
                   ` (3 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

In order to be able to run with WXN from boot (which could potentially
be under a hypervisor regime that mandates this), update the temporary
kernel page tables with read-only attributes for the text regions before
attempting to execute from them.

This is rather straight-forward for 16k and 64k granule configurations,
as the split between executable and writable regions is guaranteed to be
aligned to the granule used for the early kernel page tables. For 4k, it
involves installing a single table entry and populating it accordingly.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/assembler.h |  8 +++
 arch/arm64/kernel/head.S           | 73 ++++++++++++++++++--
 arch/arm64/kernel/vmlinux.lds.S    |  2 +-
 arch/arm64/mm/proc.S               | 11 ---
 4 files changed, 78 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index b2584709c332..e1e652410d7d 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -507,6 +507,14 @@ alternative_endif
 	load_ttbr1 \page_table, \tmp, \tmp2
 	.endm
 
+	.macro		__idmap_cpu_set_reserved_ttbr1, tmp1, tmp2
+	adrp		\tmp1, reserved_pg_dir
+	load_ttbr1 	\tmp1, \tmp1, \tmp2
+	tlbi		vmalle1
+	dsb		nsh
+	isb
+	.endm
+
 /*
  * reset_pmuserenr_el0 - reset PMUSERENR_EL0 if PMUv3 present
  */
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 6bf685f988f1..92cbad41eed8 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -87,7 +87,7 @@
 	 *  x22        create_idmap() .. start_kernel()         ID map VA of the DT blob
 	 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
 	 *  x24        __primary_switch()                       linear map KASLR seed
-	 *  x28        create_idmap()                           callee preserved temp register
+	 *  x28        create_idmap(), remap_kernel_text()      callee preserved temp register
 	 */
 SYM_CODE_START(primary_entry)
 	bl	preserve_boot_args
@@ -380,6 +380,66 @@ SYM_FUNC_START_LOCAL(create_kernel_mapping)
 	ret
 SYM_FUNC_END(create_kernel_mapping)
 
+SYM_FUNC_START_LOCAL(remap_kernel_text)
+	mov	x28, lr
+
+	ldr_l	x1, kimage_vaddr
+	mov	x2, x1
+	ldr_l	x3, .Linitdata_begin
+	adrp	x4, _text
+	bic	x4, x4, #SWAPPER_BLOCK_SIZE - 1
+	mov	x5, SWAPPER_RX_MMUFLAGS
+	mov	x6, #SWAPPER_BLOCK_SHIFT
+	bl	remap_region
+
+#if SWAPPER_BLOCK_SHIFT > PAGE_SHIFT
+	/*
+	 * If the boundary between inittext and initdata happens to be aligned
+	 * sufficiently, we are done here. Otherwise, we have to replace its block
+	 * entry with a table entry, and populate the lower level table accordingly.
+	 */
+	ldr_l	x3, .Linitdata_begin
+	tst	x3, #SWAPPER_BLOCK_SIZE - 1
+	b.eq	0f
+
+	/* First, create a table mapping to replace the block mapping */
+	ldr_l	x1, kimage_vaddr
+	bic	x2, x3, #SWAPPER_BLOCK_SIZE - 1
+	adrp	x4, init_pg_end - PAGE_SIZE
+	mov	x5, #PMD_TYPE_TABLE
+	mov	x6, #SWAPPER_BLOCK_SHIFT
+	bl	remap_region
+
+	/* Apply executable permissions to the first subregion */
+	adrp	x0, init_pg_end - PAGE_SIZE
+	ldr_l	x3, .Linitdata_begin
+	bic	x1, x3, #SWAPPER_BLOCK_SIZE - 1
+	mov	x2, x1
+	adrp	x4, __initdata_begin
+	bic	x4, x4, #SWAPPER_BLOCK_SIZE - 1
+	mov	x5, SWAPPER_RX_MMUFLAGS | PTE_TYPE_PAGE
+	mov	x6, #PAGE_SHIFT
+	bl	remap_region
+
+	/* Apply writable permissions to the second subregion */
+	ldr_l	x2, .Linitdata_begin
+	bic	x1, x2, #SWAPPER_BLOCK_SIZE - 1
+	add	x3, x1, #SWAPPER_BLOCK_SIZE
+	adrp	x4, __initdata_begin
+	mov	x5, SWAPPER_RW_MMUFLAGS | PTE_TYPE_PAGE
+	mov	x6, #PAGE_SHIFT
+	bl	remap_region
+#endif
+0:	dsb	ishst
+	ret	x28
+SYM_FUNC_END(remap_kernel_text)
+
+	__INITDATA
+	.align	3
+.Linitdata_begin:
+	.quad	__initdata_begin
+	.previous
+
 	/*
 	 * Initialize CPU registers with task-specific and cpu-specific context.
 	 *
@@ -808,12 +868,17 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 #endif
 	bl	clear_page_tables
 	bl	create_kernel_mapping
-
+#ifdef CONFIG_RELOCATABLE
 	adrp	x1, init_pg_dir
 	load_ttbr1 x1, x1, x2
-#ifdef CONFIG_RELOCATABLE
-	bl	__relocate_kernel
+	bl	__relocate_kernel		// preserves x0
+
+	__idmap_cpu_set_reserved_ttbr1 x1, x2
 #endif
+	bl	remap_kernel_text
+	adrp	x1, init_pg_dir
+	load_ttbr1 x1, x1, x2
+
 	ldr	x8, =__primary_switched
 	adrp	x0, __PHYS_OFFSET
 	br	x8
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 736aca63dad1..3830c6c66e46 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -310,7 +310,7 @@ SECTIONS
 
 	. = ALIGN(PAGE_SIZE);
 	init_pg_dir = .;
-	. += INIT_DIR_SIZE;
+	. += INIT_DIR_SIZE + PAGE_SIZE;
 	init_pg_end = .;
 
 	. = ALIGN(SEGMENT_ALIGN);
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 493b8ffc9be5..c237e976b138 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -168,17 +168,6 @@ SYM_FUNC_END(cpu_do_resume)
 
 	.pushsection ".idmap.text", "awx"
 
-.macro	__idmap_cpu_set_reserved_ttbr1, tmp1, tmp2
-	adrp	\tmp1, reserved_pg_dir
-	phys_to_ttbr \tmp2, \tmp1
-	offset_ttbr1 \tmp2, \tmp1
-	msr	ttbr1_el1, \tmp2
-	isb
-	tlbi	vmalle1
-	dsb	nsh
-	isb
-.endm
-
 /*
  * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1)
  *
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 24/26] mm: add arch hook to validate mmap() prot flags
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (22 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 23/26] arm64: head: remap the kernel text/inittext region read-only Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 16:37   ` Kees Cook
  2022-06-13 14:45 ` [PATCH v4 25/26] arm64: mm: add support for WXN memory translation attribute Ard Biesheuvel
                   ` (2 subsequent siblings)
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Add a hook to permit architectures to perform validation on the prot
flags passed to mmap(), like arch_validate_prot() does for mprotect().
This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on
configurations that run with WXN enabled.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 include/linux/mman.h | 15 +++++++++++++++
 mm/mmap.c            |  3 +++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 58b3abd457a3..53ac72310ce0 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -120,6 +120,21 @@ static inline bool arch_validate_flags(unsigned long flags)
 #define arch_validate_flags arch_validate_flags
 #endif
 
+#ifndef arch_validate_mmap_prot
+/*
+ * This is called from mmap(), which ignores unknown prot bits so the default
+ * is to accept anything.
+ *
+ * Returns true if the prot flags are valid
+ */
+static inline bool arch_validate_mmap_prot(unsigned long prot,
+					   unsigned long addr)
+{
+	return true;
+}
+#define arch_validate_mmap_prot arch_validate_mmap_prot
+#endif
+
 /*
  * Optimisation macro.  It is equivalent to:
  *      (x & bit1) ? bit2 : 0
diff --git a/mm/mmap.c b/mm/mmap.c
index 61e6135c54ef..4a585879937d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1437,6 +1437,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (!(file && path_noexec(&file->f_path)))
 			prot |= PROT_EXEC;
 
+	if (!arch_validate_mmap_prot(prot, addr))
+		return -EACCES;
+
 	/* force arch specific MAP_FIXED handling in get_unmapped_area */
 	if (flags & MAP_FIXED_NOREPLACE)
 		flags |= MAP_FIXED;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 25/26] arm64: mm: add support for WXN memory translation attribute
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (23 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 24/26] mm: add arch hook to validate mmap() prot flags Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 16:51   ` Kees Cook
  2022-06-13 14:45 ` [PATCH v4 26/26] arm64: kernel: move ID map out of .text mapping Ard Biesheuvel
  2022-06-24 13:19 ` [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Will Deacon
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

The AArch64 virtual memory system supports a global WXN control, which
can be enabled to make all writable mappings implicitly no-exec. This is
a useful hardening feature, as it prevents mistakes in managing page
table permissions from being exploited to attack the system.

When enabled at EL1, the restrictions apply to both EL1 and EL0. EL1 is
completely under our control, and has been cleaned up to allow WXN to be
enabled from boot onwards. EL0 is not under our control, but given that
widely deployed security features such as selinux or PaX already limit
the ability of user space to create mappings that are writable and
executable at the same time, the impact of enabling this for EL0 is
expected to be limited. (For this reason, common user space libraries
that have a legitimate need for manipulating executable code already
carry fallbacks such as [0].)

If enabled at compile time, the feature can still be disabled at boot if
needed, by passing arm64.nowxn on the kernel command line.

[0] https://github.com/libffi/libffi/blob/master/src/closures.c#L440

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/Kconfig                   | 11 ++++++
 arch/arm64/include/asm/cpufeature.h  |  8 +++++
 arch/arm64/include/asm/mman.h        | 36 ++++++++++++++++++++
 arch/arm64/include/asm/mmu_context.h | 30 +++++++++++++++-
 arch/arm64/kernel/head.S             | 28 ++++++++++++++-
 arch/arm64/kernel/idreg-override.c   | 16 +++++++++
 arch/arm64/mm/proc.S                 |  6 ++++
 7 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..d262d5ab4316 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1422,6 +1422,17 @@ config RODATA_FULL_DEFAULT_ENABLED
 	  This requires the linear region to be mapped down to pages,
 	  which may adversely affect performance in some cases.
 
+config ARM64_WXN
+	bool "Enable WXN attribute so all writable mappings are non-exec"
+	help
+	  Set the WXN bit in the SCTLR system register so that all writable
+	  mappings are treated as if the PXN/UXN bit is set as well.
+	  If this is set to Y, it can still be disabled at runtime by
+	  passing 'arm64.nowxn' on the kernel command line.
+
+	  This should only be set if no software needs to be supported that
+	  relies on being able to execute from writable mappings.
+
 config ARM64_SW_TTBR0_PAN
 	bool "Emulate Privileged Access Never using TTBR0_EL1 switching"
 	help
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 14a8f3d93add..fc364c4d31e2 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -911,10 +911,18 @@ extern struct arm64_ftr_override id_aa64mmfr1_override;
 extern struct arm64_ftr_override id_aa64pfr1_override;
 extern struct arm64_ftr_override id_aa64isar1_override;
 extern struct arm64_ftr_override id_aa64isar2_override;
+extern struct arm64_ftr_override sctlr_override;
 
 u32 get_kvm_ipa_limit(void);
 void dump_cpu_features(void);
 
+static inline bool arm64_wxn_enabled(void)
+{
+	if (!IS_ENABLED(CONFIG_ARM64_WXN))
+		return false;
+	return (sctlr_override.val & sctlr_override.mask & 0xf) == 0;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index 5966ee4a6154..6d4940342ba7 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -35,11 +35,40 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
 }
 #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
 
+static inline bool arm64_check_wx_prot(unsigned long prot,
+				       struct task_struct *tsk)
+{
+	/*
+	 * When we are running with SCTLR_ELx.WXN==1, writable mappings are
+	 * implicitly non-executable. This means we should reject such mappings
+	 * when user space attempts to create them using mmap() or mprotect().
+	 */
+	if (arm64_wxn_enabled() &&
+	    ((prot & (PROT_WRITE | PROT_EXEC)) == (PROT_WRITE | PROT_EXEC))) {
+		/*
+		 * User space libraries such as libffi carry elaborate
+		 * heuristics to decide whether it is worth it to even attempt
+		 * to create writable executable mappings, as PaX or selinux
+		 * enabled systems will outright reject it. They will usually
+		 * fall back to something else (e.g., two separate shared
+		 * mmap()s of a temporary file) on failure.
+		 */
+		pr_info_ratelimited(
+			"process %s (%d) attempted to create PROT_WRITE+PROT_EXEC mapping\n",
+			tsk->comm, tsk->pid);
+		return false;
+	}
+	return true;
+}
+
 static inline bool arch_validate_prot(unsigned long prot,
 	unsigned long addr __always_unused)
 {
 	unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM;
 
+	if (!arm64_check_wx_prot(prot, current))
+		return false;
+
 	if (system_supports_bti())
 		supported |= PROT_BTI;
 
@@ -50,6 +79,13 @@ static inline bool arch_validate_prot(unsigned long prot,
 }
 #define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)
 
+static inline bool arch_validate_mmap_prot(unsigned long prot,
+					   unsigned long addr)
+{
+	return arm64_check_wx_prot(prot, current);
+}
+#define arch_validate_mmap_prot arch_validate_mmap_prot
+
 static inline bool arch_validate_flags(unsigned long vm_flags)
 {
 	if (!system_supports_mte())
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index c7ccd82db1d2..cd4bb5410a18 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -19,13 +19,41 @@
 #include <asm/cacheflush.h>
 #include <asm/cpufeature.h>
 #include <asm/proc-fns.h>
-#include <asm-generic/mm_hooks.h>
 #include <asm/cputype.h>
 #include <asm/sysreg.h>
 #include <asm/tlbflush.h>
 
 extern bool rodata_full;
 
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void arch_exit_mmap(struct mm_struct *mm)
+{
+}
+
+static inline void arch_unmap(struct mm_struct *mm,
+			unsigned long start, unsigned long end)
+{
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool execute, bool foreign)
+{
+	if (IS_ENABLED(CONFIG_ARM64_WXN) && execute &&
+	    (vma->vm_flags & (VM_WRITE | VM_EXEC)) == (VM_WRITE | VM_EXEC)) {
+		pr_warn_ratelimited(
+			"process %s (%d) attempted to execute from writable memory\n",
+			current->comm, current->pid);
+		/* disallow unless the nowxn override is set */
+		return !arm64_wxn_enabled();
+	}
+	return true;
+}
+
 static inline void contextidr_thread_switch(struct task_struct *next)
 {
 	if (!IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR))
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 92cbad41eed8..834afdc1c6ff 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -511,6 +511,12 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 	mov	x0, x20
 	bl	switch_to_vhe			// Prefer VHE if possible
 	ldp	x29, x30, [sp], #16
+#ifdef CONFIG_ARM64_WXN
+	ldr_l	x1, sctlr_override + FTR_OVR_VAL_OFFSET
+	tbz	x1, #0, 0f
+	blr	lr
+0:
+#endif
 	bl	start_kernel
 	ASM_BUG()
 SYM_FUNC_END(__primary_switched)
@@ -881,5 +887,25 @@ SYM_FUNC_START_LOCAL(__primary_switch)
 
 	ldr	x8, =__primary_switched
 	adrp	x0, __PHYS_OFFSET
-	br	x8
+	blr	x8
+#ifdef CONFIG_ARM64_WXN
+	/*
+	 * If we return here, we need to disable WXN before we proceed. This
+	 * requires the MMU to be disabled, so it needs to occur while running
+	 * from the ID map.
+	 */
+	mrs	x0, sctlr_el1
+	bic	x1, x0, #SCTLR_ELx_M
+	msr	sctlr_el1, x1
+	isb
+
+	tlbi	vmalle1
+	dsb	nsh
+	isb
+
+	bic	x0, x0, #SCTLR_ELx_WXN
+	msr	sctlr_el1, x0
+	isb
+	ret
+#endif
 SYM_FUNC_END(__primary_switch)
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index f92836e196e5..85d8fa47d196 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -94,12 +94,27 @@ static const struct ftr_set_desc kaslr __initconst = {
 	},
 };
 
+#ifdef CONFIG_ARM64_WXN
+asmlinkage struct arm64_ftr_override sctlr_override __ro_after_init;
+static const struct ftr_set_desc sctlr __initconst = {
+	.name		= "sctlr",
+	.override	= &sctlr_override,
+	.fields		= {
+		{ "nowxn", 0 },
+		{}
+	},
+};
+#endif
+
 static const struct ftr_set_desc * const regs[] __initconst = {
 	&mmfr1,
 	&pfr1,
 	&isar1,
 	&isar2,
 	&kaslr,
+#ifdef CONFIG_ARM64_WXN
+	&sctlr,
+#endif
 };
 
 static const struct {
@@ -115,6 +130,7 @@ static const struct {
 	  "id_aa64isar2.gpa3=0 id_aa64isar2.apa3=0"	   },
 	{ "arm64.nomte",		"id_aa64pfr1.mte=0" },
 	{ "nokaslr",			"kaslr.disabled=1" },
+	{ "arm64.nowxn",		"sctlr.nowxn=1" },
 };
 
 static int __init find_field(const char *cmdline,
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index c237e976b138..9ffdf1091d97 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -487,6 +487,12 @@ SYM_FUNC_START(__cpu_setup)
 	 * Prepare SCTLR
 	 */
 	mov_q	x0, INIT_SCTLR_EL1_MMU_ON
+#ifdef CONFIG_ARM64_WXN
+	ldr_l	x1, sctlr_override + FTR_OVR_VAL_OFFSET
+	tst	x1, #0x1			// WXN disabled on command line?
+	orr	x1, x0, #SCTLR_ELx_WXN
+	csel	x0, x0, x1, ne
+#endif
 	ret					// return to head.S
 
 	.unreq	mair
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v4 26/26] arm64: kernel: move ID map out of .text mapping
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (24 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 25/26] arm64: mm: add support for WXN memory translation attribute Ard Biesheuvel
@ 2022-06-13 14:45 ` Ard Biesheuvel
  2022-06-13 16:52   ` Kees Cook
  2022-06-24 13:19 ` [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Will Deacon
  26 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 14:45 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-hardening, Ard Biesheuvel, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown,
	Anshuman Khandual

Reorganize the ID map slightly so that only code that is executed via
the 1:1 mapping remains. This allows to move the ID map out of the .text
segment, given that it no longer needs exec permissions via the kernel
mapping.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/kernel/head.S        | 5 ++++-
 arch/arm64/kernel/vmlinux.lds.S | 2 +-
 arch/arm64/mm/proc.S            | 2 --
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 834afdc1c6ff..eb959d3387b4 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -525,7 +525,7 @@ SYM_FUNC_END(__primary_switched)
  * end early head section, begin head code that is also used for
  * hotplug and needs to have the same protections as the text region
  */
-	.section ".idmap.text","awx"
+	.text
 
 /*
  * Starting from EL2 or EL1, configure the CPU to execute at the highest
@@ -617,6 +617,7 @@ SYM_FUNC_START_LOCAL(set_cpu_boot_mode_flag)
 	ret
 SYM_FUNC_END(set_cpu_boot_mode_flag)
 
+	.section ".idmap.text","awx"
 	/*
 	 * This provides a "holding pen" for platforms to hold all secondary
 	 * cores are held until we're ready for them to initialise.
@@ -658,6 +659,7 @@ SYM_FUNC_START_LOCAL(secondary_startup)
 	br	x8
 SYM_FUNC_END(secondary_startup)
 
+	.text
 SYM_FUNC_START_LOCAL(__secondary_switched)
 	mov	x0, x20
 	bl	set_cpu_boot_mode_flag
@@ -717,6 +719,7 @@ SYM_FUNC_END(__secondary_too_slow)
  * Checks if the selected granule size is supported by the CPU.
  * If it isn't, park the CPU
  */
+	.section ".idmap.text","awx"
 SYM_FUNC_START(__enable_mmu)
 	mrs	x3, ID_AA64MMFR0_EL1
 	ubfx	x3, x3, #ID_AA64MMFR0_TGRAN_SHIFT, 4
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 3830c6c66e46..d51aa4bbd272 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -169,7 +169,6 @@ SECTIONS
 			LOCK_TEXT
 			KPROBES_TEXT
 			HYPERVISOR_TEXT
-			IDMAP_TEXT
 			*(.gnu.warning)
 		. = ALIGN(16);
 		*(.got)			/* Global offset table		*/
@@ -194,6 +193,7 @@ SECTIONS
 		TRAMP_TEXT
 		HIBERNATE_TEXT
 		KEXEC_TEXT
+		IDMAP_TEXT
 	}
 
 	. = ALIGN(SEGMENT_ALIGN);
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 9ffdf1091d97..7b22e2afe8a0 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -107,7 +107,6 @@ SYM_FUNC_END(cpu_do_suspend)
  *
  * x0: Address of context pointer
  */
-	.pushsection ".idmap.text", "awx"
 SYM_FUNC_START(cpu_do_resume)
 	ldp	x2, x3, [x0]
 	ldp	x4, x5, [x0, #16]
@@ -163,7 +162,6 @@ alternative_else_nop_endif
 	isb
 	ret
 SYM_FUNC_END(cpu_do_resume)
-	.popsection
 #endif
 
 	.pushsection ".idmap.text", "awx"
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 24/26] mm: add arch hook to validate mmap() prot flags
  2022-06-13 14:45 ` [PATCH v4 24/26] mm: add arch hook to validate mmap() prot flags Ard Biesheuvel
@ 2022-06-13 16:37   ` Kees Cook
  2022-06-13 16:44     ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Kees Cook @ 2022-06-13 16:37 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:48PM +0200, Ard Biesheuvel wrote:
> Add a hook to permit architectures to perform validation on the prot
> flags passed to mmap(), like arch_validate_prot() does for mprotect().
> This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on
> configurations that run with WXN enabled.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  include/linux/mman.h | 15 +++++++++++++++
>  mm/mmap.c            |  3 +++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 58b3abd457a3..53ac72310ce0 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -120,6 +120,21 @@ static inline bool arch_validate_flags(unsigned long flags)
>  #define arch_validate_flags arch_validate_flags
>  #endif
>  
> +#ifndef arch_validate_mmap_prot
> +/*
> + * This is called from mmap(), which ignores unknown prot bits so the default
> + * is to accept anything.
> + *
> + * Returns true if the prot flags are valid
> + */
> +static inline bool arch_validate_mmap_prot(unsigned long prot,
> +					   unsigned long addr)
> +{
> +	return true;
> +}
> +#define arch_validate_mmap_prot arch_validate_mmap_prot
> +#endif
> +
>  /*
>   * Optimisation macro.  It is equivalent to:
>   *      (x & bit1) ? bit2 : 0
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 61e6135c54ef..4a585879937d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1437,6 +1437,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		if (!(file && path_noexec(&file->f_path)))
>  			prot |= PROT_EXEC;
>  
> +	if (!arch_validate_mmap_prot(prot, addr))
> +		return -EACCES;

I assume yes, but just to be clear, the existing userspace programs that
can switch modes are checking for EACCES? (Or are just just checking for
failure generally?) It looks like, for example, SELinux returns EACCES
too, so this looks correct. (Looking at the mmap man page, it seems the
ship has sailed for this to be EPERM, which looks more correct to me,
but so be it.)

> +
>  	/* force arch specific MAP_FIXED handling in get_unmapped_area */
>  	if (flags & MAP_FIXED_NOREPLACE)
>  		flags |= MAP_FIXED;
> -- 
> 2.30.2
> 

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 24/26] mm: add arch hook to validate mmap() prot flags
  2022-06-13 16:37   ` Kees Cook
@ 2022-06-13 16:44     ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 16:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, 13 Jun 2022 at 18:37, Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, Jun 13, 2022 at 04:45:48PM +0200, Ard Biesheuvel wrote:
> > Add a hook to permit architectures to perform validation on the prot
> > flags passed to mmap(), like arch_validate_prot() does for mprotect().
> > This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on
> > configurations that run with WXN enabled.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  include/linux/mman.h | 15 +++++++++++++++
> >  mm/mmap.c            |  3 +++
> >  2 files changed, 18 insertions(+)
> >
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index 58b3abd457a3..53ac72310ce0 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -120,6 +120,21 @@ static inline bool arch_validate_flags(unsigned long flags)
> >  #define arch_validate_flags arch_validate_flags
> >  #endif
> >
> > +#ifndef arch_validate_mmap_prot
> > +/*
> > + * This is called from mmap(), which ignores unknown prot bits so the default
> > + * is to accept anything.
> > + *
> > + * Returns true if the prot flags are valid
> > + */
> > +static inline bool arch_validate_mmap_prot(unsigned long prot,
> > +                                        unsigned long addr)
> > +{
> > +     return true;
> > +}
> > +#define arch_validate_mmap_prot arch_validate_mmap_prot
> > +#endif
> > +
> >  /*
> >   * Optimisation macro.  It is equivalent to:
> >   *      (x & bit1) ? bit2 : 0
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 61e6135c54ef..4a585879937d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1437,6 +1437,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> >               if (!(file && path_noexec(&file->f_path)))
> >                       prot |= PROT_EXEC;
> >
> > +     if (!arch_validate_mmap_prot(prot, addr))
> > +             return -EACCES;
>
> I assume yes, but just to be clear, the existing userspace programs that
> can switch modes are checking for EACCES? (Or are just just checking for
> failure generally?) It looks like, for example, SELinux returns EACCES
> too, so this looks correct. (Looking at the mmap man page, it seems the
> ship has sailed for this to be EPERM, which looks more correct to me,
> but so be it.)
>

Taking libffi for example, it will use the fallback on either EPERM or
EACCES, but only if it thinks selinux is enabled. If it thinks PaX is
enabled, it will not even try PROT_WRITE+PROT_EXEC, and use the
fallback unconditionally.

The only other occurrence I needed to fix in my user space was
libpcre2, but there, I had to rebuild it with --enable-git-sealloc
(which, presumably, more selinux minded distros are doing already)


> > +
> >       /* force arch specific MAP_FIXED handling in get_unmapped_area */
> >       if (flags & MAP_FIXED_NOREPLACE)
> >               flags |= MAP_FIXED;
> > --
> > 2.30.2
> >
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 25/26] arm64: mm: add support for WXN memory translation attribute
  2022-06-13 14:45 ` [PATCH v4 25/26] arm64: mm: add support for WXN memory translation attribute Ard Biesheuvel
@ 2022-06-13 16:51   ` Kees Cook
  0 siblings, 0 replies; 57+ messages in thread
From: Kees Cook @ 2022-06-13 16:51 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:49PM +0200, Ard Biesheuvel wrote:
> The AArch64 virtual memory system supports a global WXN control, which
> can be enabled to make all writable mappings implicitly no-exec. This is
> a useful hardening feature, as it prevents mistakes in managing page
> table permissions from being exploited to attack the system.
> 
> When enabled at EL1, the restrictions apply to both EL1 and EL0. EL1 is
> completely under our control, and has been cleaned up to allow WXN to be
> enabled from boot onwards. EL0 is not under our control, but given that
> widely deployed security features such as selinux or PaX already limit
> the ability of user space to create mappings that are writable and
> executable at the same time, the impact of enabling this for EL0 is
> expected to be limited. (For this reason, common user space libraries
> that have a legitimate need for manipulating executable code already
> carry fallbacks such as [0].)
> 
> If enabled at compile time, the feature can still be disabled at boot if
> needed, by passing arm64.nowxn on the kernel command line.
> 
> [0] https://github.com/libffi/libffi/blob/master/src/closures.c#L440
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/Kconfig                   | 11 ++++++
>  arch/arm64/include/asm/cpufeature.h  |  8 +++++
>  arch/arm64/include/asm/mman.h        | 36 ++++++++++++++++++++
>  arch/arm64/include/asm/mmu_context.h | 30 +++++++++++++++-
>  arch/arm64/kernel/head.S             | 28 ++++++++++++++-
>  arch/arm64/kernel/idreg-override.c   | 16 +++++++++
>  arch/arm64/mm/proc.S                 |  6 ++++
>  7 files changed, 133 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1652a9800ebe..d262d5ab4316 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1422,6 +1422,17 @@ config RODATA_FULL_DEFAULT_ENABLED
>  	  This requires the linear region to be mapped down to pages,
>  	  which may adversely affect performance in some cases.
>  
> +config ARM64_WXN
> +	bool "Enable WXN attribute so all writable mappings are non-exec"
> +	help
> +	  Set the WXN bit in the SCTLR system register so that all writable
> +	  mappings are treated as if the PXN/UXN bit is set as well.
> +	  If this is set to Y, it can still be disabled at runtime by
> +	  passing 'arm64.nowxn' on the kernel command line.
> +
> +	  This should only be set if no software needs to be supported that
> +	  relies on being able to execute from writable mappings.

Should this instead just be a "default value of arm64.xwn" config? It
seems like it should be possible to just drop all the #ifdefs below, as
XWN is arguably the default state we would want systems to move to.

> +
>  config ARM64_SW_TTBR0_PAN
>  	bool "Emulate Privileged Access Never using TTBR0_EL1 switching"
>  	help
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index 14a8f3d93add..fc364c4d31e2 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -911,10 +911,18 @@ extern struct arm64_ftr_override id_aa64mmfr1_override;
>  extern struct arm64_ftr_override id_aa64pfr1_override;
>  extern struct arm64_ftr_override id_aa64isar1_override;
>  extern struct arm64_ftr_override id_aa64isar2_override;
> +extern struct arm64_ftr_override sctlr_override;
>  
>  u32 get_kvm_ipa_limit(void);
>  void dump_cpu_features(void);
>  
> +static inline bool arm64_wxn_enabled(void)
> +{
> +	if (!IS_ENABLED(CONFIG_ARM64_WXN))
> +		return false;
> +	return (sctlr_override.val & sctlr_override.mask & 0xf) == 0;
> +}
> +
>  #endif /* __ASSEMBLY__ */
>  
>  #endif
> diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> index 5966ee4a6154..6d4940342ba7 100644
> --- a/arch/arm64/include/asm/mman.h
> +++ b/arch/arm64/include/asm/mman.h
> @@ -35,11 +35,40 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
>  }
>  #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
>  
> +static inline bool arm64_check_wx_prot(unsigned long prot,
> +				       struct task_struct *tsk)
> +{
> +	/*
> +	 * When we are running with SCTLR_ELx.WXN==1, writable mappings are
> +	 * implicitly non-executable. This means we should reject such mappings
> +	 * when user space attempts to create them using mmap() or mprotect().

If this series is respun, perhaps add to this comment a little to indicate
that this is basically a hint to userspace, and not an attempt to actually
provide a general W+X mapping protection:

	* Note that this is effectively just a hint (for things like
	* libffi noted below), as solving this for all mapping combinations
	* is a larger endeavor. (e.g. userspace setting an executable mapping
	* writable, changing it, and then making it read-only again.)

> +	 */
> +	if (arm64_wxn_enabled() &&
> +	    ((prot & (PROT_WRITE | PROT_EXEC)) == (PROT_WRITE | PROT_EXEC))) {
> +		/*
> +		 * User space libraries such as libffi carry elaborate
> +		 * heuristics to decide whether it is worth it to even attempt
> +		 * to create writable executable mappings, as PaX or selinux
> +		 * enabled systems will outright reject it. They will usually
> +		 * fall back to something else (e.g., two separate shared
> +		 * mmap()s of a temporary file) on failure.
> +		 */
> +		pr_info_ratelimited(
> +			"process %s (%d) attempted to create PROT_WRITE+PROT_EXEC mapping\n",
> +			tsk->comm, tsk->pid);
> +		return false;
> +	}
> +	return true;
> +}

But regardless, with or without the changes above:

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 26/26] arm64: kernel: move ID map out of .text mapping
  2022-06-13 14:45 ` [PATCH v4 26/26] arm64: kernel: move ID map out of .text mapping Ard Biesheuvel
@ 2022-06-13 16:52   ` Kees Cook
  0 siblings, 0 replies; 57+ messages in thread
From: Kees Cook @ 2022-06-13 16:52 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:50PM +0200, Ard Biesheuvel wrote:
> Reorganize the ID map slightly so that only code that is executed via
> the 1:1 mapping remains. This allows to move the ID map out of the .text
> segment, given that it no longer needs exec permissions via the kernel
> mapping.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

This could be done earlier in the series, yes?

Regardless:

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 23/26] arm64: head: remap the kernel text/inittext region read-only
  2022-06-13 14:45 ` [PATCH v4 23/26] arm64: head: remap the kernel text/inittext region read-only Ard Biesheuvel
@ 2022-06-13 16:57   ` Kees Cook
  0 siblings, 0 replies; 57+ messages in thread
From: Kees Cook @ 2022-06-13 16:57 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:47PM +0200, Ard Biesheuvel wrote:
> In order to be able to run with WXN from boot (which could potentially
> be under a hypervisor regime that mandates this), update the temporary
> kernel page tables with read-only attributes for the text regions before
> attempting to execute from them.
> 
> This is rather straight-forward for 16k and 64k granule configurations,
> as the split between executable and writable regions is guaranteed to be
> aligned to the granule used for the early kernel page tables. For 4k, it
> involves installing a single table entry and populating it accordingly.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment
  2022-06-13 14:45 ` [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment Ard Biesheuvel
@ 2022-06-13 17:00   ` Kees Cook
  2022-06-13 17:16     ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Kees Cook @ 2022-06-13 17:00 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:46PM +0200, Ard Biesheuvel wrote:
> Currently, the ro_after_init sections sits right in the middle of the
> text/rodata/inittext segment, making it difficult to map any of those
> non-writable during early boot. So instead, move it to the start of
> .data, and update the init sequences so that the section is remapped
> read-only once startup completes.
> 
> Note that this moves the entire HYP data section into .data as well -
> this likely needs to remain as a single block for now, but could perhaps
> split into a .rodata and .data..ro_after_init section later.

If I'm reading this correctly, this means that .data..ro_after_init now
lives between .data and .rodata?

Do the various LKDTM tests still pass after this change?

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment
  2022-06-13 17:00   ` Kees Cook
@ 2022-06-13 17:16     ` Ard Biesheuvel
  2022-06-13 23:38       ` Kees Cook
  0 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-13 17:16 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, 13 Jun 2022 at 19:00, Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, Jun 13, 2022 at 04:45:46PM +0200, Ard Biesheuvel wrote:
> > Currently, the ro_after_init sections sits right in the middle of the
> > text/rodata/inittext segment, making it difficult to map any of those
> > non-writable during early boot. So instead, move it to the start of
> > .data, and update the init sequences so that the section is remapped
> > read-only once startup completes.
> >
> > Note that this moves the entire HYP data section into .data as well -
> > this likely needs to remain as a single block for now, but could perhaps
> > split into a .rodata and .data..ro_after_init section later.
>
> If I'm reading this correctly, this means that .data..ro_after_init now
> lives between .data and .rodata?
>

No, between .initdata and .data

> Do the various LKDTM tests still pass after this change?
>

Good question, I'll check.

> Reviewed-by: Kees Cook <keescook@chromium.org>
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment
  2022-06-13 17:16     ` Ard Biesheuvel
@ 2022-06-13 23:38       ` Kees Cook
  2022-06-16 11:31         ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Kees Cook @ 2022-06-13 23:38 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 07:16:15PM +0200, Ard Biesheuvel wrote:
> On Mon, 13 Jun 2022 at 19:00, Kees Cook <keescook@chromium.org> wrote:
> >
> > On Mon, Jun 13, 2022 at 04:45:46PM +0200, Ard Biesheuvel wrote:
> > > Currently, the ro_after_init sections sits right in the middle of the
> > > text/rodata/inittext segment, making it difficult to map any of those
> > > non-writable during early boot. So instead, move it to the start of
> > > .data, and update the init sequences so that the section is remapped
> > > read-only once startup completes.
> > >
> > > Note that this moves the entire HYP data section into .data as well -
> > > this likely needs to remain as a single block for now, but could perhaps
> > > split into a .rodata and .data..ro_after_init section later.
> >
> > If I'm reading this correctly, this means that .data..ro_after_init now
> > lives between .data and .rodata?
> >
> 
> No, between .initdata and .data

Ah, doesn't this mean more padding (for segment alignment) used? On other
architectures .data..ro_after_init tried to be near the writable/read-only
boundary so segment padding was only needed on one side (e.g. it could
live at the end of .rodata without segment alignment but before .data
which was segment aligned.) Then when .rodata was made read-only (after
__init), .data..ro_after_init would also get set read-only.

In this case, I think it ends up needing segment alignment both at the
front and the end, since the .initdata and .data are freed and left
writable, respectively?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 02/26] arm64: mm: make vabits_actual a build time constant if possible
  2022-06-13 14:45 ` [PATCH v4 02/26] arm64: mm: make vabits_actual a build time constant if possible Ard Biesheuvel
@ 2022-06-14  8:25   ` Anshuman Khandual
  2022-06-14  8:34     ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Anshuman Khandual @ 2022-06-14  8:25 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel
  Cc: linux-hardening, Marc Zyngier, Will Deacon, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown


On 6/13/22 20:15, Ard Biesheuvel wrote:
> Currently, we only support 52-bit virtual addressing on 64k pages

But going forward, will support on 4K/16K pages as well via FEAT_LPA2.

> configurations, and in all other cases, vabits_actual is guaranteed to
> equal VA_BITS (== VA_BITS_MIN). So get rid of the variable entirely in
> that case.

The change here does not really get rid of vabit_actual in those cases
either, it just makes it a build time constant AFAICS.

--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -174,7 +174,11 @@
 #include <linux/types.h>
 #include <asm/bug.h>
 
+#if VA_BITS > 48
 extern u64                     vabits_actual;
+#else
+#define vabits_actual          ((u64)VA_BITS)
+#endif

> 
> While at it, move the assignment out of the asm entry code - it has no
> need to be there.

This also changes when vabits_actual gets evaluated ? Then how would it
know, that CPU needs to be stuck in kernel (CPU_STUCK_REASON_52_BIT_VA)
in case all secondary CPUs do not support large VA feature ? Looking at
the sequence...

secondary_entry
OR
secondary_holding_pen
	secondary_startup
		__cpu_secondary_check52bitva

primary_entry
	__create_page_tables			<--- original position
	__primary_switch			 
		start_kernel
			setup_arch
				paging_init	<--- new position

It might still be possible for the secondary cpu start up sequence to
validate LVA support across the platform, but still why even send
vabits_actual evaluation down the line until paging_init(). Ideally
should not it be evaluated as early as possible during boot. Hence,
wondering - what is the real benefit here ?

> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/include/asm/memory.h |  4 ++++
>  arch/arm64/kernel/head.S        | 15 +--------------
>  arch/arm64/mm/mmu.c             | 15 ++++++++++++++-
>  3 files changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 0af70d9abede..c751cd9b94f8 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -174,7 +174,11 @@
>  #include <linux/types.h>
>  #include <asm/bug.h>
>  
> +#if VA_BITS > 48
>  extern u64			vabits_actual;
> +#else
> +#define vabits_actual		((u64)VA_BITS)
> +#endif
>  
>  extern s64			memstart_addr;
>  /* PHYS_OFFSET - the physical address of the start of memory. */
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 1cdecce552bb..dc07858eb673 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -293,19 +293,6 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
>  	adrp	x0, idmap_pg_dir
>  	adrp	x3, __idmap_text_start		// __pa(__idmap_text_start)
>  
> -#ifdef CONFIG_ARM64_VA_BITS_52
> -	mrs_s	x6, SYS_ID_AA64MMFR2_EL1
> -	and	x6, x6, #(0xf << ID_AA64MMFR2_LVA_SHIFT)
> -	mov	x5, #52
> -	cbnz	x6, 1f
> -#endif
> -	mov	x5, #VA_BITS_MIN
> -1:
> -	adr_l	x6, vabits_actual
> -	str	x5, [x6]
> -	dmb	sy
> -	dc	ivac, x6		// Invalidate potentially stale cache line
> -
>  	/*
>  	 * VA_BITS may be too small to allow for an ID mapping to be created
>  	 * that covers system RAM if that is located sufficiently high in the
> @@ -713,7 +700,7 @@ SYM_FUNC_START(__enable_mmu)
>  SYM_FUNC_END(__enable_mmu)
>  
>  SYM_FUNC_START(__cpu_secondary_check52bitva)
> -#ifdef CONFIG_ARM64_VA_BITS_52
> +#if VA_BITS > 48

Just curious - why this is any better ? Although both (VA_BITS > 48)
and CONFIG_ARM64_VA_BITS_52 are build time constants.

>  	ldr_l	x0, vabits_actual
>  	cmp	x0, #52
>  	b.ne	2f
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 7148928e3932..17b339c1a326 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -46,8 +46,10 @@
>  u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
>  u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
>  
> -u64 __section(".mmuoff.data.write") vabits_actual;
> +#if VA_BITS > 48
> +u64 vabits_actual __ro_after_init = VA_BITS_MIN;
>  EXPORT_SYMBOL(vabits_actual);
> +#endif
>  
>  u64 kimage_vaddr __ro_after_init = (u64)&_text;
>  EXPORT_SYMBOL(kimage_vaddr);
> @@ -772,6 +774,17 @@ void __init paging_init(void)
>  {
>  	pgd_t *pgdp = pgd_set_fixmap(__pa_symbol(swapper_pg_dir));
>  
> +#if VA_BITS > 48
> +	if (cpuid_feature_extract_unsigned_field(
> +				read_sysreg_s(SYS_ID_AA64MMFR2_EL1),
> +				ID_AA64MMFR2_LVA_SHIFT))
> +		vabits_actual = VA_BITS;
> +
> +	/* make the variable visible to secondaries with the MMU off */
> +	dcache_clean_inval_poc((u64)&vabits_actual,
> +			       (u64)&vabits_actual + sizeof(vabits_actual));
> +#endif
> +
>  	map_kernel(pgdp);
>  	map_mem(pgdp);
>  

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 01/26] arm64: head: move kimage_vaddr variable into C file
  2022-06-13 14:45 ` [PATCH v4 01/26] arm64: head: move kimage_vaddr variable into C file Ard Biesheuvel
@ 2022-06-14  8:26   ` Anshuman Khandual
  0 siblings, 0 replies; 57+ messages in thread
From: Anshuman Khandual @ 2022-06-14  8:26 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel
  Cc: linux-hardening, Marc Zyngier, Will Deacon, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown

On 6/13/22 20:15, Ard Biesheuvel wrote:
> This variable definition does not need to be in head.S so move it out.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

> ---
>  arch/arm64/kernel/head.S | 7 -------
>  arch/arm64/mm/mmu.c      | 3 +++
>  2 files changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 6a98f1a38c29..1cdecce552bb 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -469,13 +469,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>  	ASM_BUG()
>  SYM_FUNC_END(__primary_switched)
>  
> -	.pushsection ".rodata", "a"
> -SYM_DATA_START(kimage_vaddr)
> -	.quad		_text
> -SYM_DATA_END(kimage_vaddr)
> -EXPORT_SYMBOL(kimage_vaddr)
> -	.popsection
> -
>  /*
>   * end early head section, begin head code that is also used for
>   * hotplug and needs to have the same protections as the text region
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index c5563ff990da..7148928e3932 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -49,6 +49,9 @@ u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
>  u64 __section(".mmuoff.data.write") vabits_actual;
>  EXPORT_SYMBOL(vabits_actual);
>  
> +u64 kimage_vaddr __ro_after_init = (u64)&_text;
> +EXPORT_SYMBOL(kimage_vaddr);
> +
>  u64 kimage_voffset __ro_after_init;
>  EXPORT_SYMBOL(kimage_voffset);
>  

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 02/26] arm64: mm: make vabits_actual a build time constant if possible
  2022-06-14  8:25   ` Anshuman Khandual
@ 2022-06-14  8:34     ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-14  8:34 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown

On Tue, 14 Jun 2022 at 10:25, Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
> On 6/13/22 20:15, Ard Biesheuvel wrote:
> > Currently, we only support 52-bit virtual addressing on 64k pages
>
> But going forward, will support on 4K/16K pages as well via FEAT_LPA2.
>
> > configurations, and in all other cases, vabits_actual is guaranteed to
> > equal VA_BITS (== VA_BITS_MIN). So get rid of the variable entirely in
> > that case.
>
> The change here does not really get rid of vabit_actual in those cases
> either, it just makes it a build time constant AFAICS.
>

Indeed, and so it ceases to be a variable.

> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -174,7 +174,11 @@
>  #include <linux/types.h>
>  #include <asm/bug.h>
>
> +#if VA_BITS > 48
>  extern u64                     vabits_actual;
> +#else
> +#define vabits_actual          ((u64)VA_BITS)
> +#endif
>
> >
> > While at it, move the assignment out of the asm entry code - it has no
> > need to be there.
>
> This also changes when vabits_actual gets evaluated ? Then how would it
> know, that CPU needs to be stuck in kernel (CPU_STUCK_REASON_52_BIT_VA)
> in case all secondary CPUs do not support large VA feature ? Looking at
> the sequence...
>
> secondary_entry
> OR
> secondary_holding_pen
>         secondary_startup
>                 __cpu_secondary_check52bitva
>
> primary_entry
>         __create_page_tables                    <--- original position
>         __primary_switch
>                 start_kernel
>                         setup_arch
>                                 paging_init     <--- new position
>
> It might still be possible for the secondary cpu start up sequence to
> validate LVA support across the platform, but still why even send
> vabits_actual evaluation down the line until paging_init(). Ideally
> should not it be evaluated as early as possible during boot. Hence,
> wondering - what is the real benefit here ?
>

Why should it be evaluated as early as possible? The whole point is
deferring it so we don't have to do it from asm code.

But I suppose doing it as early as possible from C code (i.e., in
setup_arch() before arm64_memblock_init() or even before
early_fixmap_init()) might be better.

> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/include/asm/memory.h |  4 ++++
> >  arch/arm64/kernel/head.S        | 15 +--------------
> >  arch/arm64/mm/mmu.c             | 15 ++++++++++++++-
> >  3 files changed, 19 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > index 0af70d9abede..c751cd9b94f8 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -174,7 +174,11 @@
> >  #include <linux/types.h>
> >  #include <asm/bug.h>
> >
> > +#if VA_BITS > 48
> >  extern u64                   vabits_actual;
> > +#else
> > +#define vabits_actual                ((u64)VA_BITS)
> > +#endif
> >
> >  extern s64                   memstart_addr;
> >  /* PHYS_OFFSET - the physical address of the start of memory. */
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index 1cdecce552bb..dc07858eb673 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -293,19 +293,6 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
> >       adrp    x0, idmap_pg_dir
> >       adrp    x3, __idmap_text_start          // __pa(__idmap_text_start)
> >
> > -#ifdef CONFIG_ARM64_VA_BITS_52
> > -     mrs_s   x6, SYS_ID_AA64MMFR2_EL1
> > -     and     x6, x6, #(0xf << ID_AA64MMFR2_LVA_SHIFT)
> > -     mov     x5, #52
> > -     cbnz    x6, 1f
> > -#endif
> > -     mov     x5, #VA_BITS_MIN
> > -1:
> > -     adr_l   x6, vabits_actual
> > -     str     x5, [x6]
> > -     dmb     sy
> > -     dc      ivac, x6                // Invalidate potentially stale cache line
> > -
> >       /*
> >        * VA_BITS may be too small to allow for an ID mapping to be created
> >        * that covers system RAM if that is located sufficiently high in the
> > @@ -713,7 +700,7 @@ SYM_FUNC_START(__enable_mmu)
> >  SYM_FUNC_END(__enable_mmu)
> >
> >  SYM_FUNC_START(__cpu_secondary_check52bitva)
> > -#ifdef CONFIG_ARM64_VA_BITS_52
> > +#if VA_BITS > 48
>
> Just curious - why this is any better ? Although both (VA_BITS > 48)
> and CONFIG_ARM64_VA_BITS_52 are build time constants.
>

VA_BITS > 48 is a bit more readable, and more likely to remain accurate.

> >       ldr_l   x0, vabits_actual
> >       cmp     x0, #52
> >       b.ne    2f
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 7148928e3932..17b339c1a326 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -46,8 +46,10 @@
> >  u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
> >  u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
> >
> > -u64 __section(".mmuoff.data.write") vabits_actual;
> > +#if VA_BITS > 48
> > +u64 vabits_actual __ro_after_init = VA_BITS_MIN;
> >  EXPORT_SYMBOL(vabits_actual);
> > +#endif
> >
> >  u64 kimage_vaddr __ro_after_init = (u64)&_text;
> >  EXPORT_SYMBOL(kimage_vaddr);
> > @@ -772,6 +774,17 @@ void __init paging_init(void)
> >  {
> >       pgd_t *pgdp = pgd_set_fixmap(__pa_symbol(swapper_pg_dir));
> >
> > +#if VA_BITS > 48
> > +     if (cpuid_feature_extract_unsigned_field(
> > +                             read_sysreg_s(SYS_ID_AA64MMFR2_EL1),
> > +                             ID_AA64MMFR2_LVA_SHIFT))
> > +             vabits_actual = VA_BITS;
> > +
> > +     /* make the variable visible to secondaries with the MMU off */
> > +     dcache_clean_inval_poc((u64)&vabits_actual,
> > +                            (u64)&vabits_actual + sizeof(vabits_actual));
> > +#endif
> > +
> >       map_kernel(pgdp);
> >       map_mem(pgdp);
> >

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code
  2022-06-13 14:45 ` [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code Ard Biesheuvel
@ 2022-06-14  9:22   ` Anshuman Khandual
  2022-06-14  9:34     ` Ard Biesheuvel
  2022-06-24 12:36   ` Will Deacon
  1 sibling, 1 reply; 57+ messages in thread
From: Anshuman Khandual @ 2022-06-14  9:22 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel
  Cc: linux-hardening, Marc Zyngier, Will Deacon, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown


On 6/13/22 20:15, Ard Biesheuvel wrote:
> Setting idmap_t0sz involves fiddling with the caches if done with the
> MMU off. Since we will be creating an initial ID map with the MMU and
> caches off, and the permanent ID map with the MMU and caches on, let's
> move this assignment of idmap_t0sz out of the startup code, and replace
> it with a macro that simply issues the three instructions needed to
> calculate the value wherever it is needed before the MMU is turned on.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/include/asm/assembler.h   | 14 ++++++++++++++
>  arch/arm64/include/asm/mmu_context.h |  2 +-
>  arch/arm64/kernel/head.S             | 13 +------------
>  arch/arm64/mm/mmu.c                  |  5 ++++-
>  arch/arm64/mm/proc.S                 |  2 +-
>  5 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 8c5a61aeaf8e..9468f45c07a6 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -359,6 +359,20 @@ alternative_cb_end
>  	bfi	\valreg, \t1sz, #TCR_T1SZ_OFFSET, #TCR_TxSZ_WIDTH
>  	.endm
>  
> +/*
> + * idmap_get_t0sz - get the T0SZ value needed to cover the ID map
> + *
> + * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> + * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> + * this number conveniently equals the number of leading zeroes in
> + * the physical address of _end.
> + */
> +	.macro	idmap_get_t0sz, reg
> +	adrp	\reg, _end
> +	orr	\reg, \reg, #(1 << VA_BITS_MIN) - 1
> +	clz	\reg, \reg
> +	.endm

Is there any particular reason to evaluate idmap t0sz from '__end' and
VA_BITS_MIN, instead of '__idmap_text_end', as was the case previously.

> +
>  /*
>   * tcr_compute_pa_size - set TCR.(I)PS to the highest supported
>   * ID_AA64MMFR0_EL1.PARange value
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 6770667b34a3..6ac0086ebb1a 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -60,7 +60,7 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
>   * TCR_T0SZ(VA_BITS), unless system RAM is positioned very high in
>   * physical memory, in which case it will be smaller.
>   */
> -extern u64 idmap_t0sz;
> +extern int idmap_t0sz;
>  extern u64 idmap_ptrs_per_pgd;
>  
>  /*
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index dc07858eb673..7f361bc72d12 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -299,22 +299,11 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
>  	 * physical address space. So for the ID map, use an extended virtual
>  	 * range in that case, and configure an additional translation level
>  	 * if needed.
> -	 *
> -	 * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> -	 * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> -	 * this number conveniently equals the number of leading zeroes in
> -	 * the physical address of __idmap_text_end.
>  	 */
> -	adrp	x5, __idmap_text_end
> -	clz	x5, x5
> +	idmap_get_t0sz x5
>  	cmp	x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
>  	b.ge	1f			// .. then skip VA range extension
>  
> -	adr_l	x6, idmap_t0sz
> -	str	x5, [x6]
> -	dmb	sy
> -	dc	ivac, x6		// Invalidate potentially stale cache line

Right, as there is no 'idmap_t0sz' variable to update, cache maintenance
can be dropped off.

> -
>  #if (VA_BITS < 48)
>  #define EXTRA_SHIFT	(PGDIR_SHIFT + PAGE_SHIFT - 3)
>  #define EXTRA_PTRS	(1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 17b339c1a326..103bf4ae408d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -43,7 +43,7 @@
>  #define NO_CONT_MAPPINGS	BIT(1)
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>  
> -u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
> +int idmap_t0sz __ro_after_init;

I guess this is just to reduce 'idmap_t0sz' memory foot print.

>  u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
>  
>  #if VA_BITS > 48
> @@ -785,6 +785,9 @@ void __init paging_init(void)
>  			       (u64)&vabits_actual + sizeof(vabits_actual));
>  #endif
>  
> +	idmap_t0sz = min(63UL - __fls(__pa_symbol(_end)),
> +			 TCR_T0SZ(VA_BITS_MIN));
> +

Just curious - but does not this also need some sync for the update
to be visible across the system ?

#define cpu_set_idmap_tcr_t0sz()        __cpu_set_tcr_t0sz(idmap_t0sz)

static inline void cpu_install_idmap(void)
{
        cpu_set_reserved_ttbr0();
        local_flush_tlb_all();
        cpu_set_idmap_tcr_t0sz();

        cpu_switch_mm(lm_alias(idmap_pg_dir), &init_mm);
}

>  	map_kernel(pgdp);
>  	map_mem(pgdp);
>  
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 972ce8d7f2c5..97cd67697212 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -470,7 +470,7 @@ SYM_FUNC_START(__cpu_setup)
>  	add		x9, x9, #64
>  	tcr_set_t1sz	tcr, x9
>  #else
> -	ldr_l		x9, idmap_t0sz
> +	idmap_get_t0sz	x9
>  #endif
>  	tcr_set_t0sz	tcr, x9
>  

Avoiding one cache maintenance in __create_page_table(), now makes us
again evaluate idmap_t0sz in __cpu_setup(), and also capture & update
idmap_t0sz in paging_init(). This change moves idmap_t0sz outside the
asm functions but from performance perspecive, is there an improvement ?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code
  2022-06-14  9:22   ` Anshuman Khandual
@ 2022-06-14  9:34     ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-14  9:34 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Kees Cook, Catalin Marinas, Mark Brown

On Tue, 14 Jun 2022 at 11:22, Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
> On 6/13/22 20:15, Ard Biesheuvel wrote:
> > Setting idmap_t0sz involves fiddling with the caches if done with the
> > MMU off. Since we will be creating an initial ID map with the MMU and
> > caches off, and the permanent ID map with the MMU and caches on, let's
> > move this assignment of idmap_t0sz out of the startup code, and replace
> > it with a macro that simply issues the three instructions needed to
> > calculate the value wherever it is needed before the MMU is turned on.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/include/asm/assembler.h   | 14 ++++++++++++++
> >  arch/arm64/include/asm/mmu_context.h |  2 +-
> >  arch/arm64/kernel/head.S             | 13 +------------
> >  arch/arm64/mm/mmu.c                  |  5 ++++-
> >  arch/arm64/mm/proc.S                 |  2 +-
> >  5 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > index 8c5a61aeaf8e..9468f45c07a6 100644
> > --- a/arch/arm64/include/asm/assembler.h
> > +++ b/arch/arm64/include/asm/assembler.h
> > @@ -359,6 +359,20 @@ alternative_cb_end
> >       bfi     \valreg, \t1sz, #TCR_T1SZ_OFFSET, #TCR_TxSZ_WIDTH
> >       .endm
> >
> > +/*
> > + * idmap_get_t0sz - get the T0SZ value needed to cover the ID map
> > + *
> > + * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> > + * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> > + * this number conveniently equals the number of leading zeroes in
> > + * the physical address of _end.
> > + */
> > +     .macro  idmap_get_t0sz, reg
> > +     adrp    \reg, _end
> > +     orr     \reg, \reg, #(1 << VA_BITS_MIN) - 1
> > +     clz     \reg, \reg
> > +     .endm
>
> Is there any particular reason to evaluate idmap t0sz from '__end' and
> VA_BITS_MIN, instead of '__idmap_text_end', as was the case previously.
>

Ah yes, I failed to mention that. In a later patch, the ID map will
cover the entire image.

> > +
> >  /*
> >   * tcr_compute_pa_size - set TCR.(I)PS to the highest supported
> >   * ID_AA64MMFR0_EL1.PARange value
> > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> > index 6770667b34a3..6ac0086ebb1a 100644
> > --- a/arch/arm64/include/asm/mmu_context.h
> > +++ b/arch/arm64/include/asm/mmu_context.h
> > @@ -60,7 +60,7 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
> >   * TCR_T0SZ(VA_BITS), unless system RAM is positioned very high in
> >   * physical memory, in which case it will be smaller.
> >   */
> > -extern u64 idmap_t0sz;
> > +extern int idmap_t0sz;
> >  extern u64 idmap_ptrs_per_pgd;
> >
> >  /*
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index dc07858eb673..7f361bc72d12 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -299,22 +299,11 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
> >        * physical address space. So for the ID map, use an extended virtual
> >        * range in that case, and configure an additional translation level
> >        * if needed.
> > -      *
> > -      * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> > -      * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> > -      * this number conveniently equals the number of leading zeroes in
> > -      * the physical address of __idmap_text_end.
> >        */
> > -     adrp    x5, __idmap_text_end
> > -     clz     x5, x5
> > +     idmap_get_t0sz x5
> >       cmp     x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
> >       b.ge    1f                      // .. then skip VA range extension
> >
> > -     adr_l   x6, idmap_t0sz
> > -     str     x5, [x6]
> > -     dmb     sy
> > -     dc      ivac, x6                // Invalidate potentially stale cache line
>
> Right, as there is no 'idmap_t0sz' variable to update, cache maintenance
> can be dropped off.
>
> > -
> >  #if (VA_BITS < 48)
> >  #define EXTRA_SHIFT  (PGDIR_SHIFT + PAGE_SHIFT - 3)
> >  #define EXTRA_PTRS   (1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 17b339c1a326..103bf4ae408d 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -43,7 +43,7 @@
> >  #define NO_CONT_MAPPINGS     BIT(1)
> >  #define NO_EXEC_MAPPINGS     BIT(2)  /* assumes FEAT_HPDS is not used */
> >
> > -u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
> > +int idmap_t0sz __ro_after_init;
>
> I guess this is just to reduce 'idmap_t0sz' memory foot print.
>

It's essentially the 2log of a u64 so it doesn't have to be a u64. The
footprint doesn't really matter, of course.


> >  u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
> >
> >  #if VA_BITS > 48
> > @@ -785,6 +785,9 @@ void __init paging_init(void)
> >                              (u64)&vabits_actual + sizeof(vabits_actual));
> >  #endif
> >
> > +     idmap_t0sz = min(63UL - __fls(__pa_symbol(_end)),
> > +                      TCR_T0SZ(VA_BITS_MIN));
> > +
>
> Just curious - but does not this also need some sync for the update
> to be visible across the system ?
>

No it does not, now that the asm macro no longer refers to the variable.

> >       map_kernel(pgdp);
> >       map_mem(pgdp);
> >
> > diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> > index 972ce8d7f2c5..97cd67697212 100644
> > --- a/arch/arm64/mm/proc.S
> > +++ b/arch/arm64/mm/proc.S
> > @@ -470,7 +470,7 @@ SYM_FUNC_START(__cpu_setup)
> >       add             x9, x9, #64
> >       tcr_set_t1sz    tcr, x9
> >  #else
> > -     ldr_l           x9, idmap_t0sz
> > +     idmap_get_t0sz  x9
> >  #endif
> >       tcr_set_t0sz    tcr, x9
> >
>
> Avoiding one cache maintenance in __create_page_table(), now makes us
> again evaluate idmap_t0sz in __cpu_setup(), and also capture & update
> idmap_t0sz in paging_init(). This change moves idmap_t0sz outside the
> asm functions but from performance perspecive, is there an improvement ?

No, the performance is not expected to be affected, and a ~10
instruction delta at boot is not going to be measurable anyway. The
point of this patch is to remove the need to reason about how/when
variables are accessed, and whether that requires cache cleaning,
invalidation, system-wide DMBs etc.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 04/26] arm64: head: drop idmap_ptrs_per_pgd
  2022-06-13 14:45 ` [PATCH v4 04/26] arm64: head: drop idmap_ptrs_per_pgd Ard Biesheuvel
@ 2022-06-15  4:07   ` Anshuman Khandual
  0 siblings, 0 replies; 57+ messages in thread
From: Anshuman Khandual @ 2022-06-15  4:07 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel
  Cc: linux-hardening, Marc Zyngier, Will Deacon, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown



On 6/13/22 20:15, Ard Biesheuvel wrote:
> The assignment of idmap_ptrs_per_pgd lacks any cache invalidation, even
> though it is updated with the MMU and caches disabled. However, we never

Right, seems like an omission.

> bother to read the value again except in the very next instruction, and
> so we can just drop the variable entirely.

Right.

> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/include/asm/mmu_context.h | 1 -
>  arch/arm64/kernel/head.S             | 7 +++----
>  arch/arm64/mm/mmu.c                  | 1 -
>  3 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 6ac0086ebb1a..7b387c3b312a 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -61,7 +61,6 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
>   * physical memory, in which case it will be smaller.
>   */
>  extern int idmap_t0sz;
> -extern u64 idmap_ptrs_per_pgd;
>  
>  /*
>   * Ensure TCR.T0SZ is set to the provided value.
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 7f361bc72d12..53126a35d73c 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -300,6 +300,7 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
>  	 * range in that case, and configure an additional translation level
>  	 * if needed.
>  	 */
> +	mov	x4, #PTRS_PER_PGD
>  	idmap_get_t0sz x5
>  	cmp	x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
>  	b.ge	1f			// .. then skip VA range extension
> @@ -319,18 +320,16 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
>  #error "Mismatch between VA_BITS and page size/number of translation levels"
>  #endif
>  
> -	mov	x4, EXTRA_PTRS
> -	create_table_entry x0, x3, EXTRA_SHIFT, x4, x5, x6
> +	mov	x2, EXTRA_PTRS
> +	create_table_entry x0, x3, EXTRA_SHIFT, x2, x5, x6

AFAICS should be safe to use 'x2' here instead of 'x4'.

>  #else
>  	/*
>  	 * If VA_BITS == 48, we don't have to configure an additional
>  	 * translation level, but the top-level table has more entries.
>  	 */
>  	mov	x4, #1 << (PHYS_MASK_SHIFT - PGDIR_SHIFT)
> -	str_l	x4, idmap_ptrs_per_pgd, x5
>  #endif
>  1:
> -	ldr_l	x4, idmap_ptrs_per_pgd

'x4' will contain default PTRS_PER_PGD if (VA_BITS = EXTRA_SHIFT), otherwise
it will have  #1 << (PHYS_MASK_SHIFT - PGDIR_SHIFT), but without going via
erstwhile 'idmap_ptrs_per_pgd' variable.

>  	adr_l	x6, __idmap_text_end		// __pa(__idmap_text_end)
>  
>  	map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 103bf4ae408d..0f95c91e5a8e 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -44,7 +44,6 @@
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>  
>  int idmap_t0sz __ro_after_init;
> -u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
>  
>  #if VA_BITS > 48
>  u64 vabits_actual __ro_after_init = VA_BITS_MIN;

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 08/26] arm64: kernel: drop unnecessary PoC cache clean+invalidate
  2022-06-13 14:45 ` [PATCH v4 08/26] arm64: kernel: drop unnecessary PoC cache clean+invalidate Ard Biesheuvel
@ 2022-06-15  4:32   ` Anshuman Khandual
  0 siblings, 0 replies; 57+ messages in thread
From: Anshuman Khandual @ 2022-06-15  4:32 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel
  Cc: linux-hardening, Marc Zyngier, Will Deacon, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown



On 6/13/22 20:15, Ard Biesheuvel wrote:
> Some early boot code runs before the virtual placement of the kernel is
> finalized, and we used to go back to the very start and recreate the ID
> map along with the page tables describing the virtual kernel mapping,
> and this involved setting some global variables with the caches off.
> 
> In order to ensure that global state created by the KASLR code is not
> corrupted by the cache invalidation that occurs in that case, we needed
> to clean those global variables to the PoC explicitly.
> 
> This is no longer needed now that the ID map is created only once (and
> the associated global variable updates are no longer repeated). So drop
> the cache maintenance that is no longer necessary.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/kernel/kaslr.c | 11 -----------
>  1 file changed, 11 deletions(-)
> 
> diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
> index 418b2bba1521..d5542666182f 100644
> --- a/arch/arm64/kernel/kaslr.c
> +++ b/arch/arm64/kernel/kaslr.c
> @@ -13,7 +13,6 @@
>  #include <linux/pgtable.h>
>  #include <linux/random.h>
>  
> -#include <asm/cacheflush.h>
>  #include <asm/fixmap.h>
>  #include <asm/kernel-pgtable.h>
>  #include <asm/memory.h>
> @@ -72,9 +71,6 @@ u64 __init kaslr_early_init(void)
>  	 * we end up running with module randomization disabled.
>  	 */
>  	module_alloc_base = (u64)_etext - MODULES_VSIZE;
> -	dcache_clean_inval_poc((unsigned long)&module_alloc_base,
> -			    (unsigned long)&module_alloc_base +
> -				    sizeof(module_alloc_base));
>  
>  	/*
>  	 * Try to map the FDT early. If this fails, we simply bail,
> @@ -174,13 +170,6 @@ u64 __init kaslr_early_init(void)
>  	module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
>  	module_alloc_base &= PAGE_MASK;
>  
> -	dcache_clean_inval_poc((unsigned long)&module_alloc_base,
> -			    (unsigned long)&module_alloc_base +
> -				    sizeof(module_alloc_base));
> -	dcache_clean_inval_poc((unsigned long)&memstart_offset_seed,
> -			    (unsigned long)&memstart_offset_seed +
> -				    sizeof(memstart_offset_seed));
> -
>  	return offset;
>  }
>  

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment
  2022-06-13 23:38       ` Kees Cook
@ 2022-06-16 11:31         ` Ard Biesheuvel
  2022-06-16 16:18           ` Kees Cook
  0 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-16 11:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Tue, 14 Jun 2022 at 01:38, Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, Jun 13, 2022 at 07:16:15PM +0200, Ard Biesheuvel wrote:
> > On Mon, 13 Jun 2022 at 19:00, Kees Cook <keescook@chromium.org> wrote:
> > >
> > > On Mon, Jun 13, 2022 at 04:45:46PM +0200, Ard Biesheuvel wrote:
> > > > Currently, the ro_after_init sections sits right in the middle of the
> > > > text/rodata/inittext segment, making it difficult to map any of those
> > > > non-writable during early boot. So instead, move it to the start of
> > > > .data, and update the init sequences so that the section is remapped
> > > > read-only once startup completes.
> > > >
> > > > Note that this moves the entire HYP data section into .data as well -
> > > > this likely needs to remain as a single block for now, but could perhaps
> > > > split into a .rodata and .data..ro_after_init section later.
> > >
> > > If I'm reading this correctly, this means that .data..ro_after_init now
> > > lives between .data and .rodata?
> > >
> >
> > No, between .initdata and .data
>
> Ah, doesn't this mean more padding (for segment alignment) used? On other
> architectures .data..ro_after_init tried to be near the writable/read-only
> boundary so segment padding was only needed on one side (e.g. it could
> live at the end of .rodata without segment alignment but before .data
> which was segment aligned.) Then when .rodata was made read-only (after
> __init), .data..ro_after_init would also get set read-only.
>
> In this case, I think it ends up needing segment alignment both at the
> front and the end, since the .initdata and .data are freed and left
> writable, respectively?
>

We used to have

text
--
rodata
(ro_after_init)
--
inittext
--
initdata
--
data
bss

where -- are the segment boundaries, which are always aligned to 64k on arm64

After this patch, we get

text
--
rodata
--
inittext
--
initdata
--
(ro_after_init)
data
bss

so in terms of padding due to alignment, there is not a lot of difference.

The main difference here is the fact that we lose the ability to use
block mappings, but if anyone cares about that, we could work around
this by creating a separate segment for ro_after_init.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment
  2022-06-16 11:31         ` Ard Biesheuvel
@ 2022-06-16 16:18           ` Kees Cook
  2022-06-16 16:31             ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Kees Cook @ 2022-06-16 16:18 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Thu, Jun 16, 2022 at 01:31:23PM +0200, Ard Biesheuvel wrote:
> We used to have
> 
> text
> --
> rodata
> (ro_after_init)
> --
> inittext
> --
> initdata
> --
> data
> bss
> 
> where -- are the segment boundaries, which are always aligned to 64k on arm64
> 
> After this patch, we get
> 
> text
> --
> rodata
> --
> inittext
> --
> initdata
> --
> (ro_after_init)
> data
> bss
> 
> so in terms of padding due to alignment, there is not a lot of difference.

But how is ro_after_init read-only and data isn't, if there isn't a
segment alignment to make that work out?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment
  2022-06-16 16:18           ` Kees Cook
@ 2022-06-16 16:31             ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-16 16:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Will Deacon,
	Mark Rutland, Catalin Marinas, Mark Brown, Anshuman Khandual

On Thu, 16 Jun 2022 at 18:18, Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Jun 16, 2022 at 01:31:23PM +0200, Ard Biesheuvel wrote:
> > We used to have
> >
> > text
> > --
> > rodata
> > (ro_after_init)
> > --
> > inittext
> > --
> > initdata
> > --
> > data
> > bss
> >
> > where -- are the segment boundaries, which are always aligned to 64k on arm64
> >
> > After this patch, we get
> >
> > text
> > --
> > rodata
> > --
> > inittext
> > --
> > initdata
> > --
> > (ro_after_init)
> > data
> > bss
> >
> > so in terms of padding due to alignment, there is not a lot of difference.
>
> But how is ro_after_init read-only and data isn't, if there isn't a
> segment alignment to make that work out?
>

Actually, there is a segment alignment between ro_after_init and data
- my diagram is inaccurate. But we don't actually need that to remap
this slice of memory r/o

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code
  2022-06-13 14:45 ` [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code Ard Biesheuvel
  2022-06-14  9:22   ` Anshuman Khandual
@ 2022-06-24 12:36   ` Will Deacon
  2022-06-24 12:57     ` Ard Biesheuvel
  1 sibling, 1 reply; 57+ messages in thread
From: Will Deacon @ 2022-06-24 12:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:27PM +0200, Ard Biesheuvel wrote:
> Setting idmap_t0sz involves fiddling with the caches if done with the
> MMU off. Since we will be creating an initial ID map with the MMU and
> caches off, and the permanent ID map with the MMU and caches on, let's
> move this assignment of idmap_t0sz out of the startup code, and replace
> it with a macro that simply issues the three instructions needed to
> calculate the value wherever it is needed before the MMU is turned on.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/include/asm/assembler.h   | 14 ++++++++++++++
>  arch/arm64/include/asm/mmu_context.h |  2 +-
>  arch/arm64/kernel/head.S             | 13 +------------
>  arch/arm64/mm/mmu.c                  |  5 ++++-
>  arch/arm64/mm/proc.S                 |  2 +-
>  5 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 8c5a61aeaf8e..9468f45c07a6 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -359,6 +359,20 @@ alternative_cb_end
>  	bfi	\valreg, \t1sz, #TCR_T1SZ_OFFSET, #TCR_TxSZ_WIDTH
>  	.endm
>  
> +/*
> + * idmap_get_t0sz - get the T0SZ value needed to cover the ID map
> + *
> + * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> + * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> + * this number conveniently equals the number of leading zeroes in
> + * the physical address of _end.
> + */
> +	.macro	idmap_get_t0sz, reg
> +	adrp	\reg, _end
> +	orr	\reg, \reg, #(1 << VA_BITS_MIN) - 1
> +	clz	\reg, \reg
> +	.endm
> +
>  /*
>   * tcr_compute_pa_size - set TCR.(I)PS to the highest supported
>   * ID_AA64MMFR0_EL1.PARange value
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 6770667b34a3..6ac0086ebb1a 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -60,7 +60,7 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
>   * TCR_T0SZ(VA_BITS), unless system RAM is positioned very high in
>   * physical memory, in which case it will be smaller.
>   */
> -extern u64 idmap_t0sz;
> +extern int idmap_t0sz;
>  extern u64 idmap_ptrs_per_pgd;
>  
>  /*
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index dc07858eb673..7f361bc72d12 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -299,22 +299,11 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
>  	 * physical address space. So for the ID map, use an extended virtual
>  	 * range in that case, and configure an additional translation level
>  	 * if needed.
> -	 *
> -	 * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> -	 * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> -	 * this number conveniently equals the number of leading zeroes in
> -	 * the physical address of __idmap_text_end.
>  	 */
> -	adrp	x5, __idmap_text_end
> -	clz	x5, x5
> +	idmap_get_t0sz x5
>  	cmp	x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
>  	b.ge	1f			// .. then skip VA range extension
>  
> -	adr_l	x6, idmap_t0sz
> -	str	x5, [x6]
> -	dmb	sy
> -	dc	ivac, x6		// Invalidate potentially stale cache line
> -
>  #if (VA_BITS < 48)
>  #define EXTRA_SHIFT	(PGDIR_SHIFT + PAGE_SHIFT - 3)
>  #define EXTRA_PTRS	(1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 17b339c1a326..103bf4ae408d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -43,7 +43,7 @@
>  #define NO_CONT_MAPPINGS	BIT(1)
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>  
> -u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
> +int idmap_t0sz __ro_after_init;
>  u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
>  
>  #if VA_BITS > 48
> @@ -785,6 +785,9 @@ void __init paging_init(void)
>  			       (u64)&vabits_actual + sizeof(vabits_actual));
>  #endif
>  
> +	idmap_t0sz = min(63UL - __fls(__pa_symbol(_end)),
> +			 TCR_T0SZ(VA_BITS_MIN));

nit: TCR_T0SZ shifts by TCR_T0SZ_OFFSET, so this is a bit confusing and
works out because the register offset happens to be zero. Maybe it would
be clearer to calculate the maximum of fls(__pa_symbol(_end)) and
VA_BITS_MIN, then subtract that from 64?

Will

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on
  2022-06-13 14:45 ` [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on Ard Biesheuvel
@ 2022-06-24 12:56   ` Will Deacon
  2022-06-24 13:07     ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2022-06-24 12:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:41PM +0200, Ard Biesheuvel wrote:
> Now that we can access the entire kernel image via the ID map, we can
> execute the page table population code with the MMU and caches enabled.
> The only thing we need to ensure is that translations via TTBR1 remain
> disabled while we are updating the page tables the second time around,
> in case KASLR wants them to be randomized.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/kernel/head.S | 62 +++++---------------
>  1 file changed, 16 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index d704d0bd8ffc..583cbea865e1 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -85,8 +85,6 @@
>  	 *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0
>  	 *  x22        create_idmap() .. start_kernel()         ID map VA of the DT blob
>  	 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
> -	 *  x28        clear_page_tables()                      callee preserved temp register
> -	 *  x19/x20    __primary_switch()                       callee preserved temp registers
>  	 *  x24        __primary_switch() .. relocate_kernel()  current RELR displacement
>  	 *  x28        create_idmap()                           callee preserved temp register
>  	 */
> @@ -96,9 +94,7 @@ SYM_CODE_START(primary_entry)
>  	adrp	x23, __PHYS_OFFSET
>  	and	x23, x23, MIN_KIMG_ALIGN - 1	// KASLR offset, defaults to 0
>  	bl	set_cpu_boot_mode_flag
> -	bl	clear_page_tables
>  	bl	create_idmap
> -	bl	create_kernel_mapping
>  
>  	/*
>  	 * The following calls CPU setup code, see arch/arm64/mm/proc.S for
> @@ -128,32 +124,14 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
>  SYM_CODE_END(preserve_boot_args)
>  
>  SYM_FUNC_START_LOCAL(clear_page_tables)
> -	mov	x28, lr
> -
> -	/*
> -	 * Invalidate the init page tables to avoid potential dirty cache lines
> -	 * being evicted. Other page tables are allocated in rodata as part of
> -	 * the kernel image, and thus are clean to the PoC per the boot
> -	 * protocol.
> -	 */
> -	adrp	x0, init_pg_dir
> -	adrp	x1, init_pg_end
> -	bl	dcache_inval_poc
> -
>  	/*
>  	 * Clear the init page tables.
>  	 */
>  	adrp	x0, init_pg_dir
>  	adrp	x1, init_pg_end
> -	sub	x1, x1, x0
> -1:	stp	xzr, xzr, [x0], #16
> -	stp	xzr, xzr, [x0], #16
> -	stp	xzr, xzr, [x0], #16
> -	stp	xzr, xzr, [x0], #16
> -	subs	x1, x1, #64
> -	b.ne	1b
> -
> -	ret	x28
> +	sub	x2, x1, x0
> +	mov	x1, xzr
> +	b	__pi_memset			// tail call
>  SYM_FUNC_END(clear_page_tables)
>  
>  /*
> @@ -399,16 +377,8 @@ SYM_FUNC_START_LOCAL(create_kernel_mapping)
>  
>  	map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
>  
> -	/*
> -	 * Since the page tables have been populated with non-cacheable
> -	 * accesses (MMU disabled), invalidate those tables again to
> -	 * remove any speculatively loaded cache lines.
> -	 */
> -	dmb	sy
> -
> -	adrp	x0, init_pg_dir
> -	adrp	x1, init_pg_end
> -	b	dcache_inval_poc		// tail call
> +	dsb	ishst				// sync with page table walker
> +	ret
>  SYM_FUNC_END(create_kernel_mapping)
>  
>  	/*
> @@ -863,14 +833,15 @@ SYM_FUNC_END(__relocate_kernel)
>  #endif
>  
>  SYM_FUNC_START_LOCAL(__primary_switch)
> -#ifdef CONFIG_RANDOMIZE_BASE
> -	mov	x19, x0				// preserve new SCTLR_EL1 value
> -	mrs	x20, sctlr_el1			// preserve old SCTLR_EL1 value
> -#endif
> -
> -	adrp	x1, init_pg_dir
> +	adrp	x1, reserved_pg_dir
>  	adrp	x2, init_idmap_pg_dir
>  	bl	__enable_mmu
> +
> +	bl	clear_page_tables
> +	bl	create_kernel_mapping
> +
> +	adrp	x1, init_pg_dir
> +	load_ttbr1 x1, x1, x2
>  #ifdef CONFIG_RELOCATABLE
>  #ifdef CONFIG_RELR
>  	mov	x24, #0				// no RELR displacement yet
> @@ -886,9 +857,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
>  	 * to take into account by discarding the current kernel mapping and
>  	 * creating a new one.
>  	 */
> -	pre_disable_mmu_workaround
> -	msr	sctlr_el1, x20			// disable the MMU
> -	isb
> +	adrp	x1, reserved_pg_dir		// Disable translations via TTBR1
> +	load_ttbr1 x1, x1, x2

I'd have thought we'd need some TLB maintenance here... is that not the
case?

Also, it might be a tiny bit easier to clear EPD1 instead of using the
reserved_pg_dir.

Will

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code
  2022-06-24 12:36   ` Will Deacon
@ 2022-06-24 12:57     ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-24 12:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Fri, 24 Jun 2022 at 14:36, Will Deacon <will@kernel.org> wrote:
>
> On Mon, Jun 13, 2022 at 04:45:27PM +0200, Ard Biesheuvel wrote:
> > Setting idmap_t0sz involves fiddling with the caches if done with the
> > MMU off. Since we will be creating an initial ID map with the MMU and
> > caches off, and the permanent ID map with the MMU and caches on, let's
> > move this assignment of idmap_t0sz out of the startup code, and replace
> > it with a macro that simply issues the three instructions needed to
> > calculate the value wherever it is needed before the MMU is turned on.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/include/asm/assembler.h   | 14 ++++++++++++++
> >  arch/arm64/include/asm/mmu_context.h |  2 +-
> >  arch/arm64/kernel/head.S             | 13 +------------
> >  arch/arm64/mm/mmu.c                  |  5 ++++-
> >  arch/arm64/mm/proc.S                 |  2 +-
> >  5 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > index 8c5a61aeaf8e..9468f45c07a6 100644
> > --- a/arch/arm64/include/asm/assembler.h
> > +++ b/arch/arm64/include/asm/assembler.h
> > @@ -359,6 +359,20 @@ alternative_cb_end
> >       bfi     \valreg, \t1sz, #TCR_T1SZ_OFFSET, #TCR_TxSZ_WIDTH
> >       .endm
> >
> > +/*
> > + * idmap_get_t0sz - get the T0SZ value needed to cover the ID map
> > + *
> > + * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> > + * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> > + * this number conveniently equals the number of leading zeroes in
> > + * the physical address of _end.
> > + */
> > +     .macro  idmap_get_t0sz, reg
> > +     adrp    \reg, _end
> > +     orr     \reg, \reg, #(1 << VA_BITS_MIN) - 1
> > +     clz     \reg, \reg
> > +     .endm
> > +
> >  /*
> >   * tcr_compute_pa_size - set TCR.(I)PS to the highest supported
> >   * ID_AA64MMFR0_EL1.PARange value
> > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> > index 6770667b34a3..6ac0086ebb1a 100644
> > --- a/arch/arm64/include/asm/mmu_context.h
> > +++ b/arch/arm64/include/asm/mmu_context.h
> > @@ -60,7 +60,7 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
> >   * TCR_T0SZ(VA_BITS), unless system RAM is positioned very high in
> >   * physical memory, in which case it will be smaller.
> >   */
> > -extern u64 idmap_t0sz;
> > +extern int idmap_t0sz;
> >  extern u64 idmap_ptrs_per_pgd;
> >
> >  /*
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index dc07858eb673..7f361bc72d12 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -299,22 +299,11 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
> >        * physical address space. So for the ID map, use an extended virtual
> >        * range in that case, and configure an additional translation level
> >        * if needed.
> > -      *
> > -      * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
> > -      * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
> > -      * this number conveniently equals the number of leading zeroes in
> > -      * the physical address of __idmap_text_end.
> >        */
> > -     adrp    x5, __idmap_text_end
> > -     clz     x5, x5
> > +     idmap_get_t0sz x5
> >       cmp     x5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?
> >       b.ge    1f                      // .. then skip VA range extension
> >
> > -     adr_l   x6, idmap_t0sz
> > -     str     x5, [x6]
> > -     dmb     sy
> > -     dc      ivac, x6                // Invalidate potentially stale cache line
> > -
> >  #if (VA_BITS < 48)
> >  #define EXTRA_SHIFT  (PGDIR_SHIFT + PAGE_SHIFT - 3)
> >  #define EXTRA_PTRS   (1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 17b339c1a326..103bf4ae408d 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -43,7 +43,7 @@
> >  #define NO_CONT_MAPPINGS     BIT(1)
> >  #define NO_EXEC_MAPPINGS     BIT(2)  /* assumes FEAT_HPDS is not used */
> >
> > -u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN);
> > +int idmap_t0sz __ro_after_init;
> >  u64 idmap_ptrs_per_pgd = PTRS_PER_PGD;
> >
> >  #if VA_BITS > 48
> > @@ -785,6 +785,9 @@ void __init paging_init(void)
> >                              (u64)&vabits_actual + sizeof(vabits_actual));
> >  #endif
> >
> > +     idmap_t0sz = min(63UL - __fls(__pa_symbol(_end)),
> > +                      TCR_T0SZ(VA_BITS_MIN));
>
> nit: TCR_T0SZ shifts by TCR_T0SZ_OFFSET, so this is a bit confusing and
> works out because the register offset happens to be zero. Maybe it would
> be clearer to calculate the maximum of fls(__pa_symbol(_end)) and
> VA_BITS_MIN, then subtract that from 64?
>

I just noticed there are other inconsistencies with TCR_T0SZ(), e.g.,
in create_safe_exec_page(), which receives the 'shifted' value of
t0sz, but then shifts it again in cpu_install_ttbr0(). So this is
definitely

Let's just use the same expression as in the idmap_get_t0sz macro I am adding:

idmap_t0sz = 63UL - __fls(__pa_symbol(_end) | GENMASK(VA_BITS_MIN - 1, 0));

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on
  2022-06-24 12:56   ` Will Deacon
@ 2022-06-24 13:07     ` Ard Biesheuvel
  2022-06-24 13:29       ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-24 13:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Fri, 24 Jun 2022 at 14:56, Will Deacon <will@kernel.org> wrote:
>
> On Mon, Jun 13, 2022 at 04:45:41PM +0200, Ard Biesheuvel wrote:
> > Now that we can access the entire kernel image via the ID map, we can
> > execute the page table population code with the MMU and caches enabled.
> > The only thing we need to ensure is that translations via TTBR1 remain
> > disabled while we are updating the page tables the second time around,
> > in case KASLR wants them to be randomized.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/kernel/head.S | 62 +++++---------------
> >  1 file changed, 16 insertions(+), 46 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index d704d0bd8ffc..583cbea865e1 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -85,8 +85,6 @@
> >        *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0
> >        *  x22        create_idmap() .. start_kernel()         ID map VA of the DT blob
> >        *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset
> > -      *  x28        clear_page_tables()                      callee preserved temp register
> > -      *  x19/x20    __primary_switch()                       callee preserved temp registers
> >        *  x24        __primary_switch() .. relocate_kernel()  current RELR displacement
> >        *  x28        create_idmap()                           callee preserved temp register
> >        */
> > @@ -96,9 +94,7 @@ SYM_CODE_START(primary_entry)
> >       adrp    x23, __PHYS_OFFSET
> >       and     x23, x23, MIN_KIMG_ALIGN - 1    // KASLR offset, defaults to 0
> >       bl      set_cpu_boot_mode_flag
> > -     bl      clear_page_tables
> >       bl      create_idmap
> > -     bl      create_kernel_mapping
> >
> >       /*
> >        * The following calls CPU setup code, see arch/arm64/mm/proc.S for
> > @@ -128,32 +124,14 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
> >  SYM_CODE_END(preserve_boot_args)
> >
> >  SYM_FUNC_START_LOCAL(clear_page_tables)
> > -     mov     x28, lr
> > -
> > -     /*
> > -      * Invalidate the init page tables to avoid potential dirty cache lines
> > -      * being evicted. Other page tables are allocated in rodata as part of
> > -      * the kernel image, and thus are clean to the PoC per the boot
> > -      * protocol.
> > -      */
> > -     adrp    x0, init_pg_dir
> > -     adrp    x1, init_pg_end
> > -     bl      dcache_inval_poc
> > -
> >       /*
> >        * Clear the init page tables.
> >        */
> >       adrp    x0, init_pg_dir
> >       adrp    x1, init_pg_end
> > -     sub     x1, x1, x0
> > -1:   stp     xzr, xzr, [x0], #16
> > -     stp     xzr, xzr, [x0], #16
> > -     stp     xzr, xzr, [x0], #16
> > -     stp     xzr, xzr, [x0], #16
> > -     subs    x1, x1, #64
> > -     b.ne    1b
> > -
> > -     ret     x28
> > +     sub     x2, x1, x0
> > +     mov     x1, xzr
> > +     b       __pi_memset                     // tail call
> >  SYM_FUNC_END(clear_page_tables)
> >
> >  /*
> > @@ -399,16 +377,8 @@ SYM_FUNC_START_LOCAL(create_kernel_mapping)
> >
> >       map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
> >
> > -     /*
> > -      * Since the page tables have been populated with non-cacheable
> > -      * accesses (MMU disabled), invalidate those tables again to
> > -      * remove any speculatively loaded cache lines.
> > -      */
> > -     dmb     sy
> > -
> > -     adrp    x0, init_pg_dir
> > -     adrp    x1, init_pg_end
> > -     b       dcache_inval_poc                // tail call
> > +     dsb     ishst                           // sync with page table walker
> > +     ret
> >  SYM_FUNC_END(create_kernel_mapping)
> >
> >       /*
> > @@ -863,14 +833,15 @@ SYM_FUNC_END(__relocate_kernel)
> >  #endif
> >
> >  SYM_FUNC_START_LOCAL(__primary_switch)
> > -#ifdef CONFIG_RANDOMIZE_BASE
> > -     mov     x19, x0                         // preserve new SCTLR_EL1 value
> > -     mrs     x20, sctlr_el1                  // preserve old SCTLR_EL1 value
> > -#endif
> > -
> > -     adrp    x1, init_pg_dir
> > +     adrp    x1, reserved_pg_dir
> >       adrp    x2, init_idmap_pg_dir
> >       bl      __enable_mmu
> > +
> > +     bl      clear_page_tables
> > +     bl      create_kernel_mapping
> > +
> > +     adrp    x1, init_pg_dir
> > +     load_ttbr1 x1, x1, x2
> >  #ifdef CONFIG_RELOCATABLE
> >  #ifdef CONFIG_RELR
> >       mov     x24, #0                         // no RELR displacement yet
> > @@ -886,9 +857,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
> >        * to take into account by discarding the current kernel mapping and
> >        * creating a new one.
> >        */
> > -     pre_disable_mmu_workaround
> > -     msr     sctlr_el1, x20                  // disable the MMU
> > -     isb
> > +     adrp    x1, reserved_pg_dir             // Disable translations via TTBR1
> > +     load_ttbr1 x1, x1, x2
>
> I'd have thought we'd need some TLB maintenance here... is that not the
> case?
>

You mean at this particular point? We are running from the ID map with
TTBR1 translations disabled. We clear the page tables, repopulate
them, and perform a TLBI VMALLE1.

So are you saying repopulating the page tables while translations are
disabled needs to occur only after doing TLB maintenance?

> Also, it might be a tiny bit easier to clear EPD1 instead of using the
> reserved_pg_dir.
>

Right. So is there any reason in particular why it would be
appropriate here but not anywhere else? IOW, why do we have
reserved_pg_dir in the first place if we can just flick EPD1 on and
off?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 19/26] arm64: kaslr: defer initialization to late initcall where permitted
  2022-06-13 14:45 ` [PATCH v4 19/26] arm64: kaslr: defer initialization to late initcall where permitted Ard Biesheuvel
@ 2022-06-24 13:08   ` Will Deacon
  2022-06-24 13:09     ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2022-06-24 13:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:43PM +0200, Ard Biesheuvel wrote:
> The early KASLR init code runs extremely early, and anything that could
> be deferred until later should be. So let's defer the randomization of
> the module region until much later - this also simplifies the
> arithmetic, given that we no longer have to reason about the link time
> vs load time placement of the core kernel explicitly. Also get rid of
> the global status variable, and infer the status reported by the
> diagnostic print from other KASLR related context.
> 
> While at it, get rid of the special case for KASAN without
> KASAN_VMALLOC, which never occurs in practice.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/kernel/kaslr.c | 95 +++++++++-----------
>  1 file changed, 40 insertions(+), 55 deletions(-)

[...]

> @@ -163,33 +169,12 @@ u64 __init kaslr_early_init(void)
>  		 * when ARM64_MODULE_PLTS is enabled.
>  		 */
>  		module_range = MODULES_VSIZE - (u64)(_etext - _stext);
> -		module_alloc_base = (u64)_etext + offset - MODULES_VSIZE;
>  	}
>  
>  	/* use the lower 21 bits to randomize the base of the module region */
>  	module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
>  	module_alloc_base &= PAGE_MASK;
>  
> -	return offset;
> -}
> -
> -static int __init kaslr_init(void)
> -{
> -	switch (kaslr_status) {
> -	case KASLR_ENABLED:
> -		pr_info("KASLR enabled\n");
> -		break;
> -	case KASLR_DISABLED_CMDLINE:
> -		pr_info("KASLR disabled on command line\n");
> -		break;
> -	case KASLR_DISABLED_NO_SEED:
> -		pr_warn("KASLR disabled due to lack of seed\n");
> -		break;
> -	case KASLR_DISABLED_FDT_REMAP:
> -		pr_warn("KASLR disabled due to FDT remapping failure\n");
> -		break;
> -	}
> -
>  	return 0;
>  }
> -core_initcall(kaslr_init)
> +late_initcall(kaslr_init)

Are you sure this isn't too late? I'm nervous that we might have called
request_module() off the back of all the other initcalls that we've run by
this point.

Will

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 19/26] arm64: kaslr: defer initialization to late initcall where permitted
  2022-06-24 13:08   ` Will Deacon
@ 2022-06-24 13:09     ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-24 13:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Fri, 24 Jun 2022 at 15:08, Will Deacon <will@kernel.org> wrote:
>
> On Mon, Jun 13, 2022 at 04:45:43PM +0200, Ard Biesheuvel wrote:
> > The early KASLR init code runs extremely early, and anything that could
> > be deferred until later should be. So let's defer the randomization of
> > the module region until much later - this also simplifies the
> > arithmetic, given that we no longer have to reason about the link time
> > vs load time placement of the core kernel explicitly. Also get rid of
> > the global status variable, and infer the status reported by the
> > diagnostic print from other KASLR related context.
> >
> > While at it, get rid of the special case for KASAN without
> > KASAN_VMALLOC, which never occurs in practice.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/kernel/kaslr.c | 95 +++++++++-----------
> >  1 file changed, 40 insertions(+), 55 deletions(-)
>
> [...]
>
> > @@ -163,33 +169,12 @@ u64 __init kaslr_early_init(void)
> >                * when ARM64_MODULE_PLTS is enabled.
> >                */
> >               module_range = MODULES_VSIZE - (u64)(_etext - _stext);
> > -             module_alloc_base = (u64)_etext + offset - MODULES_VSIZE;
> >       }
> >
> >       /* use the lower 21 bits to randomize the base of the module region */
> >       module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
> >       module_alloc_base &= PAGE_MASK;
> >
> > -     return offset;
> > -}
> > -
> > -static int __init kaslr_init(void)
> > -{
> > -     switch (kaslr_status) {
> > -     case KASLR_ENABLED:
> > -             pr_info("KASLR enabled\n");
> > -             break;
> > -     case KASLR_DISABLED_CMDLINE:
> > -             pr_info("KASLR disabled on command line\n");
> > -             break;
> > -     case KASLR_DISABLED_NO_SEED:
> > -             pr_warn("KASLR disabled due to lack of seed\n");
> > -             break;
> > -     case KASLR_DISABLED_FDT_REMAP:
> > -             pr_warn("KASLR disabled due to FDT remapping failure\n");
> > -             break;
> > -     }
> > -
> >       return 0;
> >  }
> > -core_initcall(kaslr_init)
> > +late_initcall(kaslr_init)
>
> Are you sure this isn't too late? I'm nervous that we might have called
> request_module() off the back of all the other initcalls that we've run by
> this point.
>

Yeah, I just realized the other day that this is probably too late.
subsys_initcall() might be more suitable here

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 20/26] arm64: head: avoid relocating the kernel twice for KASLR
  2022-06-13 14:45 ` [PATCH v4 20/26] arm64: head: avoid relocating the kernel twice for KASLR Ard Biesheuvel
@ 2022-06-24 13:16   ` Will Deacon
  2022-06-24 13:17     ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2022-06-24 13:16 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Mon, Jun 13, 2022 at 04:45:44PM +0200, Ard Biesheuvel wrote:
> Currently, when KASLR is in effect, we set up the kernel virtual address
> space twice: the first time, the KASLR seed is looked up in the device
> tree, and the kernel virtual mapping is torn down and recreated again,
> after which the relocations are applied a second time. The latter step
> means that statically initialized global pointer variables will be reset
> to their initial values, and to ensure that BSS variables are not set to
> values based on the initial translation, they are cleared again as well.
> 
> All of this is needed because we need the command line (taken from the
> DT) to tell us whether or not to randomize the virtual address space
> before entering the kernel proper. However, this code has expanded
> little by little and now creates global state unrelated to the virtual
> randomization of the kernel before the mapping is torn down and set up
> again, and the BSS cleared for a second time. This has created some
> issues in the past, and it would be better to avoid this little dance if
> possible.
> 
> So instead, let's use the temporary mapping of the device tree, and
> execute the bare minimum of code to decide whether or not KASLR should
> be enabled, and what the seed is. Only then, create the virtual kernel
> mapping, clear BSS, etc and proceed as normal.  This avoids the issues
> around inconsistent global state due to BSS being cleared twice, and is
> generally more maintainable, as it permits us to defer all the remaining
> DT parsing and KASLR initialization to a later time.
> 
> This means the relocation fixup code runs only a single time as well,
> allowing us to simplify the RELR handling code too, which is not
> idempotent and was therefore required to keep track of the offset that
> was applied the first time around.
> 
> Note that this means we have to clone a pair of FDT library objects, so
> that we can control how they are built - we need the stack protector
> and other instrumentation disabled so that the code can tolerate being
> called this early. Note that only the kernel page tables and the
> temporary stack are mapped read-write at this point, which ensures that
> the early code does not modify any global state inadvertently.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/kernel/Makefile         |   2 +-
>  arch/arm64/kernel/head.S           |  73 ++++---------
>  arch/arm64/kernel/image-vars.h     |   4 +
>  arch/arm64/kernel/kaslr.c          |  87 ---------------
>  arch/arm64/kernel/pi/Makefile      |  33 ++++++
>  arch/arm64/kernel/pi/kaslr_early.c | 112 ++++++++++++++++++++

Heh, how long before we get a decompressor in here too?

Will

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 20/26] arm64: head: avoid relocating the kernel twice for KASLR
  2022-06-24 13:16   ` Will Deacon
@ 2022-06-24 13:17     ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-24 13:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Fri, 24 Jun 2022 at 15:16, Will Deacon <will@kernel.org> wrote:
>
> On Mon, Jun 13, 2022 at 04:45:44PM +0200, Ard Biesheuvel wrote:
> > Currently, when KASLR is in effect, we set up the kernel virtual address
> > space twice: the first time, the KASLR seed is looked up in the device
> > tree, and the kernel virtual mapping is torn down and recreated again,
> > after which the relocations are applied a second time. The latter step
> > means that statically initialized global pointer variables will be reset
> > to their initial values, and to ensure that BSS variables are not set to
> > values based on the initial translation, they are cleared again as well.
> >
> > All of this is needed because we need the command line (taken from the
> > DT) to tell us whether or not to randomize the virtual address space
> > before entering the kernel proper. However, this code has expanded
> > little by little and now creates global state unrelated to the virtual
> > randomization of the kernel before the mapping is torn down and set up
> > again, and the BSS cleared for a second time. This has created some
> > issues in the past, and it would be better to avoid this little dance if
> > possible.
> >
> > So instead, let's use the temporary mapping of the device tree, and
> > execute the bare minimum of code to decide whether or not KASLR should
> > be enabled, and what the seed is. Only then, create the virtual kernel
> > mapping, clear BSS, etc and proceed as normal.  This avoids the issues
> > around inconsistent global state due to BSS being cleared twice, and is
> > generally more maintainable, as it permits us to defer all the remaining
> > DT parsing and KASLR initialization to a later time.
> >
> > This means the relocation fixup code runs only a single time as well,
> > allowing us to simplify the RELR handling code too, which is not
> > idempotent and was therefore required to keep track of the offset that
> > was applied the first time around.
> >
> > Note that this means we have to clone a pair of FDT library objects, so
> > that we can control how they are built - we need the stack protector
> > and other instrumentation disabled so that the code can tolerate being
> > called this early. Note that only the kernel page tables and the
> > temporary stack are mapped read-write at this point, which ensures that
> > the early code does not modify any global state inadvertently.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/arm64/kernel/Makefile         |   2 +-
> >  arch/arm64/kernel/head.S           |  73 ++++---------
> >  arch/arm64/kernel/image-vars.h     |   4 +
> >  arch/arm64/kernel/kaslr.c          |  87 ---------------
> >  arch/arm64/kernel/pi/Makefile      |  33 ++++++
> >  arch/arm64/kernel/pi/kaslr_early.c | 112 ++++++++++++++++++++
>
> Heh, how long before we get a decompressor in here too?
>

Right after BPF support :-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN
  2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
                   ` (25 preceding siblings ...)
  2022-06-13 14:45 ` [PATCH v4 26/26] arm64: kernel: move ID map out of .text mapping Ard Biesheuvel
@ 2022-06-24 13:19 ` Will Deacon
  2022-06-24 14:40   ` Ard Biesheuvel
  26 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2022-06-24 13:19 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

Hi Ard,

On Mon, Jun 13, 2022 at 04:45:24PM +0200, Ard Biesheuvel wrote:
> [ TL;DR this series does the following:
>   - move variable definitions and assignments out of early asm code
>     where possible, and get rid of explicit cache maintenance;
>   - convert initial ID map so it covers the entire loaded image as well
>     as the DT blob;
>   - create the kernel mapping only once instead of twice (for KASLR),
>     and do it with the MMU and caches on;
>   - avoid mappings that are both writable and executable entirely;
>   - avoid parsing the DT while the kernel text and rodata are still
>     mapped writable;
>   - allow WXN to be enabled (with an opt-out) so writable mappings are
>     never executable. ]

I really like this series -- it removes quite a few ugly warts from our
boot assembly that we've collected over the years and, while functional,
they have never been particularly satisfactory. Thank you for putting it
together.

I've left a handful of minor comments on some of the patches and if you
can address those then I'd like to queue the first 21 patches ASAP to
give them some more exposure before the next merge window.

The remaining patches are the WXN pieces, which I'd like to give others
a chance to chime in on first.

Cheers,

Will

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on
  2022-06-24 13:07     ` Ard Biesheuvel
@ 2022-06-24 13:29       ` Will Deacon
  2022-06-24 14:07         ` Ard Biesheuvel
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2022-06-24 13:29 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Fri, Jun 24, 2022 at 03:07:44PM +0200, Ard Biesheuvel wrote:
> On Fri, 24 Jun 2022 at 14:56, Will Deacon <will@kernel.org> wrote:
> >
> > On Mon, Jun 13, 2022 at 04:45:41PM +0200, Ard Biesheuvel wrote:
> > > Now that we can access the entire kernel image via the ID map, we can
> > > execute the page table population code with the MMU and caches enabled.
> > > The only thing we need to ensure is that translations via TTBR1 remain
> > > disabled while we are updating the page tables the second time around,
> > > in case KASLR wants them to be randomized.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > >  arch/arm64/kernel/head.S | 62 +++++---------------
> > >  1 file changed, 16 insertions(+), 46 deletions(-)

[...]

> > > @@ -886,9 +857,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
> > >        * to take into account by discarding the current kernel mapping and
> > >        * creating a new one.
> > >        */
> > > -     pre_disable_mmu_workaround
> > > -     msr     sctlr_el1, x20                  // disable the MMU
> > > -     isb
> > > +     adrp    x1, reserved_pg_dir             // Disable translations via TTBR1
> > > +     load_ttbr1 x1, x1, x2
> >
> > I'd have thought we'd need some TLB maintenance here... is that not the
> > case?
> >
> 
> You mean at this particular point? We are running from the ID map with
> TTBR1 translations disabled. We clear the page tables, repopulate
> them, and perform a TLBI VMALLE1.
> 
> So are you saying repopulating the page tables while translations are
> disabled needs to occur only after doing TLB maintenance?

I'm thinking about walk cache entries from the previous page-table, which
would make the reserved_pg_dir ineffective. However, if we're clearing the
page-table anyway, I'm not even sure why we need reserved_pg_dir at all!

> > Also, it might be a tiny bit easier to clear EPD1 instead of using the
> > reserved_pg_dir.
> >
> 
> Right. So is there any reason in particular why it would be
> appropriate here but not anywhere else? IOW, why do we have
> reserved_pg_dir in the first place if we can just flick EPD1 on and
> off?

I think using a reserved (all zeroes) page-table makes sense when it
has its own ASID, as you can switch to/from it without TLB invalidation,
but that doesn't seem to be the case here. Anyway, no strong preference,
I just thought it might simplify things a bit.

Will

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on
  2022-06-24 13:29       ` Will Deacon
@ 2022-06-24 14:07         ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-24 14:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Fri, 24 Jun 2022 at 15:29, Will Deacon <will@kernel.org> wrote:
>
> On Fri, Jun 24, 2022 at 03:07:44PM +0200, Ard Biesheuvel wrote:
> > On Fri, 24 Jun 2022 at 14:56, Will Deacon <will@kernel.org> wrote:
> > >
> > > On Mon, Jun 13, 2022 at 04:45:41PM +0200, Ard Biesheuvel wrote:
> > > > Now that we can access the entire kernel image via the ID map, we can
> > > > execute the page table population code with the MMU and caches enabled.
> > > > The only thing we need to ensure is that translations via TTBR1 remain
> > > > disabled while we are updating the page tables the second time around,
> > > > in case KASLR wants them to be randomized.
> > > >
> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > > ---
> > > >  arch/arm64/kernel/head.S | 62 +++++---------------
> > > >  1 file changed, 16 insertions(+), 46 deletions(-)
>
> [...]
>
> > > > @@ -886,9 +857,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
> > > >        * to take into account by discarding the current kernel mapping and
> > > >        * creating a new one.
> > > >        */
> > > > -     pre_disable_mmu_workaround
> > > > -     msr     sctlr_el1, x20                  // disable the MMU
> > > > -     isb
> > > > +     adrp    x1, reserved_pg_dir             // Disable translations via TTBR1
> > > > +     load_ttbr1 x1, x1, x2
> > >
> > > I'd have thought we'd need some TLB maintenance here... is that not the
> > > case?
> > >
> >
> > You mean at this particular point? We are running from the ID map with
> > TTBR1 translations disabled. We clear the page tables, repopulate
> > them, and perform a TLBI VMALLE1.
> >
> > So are you saying repopulating the page tables while translations are
> > disabled needs to occur only after doing TLB maintenance?
>
> I'm thinking about walk cache entries from the previous page-table, which
> would make the reserved_pg_dir ineffective. However, if we're clearing the
> page-table anyway, I'm not even sure why we need reserved_pg_dir at all!
>

Perhaps not. But this code is removed again two patches later so it
doesn't matter that much to begin with.

> > > Also, it might be a tiny bit easier to clear EPD1 instead of using the
> > > reserved_pg_dir.
> > >
> >
> > Right. So is there any reason in particular why it would be
> > appropriate here but not anywhere else? IOW, why do we have
> > reserved_pg_dir in the first place if we can just flick EPD1 on and
> > off?
>
> I think using a reserved (all zeroes) page-table makes sense when it
> has its own ASID, as you can switch to/from it without TLB invalidation,
> but that doesn't seem to be the case here. Anyway, no strong preference,
> I just thought it might simplify things a bit.
>

Ah right, I hadn't considered ASIDs.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN
  2022-06-24 13:19 ` [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Will Deacon
@ 2022-06-24 14:40   ` Ard Biesheuvel
  0 siblings, 0 replies; 57+ messages in thread
From: Ard Biesheuvel @ 2022-06-24 14:40 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux ARM, linux-hardening, Marc Zyngier, Mark Rutland,
	Kees Cook, Catalin Marinas, Mark Brown, Anshuman Khandual

On Fri, 24 Jun 2022 at 15:20, Will Deacon <will@kernel.org> wrote:
>
> Hi Ard,
>
> On Mon, Jun 13, 2022 at 04:45:24PM +0200, Ard Biesheuvel wrote:
> > [ TL;DR this series does the following:
> >   - move variable definitions and assignments out of early asm code
> >     where possible, and get rid of explicit cache maintenance;
> >   - convert initial ID map so it covers the entire loaded image as well
> >     as the DT blob;
> >   - create the kernel mapping only once instead of twice (for KASLR),
> >     and do it with the MMU and caches on;
> >   - avoid mappings that are both writable and executable entirely;
> >   - avoid parsing the DT while the kernel text and rodata are still
> >     mapped writable;
> >   - allow WXN to be enabled (with an opt-out) so writable mappings are
> >     never executable. ]
>
> I really like this series -- it removes quite a few ugly warts from our
> boot assembly that we've collected over the years and, while functional,
> they have never been particularly satisfactory. Thank you for putting it
> together.
>
> I've left a handful of minor comments on some of the patches and if you
> can address those then I'd like to queue the first 21 patches ASAP to
> give them some more exposure before the next merge window.
>

I'll spin a v5 with just those patches, and we can revisit the
remaining work at a later time.

> The remaining patches are the WXN pieces, which I'd like to give others
> a chance to chime in on first.
>
> Cheers,
>
> Will

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-06-24 14:40 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-13 14:45 [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 01/26] arm64: head: move kimage_vaddr variable into C file Ard Biesheuvel
2022-06-14  8:26   ` Anshuman Khandual
2022-06-13 14:45 ` [PATCH v4 02/26] arm64: mm: make vabits_actual a build time constant if possible Ard Biesheuvel
2022-06-14  8:25   ` Anshuman Khandual
2022-06-14  8:34     ` Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 03/26] arm64: head: move assignment of idmap_t0sz to C code Ard Biesheuvel
2022-06-14  9:22   ` Anshuman Khandual
2022-06-14  9:34     ` Ard Biesheuvel
2022-06-24 12:36   ` Will Deacon
2022-06-24 12:57     ` Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 04/26] arm64: head: drop idmap_ptrs_per_pgd Ard Biesheuvel
2022-06-15  4:07   ` Anshuman Khandual
2022-06-13 14:45 ` [PATCH v4 05/26] arm64: head: simplify page table mapping macros (slightly) Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 06/26] arm64: head: switch to map_memory macro for the extended ID map Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 07/26] arm64: head: split off idmap creation code Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 08/26] arm64: kernel: drop unnecessary PoC cache clean+invalidate Ard Biesheuvel
2022-06-15  4:32   ` Anshuman Khandual
2022-06-13 14:45 ` [PATCH v4 09/26] arm64: head: pass ID map root table address to __enable_mmu() Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 10/26] arm64: mm: provide idmap pointer to cpu_replace_ttbr1() Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 11/26] arm64: head: add helper function to remap regions in early page tables Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 12/26] arm64: head: cover entire kernel image in initial ID map Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 13/26] arm64: head: use relative references to the RELA and RELR tables Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 14/26] arm64: head: create a temporary FDT mapping in the initial ID map Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 15/26] arm64: idreg-override: use early FDT mapping in " Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 16/26] arm64: head: factor out TTBR1 assignment into a macro Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 17/26] arm64: head: populate kernel page tables with MMU and caches on Ard Biesheuvel
2022-06-24 12:56   ` Will Deacon
2022-06-24 13:07     ` Ard Biesheuvel
2022-06-24 13:29       ` Will Deacon
2022-06-24 14:07         ` Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 18/26] arm64: head: record CPU boot mode after enabling the MMU Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 19/26] arm64: kaslr: defer initialization to late initcall where permitted Ard Biesheuvel
2022-06-24 13:08   ` Will Deacon
2022-06-24 13:09     ` Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 20/26] arm64: head: avoid relocating the kernel twice for KASLR Ard Biesheuvel
2022-06-24 13:16   ` Will Deacon
2022-06-24 13:17     ` Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 21/26] arm64: setup: drop early FDT pointer helpers Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 22/26] arm64: mm: move ro_after_init section into the data segment Ard Biesheuvel
2022-06-13 17:00   ` Kees Cook
2022-06-13 17:16     ` Ard Biesheuvel
2022-06-13 23:38       ` Kees Cook
2022-06-16 11:31         ` Ard Biesheuvel
2022-06-16 16:18           ` Kees Cook
2022-06-16 16:31             ` Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 23/26] arm64: head: remap the kernel text/inittext region read-only Ard Biesheuvel
2022-06-13 16:57   ` Kees Cook
2022-06-13 14:45 ` [PATCH v4 24/26] mm: add arch hook to validate mmap() prot flags Ard Biesheuvel
2022-06-13 16:37   ` Kees Cook
2022-06-13 16:44     ` Ard Biesheuvel
2022-06-13 14:45 ` [PATCH v4 25/26] arm64: mm: add support for WXN memory translation attribute Ard Biesheuvel
2022-06-13 16:51   ` Kees Cook
2022-06-13 14:45 ` [PATCH v4 26/26] arm64: kernel: move ID map out of .text mapping Ard Biesheuvel
2022-06-13 16:52   ` Kees Cook
2022-06-24 13:19 ` [PATCH v4 00/26] arm64: refactor boot flow and add support for WXN Will Deacon
2022-06-24 14:40   ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).