All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
@ 2015-03-16 15:23 Ard Biesheuvel
  2015-03-16 15:23 ` [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates Ard Biesheuvel
                   ` (4 more replies)
  0 siblings, 5 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-16 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

These patches is a proof of concept of how we could potentially relocate
the kernel at runtime. This code is rough around the edges, and there are
a few unresolved issues, hence this RFC.

With these patches, the kernel can essentially execute at any virtual offset.
There are a couple of reasons why we would want this:
- performance: we can align PHYS_OFFSET so that most of the linear mapping can
  be done using 512 MB or 1 GB blocks (depending on page size), instead of
  the more granular level that is currently unavoidable if Image cannot be
  loaded at base of RAM (since PHYS_OFFSET is tied to the start of the kernel
  Image).
- security: with these changes, we can put the kernel Image anywhere in physical
  memory, and we can put the physical memory anywhere in the upper half of the
  virtual address range (modulo alignment). This gives us quite a number of
  bits of to play with if we were to randomize the kernel virtual address space.
  Also, this is entirely under the control of the boot loader, which is probably
  in better shape to get its hands on some entropy than the early kernel boot
  code.
- convenience: fewer constraints when loading the kernel into memory, as it
  can execute from anywhere.

How it works:
- an additional boot argument 'image offset' is passed in x1 by the boot loader,
  which should contain a value that is at least the offset of Image into physical
  memory. Higher values are also possible, and may be used to randomize the
  kernel VA space.
- the kernel binary is runtime relocated to PAGE_OFFSET + image offset

Issues:
- Since AArch64 uses the ELF RELA format (where the addends are in the
  relocation table and not in the code), the relocations need to be applied even
  if the Image runs from the same offset it was linked at. It also means that
  some values that are produced by the linker (_kernel_size_le, etc) are missing
  from the binary. This will probably need a fixup step.
- The module area may be out of range, which needs to be worked around with
  module PLTs. This is straight forward but I haven't implemented it yet for
  arm64.
- The core extable is most likely broken, and would need to be changed to use
  relative offsets instead of absolute addresses.
- Probably lots and lots of other roadblocks, hence this RFC

Output from QEMU/efi with 2 GB of memory:

Base of RAM:
  0x000040000000-0x00004000ffff [Loader Data        |   |  |  |  |   |WB|WT|WC|UC]

Physical location of Image:
  0x00007f400000-0x00007fe6ffff [Loader Data        |   |  |  |  |   |WB|WT|WC|UC]

Virtual kernel memory layout:
    vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
    vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
              0xffffffbdc1000000 - 0xffffffbdc3000000   (    32 MB actual)
    fixed   : 0xffffffbffa9fe000 - 0xffffffbffac00000   (  2056 KB)
    PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000   (    16 MB)
    modules : 0xffffffbffc000000 - 0xffffffc000000000   (    64 MB)
    memory  : 0xffffffc000000000 - 0xffffffc080000000   (  2048 MB)
      .init : 0xffffffc03fb82000 - 0xffffffc03fdbe000   (  2288 KB)
      .text : 0xffffffc03f480000 - 0xffffffc03fb81344   (  7173 KB)
      .data : 0xffffffc03fdc2000 - 0xffffffc03fe28c00   (   411 KB)


Ard Biesheuvel (3):
  arm64: head.S: replace early literals with constant immediates
  arm64: add support for relocatable kernel
  arm64/efi: use relocated kernel

 arch/arm64/Kconfig              |  3 ++
 arch/arm64/Makefile             |  4 ++
 arch/arm64/kernel/efi-entry.S   |  6 ++-
 arch/arm64/kernel/efi-stub.c    |  5 ++-
 arch/arm64/kernel/head.S        | 94 +++++++++++++++++++++++++++++++----------
 arch/arm64/kernel/image.h       |  8 +++-
 arch/arm64/kernel/vmlinux.lds.S | 12 ++++++
 scripts/sortextable.c           |  4 +-
 8 files changed, 107 insertions(+), 29 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates
  2015-03-16 15:23 [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Ard Biesheuvel
@ 2015-03-16 15:23 ` Ard Biesheuvel
  2015-03-16 17:14   ` Mark Rutland
  2015-03-16 15:23 ` [RFC PATCH 2/3] arm64: add support for relocatable kernel Ard Biesheuvel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-16 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

In preparation of making the kernel relocatable, replace literal
symbol references with immediate constants. This is necessary, as
the literals will not be populated with meaningful values until
after the relocation code has executed.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kernel/efi-entry.S |  2 +-
 arch/arm64/kernel/head.S      | 36 +++++++++++++++---------------------
 2 files changed, 16 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/kernel/efi-entry.S b/arch/arm64/kernel/efi-entry.S
index 8ce9b0577442..f78e6a1de825 100644
--- a/arch/arm64/kernel/efi-entry.S
+++ b/arch/arm64/kernel/efi-entry.S
@@ -61,7 +61,7 @@ ENTRY(efi_stub_entry)
 	 */
 	mov	x20, x0		// DTB address
 	ldr	x0, [sp, #16]	// relocated _text address
-	ldr	x21, =stext_offset
+	movz	x21, #:abs_g0:stext_offset
 	add	x21, x0, x21
 
 	/*
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 9c856f2aa7a5..1ea3cd2aba34 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -46,11 +46,9 @@
 #error TEXT_OFFSET must be less than 2MB
 #endif
 
-	.macro	pgtbl, ttb0, ttb1, virt_to_phys
-	ldr	\ttb1, =swapper_pg_dir
-	ldr	\ttb0, =idmap_pg_dir
-	add	\ttb1, \ttb1, \virt_to_phys
-	add	\ttb0, \ttb0, \virt_to_phys
+	.macro	pgtbl, ttb0, ttb1
+	adrp	\ttb1, swapper_pg_dir
+	adrp	\ttb0, idmap_pg_dir
 	.endm
 
 #ifdef CONFIG_ARM64_64K_PAGES
@@ -63,7 +61,7 @@
 #define TABLE_SHIFT	PUD_SHIFT
 #endif
 
-#define KERNEL_START	KERNEL_RAM_VADDR
+#define KERNEL_START	_text
 #define KERNEL_END	_end
 
 /*
@@ -322,7 +320,7 @@ ENDPROC(stext)
  *     been enabled
  */
 __create_page_tables:
-	pgtbl	x25, x26, x28			// idmap_pg_dir and swapper_pg_dir addresses
+	pgtbl	x25, x26			// idmap_pg_dir and swapper_pg_dir addresses
 	mov	x27, lr
 
 	/*
@@ -351,8 +349,7 @@ __create_page_tables:
 	 * Create the identity mapping.
 	 */
 	mov	x0, x25				// idmap_pg_dir
-	ldr	x3, =KERNEL_START
-	add	x3, x3, x28			// __pa(KERNEL_START)
+	adr_l	x3, KERNEL_START		// __pa(KERNEL_START)
 
 #ifndef CONFIG_ARM64_VA_BITS_48
 #define EXTRA_SHIFT	(PGDIR_SHIFT + PAGE_SHIFT - 3)
@@ -391,9 +388,8 @@ __create_page_tables:
 #endif
 
 	create_pgd_entry x0, x3, x5, x6
-	ldr	x6, =KERNEL_END
 	mov	x5, x3				// __pa(KERNEL_START)
-	add	x6, x6, x28			// __pa(KERNEL_END)
+	adr_l	x6, KERNEL_END			// __pa(KERNEL_END)
 	create_block_map x0, x7, x3, x5, x6
 
 	/*
@@ -402,8 +398,10 @@ __create_page_tables:
 	mov	x0, x26				// swapper_pg_dir
 	mov	x5, #PAGE_OFFSET
 	create_pgd_entry x0, x5, x3, x6
-	ldr	x6, =KERNEL_END
+	adr_l	x6, KERNEL_END
 	mov	x3, x24				// phys offset
+	sub	x6, x6, x3			// kernel memsize
+	add	x6, x6, x5			// __va(KERNEL_END)
 	create_block_map x0, x7, x3, x5, x6
 
 	/*
@@ -538,8 +536,7 @@ ENDPROC(el2_setup)
  * in x20. See arch/arm64/include/asm/virt.h for more info.
  */
 ENTRY(set_cpu_boot_mode_flag)
-	ldr	x1, =__boot_cpu_mode		// Compute __boot_cpu_mode
-	add	x1, x1, x28
+	adr_l	x1, __boot_cpu_mode		// Compute __boot_cpu_mode
 	cmp	w20, #BOOT_CPU_MODE_EL2
 	b.ne	1f
 	add	x1, x1, #4
@@ -598,7 +595,7 @@ ENTRY(secondary_startup)
 	/*
 	 * Common entry point for secondary CPUs.
 	 */
-	pgtbl	x25, x26, x28			// x25=TTBR0, x26=TTBR1
+	pgtbl	x25, x26			// x25=TTBR0, x26=TTBR1
 	bl	__cpu_setup			// initialise processor
 
 	ldr	x21, =secondary_data
@@ -655,17 +652,14 @@ ENDPROC(__turn_mmu_on)
  * Calculate the start of physical memory.
  */
 __calc_phys_offset:
-	adr	x0, 1f
-	ldp	x1, x2, [x0]
+	adrp	x0, KERNEL_START		// __pa(KERNEL_START)
+	ldr	x1, =KERNEL_RAM_VADDR		// __va(KERNEL_START)
+	mov	x2, PAGE_OFFSET
 	sub	x28, x0, x1			// x28 = PHYS_OFFSET - PAGE_OFFSET
 	add	x24, x2, x28			// x24 = PHYS_OFFSET
 	ret
 ENDPROC(__calc_phys_offset)
 
-	.align 3
-1:	.quad	.
-	.quad	PAGE_OFFSET
-
 /*
  * Exception handling. Something went wrong and we can't proceed. We ought to
  * tell the user, but since we don't have any guarantee that we're even
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 2/3] arm64: add support for relocatable kernel
  2015-03-16 15:23 [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Ard Biesheuvel
  2015-03-16 15:23 ` [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates Ard Biesheuvel
@ 2015-03-16 15:23 ` Ard Biesheuvel
  2015-03-16 15:23 ` [RFC PATCH 3/3] arm64/efi: use relocated kernel Ard Biesheuvel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-16 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

This adds support for runtime relocation of the kernel Image, by
building it as a PIE (ET_DYN) executable and applying the relocations
in the early boot code.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/Kconfig              |  3 +++
 arch/arm64/Makefile             |  4 +++
 arch/arm64/kernel/head.S        | 58 ++++++++++++++++++++++++++++++++++++++++-
 arch/arm64/kernel/image.h       |  8 +++++-
 arch/arm64/kernel/vmlinux.lds.S | 12 +++++++++
 scripts/sortextable.c           |  4 +--
 6 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1b8e97331ffb..cc6504998f2c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -143,6 +143,9 @@ config KERNEL_MODE_NEON
 config FIX_EARLYCON_MEM
 	def_bool y
 
+config RELOCATABLE_KERNEL
+	def_bool y
+
 source "init/Kconfig"
 
 source "kernel/Kconfig.freezer"
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 69ceedc982a5..e3914049c389 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -15,6 +15,10 @@ CPPFLAGS_vmlinux.lds = -DTEXT_OFFSET=$(TEXT_OFFSET)
 OBJCOPYFLAGS	:=-O binary -R .note -R .note.gnu.build-id -R .comment -S
 GZFLAGS		:=-9
 
+ifneq ($(CONFIG_RELOCATABLE_KERNEL),)
+LDFLAGS_vmlinux		+= -pie
+endif
+
 KBUILD_DEFCONFIG := defconfig
 
 KBUILD_CFLAGS	+= -mgeneral-regs-only
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 1ea3cd2aba34..874754794b25 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -239,6 +239,7 @@ section_table:
 
 ENTRY(stext)
 	mov	x21, x0				// x21=FDT
+	mov	x23, x1				// x23=image offset
 	bl	el2_setup			// Drop to EL1, w20=cpu_boot_mode
 	bl	__calc_phys_offset		// x24=PHYS_OFFSET, x28=PHYS_OFFSET-PAGE_OFFSET
 	bl	set_cpu_boot_mode_flag
@@ -249,8 +250,12 @@ ENTRY(stext)
 	 * On return, the CPU will be ready for the MMU to be turned on and
 	 * the TCR will have been set.
 	 */
+#ifndef CONFIG_RELOCATABLE_KERNEL
 	ldr	x27, =__mmap_switched		// address to jump to after
 						// MMU has been enabled
+#else
+	adr	x27, __relocate_kernel
+#endif
 	adrp	lr, __enable_mmu		// return (PIC) address
 	add	lr, lr, #:lo12:__enable_mmu
 	b	 __cpu_setup			// initialise processor
@@ -397,9 +402,10 @@ __create_page_tables:
 	 */
 	mov	x0, x26				// swapper_pg_dir
 	mov	x5, #PAGE_OFFSET
+	add	x5, x5, x23			// __va(KERNEL_START)
 	create_pgd_entry x0, x5, x3, x6
 	adr_l	x6, KERNEL_END
-	mov	x3, x24				// phys offset
+	add	x3, x23, x24			// phys offset + image offset
 	sub	x6, x6, x3			// kernel memsize
 	add	x6, x6, x5			// __va(KERNEL_END)
 	create_block_map x0, x7, x3, x5, x6
@@ -438,6 +444,55 @@ __mmap_switched:
 	b	start_kernel
 ENDPROC(__mmap_switched)
 
+#ifdef CONFIG_RELOCATABLE_KERNEL
+
+#define R_AARCH64_RELATIVE	0x403
+#define R_AARCH64_ABS64		0x101
+
+	/*
+	 * Iterate over each entry in the relocation table, and apply the
+	 * relocations in place.
+	 */
+__relocate_kernel:
+	adr_l	x8, __dynsym_start		// start of symbol table
+	adr_l	x9, __reloc_start		// start of reloc table
+	adr_l	x10, __reloc_end		// end of reloc table
+0:	cmp	x9, x10
+	b.hs	2f
+	ldp	x11, x12, [x9], #24
+	cmp	x12, #R_AARCH64_RELATIVE
+	b.ne	1f
+	ldr	x12, [x9, #-8]
+	add	x12, x12, x23			// relocate
+	str	x12, [x11, x28]
+	b	0b
+
+1:	ubfx	x13, x12, #0, #32
+	cmp	x13, #R_AARCH64_ABS64
+	b.ne	0b
+	lsr	x13, x12, #32			// symbol index
+	ldr	x12, [x9, #-8]
+	add	x13, x13, x13, lsl #1		// x 3
+	add	x13, x8, x13, lsl #3		// x 8
+	ldrsh	w14, [x13, #6]			// Elf64_Sym::st_shndx
+	ldr	x15, [x13, #8]			// Elf64_Sym::st_value
+	cmp	w14, #-0xf			// SHN_ABS (0xfff1) ?
+	add	x14, x15, x23			// relocate
+	csel	x15, x14, x15, ne
+	add	x15, x12, x15
+	str	x15, [x11, x28]
+	b	0b
+
+2:	ldr	x8, =vectors			// reload VBAR_EL1 with
+	msr	vbar_el1, x8			// relocated address
+	isb
+
+	ldr	x9, =__mmap_switched
+	br	x9
+ENDPROC(__relocate_kernel)
+
+#endif
+
 /*
  * end early head section, begin head code that is also used for
  * hotplug and needs to have the same protections as the text region
@@ -657,6 +712,7 @@ __calc_phys_offset:
 	mov	x2, PAGE_OFFSET
 	sub	x28, x0, x1			// x28 = PHYS_OFFSET - PAGE_OFFSET
 	add	x24, x2, x28			// x24 = PHYS_OFFSET
+	sub	x24, x24, x23			// subtract image offset
 	ret
 ENDPROC(__calc_phys_offset)
 
diff --git a/arch/arm64/kernel/image.h b/arch/arm64/kernel/image.h
index 8fae0756e175..8b1e9fa8fc8c 100644
--- a/arch/arm64/kernel/image.h
+++ b/arch/arm64/kernel/image.h
@@ -47,7 +47,13 @@
 #define __HEAD_FLAG_BE	0
 #endif
 
-#define __HEAD_FLAGS	(__HEAD_FLAG_BE << 0)
+#ifdef CONFIG_RELOCATABLE_KERNEL
+#define __HEAD_FLAG_RELOC	1
+#else
+#define __HEAD_FLAG_RELOC	0
+#endif
+
+#define __HEAD_FLAGS	(__HEAD_FLAG_BE << 0) | (__HEAD_FLAG_RELOC << 1)
 
 /*
  * These will output as part of the Image header, which should be little-endian
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index a2c29865c3fe..df706a8c22f1 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -131,6 +131,18 @@ SECTIONS
 
 	PERCPU_SECTION(64)
 
+	.rela : {
+		. = ALIGN(8);
+		__reloc_start = .;
+		*(.rela .rela*)
+		__reloc_end = .;
+	}
+	.dynsym : {
+		. = ALIGN(8);
+		__dynsym_start = .;
+		*(.dynsym)
+	}
+
 	. = ALIGN(PAGE_SIZE);
 	__init_end = .;
 
diff --git a/scripts/sortextable.c b/scripts/sortextable.c
index 1052d4834a44..77fcc1a80011 100644
--- a/scripts/sortextable.c
+++ b/scripts/sortextable.c
@@ -262,9 +262,9 @@ do_file(char const *const fname)
 		break;
 	}  /* end switch */
 	if (memcmp(ELFMAG, ehdr->e_ident, SELFMAG) != 0
-	||  r2(&ehdr->e_type) != ET_EXEC
+	|| (r2(&ehdr->e_type) != ET_EXEC && r2(&ehdr->e_type) != ET_DYN)
 	||  ehdr->e_ident[EI_VERSION] != EV_CURRENT) {
-		fprintf(stderr, "unrecognized ET_EXEC file %s\n", fname);
+		fprintf(stderr, "unrecognized ET_EXEC/ET_DYN file %s\n", fname);
 		fail_file();
 	}
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 3/3] arm64/efi: use relocated kernel
  2015-03-16 15:23 [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Ard Biesheuvel
  2015-03-16 15:23 ` [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates Ard Biesheuvel
  2015-03-16 15:23 ` [RFC PATCH 2/3] arm64: add support for relocatable kernel Ard Biesheuvel
@ 2015-03-16 15:23 ` Ard Biesheuvel
  2015-03-16 16:09 ` [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Mark Rutland
  2015-03-16 23:19 ` Kees Cook
  4 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-16 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

PoC for relocated kernel code. This puts the kernel at the top of the
lowest naturally aligned 1 GB region of memory, and relocates the
kernel so that the relative alignment of physical and virtual memory
is at least 1 GB as well.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kernel/efi-entry.S | 4 +++-
 arch/arm64/kernel/efi-stub.c  | 5 +++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/efi-entry.S b/arch/arm64/kernel/efi-entry.S
index f78e6a1de825..89ba42191e59 100644
--- a/arch/arm64/kernel/efi-entry.S
+++ b/arch/arm64/kernel/efi-entry.S
@@ -11,6 +11,7 @@
  */
 #include <linux/linkage.h>
 #include <linux/init.h>
+#include <linux/sizes.h>
 
 #include <asm/assembler.h>
 
@@ -110,7 +111,8 @@ ENTRY(efi_stub_entry)
 2:
 	/* Jump to kernel entry point */
 	mov	x0, x20
-	mov	x1, xzr
+	and	x1, x21, #~(SZ_2M - 1)
+	and	x1, x1, #(SZ_1G - 1)
 	mov	x2, xzr
 	mov	x3, xzr
 	br	x21
diff --git a/arch/arm64/kernel/efi-stub.c b/arch/arm64/kernel/efi-stub.c
index 3b67ca4e2f2e..ba95d9d69884 100644
--- a/arch/arm64/kernel/efi-stub.c
+++ b/arch/arm64/kernel/efi-stub.c
@@ -38,8 +38,9 @@ efi_status_t __init handle_kernel_image(efi_system_table_t *sys_table,
 	if (*image_addr != preferred_offset) {
 		const unsigned long alloc_size = kernel_memsize + TEXT_OFFSET;
 
-		status = efi_low_alloc(sys_table, alloc_size, SZ_2M,
-				       reserve_addr);
+		status = efi_high_alloc(sys_table, alloc_size, SZ_2M,
+				       reserve_addr,
+				       (dram_base | (SZ_1G - 1)) + 1);
 
 		/*
 		 * Check whether the new allocation crosses a 512 MB alignment
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-16 15:23 [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2015-03-16 15:23 ` [RFC PATCH 3/3] arm64/efi: use relocated kernel Ard Biesheuvel
@ 2015-03-16 16:09 ` Mark Rutland
  2015-03-16 16:45   ` Ard Biesheuvel
  2015-03-16 23:19 ` Kees Cook
  4 siblings, 1 reply; 42+ messages in thread
From: Mark Rutland @ 2015-03-16 16:09 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

I agree that we want to be able to load the kernel anywhwere in memory
(modulo alignment restrictions and the like). However, I'm not keen on
the approach taken; I'd rather see the linear mapping split from the
text mapping. More on that below.

On Mon, Mar 16, 2015 at 03:23:40PM +0000, Ard Biesheuvel wrote:
> These patches is a proof of concept of how we could potentially relocate
> the kernel at runtime. This code is rough around the edges, and there are
> a few unresolved issues, hence this RFC.
> 
> With these patches, the kernel can essentially execute at any virtual offset.
> There are a couple of reasons why we would want this:
> - performance: we can align PHYS_OFFSET so that most of the linear mapping can
>   be done using 512 MB or 1 GB blocks (depending on page size), instead of
>   the more granular level that is currently unavoidable if Image cannot be
>   loaded at base of RAM (since PHYS_OFFSET is tied to the start of the kernel
>   Image).

Isn't this gain somewhat offset by having to build the kernel as a PIE?
If we're doing this for performance it would be good to see numbers.

> - security: with these changes, we can put the kernel Image anywhere in physical
>   memory, and we can put the physical memory anywhere in the upper half of the
>   virtual address range (modulo alignment). This gives us quite a number of
>   bits of to play with if we were to randomize the kernel virtual address space.
>   Also, this is entirely under the control of the boot loader, which is probably
>   in better shape to get its hands on some entropy than the early kernel boot
>   code.
> - convenience: fewer constraints when loading the kernel into memory, as it
>   can execute from anywhere.

I agree that making things easier for loaders is for the best.

> How it works:
> - an additional boot argument 'image offset' is passed in x1 by the boot loader,
>   which should contain a value that is at least the offset of Image into physical
>   memory. Higher values are also possible, and may be used to randomize the
>   kernel VA space.

I have a very strong suspicion that bootloaders in the wild don't zero
x1-x3, and that given that we might not have a reliable mechanism for
acquiring the offset.

> - the kernel binary is runtime relocated to PAGE_OFFSET + image offset
> 
> Issues:
> - Since AArch64 uses the ELF RELA format (where the addends are in the
>   relocation table and not in the code), the relocations need to be applied even
>   if the Image runs from the same offset it was linked at. It also means that
>   some values that are produced by the linker (_kernel_size_le, etc) are missing
>   from the binary. This will probably need a fixup step.
> - The module area may be out of range, which needs to be worked around with
>   module PLTs. This is straight forward but I haven't implemented it yet for
>   arm64.
> - The core extable is most likely broken, and would need to be changed to use
>   relative offsets instead of absolute addresses.

This sounds like it's going to be a big headache.

I'd rather see that we decouple the kernel (text/data) mapping from the
linear mapping, with the former given a fixed VA independent of the PA
of the kernel Image (which would still need to be at a 2M-aligned
address + text_offset, and not straddling a 512M boundary).

That would allow us to place the kernel anywhere in memory (modulo those
constraints), enable us to address memory below the kernel when we do
so, and would still allow the kernel to be built with absolute
addressing, which keeps things simple and fast.

That doesn't give us VA randomisation, but that could be built atop by
reserving a larger VA range than necessary for the kernel, and have the
kernel pick a window from within that (assuming we can find some entropy
early on) to relocate itself to. That would also be independent of the
physical layout, which is nice -- we could have randomised VAs even with
a trivial loader that always placed the kernel at the same address
(which is likely to be the common case).

When I looked at this a while back it seemed like the majority of the
changes were fairly mechanical (introducing and using
text_to_phys/phys_to_text and leaving virt_to_x for the linear mapping),
and the big pain points seemed to be the page table init (where we rely
on memory at the end of the kernel mapping) and KVM.

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-16 16:09 ` [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Mark Rutland
@ 2015-03-16 16:45   ` Ard Biesheuvel
  2015-03-16 17:33     ` Mark Rutland
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-16 16:45 UTC (permalink / raw)
  To: linux-arm-kernel

On 16 March 2015 at 17:09, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi Ard,
>
> I agree that we want to be able to load the kernel anywhwere in memory
> (modulo alignment restrictions and the like). However, I'm not keen on
> the approach taken; I'd rather see the linear mapping split from the
> text mapping. More on that below.
>
> On Mon, Mar 16, 2015 at 03:23:40PM +0000, Ard Biesheuvel wrote:
>> These patches is a proof of concept of how we could potentially relocate
>> the kernel at runtime. This code is rough around the edges, and there are
>> a few unresolved issues, hence this RFC.
>>
>> With these patches, the kernel can essentially execute at any virtual offset.
>> There are a couple of reasons why we would want this:
>> - performance: we can align PHYS_OFFSET so that most of the linear mapping can
>>   be done using 512 MB or 1 GB blocks (depending on page size), instead of
>>   the more granular level that is currently unavoidable if Image cannot be
>>   loaded at base of RAM (since PHYS_OFFSET is tied to the start of the kernel
>>   Image).
>
> Isn't this gain somewhat offset by having to build the kernel as a PIE?

I don't think so. Note that this is not -fpic code, it's just the ld
option that dumps the reloc and dynsym tables into the output image.
The reloc penalty is boottime only.

> If we're doing this for performance it would be good to see numbers.
>

Ack

>> - security: with these changes, we can put the kernel Image anywhere in physical
>>   memory, and we can put the physical memory anywhere in the upper half of the
>>   virtual address range (modulo alignment). This gives us quite a number of
>>   bits of to play with if we were to randomize the kernel virtual address space.
>>   Also, this is entirely under the control of the boot loader, which is probably
>>   in better shape to get its hands on some entropy than the early kernel boot
>>   code.
>> - convenience: fewer constraints when loading the kernel into memory, as it
>>   can execute from anywhere.
>
> I agree that making things easier for loaders is for the best.
>
>> How it works:
>> - an additional boot argument 'image offset' is passed in x1 by the boot loader,
>>   which should contain a value that is at least the offset of Image into physical
>>   memory. Higher values are also possible, and may be used to randomize the
>>   kernel VA space.
>
> I have a very strong suspicion that bootloaders in the wild don't zero
> x1-x3, and that given that we might not have a reliable mechanism for
> acquiring the offset.
>

OK, sounds about time to start complaining about that then.

>> - the kernel binary is runtime relocated to PAGE_OFFSET + image offset
>>
>> Issues:
>> - Since AArch64 uses the ELF RELA format (where the addends are in the
>>   relocation table and not in the code), the relocations need to be applied even
>>   if the Image runs from the same offset it was linked at. It also means that
>>   some values that are produced by the linker (_kernel_size_le, etc) are missing
>>   from the binary. This will probably need a fixup step.
>> - The module area may be out of range, which needs to be worked around with
>>   module PLTs. This is straight forward but I haven't implemented it yet for
>>   arm64.
>> - The core extable is most likely broken, and would need to be changed to use
>>   relative offsets instead of absolute addresses.
>
> This sounds like it's going to be a big headache.
>

It's all manageable, really. The module PLT thing is something I
already implemented for 32-bit ARM here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2014-November/305539.html
(only Russell couldn't be bothered to merge it)

The extable is already relative on x86, and the fixup step is some
straight forward ELF mangling on vmlinux before performing the
objcopy.
But yes, it's rather ugly.

> I'd rather see that we decouple the kernel (text/data) mapping from the
> linear mapping, with the former given a fixed VA independent of the PA
> of the kernel Image (which would still need to be at a 2M-aligned
> address + text_offset, and not straddling a 512M boundary).
>

Hmm, that's quite nice, actually, It also fixes the module range
problem, and for VA randomization we could move both regions together.

> That would allow us to place the kernel anywhere in memory (modulo those
> constraints), enable us to address memory below the kernel when we do
> so, and would still allow the kernel to be built with absolute
> addressing, which keeps things simple and fast.
>
> That doesn't give us VA randomisation, but that could be built atop by
> reserving a larger VA range than necessary for the kernel, and have the
> kernel pick a window from within that (assuming we can find some entropy
> early on) to relocate itself to. That would also be independent of the
> physical layout, which is nice -- we could have randomised VAs even with
> a trivial loader that always placed the kernel at the same address
> (which is likely to be the common case).
>

vmlinux.ko ? That would be very cool :-)

> When I looked at this a while back it seemed like the majority of the
> changes were fairly mechanical (introducing and using
> text_to_phys/phys_to_text and leaving virt_to_x for the linear mapping),
> and the big pain points seemed to be the page table init (where we rely
> on memory at the end of the kernel mapping) and KVM.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates
  2015-03-16 15:23 ` [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates Ard Biesheuvel
@ 2015-03-16 17:14   ` Mark Rutland
  2015-03-17  7:01     ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Mark Rutland @ 2015-03-16 17:14 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On Mon, Mar 16, 2015 at 03:23:41PM +0000, Ard Biesheuvel wrote:
> In preparation of making the kernel relocatable, replace literal
> symbol references with immediate constants. This is necessary, as
> the literals will not be populated with meaningful values until
> after the relocation code has executed.

The majority of these changes look like nice cleanups/simplifications
regardless of whether the kernel is relocatable, so I like most of the
patch.

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/kernel/efi-entry.S |  2 +-
>  arch/arm64/kernel/head.S      | 36 +++++++++++++++---------------------
>  2 files changed, 16 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/arm64/kernel/efi-entry.S b/arch/arm64/kernel/efi-entry.S
> index 8ce9b0577442..f78e6a1de825 100644
> --- a/arch/arm64/kernel/efi-entry.S
> +++ b/arch/arm64/kernel/efi-entry.S
> @@ -61,7 +61,7 @@ ENTRY(efi_stub_entry)
>  	 */
>  	mov	x20, x0		// DTB address
>  	ldr	x0, [sp, #16]	// relocated _text address
> -	ldr	x21, =stext_offset
> +	movz	x21, #:abs_g0:stext_offset

This looks a little scary. Why do we need to use a movz with a special
relocation rather than a simple mov? As far as I can see mov has
sufficient range.

>  	add	x21, x0, x21
>  
>  	/*
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 9c856f2aa7a5..1ea3cd2aba34 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -46,11 +46,9 @@
>  #error TEXT_OFFSET must be less than 2MB
>  #endif
>  
> -	.macro	pgtbl, ttb0, ttb1, virt_to_phys
> -	ldr	\ttb1, =swapper_pg_dir
> -	ldr	\ttb0, =idmap_pg_dir
> -	add	\ttb1, \ttb1, \virt_to_phys
> -	add	\ttb0, \ttb0, \virt_to_phys
> +	.macro	pgtbl, ttb0, ttb1
> +	adrp	\ttb1, swapper_pg_dir
> +	adrp	\ttb0, idmap_pg_dir
>  	.endm

I reckon we should just kill pgtbl and use adrp inline in both of the
callsites. That way we can also get rid of the comment in each case,
beacuse the assembly itself will document which register gets which
table address.

>  #ifdef CONFIG_ARM64_64K_PAGES
> @@ -63,7 +61,7 @@
>  #define TABLE_SHIFT	PUD_SHIFT
>  #endif
>  
> -#define KERNEL_START	KERNEL_RAM_VADDR
> +#define KERNEL_START	_text

We can kill the KERNEL_RAM_VADDR definition too, I believe. It's only
used here.

>  #define KERNEL_END	_end
>  
>  /*
> @@ -322,7 +320,7 @@ ENDPROC(stext)
>   *     been enabled
>   */
>  __create_page_tables:
> -	pgtbl	x25, x26, x28			// idmap_pg_dir and swapper_pg_dir addresses
> +	pgtbl	x25, x26			// idmap_pg_dir and swapper_pg_dir addresses

Here we could just have:

	adrp	x25, idmap_pg_dir
	adrp	x26, swapper_pg_dir

>  	mov	x27, lr
>  
>  	/*
> @@ -351,8 +349,7 @@ __create_page_tables:
>  	 * Create the identity mapping.
>  	 */
>  	mov	x0, x25				// idmap_pg_dir
> -	ldr	x3, =KERNEL_START
> -	add	x3, x3, x28			// __pa(KERNEL_START)
> +	adr_l	x3, KERNEL_START		// __pa(KERNEL_START)
>  
>  #ifndef CONFIG_ARM64_VA_BITS_48
>  #define EXTRA_SHIFT	(PGDIR_SHIFT + PAGE_SHIFT - 3)
> @@ -391,9 +388,8 @@ __create_page_tables:
>  #endif
>  
>  	create_pgd_entry x0, x3, x5, x6
> -	ldr	x6, =KERNEL_END
>  	mov	x5, x3				// __pa(KERNEL_START)
> -	add	x6, x6, x28			// __pa(KERNEL_END)
> +	adr_l	x6, KERNEL_END			// __pa(KERNEL_END)
>  	create_block_map x0, x7, x3, x5, x6
>  
>  	/*
> @@ -402,8 +398,10 @@ __create_page_tables:
>  	mov	x0, x26				// swapper_pg_dir
>  	mov	x5, #PAGE_OFFSET
>  	create_pgd_entry x0, x5, x3, x6
> -	ldr	x6, =KERNEL_END
> +	adr_l	x6, KERNEL_END
>  	mov	x3, x24				// phys offset
> +	sub	x6, x6, x3			// kernel memsize
> +	add	x6, x6, x5			// __va(KERNEL_END)

I'm not sure on this; it makes it consistent with the rest w.r.t. using
adr* but it's now a little harder to read :/

>  	create_block_map x0, x7, x3, x5, x6
>  
>  	/*
> @@ -538,8 +536,7 @@ ENDPROC(el2_setup)
>   * in x20. See arch/arm64/include/asm/virt.h for more info.
>   */
>  ENTRY(set_cpu_boot_mode_flag)
> -	ldr	x1, =__boot_cpu_mode		// Compute __boot_cpu_mode
> -	add	x1, x1, x28
> +	adr_l	x1, __boot_cpu_mode		// Compute __boot_cpu_mode

The comment seems redundant now (and arguably was before). I think it's
a distraction we can drop.

>  	cmp	w20, #BOOT_CPU_MODE_EL2
>  	b.ne	1f
>  	add	x1, x1, #4
> @@ -598,7 +595,7 @@ ENTRY(secondary_startup)
>  	/*
>  	 * Common entry point for secondary CPUs.
>  	 */
> -	pgtbl	x25, x26, x28			// x25=TTBR0, x26=TTBR1
> +	pgtbl	x25, x26			// x25=TTBR0, x26=TTBR1

As mentioned above, I think this is clearer as:

	adrp	x25, idmap_pg_dir
	adrp	x26, swapper_pg_dir

>  	bl	__cpu_setup			// initialise processor
>  
>  	ldr	x21, =secondary_data
> @@ -655,17 +652,14 @@ ENDPROC(__turn_mmu_on)
>   * Calculate the start of physical memory.
>   */
>  __calc_phys_offset:
> -	adr	x0, 1f
> -	ldp	x1, x2, [x0]
> +	adrp	x0, KERNEL_START		// __pa(KERNEL_START)
> +	ldr	x1, =KERNEL_RAM_VADDR		// __va(KERNEL_START)
> +	mov	x2, PAGE_OFFSET

Missing '#' on the PAGE_OFFSET immediate.

Thanks,
Mark.

>  	sub	x28, x0, x1			// x28 = PHYS_OFFSET - PAGE_OFFSET
>  	add	x24, x2, x28			// x24 = PHYS_OFFSET
>  	ret
>  ENDPROC(__calc_phys_offset)
>  
> -	.align 3
> -1:	.quad	.
> -	.quad	PAGE_OFFSET
> -
>  /*
>   * Exception handling. Something went wrong and we can't proceed. We ought to
>   * tell the user, but since we don't have any guarantee that we're even
> -- 
> 1.8.3.2
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-16 16:45   ` Ard Biesheuvel
@ 2015-03-16 17:33     ` Mark Rutland
  2015-03-16 17:43       ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Mark Rutland @ 2015-03-16 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

> >> - performance: we can align PHYS_OFFSET so that most of the linear mapping can
> >>   be done using 512 MB or 1 GB blocks (depending on page size), instead of
> >>   the more granular level that is currently unavoidable if Image cannot be
> >>   loaded at base of RAM (since PHYS_OFFSET is tied to the start of the kernel
> >>   Image).
> >
> > Isn't this gain somewhat offset by having to build the kernel as a PIE?
> 
> I don't think so. Note that this is not -fpic code, it's just the ld
> option that dumps the reloc and dynsym tables into the output image.
> The reloc penalty is boottime only.

Ah, ok.

> > I have a very strong suspicion that bootloaders in the wild don't zero
> > x1-x3, and that given that we might not have a reliable mechanism for
> > acquiring the offset.
> >
> 
> OK, sounds about time to start complaining about that then.

I guess so.

> >> Issues:
> >> - Since AArch64 uses the ELF RELA format (where the addends are in the
> >>   relocation table and not in the code), the relocations need to be applied even
> >>   if the Image runs from the same offset it was linked at. It also means that
> >>   some values that are produced by the linker (_kernel_size_le, etc) are missing
> >>   from the binary. This will probably need a fixup step.
> >> - The module area may be out of range, which needs to be worked around with
> >>   module PLTs. This is straight forward but I haven't implemented it yet for
> >>   arm64.
> >> - The core extable is most likely broken, and would need to be changed to use
> >>   relative offsets instead of absolute addresses.
> >
> > This sounds like it's going to be a big headache.
> >
> 
> It's all manageable, really. The module PLT thing is something I
> already implemented for 32-bit ARM here:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-November/305539.html
> (only Russell couldn't be bothered to merge it)
> 
> The extable is already relative on x86, and the fixup step is some
> straight forward ELF mangling on vmlinux before performing the
> objcopy.
> But yes, it's rather ugly.

Hmm. I'd be rather worried about the fixup step; I suspect that'll be
fragile and rarely tested. Perhaps we could verify them at boot time?

> > I'd rather see that we decouple the kernel (text/data) mapping from the
> > linear mapping, with the former given a fixed VA independent of the PA
> > of the kernel Image (which would still need to be at a 2M-aligned
> > address + text_offset, and not straddling a 512M boundary).
> >
> 
> Hmm, that's quite nice, actually, It also fixes the module range
> problem, and for VA randomization we could move both regions together.

Ah, good point. I hadn't consdiered modules all that much, but it sounds
like it could work.

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-16 17:33     ` Mark Rutland
@ 2015-03-16 17:43       ` Ard Biesheuvel
  2015-03-17 16:20         ` Mark Rutland
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-16 17:43 UTC (permalink / raw)
  To: linux-arm-kernel


On 16 mrt. 2015, at 18:33, Mark Rutland <mark.rutland@arm.com> wrote:

>>>> - performance: we can align PHYS_OFFSET so that most of the linear mapping can
>>>>  be done using 512 MB or 1 GB blocks (depending on page size), instead of
>>>>  the more granular level that is currently unavoidable if Image cannot be
>>>>  loaded at base of RAM (since PHYS_OFFSET is tied to the start of the kernel
>>>>  Image).
>>> 
>>> Isn't this gain somewhat offset by having to build the kernel as a PIE?
>> 
>> I don't think so. Note that this is not -fpic code, it's just the ld
>> option that dumps the reloc and dynsym tables into the output image.
>> The reloc penalty is boottime only.
> 
> Ah, ok.
> 
>>> I have a very strong suspicion that bootloaders in the wild don't zero
>>> x1-x3, and that given that we might not have a reliable mechanism for
>>> acquiring the offset.
>> 
>> OK, sounds about time to start complaining about that then.
> 
> I guess so.
> 
>>>> Issues:
>>>> - Since AArch64 uses the ELF RELA format (where the addends are in the
>>>>  relocation table and not in the code), the relocations need to be applied even
>>>>  if the Image runs from the same offset it was linked at. It also means that
>>>>  some values that are produced by the linker (_kernel_size_le, etc) are missing
>>>>  from the binary. This will probably need a fixup step.
>>>> - The module area may be out of range, which needs to be worked around with
>>>>  module PLTs. This is straight forward but I haven't implemented it yet for
>>>>  arm64.
>>>> - The core extable is most likely broken, and would need to be changed to use
>>>>  relative offsets instead of absolute addresses.
>>> 
>>> This sounds like it's going to be a big headache.
>> 
>> It's all manageable, really. The module PLT thing is something I
>> already implemented for 32-bit ARM here:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-November/305539.html
>> (only Russell couldn't be bothered to merge it)
>> 
>> The extable is already relative on x86, and the fixup step is some
>> straight forward ELF mangling on vmlinux before performing the
>> objcopy.
>> But yes, it's rather ugly.
> 
> Hmm. I'd be rather worried about the fixup step; I suspect that'll be
> fragile and rarely tested. Perhaps we could verify them at boot time?

It would essentially be __relocate_kernel() but executed at build time against vmlinux with a reloc offset of #0

> 
>>> I'd rather see that we decouple the kernel (text/data) mapping from the
>>> linear mapping, with the former given a fixed VA independent of the PA
>>> of the kernel Image (which would still need to be at a 2M-aligned
>>> address + text_offset, and not straddling a 512M boundary).
>> 
>> Hmm, that's quite nice, actually, It also fixes the module range
>> problem, and for VA randomization we could move both regions together.
> 
> Ah, good point. I hadn't consdiered modules all that much, but it sounds
> like it could work.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-16 15:23 [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2015-03-16 16:09 ` [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Mark Rutland
@ 2015-03-16 23:19 ` Kees Cook
  2015-03-17  7:38   ` Ard Biesheuvel
  4 siblings, 1 reply; 42+ messages in thread
From: Kees Cook @ 2015-03-16 23:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 16, 2015 at 8:23 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> These patches is a proof of concept of how we could potentially relocate
> the kernel at runtime. This code is rough around the edges, and there are
> a few unresolved issues, hence this RFC.
>
> With these patches, the kernel can essentially execute at any virtual offset.
> There are a couple of reasons why we would want this:
> - performance: we can align PHYS_OFFSET so that most of the linear mapping can
>   be done using 512 MB or 1 GB blocks (depending on page size), instead of
>   the more granular level that is currently unavoidable if Image cannot be
>   loaded at base of RAM (since PHYS_OFFSET is tied to the start of the kernel
>   Image).
> - security: with these changes, we can put the kernel Image anywhere in physical
>   memory, and we can put the physical memory anywhere in the upper half of the
>   virtual address range (modulo alignment). This gives us quite a number of
>   bits of to play with if we were to randomize the kernel virtual address space.
>   Also, this is entirely under the control of the boot loader, which is probably
>   in better shape to get its hands on some entropy than the early kernel boot
>   code.

This is great! Thank you for working on this. ARM kernel text ASLR has
been on my TODO list for a while now, but I kept looking at the
missing relocation support and then would start crying. :) (Do you
have any plans for 32-bit ARM relocation support?)

Some random brain-dump from my experiences with kASLR on x86:

I don't want to exclusively depend on the bootloader for kASLR. There
is, I think, already a way to pass entropy into the kernel from the
bootloader, so perhaps that could be used on top of chipset-specific
entropy sources? I would like to try to keep the way kASLR gets
enabled/used/whatever at least compatible with x86 (which probably
means adding some additional knobs to x86 to gain what ARM will have).

Possibly related to running with/detecting an offset, we need a way to
communicate that kASLR is active through the compressed kernel to
uncompressed kernel. x86 is going to be using x86's setup_data, but we
may need to generalize this. (The reasoning here is that when kaslr is
disabled at runtime, we should turn off other kernel ASLR, like module
offset ASLR, without duplicating kernel command line parameter parsing
-- which is what x86 is currently doing now.) Just examining the
offset isn't sufficient because perhaps we got randomized to offset 0.
:)

For note, here's kernel module offset ASLR for ARM:
https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/commit/?h=kaslr/arm&id=773234f719a5cccb662840b90f41fad2930b2460

You mention the linear mappings in "performance", which I worry may be
at odds with kASLR? Can large mappings still be used even when we've
got smaller alignment goals? Since you mention the "upper half of the
virtual address range", I assume ARM is built using the -2GB
addressing as used by x86, is that right? So it sounds like it would
be similar entropy to x86.

We should keep in mind the idea of splitting physical ASLR from
virtual ASLR. The virtual addresses remain bound to the -2GB
addressing, but the kernel could be anywhere in physical RAM when
there is >4GB available. This is currently being worked on for x86.

> - convenience: fewer constraints when loading the kernel into memory, as it
>   can execute from anywhere.
>
> How it works:
> - an additional boot argument 'image offset' is passed in x1 by the boot loader,
>   which should contain a value that is at least the offset of Image into physical
>   memory. Higher values are also possible, and may be used to randomize the
>   kernel VA space.
> - the kernel binary is runtime relocated to PAGE_OFFSET + image offset
>
> Issues:
> - Since AArch64 uses the ELF RELA format (where the addends are in the
>   relocation table and not in the code), the relocations need to be applied even
>   if the Image runs from the same offset it was linked at. It also means that
>   some values that are produced by the linker (_kernel_size_le, etc) are missing
>   from the binary. This will probably need a fixup step.

I had no end of troubles with linkers changing their behavior. :) Have
you looked at arch/x86/tools/relocs* tool, or the reloc mechanisms in
arch/x86/boot/compressed/misc.c handle_relocations()? It looks like
you're using real ELF relocations in patch 2, which seems more
sensible that what x86 is doing (reinventing relocation tables). I'm
not sure why it's done that way, though.

> - The module area may be out of range, which needs to be worked around with
>   module PLTs. This is straight forward but I haven't implemented it yet for
>   arm64.
> - The core extable is most likely broken, and would need to be changed to use
>   relative offsets instead of absolute addresses.
> - Probably lots and lots of other roadblocks, hence this RFC
>
> Output from QEMU/efi with 2 GB of memory:
>
> Base of RAM:
>   0x000040000000-0x00004000ffff [Loader Data        |   |  |  |  |   |WB|WT|WC|UC]
>
> Physical location of Image:
>   0x00007f400000-0x00007fe6ffff [Loader Data        |   |  |  |  |   |WB|WT|WC|UC]
>
> Virtual kernel memory layout:
>     vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
>     vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
>               0xffffffbdc1000000 - 0xffffffbdc3000000   (    32 MB actual)
>     fixed   : 0xffffffbffa9fe000 - 0xffffffbffac00000   (  2056 KB)
>     PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000   (    16 MB)
>     modules : 0xffffffbffc000000 - 0xffffffc000000000   (    64 MB)
>     memory  : 0xffffffc000000000 - 0xffffffc080000000   (  2048 MB)
>       .init : 0xffffffc03fb82000 - 0xffffffc03fdbe000   (  2288 KB)
>       .text : 0xffffffc03f480000 - 0xffffffc03fb81344   (  7173 KB)
>       .data : 0xffffffc03fdc2000 - 0xffffffc03fe28c00   (   411 KB)
>
>
> Ard Biesheuvel (3):
>   arm64: head.S: replace early literals with constant immediates
>   arm64: add support for relocatable kernel
>   arm64/efi: use relocated kernel
>
>  arch/arm64/Kconfig              |  3 ++
>  arch/arm64/Makefile             |  4 ++
>  arch/arm64/kernel/efi-entry.S   |  6 ++-
>  arch/arm64/kernel/efi-stub.c    |  5 ++-
>  arch/arm64/kernel/head.S        | 94 +++++++++++++++++++++++++++++++----------
>  arch/arm64/kernel/image.h       |  8 +++-
>  arch/arm64/kernel/vmlinux.lds.S | 12 ++++++
>  scripts/sortextable.c           |  4 +-
>  8 files changed, 107 insertions(+), 29 deletions(-)
>
> --
> 1.8.3.2
>

I'll try to get this series into a state I can test on my hardware.
Thanks again!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates
  2015-03-16 17:14   ` Mark Rutland
@ 2015-03-17  7:01     ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-17  7:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 16 March 2015 at 18:14, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi Ard,
>
> On Mon, Mar 16, 2015 at 03:23:41PM +0000, Ard Biesheuvel wrote:
>> In preparation of making the kernel relocatable, replace literal
>> symbol references with immediate constants. This is necessary, as
>> the literals will not be populated with meaningful values until
>> after the relocation code has executed.
>
> The majority of these changes look like nice cleanups/simplifications
> regardless of whether the kernel is relocatable, so I like most of the
> patch.
>

OK. I can clean it up and add it to the head.S spring cleaning series.

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>>  arch/arm64/kernel/efi-entry.S |  2 +-
>>  arch/arm64/kernel/head.S      | 36 +++++++++++++++---------------------
>>  2 files changed, 16 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/efi-entry.S b/arch/arm64/kernel/efi-entry.S
>> index 8ce9b0577442..f78e6a1de825 100644
>> --- a/arch/arm64/kernel/efi-entry.S
>> +++ b/arch/arm64/kernel/efi-entry.S
>> @@ -61,7 +61,7 @@ ENTRY(efi_stub_entry)
>>        */
>>       mov     x20, x0         // DTB address
>>       ldr     x0, [sp, #16]   // relocated _text address
>> -     ldr     x21, =stext_offset
>> +     movz    x21, #:abs_g0:stext_offset
>
> This looks a little scary. Why do we need to use a movz with a special
> relocation rather than a simple mov? As far as I can see mov has
> sufficient range.
>

stext_offset is an external symbol supplied by the linker, so you need
a relocation one way or the other, and
plain mov doesn't have those.

>>       add     x21, x0, x21
>>
>>       /*
>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>> index 9c856f2aa7a5..1ea3cd2aba34 100644
>> --- a/arch/arm64/kernel/head.S
>> +++ b/arch/arm64/kernel/head.S
>> @@ -46,11 +46,9 @@
>>  #error TEXT_OFFSET must be less than 2MB
>>  #endif
>>
>> -     .macro  pgtbl, ttb0, ttb1, virt_to_phys
>> -     ldr     \ttb1, =swapper_pg_dir
>> -     ldr     \ttb0, =idmap_pg_dir
>> -     add     \ttb1, \ttb1, \virt_to_phys
>> -     add     \ttb0, \ttb0, \virt_to_phys
>> +     .macro  pgtbl, ttb0, ttb1
>> +     adrp    \ttb1, swapper_pg_dir
>> +     adrp    \ttb0, idmap_pg_dir
>>       .endm
>
> I reckon we should just kill pgtbl and use adrp inline in both of the
> callsites. That way we can also get rid of the comment in each case,
> beacuse the assembly itself will document which register gets which
> table address.
>

Agreed.

>>  #ifdef CONFIG_ARM64_64K_PAGES
>> @@ -63,7 +61,7 @@
>>  #define TABLE_SHIFT  PUD_SHIFT
>>  #endif
>>
>> -#define KERNEL_START KERNEL_RAM_VADDR
>> +#define KERNEL_START _text
>
> We can kill the KERNEL_RAM_VADDR definition too, I believe. It's only
> used here.
>

I am using it in __calc_phys_offset, but there it is probably better
to opencode it as (PAGE_OFFSET + TEXT_OFFSET) for clarity.

>>  #define KERNEL_END   _end
>>
>>  /*
>> @@ -322,7 +320,7 @@ ENDPROC(stext)
>>   *     been enabled
>>   */
>>  __create_page_tables:
>> -     pgtbl   x25, x26, x28                   // idmap_pg_dir and swapper_pg_dir addresses
>> +     pgtbl   x25, x26                        // idmap_pg_dir and swapper_pg_dir addresses
>
> Here we could just have:
>
>         adrp    x25, idmap_pg_dir
>         adrp    x26, swapper_pg_dir
>
>>       mov     x27, lr
>>
>>       /*
>> @@ -351,8 +349,7 @@ __create_page_tables:
>>        * Create the identity mapping.
>>        */
>>       mov     x0, x25                         // idmap_pg_dir
>> -     ldr     x3, =KERNEL_START
>> -     add     x3, x3, x28                     // __pa(KERNEL_START)
>> +     adr_l   x3, KERNEL_START                // __pa(KERNEL_START)
>>
>>  #ifndef CONFIG_ARM64_VA_BITS_48
>>  #define EXTRA_SHIFT  (PGDIR_SHIFT + PAGE_SHIFT - 3)
>> @@ -391,9 +388,8 @@ __create_page_tables:
>>  #endif
>>
>>       create_pgd_entry x0, x3, x5, x6
>> -     ldr     x6, =KERNEL_END
>>       mov     x5, x3                          // __pa(KERNEL_START)
>> -     add     x6, x6, x28                     // __pa(KERNEL_END)
>> +     adr_l   x6, KERNEL_END                  // __pa(KERNEL_END)
>>       create_block_map x0, x7, x3, x5, x6
>>
>>       /*
>> @@ -402,8 +398,10 @@ __create_page_tables:
>>       mov     x0, x26                         // swapper_pg_dir
>>       mov     x5, #PAGE_OFFSET
>>       create_pgd_entry x0, x5, x3, x6
>> -     ldr     x6, =KERNEL_END
>> +     adr_l   x6, KERNEL_END
>>       mov     x3, x24                         // phys offset
>> +     sub     x6, x6, x3                      // kernel memsize
>> +     add     x6, x6, x5                      // __va(KERNEL_END)
>
> I'm not sure on this; it makes it consistent with the rest w.r.t. using
> adr* but it's now a little harder to read :/
>

Yes, but 'ldr x6, =XX' is going to return 0 until the relocation code
has executed.

>>       create_block_map x0, x7, x3, x5, x6
>>
>>       /*
>> @@ -538,8 +536,7 @@ ENDPROC(el2_setup)
>>   * in x20. See arch/arm64/include/asm/virt.h for more info.
>>   */
>>  ENTRY(set_cpu_boot_mode_flag)
>> -     ldr     x1, =__boot_cpu_mode            // Compute __boot_cpu_mode
>> -     add     x1, x1, x28
>> +     adr_l   x1, __boot_cpu_mode             // Compute __boot_cpu_mode
>
> The comment seems redundant now (and arguably was before). I think it's
> a distraction we can drop.
>

ok

>>       cmp     w20, #BOOT_CPU_MODE_EL2
>>       b.ne    1f
>>       add     x1, x1, #4
>> @@ -598,7 +595,7 @@ ENTRY(secondary_startup)
>>       /*
>>        * Common entry point for secondary CPUs.
>>        */
>> -     pgtbl   x25, x26, x28                   // x25=TTBR0, x26=TTBR1
>> +     pgtbl   x25, x26                        // x25=TTBR0, x26=TTBR1
>
> As mentioned above, I think this is clearer as:
>
>         adrp    x25, idmap_pg_dir
>         adrp    x26, swapper_pg_dir
>
>>       bl      __cpu_setup                     // initialise processor
>>
>>       ldr     x21, =secondary_data
>> @@ -655,17 +652,14 @@ ENDPROC(__turn_mmu_on)
>>   * Calculate the start of physical memory.
>>   */
>>  __calc_phys_offset:
>> -     adr     x0, 1f
>> -     ldp     x1, x2, [x0]
>> +     adrp    x0, KERNEL_START                // __pa(KERNEL_START)
>> +     ldr     x1, =KERNEL_RAM_VADDR           // __va(KERNEL_START)
>> +     mov     x2, PAGE_OFFSET
>
> Missing '#' on the PAGE_OFFSET immediate.
>

yep


>>       sub     x28, x0, x1                     // x28 = PHYS_OFFSET - PAGE_OFFSET
>>       add     x24, x2, x28                    // x24 = PHYS_OFFSET
>>       ret
>>  ENDPROC(__calc_phys_offset)
>>
>> -     .align 3
>> -1:   .quad   .
>> -     .quad   PAGE_OFFSET
>> -
>>  /*
>>   * Exception handling. Something went wrong and we can't proceed. We ought to
>>   * tell the user, but since we don't have any guarantee that we're even
>> --
>> 1.8.3.2
>>
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-16 23:19 ` Kees Cook
@ 2015-03-17  7:38   ` Ard Biesheuvel
  2015-03-17 16:35     ` Mark Rutland
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-17  7:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 17 March 2015 at 00:19, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Mar 16, 2015 at 8:23 AM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> These patches is a proof of concept of how we could potentially relocate
>> the kernel at runtime. This code is rough around the edges, and there are
>> a few unresolved issues, hence this RFC.
>>
>> With these patches, the kernel can essentially execute at any virtual offset.
>> There are a couple of reasons why we would want this:
>> - performance: we can align PHYS_OFFSET so that most of the linear mapping can
>>   be done using 512 MB or 1 GB blocks (depending on page size), instead of
>>   the more granular level that is currently unavoidable if Image cannot be
>>   loaded at base of RAM (since PHYS_OFFSET is tied to the start of the kernel
>>   Image).
>> - security: with these changes, we can put the kernel Image anywhere in physical
>>   memory, and we can put the physical memory anywhere in the upper half of the
>>   virtual address range (modulo alignment). This gives us quite a number of
>>   bits of to play with if we were to randomize the kernel virtual address space.
>>   Also, this is entirely under the control of the boot loader, which is probably
>>   in better shape to get its hands on some entropy than the early kernel boot
>>   code.
>
> This is great! Thank you for working on this. ARM kernel text ASLR has
> been on my TODO list for a while now, but I kept looking at the
> missing relocation support and then would start crying. :) (Do you
> have any plans for 32-bit ARM relocation support?)
>

Well, to be honest, I have no real plans at all to devote substantial
time on this, this is really just a rainy Sunday afternoon hack.
But I will check internally (Joakim?) if this is on anyone's radar at Linaro

> Some random brain-dump from my experiences with kASLR on x86:
>
> I don't want to exclusively depend on the bootloader for kASLR. There
> is, I think, already a way to pass entropy into the kernel from the
> bootloader, so perhaps that could be used on top of chipset-specific
> entropy sources? I would like to try to keep the way kASLR gets
> enabled/used/whatever at least compatible with x86 (which probably
> means adding some additional knobs to x86 to gain what ARM will have).
>

Yes, that makes sense.

> Possibly related to running with/detecting an offset, we need a way to
> communicate that kASLR is active through the compressed kernel to
> uncompressed kernel. x86 is going to be using x86's setup_data, but we
> may need to generalize this. (The reasoning here is that when kaslr is
> disabled at runtime, we should turn off other kernel ASLR, like module
> offset ASLR, without duplicating kernel command line parameter parsing
> -- which is what x86 is currently doing now.) Just examining the
> offset isn't sufficient because perhaps we got randomized to offset 0.
> :)
>

There is no decompressor on arm64, just the core kernel Image. So if
an offset needs to be chosen before branching into the kernel proper,
it needs to be the bootloader that chooses it.

> For note, here's kernel module offset ASLR for ARM:
> https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/commit/?h=kaslr/arm&id=773234f719a5cccb662840b90f41fad2930b2460
>

Interesting. I have a patch for module PLTs (which Russel hasn't
merged yet) which allows modules to reside anywhere in the vmalloc
space. It was mainly for large kernels/modules to work around the
limited range of the branch instructions, but it makes sense for ASLR
as well.
http://lists.infradead.org/pipermail/linux-arm-kernel/2014-November/305539.html

> You mention the linear mappings in "performance", which I worry may be
> at odds with kASLR? Can large mappings still be used even when we've
> got smaller alignment goals? Since you mention the "upper half of the
> virtual address range", I assume ARM is built using the -2GB
> addressing as used by x86, is that right? So it sounds like it would
> be similar entropy to x86.
>

I haven't quantified the performance gain, but it is arguably more
efficient to map RAM using 1 GB blocks than using 2 MB sections.
On the other part of the question, I really need to do more research
on what x86 implements in the first place before even trying to answer
it.

> We should keep in mind the idea of splitting physical ASLR from
> virtual ASLR. The virtual addresses remain bound to the -2GB
> addressing, but the kernel could be anywhere in physical RAM when
> there is >4GB available. This is currently being worked on for x86.
>

On arm64, the only issue here (again) is that there is no decompressor
so it is up to the bootloader or the very early boot code to move the
image around, and getting any randomness in that case is problematic

>> - convenience: fewer constraints when loading the kernel into memory, as it
>>   can execute from anywhere.
>>
>> How it works:
>> - an additional boot argument 'image offset' is passed in x1 by the boot loader,
>>   which should contain a value that is at least the offset of Image into physical
>>   memory. Higher values are also possible, and may be used to randomize the
>>   kernel VA space.
>> - the kernel binary is runtime relocated to PAGE_OFFSET + image offset
>>
>> Issues:
>> - Since AArch64 uses the ELF RELA format (where the addends are in the
>>   relocation table and not in the code), the relocations need to be applied even
>>   if the Image runs from the same offset it was linked at. It also means that
>>   some values that are produced by the linker (_kernel_size_le, etc) are missing
>>   from the binary. This will probably need a fixup step.
>
> I had no end of troubles with linkers changing their behavior. :) Have
> you looked at arch/x86/tools/relocs* tool, or the reloc mechanisms in
> arch/x86/boot/compressed/misc.c handle_relocations()? It looks like
> you're using real ELF relocations in patch 2, which seems more
> sensible that what x86 is doing (reinventing relocation tables). I'm
> not sure why it's done that way, though.
>

As I said, I need to educate myself a bit before attempting to rework
this into a proper kaslr implementation.

>> - The module area may be out of range, which needs to be worked around with
>>   module PLTs. This is straight forward but I haven't implemented it yet for
>>   arm64.
>> - The core extable is most likely broken, and would need to be changed to use
>>   relative offsets instead of absolute addresses.
>> - Probably lots and lots of other roadblocks, hence this RFC
>>
>> Output from QEMU/efi with 2 GB of memory:
>>
>> Base of RAM:
>>   0x000040000000-0x00004000ffff [Loader Data        |   |  |  |  |   |WB|WT|WC|UC]
>>
>> Physical location of Image:
>>   0x00007f400000-0x00007fe6ffff [Loader Data        |   |  |  |  |   |WB|WT|WC|UC]
>>
>> Virtual kernel memory layout:
>>     vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
>>     vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
>>               0xffffffbdc1000000 - 0xffffffbdc3000000   (    32 MB actual)
>>     fixed   : 0xffffffbffa9fe000 - 0xffffffbffac00000   (  2056 KB)
>>     PCI I/O : 0xffffffbffae00000 - 0xffffffbffbe00000   (    16 MB)
>>     modules : 0xffffffbffc000000 - 0xffffffc000000000   (    64 MB)
>>     memory  : 0xffffffc000000000 - 0xffffffc080000000   (  2048 MB)
>>       .init : 0xffffffc03fb82000 - 0xffffffc03fdbe000   (  2288 KB)
>>       .text : 0xffffffc03f480000 - 0xffffffc03fb81344   (  7173 KB)
>>       .data : 0xffffffc03fdc2000 - 0xffffffc03fe28c00   (   411 KB)
>>
>>
>> Ard Biesheuvel (3):
>>   arm64: head.S: replace early literals with constant immediates
>>   arm64: add support for relocatable kernel
>>   arm64/efi: use relocated kernel
>>
>>  arch/arm64/Kconfig              |  3 ++
>>  arch/arm64/Makefile             |  4 ++
>>  arch/arm64/kernel/efi-entry.S   |  6 ++-
>>  arch/arm64/kernel/efi-stub.c    |  5 ++-
>>  arch/arm64/kernel/head.S        | 94 +++++++++++++++++++++++++++++++----------
>>  arch/arm64/kernel/image.h       |  8 +++-
>>  arch/arm64/kernel/vmlinux.lds.S | 12 ++++++
>>  scripts/sortextable.c           |  4 +-
>>  8 files changed, 107 insertions(+), 29 deletions(-)
>>
>> --
>> 1.8.3.2
>>
>
> I'll try to get this series into a state I can test on my hardware.
> Thanks again!
>

Thanks Kees.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-16 17:43       ` Ard Biesheuvel
@ 2015-03-17 16:20         ` Mark Rutland
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2015-03-17 16:20 UTC (permalink / raw)
  To: linux-arm-kernel

> > Hmm. I'd be rather worried about the fixup step; I suspect that'll be
> > fragile and rarely tested. Perhaps we could verify them at boot time?
> 
> It would essentially be __relocate_kernel() but executed at build time against vmlinux with a reloc offset of #0

Sure, the logic is simple enough.

What I'm worried about is something changing in the build system
somewhere and we accidentally miss the fixup somehow in some cases,
we accidentally apply the fixup to the wrong field, or things like that.

At that stage there's very little sanity checking that we can do a la
BUILD_BUG_ON, ASSERT, etc.

Which isn't to say that we can't do that, just that I feel a little
uneasy about it.

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-17  7:38   ` Ard Biesheuvel
@ 2015-03-17 16:35     ` Mark Rutland
  2015-03-17 16:40       ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Mark Rutland @ 2015-03-17 16:35 UTC (permalink / raw)
  To: linux-arm-kernel

> > Possibly related to running with/detecting an offset, we need a way to
> > communicate that kASLR is active through the compressed kernel to
> > uncompressed kernel. x86 is going to be using x86's setup_data, but we
> > may need to generalize this. (The reasoning here is that when kaslr is
> > disabled at runtime, we should turn off other kernel ASLR, like module
> > offset ASLR, without duplicating kernel command line parameter parsing
> > -- which is what x86 is currently doing now.) Just examining the
> > offset isn't sufficient because perhaps we got randomized to offset 0.
> > :)
> >
> 
> There is no decompressor on arm64, just the core kernel Image. So if
> an offset needs to be chosen before branching into the kernel proper,
> it needs to be the bootloader that chooses it.

Agreed.

Our equivalent to setup_data is the DT /chosen node, and I don't think
we want to try parsing that before we've turned on the MMU.

However, for the UEFI boot case we could have the stub do something more
intelligent and choose a random physical offset itself.

> > You mention the linear mappings in "performance", which I worry may be
> > at odds with kASLR? Can large mappings still be used even when we've
> > got smaller alignment goals? Since you mention the "upper half of the
> > virtual address range", I assume ARM is built using the -2GB
> > addressing as used by x86, is that right? So it sounds like it would
> > be similar entropy to x86.
> >
> 
> I haven't quantified the performance gain, but it is arguably more
> efficient to map RAM using 1 GB blocks than using 2 MB sections.
> On the other part of the question, I really need to do more research
> on what x86 implements in the first place before even trying to answer
> it.

That might not always be true, depending on the TLB implementation
(though it's better to assume that it is, as it shouldn't result in a
performance loss).

Also, if you use DEBUG_RODATA the kernel won't be mapped with 1GB
mappings after early boot anyway.

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-17 16:35     ` Mark Rutland
@ 2015-03-17 16:40       ` Ard Biesheuvel
  2015-03-17 16:43         ` Mark Rutland
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-17 16:40 UTC (permalink / raw)
  To: linux-arm-kernel

On 17 March 2015 at 17:35, Mark Rutland <mark.rutland@arm.com> wrote:
>> > Possibly related to running with/detecting an offset, we need a way to
>> > communicate that kASLR is active through the compressed kernel to
>> > uncompressed kernel. x86 is going to be using x86's setup_data, but we
>> > may need to generalize this. (The reasoning here is that when kaslr is
>> > disabled at runtime, we should turn off other kernel ASLR, like module
>> > offset ASLR, without duplicating kernel command line parameter parsing
>> > -- which is what x86 is currently doing now.) Just examining the
>> > offset isn't sufficient because perhaps we got randomized to offset 0.
>> > :)
>> >
>>
>> There is no decompressor on arm64, just the core kernel Image. So if
>> an offset needs to be chosen before branching into the kernel proper,
>> it needs to be the bootloader that chooses it.
>
> Agreed.
>
> Our equivalent to setup_data is the DT /chosen node, and I don't think
> we want to try parsing that before we've turned on the MMU.
>
> However, for the UEFI boot case we could have the stub do something more
> intelligent and choose a random physical offset itself.
>
>> > You mention the linear mappings in "performance", which I worry may be
>> > at odds with kASLR? Can large mappings still be used even when we've
>> > got smaller alignment goals? Since you mention the "upper half of the
>> > virtual address range", I assume ARM is built using the -2GB
>> > addressing as used by x86, is that right? So it sounds like it would
>> > be similar entropy to x86.
>> >
>>
>> I haven't quantified the performance gain, but it is arguably more
>> efficient to map RAM using 1 GB blocks than using 2 MB sections.
>> On the other part of the question, I really need to do more research
>> on what x86 implements in the first place before even trying to answer
>> it.
>
> That might not always be true, depending on the TLB implementation
> (though it's better to assume that it is, as it shouldn't result in a
> performance loss).
>

Exactly, but I didn't do any measurements at all.

> Also, if you use DEBUG_RODATA the kernel won't be mapped with 1GB
> mappings after early boot anyway.
>

That's not the point. The point is that, since physical memory starts
at the base of the kernel image, as far as the kernel is concerned,
the relative alignment of the virtual and physical address spaces may
result in *all* of memory being mapped using 2 MB sections rather than
1 GB blocks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 0/3] arm64: relocatable kernel proof of concept
  2015-03-17 16:40       ` Ard Biesheuvel
@ 2015-03-17 16:43         ` Mark Rutland
  2015-03-23 15:36           ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Mark Rutland @ 2015-03-17 16:43 UTC (permalink / raw)
  To: linux-arm-kernel

> > Also, if you use DEBUG_RODATA the kernel won't be mapped with 1GB
> > mappings after early boot anyway.
> >
> 
> That's not the point. The point is that, since physical memory starts
> at the base of the kernel image, as far as the kernel is concerned,
> the relative alignment of the virtual and physical address spaces may
> result in *all* of memory being mapped using 2 MB sections rather than
> 1 GB blocks.

Ah. Good point.

Another reason to split the text and linear mappings ;)

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/4] RFC: split text and linear mappings using tagged pointers
  2015-03-17 16:43         ` Mark Rutland
@ 2015-03-23 15:36           ` Ard Biesheuvel
  2015-03-23 15:36             ` [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping Ard Biesheuvel
                               ` (4 more replies)
  0 siblings, 5 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-23 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

This implements a split mapping of the kernel Image using the AArch64
tagged pointers feature. The kernel text is emitted with bit 56 cleared,
which can be used by the virt to phys translation to choose one translation
regime or another.

The purpose is to allow PHYS_OFFSET to start at an arbitrary offset below
the kernel image, so that memory below the kernel Image can be used, allowing
the Image to be loaded anywhere in physical memory.

This series moves the kernel text right below PAGE_OFFSET, next to the modules
area. PHYS_OFFSET is chosen to be a suitable aligned boundary somewhere below
the kernel Image (1 GB or 512 MB depending on chosen page size).

Output from a QEMU/EFI boot:

System RAM:
 memory[0x0]	[0x00000040000000-0x000000bfffffff], 0x80000000 bytes flags: 0x0

Kernel image:
 reserved[0x0]	[0x0000005f680000-0x0000005fe68fff], 0x7e9000 bytes flags: 0x0

Virtual kernel memory layout:
    vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
    vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
              0xffffffbdc1000000 - 0xffffffbdc3000000   (    32 MB actual)
    fixed   : 0xffffffbff69fd000 - 0xffffffbff6c00000   (  2060 KB)
    PCI I/O : 0xffffffbff6e00000 - 0xffffffbff7e00000   (    16 MB)
    modules : 0xffffffbff8000000 - 0xffffffbffc000000   (    64 MB)
    memory  : 0xffffffc000000000 - 0xffffffc080000000   (  2048 MB)
      .init : 0xfeffffbffc783000 - 0xfeffffbffc7c6000   (   268 KB)
      .text : 0xfeffffbffc080000 - 0xfeffffbffc782344   (  7177 KB)
      .data : 0xfeffffbffc7ca000 - 0xfeffffbffc830c00   (   411 KB)

Note that this time, no relocations were harmed in the making of these
patches. With added relocation support, it should be possible to move
the combined modules and kernel text anywhere in the vmalloc area (for
kaslr)

There are probably some places where the cleared bit 56 in the virtual
address may cause trouble. Known problem is the module loader, but there
are surely others.


Ard Biesheuvel (4):
  arm64: use tagged pointers to distinguish kernel text from the linear
    mapping
  arm64: fixmap: move translation tables to dedicated region
  arm64: move kernel text below PAGE_OFFSET
  arm64: align PHYS_OFFSET to block size

 arch/arm64/include/asm/linkage.h       |  2 ++
 arch/arm64/include/asm/memory.h        | 27 +++++++++++++++++--
 arch/arm64/include/asm/pgtable-hwdef.h |  1 +
 arch/arm64/kernel/head.S               | 48 +++++++++++++++++++++++++---------
 arch/arm64/kernel/vmlinux.lds.S        | 19 +++++++++-----
 arch/arm64/mm/mmu.c                    | 21 +++++++++------
 arch/arm64/mm/proc.S                   |  3 ++-
 7 files changed, 91 insertions(+), 30 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping
  2015-03-23 15:36           ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Ard Biesheuvel
@ 2015-03-23 15:36             ` Ard Biesheuvel
  2015-03-25 14:04               ` Catalin Marinas
  2015-03-26  1:27               ` Mark Rutland
  2015-03-23 15:36             ` [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region Ard Biesheuvel
                               ` (3 subsequent siblings)
  4 siblings, 2 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-23 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

This enables tagged pointers for kernel addresses, and uses it to
tag statically allocated kernel objects. This allows us to use a
separate translation regime for kernel text in the next patch.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/include/asm/memory.h        | 20 +++++++++++++++++++-
 arch/arm64/include/asm/pgtable-hwdef.h |  1 +
 arch/arm64/kernel/vmlinux.lds.S        |  4 ++--
 arch/arm64/mm/mmu.c                    |  4 ++--
 arch/arm64/mm/proc.S                   |  3 ++-
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index f800d45ea226..7dfe1b0c9c01 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -107,6 +107,10 @@
 #define MT_S2_NORMAL		0xf
 #define MT_S2_DEVICE_nGnRE	0x1
 
+#define __TEXT(x)	((x) & ~(UL(1) << 56))
+#define __VIRT(x)	((x) | (UL(1) << 56))
+#define __IS_TEXT(x)	(!((x) & (UL(1) << 56)))
+
 #ifndef __ASSEMBLY__
 
 extern phys_addr_t		memstart_addr;
@@ -141,9 +145,23 @@ static inline void *phys_to_virt(phys_addr_t x)
 }
 
 /*
+ * Return the physical address of a statically allocated object that
+ * is covered by the kernel Image mapping. We use tagged pointers to
+ * distinguish between the virtual linear and the virtual kimage range.
+ */
+static inline phys_addr_t __text_to_phys(unsigned long x)
+{
+	return __virt_to_phys(__VIRT(x));
+}
+
+/*
  * Drivers should NOT use these either.
  */
-#define __pa(x)			__virt_to_phys((unsigned long)(x))
+#define __pa(x)	({					\
+	unsigned long __x = (unsigned long)(x);		\
+	__IS_TEXT(__x) ? __text_to_phys(__x) :		\
+			 __virt_to_phys(__x); })
+
 #define __va(x)			((void *)__phys_to_virt((phys_addr_t)(x)))
 #define pfn_to_kaddr(pfn)	__va((pfn) << PAGE_SHIFT)
 #define virt_to_pfn(x)      __phys_to_pfn(__virt_to_phys(x))
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index 5f930cc9ea83..8bcec4e626b4 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -163,5 +163,6 @@
 #define TCR_TG1_64K		(UL(3) << 30)
 #define TCR_ASID16		(UL(1) << 36)
 #define TCR_TBI0		(UL(1) << 37)
+#define TCR_TBI1		(UL(1) << 38)
 
 #endif
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 5d9d2dca530d..434ef407ef0f 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -74,7 +74,7 @@ SECTIONS
 		*(.discard.*)
 	}
 
-	. = PAGE_OFFSET + TEXT_OFFSET;
+	. = __TEXT(PAGE_OFFSET) + TEXT_OFFSET;
 
 	.head.text : {
 		_text = .;
@@ -171,4 +171,4 @@ ASSERT(((__hyp_idmap_text_start + PAGE_SIZE) > __hyp_idmap_text_end),
 /*
  * If padding is applied before .head.text, virt<->phys conversions will fail.
  */
-ASSERT(_text == (PAGE_OFFSET + TEXT_OFFSET), "HEAD is misaligned")
+ASSERT(_text == (__TEXT(PAGE_OFFSET) + TEXT_OFFSET), "HEAD is misaligned")
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index c9267acb699c..43496748e3d9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -267,7 +267,7 @@ static void *late_alloc(unsigned long size)
 static void __ref create_mapping(phys_addr_t phys, unsigned long virt,
 				  phys_addr_t size, pgprot_t prot)
 {
-	if (virt < VMALLOC_START) {
+	if (__VIRT(virt) < VMALLOC_START) {
 		pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
 			&phys, virt);
 		return;
@@ -287,7 +287,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 static void create_mapping_late(phys_addr_t phys, unsigned long virt,
 				  phys_addr_t size, pgprot_t prot)
 {
-	if (virt < VMALLOC_START) {
+	if (__VIRT(virt) < VMALLOC_START) {
 		pr_warn("BUG: not creating mapping for %pa@0x%016lx - outside kernel range\n",
 			&phys, virt);
 		return;
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 28eebfb6af76..7f2d7f73bc93 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -232,7 +232,8 @@ ENTRY(__cpu_setup)
 	 * both user and kernel.
 	 */
 	ldr	x10, =TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
-			TCR_TG_FLAGS | TCR_ASID16 | TCR_TBI0
+			TCR_TG_FLAGS | TCR_ASID16 | TCR_TBI0 | TCR_TBI1
+
 	/*
 	 * Read the PARange bits from ID_AA64MMFR0_EL1 and set the IPS bits in
 	 * TCR_EL1.
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region
  2015-03-23 15:36           ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Ard Biesheuvel
  2015-03-23 15:36             ` [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping Ard Biesheuvel
@ 2015-03-23 15:36             ` Ard Biesheuvel
  2015-03-26  1:28               ` Mark Rutland
  2015-03-23 15:36             ` [PATCH 3/4] arm64: move kernel text below PAGE_OFFSET Ard Biesheuvel
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-23 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

This moves the fixmap translation tables to a dedicated region
in the linker map. This is needed for a split kernel text from
the linear mapping, to ensure that the contents of the tables
rely on a single translation regime.

Also make the population of the translation levels conditional,
so that the kernel text can share some levels of translation if
needed.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/include/asm/linkage.h |  2 ++
 arch/arm64/kernel/vmlinux.lds.S  | 15 ++++++++++-----
 arch/arm64/mm/mmu.c              | 15 +++++++++------
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
index 636c1bced7d4..dc9354de6f52 100644
--- a/arch/arm64/include/asm/linkage.h
+++ b/arch/arm64/include/asm/linkage.h
@@ -4,4 +4,6 @@
 #define __ALIGN		.align 4
 #define __ALIGN_STR	".align 4"
 
+#define __pgdir		__attribute__((__section__(".pgdir")))
+
 #endif
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 434ef407ef0f..d3885043f0b7 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -149,11 +149,16 @@ SECTIONS
 
 	BSS_SECTION(0, 0, 0)
 
-	. = ALIGN(PAGE_SIZE);
-	idmap_pg_dir = .;
-	. += IDMAP_DIR_SIZE;
-	swapper_pg_dir = .;
-	. += SWAPPER_DIR_SIZE;
+	.pgdir : {
+		. = ALIGN(PAGE_SIZE);
+		__pgdir_start = .;
+		idmap_pg_dir = .;
+		. += IDMAP_DIR_SIZE;
+		*(.pgdir)
+		swapper_pg_dir = .;
+		. += SWAPPER_DIR_SIZE;
+		__pgdir_stop = .;
+	}
 
 	_end = .;
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 43496748e3d9..bb3ce41130f3 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -549,12 +549,12 @@ void vmemmap_free(unsigned long start, unsigned long end)
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
-static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_bss;
+static pte_t bm_pte[PTRS_PER_PTE] __pgdir;
 #if CONFIG_ARM64_PGTABLE_LEVELS > 2
-static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss;
+static pmd_t bm_pmd[PTRS_PER_PMD] __pgdir;
 #endif
 #if CONFIG_ARM64_PGTABLE_LEVELS > 3
-static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss;
+static pud_t bm_pud[PTRS_PER_PUD] __pgdir;
 #endif
 
 static inline pud_t * fixmap_pud(unsigned long addr)
@@ -592,11 +592,14 @@ void __init early_fixmap_init(void)
 	unsigned long addr = FIXADDR_START;
 
 	pgd = pgd_offset_k(addr);
-	pgd_populate(&init_mm, pgd, bm_pud);
+	if (pgd_none(*pgd))
+		pgd_populate(&init_mm, pgd, bm_pud);
 	pud = pud_offset(pgd, addr);
-	pud_populate(&init_mm, pud, bm_pmd);
+	if (pud_none(*pud))
+		pud_populate(&init_mm, pud, bm_pmd);
 	pmd = pmd_offset(pud, addr);
-	pmd_populate_kernel(&init_mm, pmd, bm_pte);
+	if (pmd_none(*pmd))
+		pmd_populate_kernel(&init_mm, pmd, bm_pte);
 
 	/*
 	 * The boot-ioremap range spans multiple pmds, for which
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/4] arm64: move kernel text below PAGE_OFFSET
  2015-03-23 15:36           ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Ard Biesheuvel
  2015-03-23 15:36             ` [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping Ard Biesheuvel
  2015-03-23 15:36             ` [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region Ard Biesheuvel
@ 2015-03-23 15:36             ` Ard Biesheuvel
  2015-03-25 14:10               ` Catalin Marinas
  2015-03-23 15:36             ` [PATCH 4/4] arm64: align PHYS_OFFSET to block size Ard Biesheuvel
  2015-03-26  1:26             ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Mark Rutland
  4 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-23 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

This moves the virtual mapping of the kernel Image down into
the lower half of the kernel virtual memory range, moving it
out of the linear mapping.

An exception is made for the statically allocated translation
tables: these are so entangled with the translation regime that
they need to be accessed via the linear mapping exclusively.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/include/asm/memory.h |  9 +++++++--
 arch/arm64/kernel/head.S        | 39 +++++++++++++++++++++++++++++----------
 arch/arm64/kernel/vmlinux.lds.S |  4 ++--
 arch/arm64/mm/mmu.c             |  2 ++
 4 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 7dfe1b0c9c01..2b2d2fccfee3 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -38,6 +38,8 @@
  */
 #define PCI_IO_SIZE		SZ_16M
 
+#define KIMAGE_OFFSET		SZ_64M
+
 /*
  * PAGE_OFFSET - the virtual address of the start of the kernel image (top
  *		 (VA_BITS - 1))
@@ -49,7 +51,8 @@
  */
 #define VA_BITS			(CONFIG_ARM64_VA_BITS)
 #define PAGE_OFFSET		(UL(0xffffffffffffffff) << (VA_BITS - 1))
-#define MODULES_END		(PAGE_OFFSET)
+#define KIMAGE_VADDR		(PAGE_OFFSET - KIMAGE_OFFSET)
+#define MODULES_END		KIMAGE_VADDR
 #define MODULES_VADDR		(MODULES_END - SZ_64M)
 #define PCI_IO_END		(MODULES_VADDR - SZ_2M)
 #define PCI_IO_START		(PCI_IO_END - PCI_IO_SIZE)
@@ -117,6 +120,8 @@ extern phys_addr_t		memstart_addr;
 /* PHYS_OFFSET - the physical address of the start of memory. */
 #define PHYS_OFFSET		({ memstart_addr; })
 
+extern u64 image_offset;
+
 /*
  * PFNs are used to describe any physical page; this means
  * PFN 0 == physical address 0.
@@ -151,7 +156,7 @@ static inline void *phys_to_virt(phys_addr_t x)
  */
 static inline phys_addr_t __text_to_phys(unsigned long x)
 {
-	return __virt_to_phys(__VIRT(x));
+	return __virt_to_phys(__VIRT(x)) + image_offset;
 }
 
 /*
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 504423422e20..16134608eecf 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -235,7 +235,10 @@ section_table:
 ENTRY(stext)
 	bl	preserve_boot_args
 	bl	el2_setup			// Drop to EL1, w20=cpu_boot_mode
+
 	adrp	x24, __PHYS_OFFSET
+	mov	x23, #KIMAGE_OFFSET
+
 	bl	set_cpu_boot_mode_flag
 	bl	__create_page_tables		// x25=TTBR0, x26=TTBR1
 	/*
@@ -279,13 +282,15 @@ ENDPROC(preserve_boot_args)
  * Corrupts:	tmp1, tmp2
  * Returns:	tbl -> next level table page address
  */
-	.macro	create_table_entry, tbl, virt, shift, ptrs, tmp1, tmp2
+	.macro	create_table_entry, tbl, virt, shift, ptrs, tmp1, tmp2, off=0
 	lsr	\tmp1, \virt, #\shift
+	.if	\off
+	add	\tmp1, \tmp1, #\off
+	.endif
 	and	\tmp1, \tmp1, #\ptrs - 1	// table index
-	add	\tmp2, \tbl, #PAGE_SIZE
+	add	\tmp2, \tbl, #(\off + 1) * PAGE_SIZE
 	orr	\tmp2, \tmp2, #PMD_TYPE_TABLE	// address of next table and entry type
 	str	\tmp2, [\tbl, \tmp1, lsl #3]
-	add	\tbl, \tbl, #PAGE_SIZE		// next level table page
 	.endm
 
 /*
@@ -298,8 +303,13 @@ ENDPROC(preserve_boot_args)
 	.macro	create_pgd_entry, tbl, virt, tmp1, tmp2
 	create_table_entry \tbl, \virt, PGDIR_SHIFT, PTRS_PER_PGD, \tmp1, \tmp2
 #if SWAPPER_PGTABLE_LEVELS == 3
+	add	\tbl, \tbl, #PAGE_SIZE		// next level table page
 	create_table_entry \tbl, \virt, TABLE_SHIFT, PTRS_PER_PTE, \tmp1, \tmp2
+	create_table_entry \tbl, \virt, TABLE_SHIFT, PTRS_PER_PTE, \tmp1, \tmp2, 1
+#else
+	create_table_entry \tbl, \virt, PGDIR_SHIFT, PTRS_PER_PGD, \tmp1, \tmp2, 1
 #endif
+	add	\tbl, \tbl, #PAGE_SIZE		// next level table page
 	.endm
 
 /*
@@ -312,15 +322,15 @@ ENDPROC(preserve_boot_args)
 	.macro	create_block_map, tbl, flags, phys, start, end
 	lsr	\phys, \phys, #BLOCK_SHIFT
 	lsr	\start, \start, #BLOCK_SHIFT
-	and	\start, \start, #PTRS_PER_PTE - 1	// table index
 	orr	\phys, \flags, \phys, lsl #BLOCK_SHIFT	// table entry
 	lsr	\end, \end, #BLOCK_SHIFT
-	and	\end, \end, #PTRS_PER_PTE - 1		// table end index
+	sub	\end, \end, \start
+	and	\start, \start, #PTRS_PER_PTE - 1	// table index
 9999:	str	\phys, [\tbl, \start, lsl #3]		// store the entry
 	add	\start, \start, #1			// next entry
 	add	\phys, \phys, #BLOCK_SIZE		// next block
-	cmp	\start, \end
-	b.ls	9999b
+	subs	\end, \end, #1
+	b.pl	9999b
 	.endm
 
 /*
@@ -371,10 +381,18 @@ __create_page_tables:
 	 * Map the kernel image (starting with PHYS_OFFSET).
 	 */
 	mov	x0, x26				// swapper_pg_dir
-	mov	x5, #PAGE_OFFSET
+	ldr	x5, =KERNEL_START		// VA of __PHYS_OFFSET
 	create_pgd_entry x0, x5, x3, x6
-	ldr	x6, =KERNEL_END			// __va(KERNEL_END)
-	mov	x3, x24				// phys offset
+	ldr	x6, =__pgdir_start		// VA of KERNEL_END
+	adrp	x3, KERNEL_START		// phys offset
+	create_block_map x0, x7, x3, x5, x6
+
+	ldr	x5, =__pgdir_start
+	add	x5, x5, x23
+	adrp	x3, idmap_pg_dir
+	add	x0, x0, #PAGE_SIZE
+	ldr	x6, =__pgdir_stop
+	add	x6, x6, x23
 	create_block_map x0, x7, x3, x5, x6
 
 	/*
@@ -406,6 +424,7 @@ __mmap_switched:
 2:
 	adr_l	sp, initial_sp, x4
 	str_l	x21, __fdt_pointer, x5		// Save FDT pointer
+	str_l	x23, image_offset, x5
 	str_l	x24, memstart_addr, x6		// Save PHYS_OFFSET
 	mov	x29, #0
 	b	start_kernel
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index d3885043f0b7..9f514a0a38ad 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -74,7 +74,7 @@ SECTIONS
 		*(.discard.*)
 	}
 
-	. = __TEXT(PAGE_OFFSET) + TEXT_OFFSET;
+	. = __TEXT(KIMAGE_VADDR) + TEXT_OFFSET;
 
 	.head.text : {
 		_text = .;
@@ -176,4 +176,4 @@ ASSERT(((__hyp_idmap_text_start + PAGE_SIZE) > __hyp_idmap_text_end),
 /*
  * If padding is applied before .head.text, virt<->phys conversions will fail.
  */
-ASSERT(_text == (__TEXT(PAGE_OFFSET) + TEXT_OFFSET), "HEAD is misaligned")
+ASSERT(_text == (__TEXT(KIMAGE_VADDR) + TEXT_OFFSET), "HEAD is misaligned")
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index bb3ce41130f3..14ba1dd80932 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -47,6 +47,8 @@
 struct page *empty_zero_page;
 EXPORT_SYMBOL(empty_zero_page);
 
+u64 image_offset;
+
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 			      unsigned long size, pgprot_t vma_prot)
 {
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-23 15:36           ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Ard Biesheuvel
                               ` (2 preceding siblings ...)
  2015-03-23 15:36             ` [PATCH 3/4] arm64: move kernel text below PAGE_OFFSET Ard Biesheuvel
@ 2015-03-23 15:36             ` Ard Biesheuvel
  2015-03-25 14:14               ` Catalin Marinas
  2015-03-25 14:59               ` Catalin Marinas
  2015-03-26  1:26             ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Mark Rutland
  4 siblings, 2 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-23 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

This aligns PHYS_OFFSET down to an alignment that allows system
RAM to be mapped using the largest blocks available, i.e., 1 GB
blocks on 4 KB pages or 512 MB blocks on 64 KB pages.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kernel/head.S | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 16134608eecf..fd8434753372 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -49,13 +49,15 @@
 #ifdef CONFIG_ARM64_64K_PAGES
 #define BLOCK_SHIFT	PAGE_SHIFT
 #define BLOCK_SIZE	PAGE_SIZE
-#define TABLE_SHIFT	PMD_SHIFT
 #else
 #define BLOCK_SHIFT	SECTION_SHIFT
 #define BLOCK_SIZE	SECTION_SIZE
-#define TABLE_SHIFT	PUD_SHIFT
 #endif
 
+#define TABLE_SHIFT	(BLOCK_SHIFT + PAGE_SHIFT - 3)
+#define TABLE_SIZE	(1 << TABLE_SHIFT)
+#define TABLE_MASK	(~(TABLE_SIZE - 1))
+
 #define KERNEL_START	_text
 #define KERNEL_END	_end
 
@@ -237,7 +239,10 @@ ENTRY(stext)
 	bl	el2_setup			// Drop to EL1, w20=cpu_boot_mode
 
 	adrp	x24, __PHYS_OFFSET
-	mov	x23, #KIMAGE_OFFSET
+	and	x23, x24, #~TABLE_MASK		// image offset
+	and	x24, x24, #TABLE_MASK		// PHYS_OFFSET
+	mov	x0, #KIMAGE_OFFSET
+	add	x23, x23, x0
 
 	bl	set_cpu_boot_mode_flag
 	bl	__create_page_tables		// x25=TTBR0, x26=TTBR1
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping
  2015-03-23 15:36             ` [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping Ard Biesheuvel
@ 2015-03-25 14:04               ` Catalin Marinas
  2015-03-26  1:27               ` Mark Rutland
  1 sibling, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2015-03-25 14:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 23, 2015 at 04:36:53PM +0100, Ard Biesheuvel wrote:
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index f800d45ea226..7dfe1b0c9c01 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -107,6 +107,10 @@
>  #define MT_S2_NORMAL		0xf
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
> +#define __TEXT(x)	((x) & ~(UL(1) << 56))
> +#define __VIRT(x)	((x) | (UL(1) << 56))
> +#define __IS_TEXT(x)	(!((x) & (UL(1) << 56)))
> +
>  #ifndef __ASSEMBLY__
>  
>  extern phys_addr_t		memstart_addr;
> @@ -141,9 +145,23 @@ static inline void *phys_to_virt(phys_addr_t x)
>  }
>  
>  /*
> + * Return the physical address of a statically allocated object that
> + * is covered by the kernel Image mapping. We use tagged pointers to
> + * distinguish between the virtual linear and the virtual kimage range.
> + */
> +static inline phys_addr_t __text_to_phys(unsigned long x)
> +{
> +	return __virt_to_phys(__VIRT(x));
> +}

If PAGE_OFFSET is not an immediate value for SUB, you could define a
TEXT_PAGE_OFFSET as __TEXT(PAGE_OFFSET) and avoid the extra "or".

> +
> +/*
>   * Drivers should NOT use these either.
>   */

This existing comment doesn't seem to have any effect. I can see plenty
of drivers using __pa().

> -#define __pa(x)			__virt_to_phys((unsigned long)(x))
> +#define __pa(x)	({					\
> +	unsigned long __x = (unsigned long)(x);		\
> +	__IS_TEXT(__x) ? __text_to_phys(__x) :		\
> +			 __virt_to_phys(__x); })

Could we check where __pa() is actually used on a kernel text address?
If there are only a few such cases, we could avoid this check and create
a specific __kernel_pa(). Same for virt_to_phys(), there are some places
like setting the idmap_pg_dir.

Anyway, if the performance impact is not significant, we can live with
the check here. But I really think we should avoid tagged pointers by
simply splitting the VA space and check one of the bits which is 1 with
kernel text mapping and 0 with the linear mapping (move the kernel high
up).

>  #endif
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 5d9d2dca530d..434ef407ef0f 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -74,7 +74,7 @@ SECTIONS
>  		*(.discard.*)
>  	}
>  
> -	. = PAGE_OFFSET + TEXT_OFFSET;
> +	. = __TEXT(PAGE_OFFSET) + TEXT_OFFSET;

And without tagged pointers, just define something like
KERNEL_PAGE_OFFSET or TEXT_PAGE_OFFSET (I prefer to avoid TEXT since we
have data as well but I'm not really bothered).

> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index c9267acb699c..43496748e3d9 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -267,7 +267,7 @@ static void *late_alloc(unsigned long size)
>  static void __ref create_mapping(phys_addr_t phys, unsigned long virt,
>  				  phys_addr_t size, pgprot_t prot)
>  {
> -	if (virt < VMALLOC_START) {
> +	if (__VIRT(virt) < VMALLOC_START) {

I don't think we would need __VIRT() without tagged pointers.

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/4] arm64: move kernel text below PAGE_OFFSET
  2015-03-23 15:36             ` [PATCH 3/4] arm64: move kernel text below PAGE_OFFSET Ard Biesheuvel
@ 2015-03-25 14:10               ` Catalin Marinas
  0 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2015-03-25 14:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 23, 2015 at 04:36:55PM +0100, Ard Biesheuvel wrote:
> This moves the virtual mapping of the kernel Image down into
> the lower half of the kernel virtual memory range, moving it
> out of the linear mapping.

I wonder whether there would be better code generation with the kernel
address very high, like the top 64MB. Anyway, having it higher allows us
to check some top bits of the address and detect whether it's kernel
text or linear mapping.

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-23 15:36             ` [PATCH 4/4] arm64: align PHYS_OFFSET to block size Ard Biesheuvel
@ 2015-03-25 14:14               ` Catalin Marinas
  2015-03-26  6:23                 ` Ard Biesheuvel
  2015-03-25 14:59               ` Catalin Marinas
  1 sibling, 1 reply; 42+ messages in thread
From: Catalin Marinas @ 2015-03-25 14:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
> This aligns PHYS_OFFSET down to an alignment that allows system
> RAM to be mapped using the largest blocks available, i.e., 1 GB
> blocks on 4 KB pages or 512 MB blocks on 64 KB pages.

Does this assume that the platform RAM starts at such block aligned
addresses? If not, is the unaligned range ignored?

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-23 15:36             ` [PATCH 4/4] arm64: align PHYS_OFFSET to block size Ard Biesheuvel
  2015-03-25 14:14               ` Catalin Marinas
@ 2015-03-25 14:59               ` Catalin Marinas
  2015-03-26  6:22                 ` Ard Biesheuvel
  1 sibling, 1 reply; 42+ messages in thread
From: Catalin Marinas @ 2015-03-25 14:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 16134608eecf..fd8434753372 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -49,13 +49,15 @@
>  #ifdef CONFIG_ARM64_64K_PAGES
>  #define BLOCK_SHIFT	PAGE_SHIFT
>  #define BLOCK_SIZE	PAGE_SIZE
> -#define TABLE_SHIFT	PMD_SHIFT
>  #else
>  #define BLOCK_SHIFT	SECTION_SHIFT
>  #define BLOCK_SIZE	SECTION_SIZE
> -#define TABLE_SHIFT	PUD_SHIFT
>  #endif
>  
> +#define TABLE_SHIFT	(BLOCK_SHIFT + PAGE_SHIFT - 3)
> +#define TABLE_SIZE	(1 << TABLE_SHIFT)
> +#define TABLE_MASK	(~(TABLE_SIZE - 1))
> +
>  #define KERNEL_START	_text
>  #define KERNEL_END	_end
>  
> @@ -237,7 +239,10 @@ ENTRY(stext)
>  	bl	el2_setup			// Drop to EL1, w20=cpu_boot_mode
>  
>  	adrp	x24, __PHYS_OFFSET
> -	mov	x23, #KIMAGE_OFFSET
> +	and	x23, x24, #~TABLE_MASK		// image offset
> +	and	x24, x24, #TABLE_MASK		// PHYS_OFFSET
> +	mov	x0, #KIMAGE_OFFSET
> +	add	x23, x23, x0

I'm still trying to figure out how this works. Does the code imply that
the kernel image can only be loaded within a block size of the
PHYS_OFFSET? If that's the case, it's not too flexible.

My thoughts were to make PHYS_OFFSET permanently 0 (i.e. get rid of it
entirely) or at least compute it at DT parsing time. I'm more inclined
for making it 0 assuming that it doesn't break anything else (vmemmap
virtual range may get slightly bigger but still not significant,
basically max_phys_addr / sizeof(struct page)).

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/4] RFC: split text and linear mappings using tagged pointers
  2015-03-23 15:36           ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Ard Biesheuvel
                               ` (3 preceding siblings ...)
  2015-03-23 15:36             ` [PATCH 4/4] arm64: align PHYS_OFFSET to block size Ard Biesheuvel
@ 2015-03-26  1:26             ` Mark Rutland
  2015-03-26  6:09               ` Ard Biesheuvel
  4 siblings, 1 reply; 42+ messages in thread
From: Mark Rutland @ 2015-03-26  1:26 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On Mon, Mar 23, 2015 at 03:36:52PM +0000, Ard Biesheuvel wrote:
> This implements a split mapping of the kernel Image using the AArch64
> tagged pointers feature. The kernel text is emitted with bit 56 cleared,
> which can be used by the virt to phys translation to choose one translation
> regime or another.

This looks neat!

I definitely want to see the linear mapping decoupled from the text mapping,
but I'm rather worried by the prospect of tagged pointers in the kernel (there
are a lot of edge cases for those w.r.t. the preservations of the tag bits),
and if possible I'd like to avoid their use. I think that we can achieve the
same effect by reorganising the VA layout within the 56 bits we have to play
with.

Nit: please s/translation regime/mapping/ (or some other wording to that
effect) for these patches; the architectural definition of "translation regime"
differs from what you're referring to here, and we should avoid overloading
that terminology.

Thanks,
Mark.

> The purpose is to allow PHYS_OFFSET to start at an arbitrary offset below
> the kernel image, so that memory below the kernel Image can be used, allowing
> the Image to be loaded anywhere in physical memory.
>
> This series moves the kernel text right below PAGE_OFFSET, next to the modules
> area. PHYS_OFFSET is chosen to be a suitable aligned boundary somewhere below
> the kernel Image (1 GB or 512 MB depending on chosen page size).
> 
> Output from a QEMU/EFI boot:
> 
> System RAM:
>  memory[0x0]	[0x00000040000000-0x000000bfffffff], 0x80000000 bytes flags: 0x0
> 
> Kernel image:
>  reserved[0x0]	[0x0000005f680000-0x0000005fe68fff], 0x7e9000 bytes flags: 0x0
> 
> Virtual kernel memory layout:
>     vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
>     vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
>               0xffffffbdc1000000 - 0xffffffbdc3000000   (    32 MB actual)
>     fixed   : 0xffffffbff69fd000 - 0xffffffbff6c00000   (  2060 KB)
>     PCI I/O : 0xffffffbff6e00000 - 0xffffffbff7e00000   (    16 MB)
>     modules : 0xffffffbff8000000 - 0xffffffbffc000000   (    64 MB)
>     memory  : 0xffffffc000000000 - 0xffffffc080000000   (  2048 MB)
>       .init : 0xfeffffbffc783000 - 0xfeffffbffc7c6000   (   268 KB)
>       .text : 0xfeffffbffc080000 - 0xfeffffbffc782344   (  7177 KB)
>       .data : 0xfeffffbffc7ca000 - 0xfeffffbffc830c00   (   411 KB)
> 
> Note that this time, no relocations were harmed in the making of these
> patches. With added relocation support, it should be possible to move
> the combined modules and kernel text anywhere in the vmalloc area (for
> kaslr)
> 
> There are probably some places where the cleared bit 56 in the virtual
> address may cause trouble. Known problem is the module loader, but there
> are surely others.
> 
> 
> Ard Biesheuvel (4):
>   arm64: use tagged pointers to distinguish kernel text from the linear
>     mapping
>   arm64: fixmap: move translation tables to dedicated region
>   arm64: move kernel text below PAGE_OFFSET
>   arm64: align PHYS_OFFSET to block size
> 
>  arch/arm64/include/asm/linkage.h       |  2 ++
>  arch/arm64/include/asm/memory.h        | 27 +++++++++++++++++--
>  arch/arm64/include/asm/pgtable-hwdef.h |  1 +
>  arch/arm64/kernel/head.S               | 48 +++++++++++++++++++++++++---------
>  arch/arm64/kernel/vmlinux.lds.S        | 19 +++++++++-----
>  arch/arm64/mm/mmu.c                    | 21 +++++++++------
>  arch/arm64/mm/proc.S                   |  3 ++-
>  7 files changed, 91 insertions(+), 30 deletions(-)
> 
> -- 
> 1.8.3.2
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping
  2015-03-23 15:36             ` [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping Ard Biesheuvel
  2015-03-25 14:04               ` Catalin Marinas
@ 2015-03-26  1:27               ` Mark Rutland
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2015-03-26  1:27 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On Mon, Mar 23, 2015 at 03:36:53PM +0000, Ard Biesheuvel wrote:
> This enables tagged pointers for kernel addresses, and uses it to
> tag statically allocated kernel objects. This allows us to use a
> separate translation regime for kernel text in the next patch.

If I've understood correctly, the tag is just to optimize virt_to_phys (using a
bit test rather than VA boundary comparisons), no? Or does the tag allow for
something else that I've missed?

If that is the case, can we not reorganise the kernel VA layout such that we
can use (1 << (VA_BITS - 1)) instead (i.e. check if we're above or below
PAGE_OFFSET) to distinguish the linear mapping form the text mapping?

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/include/asm/memory.h        | 20 +++++++++++++++++++-
>  arch/arm64/include/asm/pgtable-hwdef.h |  1 +
>  arch/arm64/kernel/vmlinux.lds.S        |  4 ++--
>  arch/arm64/mm/mmu.c                    |  4 ++--
>  arch/arm64/mm/proc.S                   |  3 ++-
>  5 files changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index f800d45ea226..7dfe1b0c9c01 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -107,6 +107,10 @@
>  #define MT_S2_NORMAL		0xf
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
> +#define __TEXT(x)	((x) & ~(UL(1) << 56))
> +#define __VIRT(x)	((x) | (UL(1) << 56))
> +#define __IS_TEXT(x)	(!((x) & (UL(1) << 56)))
> +
>  #ifndef __ASSEMBLY__
>  
>  extern phys_addr_t		memstart_addr;
> @@ -141,9 +145,23 @@ static inline void *phys_to_virt(phys_addr_t x)
>  }
>  
>  /*
> + * Return the physical address of a statically allocated object that
> + * is covered by the kernel Image mapping. We use tagged pointers to
> + * distinguish between the virtual linear and the virtual kimage range.
> + */
> +static inline phys_addr_t __text_to_phys(unsigned long x)
> +{
> +	return __virt_to_phys(__VIRT(x));
> +}
> +
> +/*
>   * Drivers should NOT use these either.
>   */
> -#define __pa(x)			__virt_to_phys((unsigned long)(x))
> +#define __pa(x)	({					\
> +	unsigned long __x = (unsigned long)(x);		\
> +	__IS_TEXT(__x) ? __text_to_phys(__x) :		\
> +			 __virt_to_phys(__x); })

Do we have an idea what the overhead of this is over a plain __virt_to_phys?

If it's noticeable, we might want to convert some users (eg arm64/mm) to use
__virt_to_phys / __text_to_phys exclusively. For mm that would require some
init code to figure out linear map pointers for the idmap and swapper from the
text map pointers.

Mark.

> +
>  #define __va(x)			((void *)__phys_to_virt((phys_addr_t)(x)))
>  #define pfn_to_kaddr(pfn)	__va((pfn) << PAGE_SHIFT)
>  #define virt_to_pfn(x)      __phys_to_pfn(__virt_to_phys(x))
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
> index 5f930cc9ea83..8bcec4e626b4 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -163,5 +163,6 @@
>  #define TCR_TG1_64K		(UL(3) << 30)
>  #define TCR_ASID16		(UL(1) << 36)
>  #define TCR_TBI0		(UL(1) << 37)
> +#define TCR_TBI1		(UL(1) << 38)
>  
>  #endif
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 5d9d2dca530d..434ef407ef0f 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -74,7 +74,7 @@ SECTIONS
>  		*(.discard.*)
>  	}
>  
> -	. = PAGE_OFFSET + TEXT_OFFSET;
> +	. = __TEXT(PAGE_OFFSET) + TEXT_OFFSET;
>  
>  	.head.text : {
>  		_text = .;
> @@ -171,4 +171,4 @@ ASSERT(((__hyp_idmap_text_start + PAGE_SIZE) > __hyp_idmap_text_end),
>  /*
>   * If padding is applied before .head.text, virt<->phys conversions will fail.
>   */
> -ASSERT(_text == (PAGE_OFFSET + TEXT_OFFSET), "HEAD is misaligned")
> +ASSERT(_text == (__TEXT(PAGE_OFFSET) + TEXT_OFFSET), "HEAD is misaligned")
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index c9267acb699c..43496748e3d9 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -267,7 +267,7 @@ static void *late_alloc(unsigned long size)
>  static void __ref create_mapping(phys_addr_t phys, unsigned long virt,
>  				  phys_addr_t size, pgprot_t prot)
>  {
> -	if (virt < VMALLOC_START) {
> +	if (__VIRT(virt) < VMALLOC_START) {
>  		pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
>  			&phys, virt);
>  		return;
> @@ -287,7 +287,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>  static void create_mapping_late(phys_addr_t phys, unsigned long virt,
>  				  phys_addr_t size, pgprot_t prot)
>  {
> -	if (virt < VMALLOC_START) {
> +	if (__VIRT(virt) < VMALLOC_START) {
>  		pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
>  			&phys, virt);
>  		return;
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 28eebfb6af76..7f2d7f73bc93 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -232,7 +232,8 @@ ENTRY(__cpu_setup)
>  	 * both user and kernel.
>  	 */
>  	ldr	x10, =TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
> -			TCR_TG_FLAGS | TCR_ASID16 | TCR_TBI0
> +			TCR_TG_FLAGS | TCR_ASID16 | TCR_TBI0 | TCR_TBI1
> +
>  	/*
>  	 * Read the PARange bits from ID_AA64MMFR0_EL1 and set the IPS bits in
>  	 * TCR_EL1.
> -- 
> 1.8.3.2
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region
  2015-03-23 15:36             ` [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region Ard Biesheuvel
@ 2015-03-26  1:28               ` Mark Rutland
  2015-03-26  6:20                 ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Mark Rutland @ 2015-03-26  1:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On Mon, Mar 23, 2015 at 03:36:54PM +0000, Ard Biesheuvel wrote:
> This moves the fixmap translation tables to a dedicated region
> in the linker map. This is needed for a split kernel text from
> the linear mapping, to ensure that the contents of the tables
> rely on a single translation regime.

I'm not sure what you mean by this. Could you elaborate?

What problem would we have if we didn't have this section, and how does this
section solve that?

Regardless, I have some comments on the implementation below.

> Also make the population of the translation levels conditional,
> so that the kernel text can share some levels of translation if
> needed.

I guess you mean shared with the tables for the text mapping?

Please word this to be explicit w.r.t. what is shared between whom, and when.

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/include/asm/linkage.h |  2 ++
>  arch/arm64/kernel/vmlinux.lds.S  | 15 ++++++++++-----
>  arch/arm64/mm/mmu.c              | 15 +++++++++------
>  3 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> index 636c1bced7d4..dc9354de6f52 100644
> --- a/arch/arm64/include/asm/linkage.h
> +++ b/arch/arm64/include/asm/linkage.h
> @@ -4,4 +4,6 @@
>  #define __ALIGN		.align 4
>  #define __ALIGN_STR	".align 4"
>  
> +#define __pgdir		__attribute__((__section__(".pgdir")))

It would be nice for this to also provide page alignment, like
__page_aligned_bss that this replaces uses of.

> +
>  #endif
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 434ef407ef0f..d3885043f0b7 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -149,11 +149,16 @@ SECTIONS
>  
>  	BSS_SECTION(0, 0, 0)
>  
> -	. = ALIGN(PAGE_SIZE);
> -	idmap_pg_dir = .;
> -	. += IDMAP_DIR_SIZE;
> -	swapper_pg_dir = .;
> -	. += SWAPPER_DIR_SIZE;
> +	.pgdir : {
> +		. = ALIGN(PAGE_SIZE);
> +		__pgdir_start = .;
> +		idmap_pg_dir = .;
> +		. += IDMAP_DIR_SIZE;
> +		*(.pgdir)
> +		swapper_pg_dir = .;
> +		. += SWAPPER_DIR_SIZE;
> +		__pgdir_stop = .;

Nit: __pgdir_end would be consitent with our other vmlinux.lds.S symbols.

> +	}
>  
>  	_end = .;
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 43496748e3d9..bb3ce41130f3 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -549,12 +549,12 @@ void vmemmap_free(unsigned long start, unsigned long end)
>  }
>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
>  
> -static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_bss;
> +static pte_t bm_pte[PTRS_PER_PTE] __pgdir;
>  #if CONFIG_ARM64_PGTABLE_LEVELS > 2
> -static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss;
> +static pmd_t bm_pmd[PTRS_PER_PMD] __pgdir;
>  #endif
>  #if CONFIG_ARM64_PGTABLE_LEVELS > 3
> -static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss;
> +static pud_t bm_pud[PTRS_PER_PUD] __pgdir;
>  #endif

These used to be zeroed (by the bss init code in head.S), but now they won't
be, and as they live after all the initialized data they could contain garbage
if we're unlucky. So I suspect this is broken.

If we zero these before use then that's fine. We could define the swapper and
idmap similarly in this C file for consistency (we zero those in head.S). That
would bring all the page table allocations into the same file, and .pgdir could
be simpler:

	.pgdir {
		. = ALIGN(PAGE_SIZE);
		__pgdir_start = .;
		*(.pgdir)
		__pgdir_end = .;
	}

>  
>  static inline pud_t * fixmap_pud(unsigned long addr)
> @@ -592,11 +592,14 @@ void __init early_fixmap_init(void)
>  	unsigned long addr = FIXADDR_START;
>  
>  	pgd = pgd_offset_k(addr);
> -	pgd_populate(&init_mm, pgd, bm_pud);
> +	if (pgd_none(*pgd))
> +		pgd_populate(&init_mm, pgd, bm_pud);
>  	pud = pud_offset(pgd, addr);
> -	pud_populate(&init_mm, pud, bm_pmd);
> +	if (pud_none(*pud))
> +		pud_populate(&init_mm, pud, bm_pmd);
>  	pmd = pmd_offset(pud, addr);
> -	pmd_populate_kernel(&init_mm, pmd, bm_pte);
> +	if (pmd_none(*pmd))
> +		pmd_populate_kernel(&init_mm, pmd, bm_pte);

It's a bit confusing to have this change in the same patch as the other
changes, given that this shouldn't be required yet. It might make more sense
for this to live in the patch moving the text mapping.

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/4] RFC: split text and linear mappings using tagged pointers
  2015-03-26  1:26             ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Mark Rutland
@ 2015-03-26  6:09               ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-26  6:09 UTC (permalink / raw)
  To: linux-arm-kernel

On 26 March 2015 at 02:26, Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Ard,
>
> On Mon, Mar 23, 2015 at 03:36:52PM +0000, Ard Biesheuvel wrote:
> > This implements a split mapping of the kernel Image using the AArch64
> > tagged pointers feature. The kernel text is emitted with bit 56 cleared,
> > which can be used by the virt to phys translation to choose one translation
> > regime or another.
>
> This looks neat!
>
> I definitely want to see the linear mapping decoupled from the text mapping,
> but I'm rather worried by the prospect of tagged pointers in the kernel (there
> are a lot of edge cases for those w.r.t. the preservations of the tag bits),
> and if possible I'd like to avoid their use. I think that we can achieve the
> same effect by reorganising the VA layout within the 56 bits we have to play
> with.
>

Yes, it was nice as an experiment, but it breaks down in plenty of
unexpected places. It appears that this

-#define __virt_to_phys(x)      (((phys_addr_t)(x) - PAGE_OFFSET + PHYS_OFFSET))
+#define __virt_to_phys(x) ({                                           \
+       long __x = (long)(x) - PAGE_OFFSET;                             \
+       __x >= 0 ? (phys_addr_t)(__x + PHYS_OFFSET) :                   \
+                  (phys_addr_t)(__x + PHYS_OFFSET + image_offset); })
+

is really all we need.

>
> Nit: please s/translation regime/mapping/ (or some other wording to that
> effect) for these patches; the architectural definition of "translation regime"
> differs from what you're referring to here, and we should avoid overloading
> that terminology.
>

Ah ok. I thought it had a nice ring to it :-)


> > The purpose is to allow PHYS_OFFSET to start at an arbitrary offset below
> > the kernel image, so that memory below the kernel Image can be used, allowing
> > the Image to be loaded anywhere in physical memory.
> >
> > This series moves the kernel text right below PAGE_OFFSET, next to the modules
> > area. PHYS_OFFSET is chosen to be a suitable aligned boundary somewhere below
> > the kernel Image (1 GB or 512 MB depending on chosen page size).
> >
> > Output from a QEMU/EFI boot:
> >
> > System RAM:
> >  memory[0x0]  [0x00000040000000-0x000000bfffffff], 0x80000000 bytes flags: 0x0
> >
> > Kernel image:
> >  reserved[0x0]        [0x0000005f680000-0x0000005fe68fff], 0x7e9000 bytes flags: 0x0
> >
> > Virtual kernel memory layout:
> >     vmalloc : 0xffffff8000000000 - 0xffffffbdbfff0000   (   246 GB)
> >     vmemmap : 0xffffffbdc0000000 - 0xffffffbfc0000000   (     8 GB maximum)
> >               0xffffffbdc1000000 - 0xffffffbdc3000000   (    32 MB actual)
> >     fixed   : 0xffffffbff69fd000 - 0xffffffbff6c00000   (  2060 KB)
> >     PCI I/O : 0xffffffbff6e00000 - 0xffffffbff7e00000   (    16 MB)
> >     modules : 0xffffffbff8000000 - 0xffffffbffc000000   (    64 MB)
> >     memory  : 0xffffffc000000000 - 0xffffffc080000000   (  2048 MB)
> >       .init : 0xfeffffbffc783000 - 0xfeffffbffc7c6000   (   268 KB)
> >       .text : 0xfeffffbffc080000 - 0xfeffffbffc782344   (  7177 KB)
> >       .data : 0xfeffffbffc7ca000 - 0xfeffffbffc830c00   (   411 KB)
> >
> > Note that this time, no relocations were harmed in the making of these
> > patches. With added relocation support, it should be possible to move
> > the combined modules and kernel text anywhere in the vmalloc area (for
> > kaslr)
> >
> > There are probably some places where the cleared bit 56 in the virtual
> > address may cause trouble. Known problem is the module loader, but there
> > are surely others.
> >
> >
> > Ard Biesheuvel (4):
> >   arm64: use tagged pointers to distinguish kernel text from the linear
> >     mapping
> >   arm64: fixmap: move translation tables to dedicated region
> >   arm64: move kernel text below PAGE_OFFSET
> >   arm64: align PHYS_OFFSET to block size
> >
> >  arch/arm64/include/asm/linkage.h       |  2 ++
> >  arch/arm64/include/asm/memory.h        | 27 +++++++++++++++++--
> >  arch/arm64/include/asm/pgtable-hwdef.h |  1 +
> >  arch/arm64/kernel/head.S               | 48 +++++++++++++++++++++++++---------
> >  arch/arm64/kernel/vmlinux.lds.S        | 19 +++++++++-----
> >  arch/arm64/mm/mmu.c                    | 21 +++++++++------
> >  arch/arm64/mm/proc.S                   |  3 ++-
> >  7 files changed, 91 insertions(+), 30 deletions(-)
> >
> > --
> > 1.8.3.2
> >
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region
  2015-03-26  1:28               ` Mark Rutland
@ 2015-03-26  6:20                 ` Ard Biesheuvel
  2015-03-30 14:34                   ` Mark Rutland
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-26  6:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 26 March 2015 at 02:28, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi Ard,
>
> On Mon, Mar 23, 2015 at 03:36:54PM +0000, Ard Biesheuvel wrote:
>> This moves the fixmap translation tables to a dedicated region
>> in the linker map. This is needed for a split kernel text from
>> the linear mapping, to ensure that the contents of the tables
>> rely on a single translation regime.
>
> I'm not sure what you mean by this. Could you elaborate?
>
> What problem would we have if we didn't have this section, and how does this
> section solve that?
>

The pgd manipulation code is riddled with va/pa and pa/va
translations, and uses both statically and dynamically allocated
pages. Untangling all of that is not so easy, and it is simpler just
to refer to those regions through the linear mapping exclusively.

> Regardless, I have some comments on the implementation below.
>
>> Also make the population of the translation levels conditional,
>> so that the kernel text can share some levels of translation if
>> needed.
>
> I guess you mean shared with the tables for the text mapping?
>
> Please word this to be explicit w.r.t. what is shared between whom, and when.
>

Yes. Currently, the address space is split down the middle, so fixmap
and linear always live in regions that are disjoint all the way up to
different root pgdir entries. Once that changes, the fixmap code needs
to be prepared for any of the levels it needs to populated having
already been populated.

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>>  arch/arm64/include/asm/linkage.h |  2 ++
>>  arch/arm64/kernel/vmlinux.lds.S  | 15 ++++++++++-----
>>  arch/arm64/mm/mmu.c              | 15 +++++++++------
>>  3 files changed, 21 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
>> index 636c1bced7d4..dc9354de6f52 100644
>> --- a/arch/arm64/include/asm/linkage.h
>> +++ b/arch/arm64/include/asm/linkage.h
>> @@ -4,4 +4,6 @@
>>  #define __ALIGN              .align 4
>>  #define __ALIGN_STR  ".align 4"
>>
>> +#define __pgdir              __attribute__((__section__(".pgdir")))
>
> It would be nice for this to also provide page alignment, like
> __page_aligned_bss that this replaces uses of.
>

I thought it wasn't necessary, because we allocate a full page for the
highest level, but we should probably not rely on that.

>> +
>>  #endif
>> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
>> index 434ef407ef0f..d3885043f0b7 100644
>> --- a/arch/arm64/kernel/vmlinux.lds.S
>> +++ b/arch/arm64/kernel/vmlinux.lds.S
>> @@ -149,11 +149,16 @@ SECTIONS
>>
>>       BSS_SECTION(0, 0, 0)
>>
>> -     . = ALIGN(PAGE_SIZE);
>> -     idmap_pg_dir = .;
>> -     . += IDMAP_DIR_SIZE;
>> -     swapper_pg_dir = .;
>> -     . += SWAPPER_DIR_SIZE;
>> +     .pgdir : {
>> +             . = ALIGN(PAGE_SIZE);
>> +             __pgdir_start = .;
>> +             idmap_pg_dir = .;
>> +             . += IDMAP_DIR_SIZE;
>> +             *(.pgdir)
>> +             swapper_pg_dir = .;
>> +             . += SWAPPER_DIR_SIZE;
>> +             __pgdir_stop = .;
>
> Nit: __pgdir_end would be consitent with our other vmlinux.lds.S symbols.
>

Right

>> +     }
>>
>>       _end = .;
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 43496748e3d9..bb3ce41130f3 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -549,12 +549,12 @@ void vmemmap_free(unsigned long start, unsigned long end)
>>  }
>>  #endif       /* CONFIG_SPARSEMEM_VMEMMAP */
>>
>> -static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_bss;
>> +static pte_t bm_pte[PTRS_PER_PTE] __pgdir;
>>  #if CONFIG_ARM64_PGTABLE_LEVELS > 2
>> -static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss;
>> +static pmd_t bm_pmd[PTRS_PER_PMD] __pgdir;
>>  #endif
>>  #if CONFIG_ARM64_PGTABLE_LEVELS > 3
>> -static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss;
>> +static pud_t bm_pud[PTRS_PER_PUD] __pgdir;
>>  #endif
>
> These used to be zeroed (by the bss init code in head.S), but now they won't
> be, and as they live after all the initialized data they could contain garbage
> if we're unlucky. So I suspect this is broken.
>

Actually, they are in this case, as the init code zeroes from the
beginning of idmap to the end of swapper :-)
But that is something that should be more explicit I guess

I was wondering if we should worry about doing the zeroing with the
caches off, which is not needed for fixmap

> If we zero these before use then that's fine. We could define the swapper and
> idmap similarly in this C file for consistency (we zero those in head.S). That
> would bring all the page table allocations into the same file, and .pgdir could
> be simpler:
>
>         .pgdir {
>                 . = ALIGN(PAGE_SIZE);
>                 __pgdir_start = .;
>                 *(.pgdir)
>                 __pgdir_end = .;
>         }
>
>>
>>  static inline pud_t * fixmap_pud(unsigned long addr)
>> @@ -592,11 +592,14 @@ void __init early_fixmap_init(void)
>>       unsigned long addr = FIXADDR_START;
>>
>>       pgd = pgd_offset_k(addr);
>> -     pgd_populate(&init_mm, pgd, bm_pud);
>> +     if (pgd_none(*pgd))
>> +             pgd_populate(&init_mm, pgd, bm_pud);
>>       pud = pud_offset(pgd, addr);
>> -     pud_populate(&init_mm, pud, bm_pmd);
>> +     if (pud_none(*pud))
>> +             pud_populate(&init_mm, pud, bm_pmd);
>>       pmd = pmd_offset(pud, addr);
>> -     pmd_populate_kernel(&init_mm, pmd, bm_pte);
>> +     if (pmd_none(*pmd))
>> +             pmd_populate_kernel(&init_mm, pmd, bm_pte);
>
> It's a bit confusing to have this change in the same patch as the other
> changes, given that this shouldn't be required yet. It might make more sense
> for this to live in the patch moving the text mapping.
>


ok

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-25 14:59               ` Catalin Marinas
@ 2015-03-26  6:22                 ` Ard Biesheuvel
  2015-03-27 13:16                   ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-26  6:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 March 2015 at 15:59, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>> index 16134608eecf..fd8434753372 100644
>> --- a/arch/arm64/kernel/head.S
>> +++ b/arch/arm64/kernel/head.S
>> @@ -49,13 +49,15 @@
>>  #ifdef CONFIG_ARM64_64K_PAGES
>>  #define BLOCK_SHIFT  PAGE_SHIFT
>>  #define BLOCK_SIZE   PAGE_SIZE
>> -#define TABLE_SHIFT  PMD_SHIFT
>>  #else
>>  #define BLOCK_SHIFT  SECTION_SHIFT
>>  #define BLOCK_SIZE   SECTION_SIZE
>> -#define TABLE_SHIFT  PUD_SHIFT
>>  #endif
>>
>> +#define TABLE_SHIFT  (BLOCK_SHIFT + PAGE_SHIFT - 3)
>> +#define TABLE_SIZE   (1 << TABLE_SHIFT)
>> +#define TABLE_MASK   (~(TABLE_SIZE - 1))
>> +
>>  #define KERNEL_START _text
>>  #define KERNEL_END   _end
>>
>> @@ -237,7 +239,10 @@ ENTRY(stext)
>>       bl      el2_setup                       // Drop to EL1, w20=cpu_boot_mode
>>
>>       adrp    x24, __PHYS_OFFSET
>> -     mov     x23, #KIMAGE_OFFSET
>> +     and     x23, x24, #~TABLE_MASK          // image offset
>> +     and     x24, x24, #TABLE_MASK           // PHYS_OFFSET
>> +     mov     x0, #KIMAGE_OFFSET
>> +     add     x23, x23, x0
>
> I'm still trying to figure out how this works. Does the code imply that
> the kernel image can only be loaded within a block size of the
> PHYS_OFFSET? If that's the case, it's not too flexible.
>

For now, yes.

> My thoughts were to make PHYS_OFFSET permanently 0 (i.e. get rid of it
> entirely) or at least compute it at DT parsing time. I'm more inclined
> for making it 0 assuming that it doesn't break anything else (vmemmap
> virtual range may get slightly bigger but still not significant,
> basically max_phys_addr / sizeof(struct page)).
>

Making it zero would be an improvement, I suppose

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-25 14:14               ` Catalin Marinas
@ 2015-03-26  6:23                 ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-26  6:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 March 2015 at 15:14, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
>> This aligns PHYS_OFFSET down to an alignment that allows system
>> RAM to be mapped using the largest blocks available, i.e., 1 GB
>> blocks on 4 KB pages or 512 MB blocks on 64 KB pages.
>
> Does this assume that the platform RAM starts at such block aligned
> addresses? If not, is the unaligned range ignored?
>

No, it doesn't. The unaligned range is not mapped by the early code
(it only maps from the start of image)
Later on, it just maps whatever is in the DT nodes.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-26  6:22                 ` Ard Biesheuvel
@ 2015-03-27 13:16                   ` Ard Biesheuvel
  2015-03-30 13:49                     ` Catalin Marinas
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-27 13:16 UTC (permalink / raw)
  To: linux-arm-kernel

On 26 March 2015 at 07:22, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 25 March 2015 at 15:59, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
>>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>>> index 16134608eecf..fd8434753372 100644
>>> --- a/arch/arm64/kernel/head.S
>>> +++ b/arch/arm64/kernel/head.S
>>> @@ -49,13 +49,15 @@
>>>  #ifdef CONFIG_ARM64_64K_PAGES
>>>  #define BLOCK_SHIFT  PAGE_SHIFT
>>>  #define BLOCK_SIZE   PAGE_SIZE
>>> -#define TABLE_SHIFT  PMD_SHIFT
>>>  #else
>>>  #define BLOCK_SHIFT  SECTION_SHIFT
>>>  #define BLOCK_SIZE   SECTION_SIZE
>>> -#define TABLE_SHIFT  PUD_SHIFT
>>>  #endif
>>>
>>> +#define TABLE_SHIFT  (BLOCK_SHIFT + PAGE_SHIFT - 3)
>>> +#define TABLE_SIZE   (1 << TABLE_SHIFT)
>>> +#define TABLE_MASK   (~(TABLE_SIZE - 1))
>>> +
>>>  #define KERNEL_START _text
>>>  #define KERNEL_END   _end
>>>
>>> @@ -237,7 +239,10 @@ ENTRY(stext)
>>>       bl      el2_setup                       // Drop to EL1, w20=cpu_boot_mode
>>>
>>>       adrp    x24, __PHYS_OFFSET
>>> -     mov     x23, #KIMAGE_OFFSET
>>> +     and     x23, x24, #~TABLE_MASK          // image offset
>>> +     and     x24, x24, #TABLE_MASK           // PHYS_OFFSET
>>> +     mov     x0, #KIMAGE_OFFSET
>>> +     add     x23, x23, x0
>>
>> I'm still trying to figure out how this works. Does the code imply that
>> the kernel image can only be loaded within a block size of the
>> PHYS_OFFSET? If that's the case, it's not too flexible.
>>
>
> For now, yes.
>
>> My thoughts were to make PHYS_OFFSET permanently 0 (i.e. get rid of it
>> entirely) or at least compute it at DT parsing time. I'm more inclined
>> for making it 0 assuming that it doesn't break anything else (vmemmap
>> virtual range may get slightly bigger but still not significant,
>> basically max_phys_addr / sizeof(struct page)).
>>
>
> Making it zero would be an improvement, I suppose

Actually, wouldn't that reintroduce a similar VA range problem to the
one I fixed the other day?

On Seattle, with its DRAM at 0x80_0000_0000, you wouldn't have enough
space after PAGE_OFFSET

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-27 13:16                   ` Ard Biesheuvel
@ 2015-03-30 13:49                     ` Catalin Marinas
  2015-03-30 14:00                       ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Catalin Marinas @ 2015-03-30 13:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 27, 2015 at 02:16:19PM +0100, Ard Biesheuvel wrote:
> On 26 March 2015 at 07:22, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > On 25 March 2015 at 15:59, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
> >>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> >>> index 16134608eecf..fd8434753372 100644
> >>> --- a/arch/arm64/kernel/head.S
> >>> +++ b/arch/arm64/kernel/head.S
> >>> @@ -49,13 +49,15 @@
> >>>  #ifdef CONFIG_ARM64_64K_PAGES
> >>>  #define BLOCK_SHIFT  PAGE_SHIFT
> >>>  #define BLOCK_SIZE   PAGE_SIZE
> >>> -#define TABLE_SHIFT  PMD_SHIFT
> >>>  #else
> >>>  #define BLOCK_SHIFT  SECTION_SHIFT
> >>>  #define BLOCK_SIZE   SECTION_SIZE
> >>> -#define TABLE_SHIFT  PUD_SHIFT
> >>>  #endif
> >>>
> >>> +#define TABLE_SHIFT  (BLOCK_SHIFT + PAGE_SHIFT - 3)
> >>> +#define TABLE_SIZE   (1 << TABLE_SHIFT)
> >>> +#define TABLE_MASK   (~(TABLE_SIZE - 1))
> >>> +
> >>>  #define KERNEL_START _text
> >>>  #define KERNEL_END   _end
> >>>
> >>> @@ -237,7 +239,10 @@ ENTRY(stext)
> >>>       bl      el2_setup                       // Drop to EL1, w20=cpu_boot_mode
> >>>
> >>>       adrp    x24, __PHYS_OFFSET
> >>> -     mov     x23, #KIMAGE_OFFSET
> >>> +     and     x23, x24, #~TABLE_MASK          // image offset
> >>> +     and     x24, x24, #TABLE_MASK           // PHYS_OFFSET
> >>> +     mov     x0, #KIMAGE_OFFSET
> >>> +     add     x23, x23, x0
> >>
> >> I'm still trying to figure out how this works. Does the code imply that
> >> the kernel image can only be loaded within a block size of the
> >> PHYS_OFFSET? If that's the case, it's not too flexible.
> >
> > For now, yes.

Can we defer the setting of PHYS_OFFSET until we parse the DT memory
nodes?

> >> My thoughts were to make PHYS_OFFSET permanently 0 (i.e. get rid of it
> >> entirely) or at least compute it at DT parsing time. I'm more inclined
> >> for making it 0 assuming that it doesn't break anything else (vmemmap
> >> virtual range may get slightly bigger but still not significant,
> >> basically max_phys_addr / sizeof(struct page)).
> >
> > Making it zero would be an improvement, I suppose
> 
> Actually, wouldn't that reintroduce a similar VA range problem to the
> one I fixed the other day?
> 
> On Seattle, with its DRAM at 0x80_0000_0000, you wouldn't have enough
> space after PAGE_OFFSET

Ah, yes. When I thought about this PHYS_OFFSET == 0 in the past, we
didn't have any patches and the VA range had to be sufficiently large.

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-30 13:49                     ` Catalin Marinas
@ 2015-03-30 14:00                       ` Ard Biesheuvel
  2015-03-30 14:55                         ` Mark Rutland
  2015-03-30 15:00                         ` Catalin Marinas
  0 siblings, 2 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-30 14:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 March 2015 at 15:49, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Mar 27, 2015 at 02:16:19PM +0100, Ard Biesheuvel wrote:
>> On 26 March 2015 at 07:22, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> > On 25 March 2015 at 15:59, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> >> On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
>> >>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>> >>> index 16134608eecf..fd8434753372 100644
>> >>> --- a/arch/arm64/kernel/head.S
>> >>> +++ b/arch/arm64/kernel/head.S
>> >>> @@ -49,13 +49,15 @@
>> >>>  #ifdef CONFIG_ARM64_64K_PAGES
>> >>>  #define BLOCK_SHIFT  PAGE_SHIFT
>> >>>  #define BLOCK_SIZE   PAGE_SIZE
>> >>> -#define TABLE_SHIFT  PMD_SHIFT
>> >>>  #else
>> >>>  #define BLOCK_SHIFT  SECTION_SHIFT
>> >>>  #define BLOCK_SIZE   SECTION_SIZE
>> >>> -#define TABLE_SHIFT  PUD_SHIFT
>> >>>  #endif
>> >>>
>> >>> +#define TABLE_SHIFT  (BLOCK_SHIFT + PAGE_SHIFT - 3)
>> >>> +#define TABLE_SIZE   (1 << TABLE_SHIFT)
>> >>> +#define TABLE_MASK   (~(TABLE_SIZE - 1))
>> >>> +
>> >>>  #define KERNEL_START _text
>> >>>  #define KERNEL_END   _end
>> >>>
>> >>> @@ -237,7 +239,10 @@ ENTRY(stext)
>> >>>       bl      el2_setup                       // Drop to EL1, w20=cpu_boot_mode
>> >>>
>> >>>       adrp    x24, __PHYS_OFFSET
>> >>> -     mov     x23, #KIMAGE_OFFSET
>> >>> +     and     x23, x24, #~TABLE_MASK          // image offset
>> >>> +     and     x24, x24, #TABLE_MASK           // PHYS_OFFSET
>> >>> +     mov     x0, #KIMAGE_OFFSET
>> >>> +     add     x23, x23, x0
>> >>
>> >> I'm still trying to figure out how this works. Does the code imply that
>> >> the kernel image can only be loaded within a block size of the
>> >> PHYS_OFFSET? If that's the case, it's not too flexible.
>> >
>> > For now, yes.
>
> Can we defer the setting of PHYS_OFFSET until we parse the DT memory
> nodes?
>

I experimented a bit with that, but it is quite hairy. Any
manipulation of the page tables goes through __va/__pa, so you need a
valid PHYS_OFFSET there to ensure they point at the right physical
region. But PHYS_OFFSET also needs to be small enough for the DT
parsing code not to disregard regions that are below it. And then
there is the memblock limit to ensure that early dynamically allocated
page tables come from a region that is already mapped.

I think it may be doable, but it would require some significant
hacking, e.g., call early_init_scan_dt() at its physical address with
only the ID map loaded and the MMU and caches on, and only after that
start populating the virtual address space. Or at least only populate
the lower half, i.e., mappings below PAGE_OFFSET for the kernel and
the FDT

>> >> My thoughts were to make PHYS_OFFSET permanently 0 (i.e. get rid of it
>> >> entirely) or at least compute it at DT parsing time. I'm more inclined
>> >> for making it 0 assuming that it doesn't break anything else (vmemmap
>> >> virtual range may get slightly bigger but still not significant,
>> >> basically max_phys_addr / sizeof(struct page)).
>> >
>> > Making it zero would be an improvement, I suppose
>>
>> Actually, wouldn't that reintroduce a similar VA range problem to the
>> one I fixed the other day?
>>
>> On Seattle, with its DRAM at 0x80_0000_0000, you wouldn't have enough
>> space after PAGE_OFFSET
>
> Ah, yes. When I thought about this PHYS_OFFSET == 0 in the past, we
> didn't have any patches and the VA range had to be sufficiently large.
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region
  2015-03-26  6:20                 ` Ard Biesheuvel
@ 2015-03-30 14:34                   ` Mark Rutland
  0 siblings, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2015-03-30 14:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 26, 2015 at 06:20:38AM +0000, Ard Biesheuvel wrote:
> On 26 March 2015 at 02:28, Mark Rutland <mark.rutland@arm.com> wrote:
> > Hi Ard,
> >
> > On Mon, Mar 23, 2015 at 03:36:54PM +0000, Ard Biesheuvel wrote:
> >> This moves the fixmap translation tables to a dedicated region
> >> in the linker map. This is needed for a split kernel text from
> >> the linear mapping, to ensure that the contents of the tables
> >> rely on a single translation regime.
> >
> > I'm not sure what you mean by this. Could you elaborate?
> >
> > What problem would we have if we didn't have this section, and how does this
> > section solve that?
> >
> 
> The pgd manipulation code is riddled with va/pa and pa/va
> translations, and uses both statically and dynamically allocated
> pages. Untangling all of that is not so easy, and it is simpler just
> to refer to those regions through the linear mapping exclusively.

I'm missing the leap as to how the .pgdir section helps with that,
unfortunately ;)

> > Regardless, I have some comments on the implementation below.
> >
> >> Also make the population of the translation levels conditional,
> >> so that the kernel text can share some levels of translation if
> >> needed.
> >
> > I guess you mean shared with the tables for the text mapping?
> >
> > Please word this to be explicit w.r.t. what is shared between whom, and when.
> >
> 
> Yes. Currently, the address space is split down the middle, so fixmap
> and linear always live in regions that are disjoint all the way up to
> different root pgdir entries. Once that changes, the fixmap code needs
> to be prepared for any of the levels it needs to populated having
> already been populated.

Ok.

> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >> ---
> >>  arch/arm64/include/asm/linkage.h |  2 ++
> >>  arch/arm64/kernel/vmlinux.lds.S  | 15 ++++++++++-----
> >>  arch/arm64/mm/mmu.c              | 15 +++++++++------
> >>  3 files changed, 21 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> >> index 636c1bced7d4..dc9354de6f52 100644
> >> --- a/arch/arm64/include/asm/linkage.h
> >> +++ b/arch/arm64/include/asm/linkage.h
> >> @@ -4,4 +4,6 @@
> >>  #define __ALIGN              .align 4
> >>  #define __ALIGN_STR  ".align 4"
> >>
> >> +#define __pgdir              __attribute__((__section__(".pgdir")))
> >
> > It would be nice for this to also provide page alignment, like
> > __page_aligned_bss that this replaces uses of.
> >
> 
> I thought it wasn't necessary, because we allocate a full page for the
> highest level, but we should probably not rely on that.

While it shouldn't be necessary I'd feel more comfortable knowing that
the alignment is definitely applied to each object.

[...]

> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >> index 43496748e3d9..bb3ce41130f3 100644
> >> --- a/arch/arm64/mm/mmu.c
> >> +++ b/arch/arm64/mm/mmu.c
> >> @@ -549,12 +549,12 @@ void vmemmap_free(unsigned long start, unsigned long end)
> >>  }
> >>  #endif       /* CONFIG_SPARSEMEM_VMEMMAP */
> >>
> >> -static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_bss;
> >> +static pte_t bm_pte[PTRS_PER_PTE] __pgdir;
> >>  #if CONFIG_ARM64_PGTABLE_LEVELS > 2
> >> -static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss;
> >> +static pmd_t bm_pmd[PTRS_PER_PMD] __pgdir;
> >>  #endif
> >>  #if CONFIG_ARM64_PGTABLE_LEVELS > 3
> >> -static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss;
> >> +static pud_t bm_pud[PTRS_PER_PUD] __pgdir;
> >>  #endif
> >
> > These used to be zeroed (by the bss init code in head.S), but now they won't
> > be, and as they live after all the initialized data they could contain garbage
> > if we're unlucky. So I suspect this is broken.
> >
> 
> Actually, they are in this case, as the init code zeroes from the
> beginning of idmap to the end of swapper :-)
> But that is something that should be more explicit I guess
> 
> I was wondering if we should worry about doing the zeroing with the
> caches off, which is not needed for fixmap

It's unfortunate to do more than we have to before the caches are off,
so I guess we need to be able to determine the region(s) corresponding
to the idmap + swapper, independent of the other page tables.

> > If we zero these before use then that's fine. We could define the swapper and
> > idmap similarly in this C file for consistency (we zero those in head.S). That
> > would bring all the page table allocations into the same file, and .pgdir could
> > be simpler:
> >
> >         .pgdir {
> >                 . = ALIGN(PAGE_SIZE);
> >                 __pgdir_start = .;
> >                 *(.pgdir)
> >                 __pgdir_end = .;
> >         }

Given that, the simplified .pgdir suggestion wouldn't work unless we
modified head.S to handle the idmap and swapper independently (rather
than treating them as a single entity for cache maintenance and
zeroing).

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-30 14:00                       ` Ard Biesheuvel
@ 2015-03-30 14:55                         ` Mark Rutland
  2015-03-30 15:00                         ` Catalin Marinas
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Rutland @ 2015-03-30 14:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 30, 2015 at 03:00:31PM +0100, Ard Biesheuvel wrote:
> On 30 March 2015 at 15:49, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, Mar 27, 2015 at 02:16:19PM +0100, Ard Biesheuvel wrote:
> >> On 26 March 2015 at 07:22, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >> > On 25 March 2015 at 15:59, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> >> On Mon, Mar 23, 2015 at 04:36:56PM +0100, Ard Biesheuvel wrote:
> >> >>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> >> >>> index 16134608eecf..fd8434753372 100644
> >> >>> --- a/arch/arm64/kernel/head.S
> >> >>> +++ b/arch/arm64/kernel/head.S
> >> >>> @@ -49,13 +49,15 @@
> >> >>>  #ifdef CONFIG_ARM64_64K_PAGES
> >> >>>  #define BLOCK_SHIFT  PAGE_SHIFT
> >> >>>  #define BLOCK_SIZE   PAGE_SIZE
> >> >>> -#define TABLE_SHIFT  PMD_SHIFT
> >> >>>  #else
> >> >>>  #define BLOCK_SHIFT  SECTION_SHIFT
> >> >>>  #define BLOCK_SIZE   SECTION_SIZE
> >> >>> -#define TABLE_SHIFT  PUD_SHIFT
> >> >>>  #endif
> >> >>>
> >> >>> +#define TABLE_SHIFT  (BLOCK_SHIFT + PAGE_SHIFT - 3)
> >> >>> +#define TABLE_SIZE   (1 << TABLE_SHIFT)
> >> >>> +#define TABLE_MASK   (~(TABLE_SIZE - 1))
> >> >>> +
> >> >>>  #define KERNEL_START _text
> >> >>>  #define KERNEL_END   _end
> >> >>>
> >> >>> @@ -237,7 +239,10 @@ ENTRY(stext)
> >> >>>       bl      el2_setup                       // Drop to EL1, w20=cpu_boot_mode
> >> >>>
> >> >>>       adrp    x24, __PHYS_OFFSET
> >> >>> -     mov     x23, #KIMAGE_OFFSET
> >> >>> +     and     x23, x24, #~TABLE_MASK          // image offset
> >> >>> +     and     x24, x24, #TABLE_MASK           // PHYS_OFFSET
> >> >>> +     mov     x0, #KIMAGE_OFFSET
> >> >>> +     add     x23, x23, x0
> >> >>
> >> >> I'm still trying to figure out how this works. Does the code imply that
> >> >> the kernel image can only be loaded within a block size of the
> >> >> PHYS_OFFSET? If that's the case, it's not too flexible.
> >> >
> >> > For now, yes.
> >
> > Can we defer the setting of PHYS_OFFSET until we parse the DT memory
> > nodes?
> >
> 
> I experimented a bit with that, but it is quite hairy. Any
> manipulation of the page tables goes through __va/__pa, so you need a
> valid PHYS_OFFSET there to ensure they point at the right physical
> region. But PHYS_OFFSET also needs to be small enough for the DT
> parsing code not to disregard regions that are below it. And then
> there is the memblock limit to ensure that early dynamically allocated
> page tables come from a region that is already mapped.

Could we not use fixmaps to bootstrap the page tables to a state where
we can address them via the linear mapping? That would allow us to
forget about the memblock limit.

If we can do that, then we won't care about PHYS_OFFSET until later, and
we should be able to leave it zero for parsing the DT, then move it up
if necessary (dropping memory we can't address) once we know where
memory is.

This is of course a vast simplification of the work involved ;)

> I think it may be doable, but it would require some significant
> hacking, e.g., call early_init_scan_dt() at its physical address with
> only the ID map loaded and the MMU and caches on, and only after that
> start populating the virtual address space. Or at least only populate
> the lower half, i.e., mappings below PAGE_OFFSET for the kernel and
> the FDT

I thought we were going to use the fixmap for the DT, which would avoid
messiness there, no?

Mark.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-30 14:00                       ` Ard Biesheuvel
  2015-03-30 14:55                         ` Mark Rutland
@ 2015-03-30 15:00                         ` Catalin Marinas
  2015-03-30 18:08                           ` Ard Biesheuvel
  1 sibling, 1 reply; 42+ messages in thread
From: Catalin Marinas @ 2015-03-30 15:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 30, 2015 at 04:00:31PM +0200, Ard Biesheuvel wrote:
> On 30 March 2015 at 15:49, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > Can we defer the setting of PHYS_OFFSET until we parse the DT memory
> > nodes?
> 
> I experimented a bit with that, but it is quite hairy. Any
> manipulation of the page tables goes through __va/__pa, so you need a
> valid PHYS_OFFSET there to ensure they point at the right physical
> region.

Yes, so we need set PHYS_OFFSET before we start using __va/__pa.

> But PHYS_OFFSET also needs to be small enough for the DT
> parsing code not to disregard regions that are below it.

With PHYS_OFFSET as 0 initially, no regions would be dropped. But we
could write a simplified early_init_dt_add_memory_arch() which avoids
the PHYS_OFFSET check while setting the actual PHYS_OFFSET to the
minimal address detected. I can see the generic function only uses
__pa() to get the PHYS_OFFSET.

> And then
> there is the memblock limit to ensure that early dynamically allocated
> page tables come from a region that is already mapped.

By the time we start using memblock allocations, we have a PHYS_OFFSET
set. The early DT parsing does not require any memory allocations AFAIK.
We need to make sure that setup_machine_fdt creates a VA mapping of the
DT and does not require the __va() macro (I thought I've seen some
patches to use fixmap for DT).

> I think it may be doable, but it would require some significant
> hacking, e.g., call early_init_scan_dt() at its physical address with
> only the ID map loaded and the MMU and caches on, and only after that
> start populating the virtual address space. Or at least only populate
> the lower half, i.e., mappings below PAGE_OFFSET for the kernel and
> the FDT

With some form of your patches, we already decouple the PAGE_OFFSET from
the kernel text mapping. We map the latter at some very high
KERNEL_PAGE_OFFSET, the DT via fixmap (which is part of the kernel data
section, so mapped at the high KERNEL_PAGE_OFFSET). Once we start
calling early_init_dt_add_memory_arch(), we set the real PAGE_OFFSET and
are free to use __pa/__va after setup_machine_fdt(). The actual linear
mappings will be created later via paging_init().

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-30 15:00                         ` Catalin Marinas
@ 2015-03-30 18:08                           ` Ard Biesheuvel
  2015-03-31 14:49                             ` Catalin Marinas
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2015-03-30 18:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 March 2015 at 17:00, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Mon, Mar 30, 2015 at 04:00:31PM +0200, Ard Biesheuvel wrote:
>> On 30 March 2015 at 15:49, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > Can we defer the setting of PHYS_OFFSET until we parse the DT memory
>> > nodes?
>>
>> I experimented a bit with that, but it is quite hairy. Any
>> manipulation of the page tables goes through __va/__pa, so you need a
>> valid PHYS_OFFSET there to ensure they point at the right physical
>> region.
>
> Yes, so we need set PHYS_OFFSET before we start using __va/__pa.
>
>> But PHYS_OFFSET also needs to be small enough for the DT
>> parsing code not to disregard regions that are below it.
>
> With PHYS_OFFSET as 0 initially, no regions would be dropped. But we
> could write a simplified early_init_dt_add_memory_arch() which avoids
> the PHYS_OFFSET check while setting the actual PHYS_OFFSET to the
> minimal address detected. I can see the generic function only uses
> __pa() to get the PHYS_OFFSET.
>

Excellent idea.

>> And then
>> there is the memblock limit to ensure that early dynamically allocated
>> page tables come from a region that is already mapped.
>
> By the time we start using memblock allocations, we have a PHYS_OFFSET
> set. The early DT parsing does not require any memory allocations AFAIK.
> We need to make sure that setup_machine_fdt creates a VA mapping of the
> DT and does not require the __va() macro (I thought I've seen some
> patches to use fixmap for DT).
>

Yes, so did I :-)

But using early_fixmap() implies using the ordinary page tables
manipulation code, which uses __pa/__va
So instead, I should refactor those patches to simply pick a VA offset
and map the FDT there from head.S

>> I think it may be doable, but it would require some significant
>> hacking, e.g., call early_init_scan_dt() at its physical address with
>> only the ID map loaded and the MMU and caches on, and only after that
>> start populating the virtual address space. Or at least only populate
>> the lower half, i.e., mappings below PAGE_OFFSET for the kernel and
>> the FDT
>
> With some form of your patches, we already decouple the PAGE_OFFSET from
> the kernel text mapping. We map the latter at some very high
> KERNEL_PAGE_OFFSET, the DT via fixmap (which is part of the kernel data
> section, so mapped at the high KERNEL_PAGE_OFFSET). Once we start
> calling early_init_dt_add_memory_arch(), we set the real PAGE_OFFSET and
> are free to use __pa/__va after setup_machine_fdt(). The actual linear
> mappings will be created later via paging_init().
>

OK, that sounds like it could work.

The only thing to note is (and this should answer Mark's question in
the other thread) is that the pgdir region should be mapped via the
linear mapping as well.
The alternative is to make __va() check whether the input physical
address falls within the pgdir region, and subtract the PAGE_OFFSET -
KERNEL_PAGE_OFFSET from the return value, but this is less trivial
than the __pa() counterpart which only needs to check a single bit.

So I propose we take defines KERNEL_PAGE_OFFSET as (PAGE_OFFSET -
SZ_64M) and FDT_PAGE_OFFSET (or whatever you want to call it) as
(PAGE_OFFSET - SZ_2M).
That way, the kernel, fdt and module space always share the same 128
MB window, which we can move around in the vmalloc space later if we
want to implement kASLR.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-30 18:08                           ` Ard Biesheuvel
@ 2015-03-31 14:49                             ` Catalin Marinas
  2015-03-31 16:19                               ` Catalin Marinas
  0 siblings, 1 reply; 42+ messages in thread
From: Catalin Marinas @ 2015-03-31 14:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 30, 2015 at 08:08:52PM +0200, Ard Biesheuvel wrote:
> On 30 March 2015 at 17:00, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Mar 30, 2015 at 04:00:31PM +0200, Ard Biesheuvel wrote:
> >> On 30 March 2015 at 15:49, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > Can we defer the setting of PHYS_OFFSET until we parse the DT memory
> >> > nodes?
> >>
> >> I experimented a bit with that, but it is quite hairy. Any
> >> manipulation of the page tables goes through __va/__pa, so you need a
> >> valid PHYS_OFFSET there to ensure they point at the right physical
> >> region.
> >
> > Yes, so we need set PHYS_OFFSET before we start using __va/__pa.
> >
> >> But PHYS_OFFSET also needs to be small enough for the DT
> >> parsing code not to disregard regions that are below it.
> >
> > With PHYS_OFFSET as 0 initially, no regions would be dropped. But we
> > could write a simplified early_init_dt_add_memory_arch() which avoids
> > the PHYS_OFFSET check while setting the actual PHYS_OFFSET to the
> > minimal address detected. I can see the generic function only uses
> > __pa() to get the PHYS_OFFSET.
> >
> 
> Excellent idea.
> 
> >> And then
> >> there is the memblock limit to ensure that early dynamically allocated
> >> page tables come from a region that is already mapped.
> >
> > By the time we start using memblock allocations, we have a PHYS_OFFSET
> > set. The early DT parsing does not require any memory allocations AFAIK.
> > We need to make sure that setup_machine_fdt creates a VA mapping of the
> > DT and does not require the __va() macro (I thought I've seen some
> > patches to use fixmap for DT).
> 
> Yes, so did I :-)
> 
> But using early_fixmap() implies using the ordinary page tables
> manipulation code, which uses __pa/__va
> So instead, I should refactor those patches to simply pick a VA offset
> and map the FDT there from head.S

I haven't dug out those patches yet but in principle you should not care
about __va, just __pa for populating the pmd/pud/pgd. Since with fixmap
we pre-allocate the page tables in the kernel data section
(bm_pud/pmd/pte), a __pa implementation that takes KERNEL_PAGE_OFFSET
into account (as per these patches) is enough, you don't really care
about the linear PAGE_OFFSET at this stage since you would not provide
__pa with such virtual address until after setup_machine_fdt().

> >> I think it may be doable, but it would require some significant
> >> hacking, e.g., call early_init_scan_dt() at its physical address with
> >> only the ID map loaded and the MMU and caches on, and only after that
> >> start populating the virtual address space. Or at least only populate
> >> the lower half, i.e., mappings below PAGE_OFFSET for the kernel and
> >> the FDT
> >
> > With some form of your patches, we already decouple the PAGE_OFFSET from
> > the kernel text mapping. We map the latter at some very high
> > KERNEL_PAGE_OFFSET, the DT via fixmap (which is part of the kernel data
> > section, so mapped at the high KERNEL_PAGE_OFFSET). Once we start
> > calling early_init_dt_add_memory_arch(), we set the real PAGE_OFFSET and
> > are free to use __pa/__va after setup_machine_fdt(). The actual linear
> > mappings will be created later via paging_init().
> 
> OK, that sounds like it could work.
> 
> The only thing to note is (and this should answer Mark's question in
> the other thread) is that the pgdir region should be mapped via the
> linear mapping as well.

I'm not sure I followed Mark's comments fully but why can't use just
access swapper_pg_dir only via the KERNEL_PAGE_OFFSET mapping? Such
kernel text/data mapping is not going away. There are some weird things
at some point as functions like pmd_page_vaddr() would return the linear
mapping. I don't think anything would break but I cannot guarantee (I
don't think such vaddr functions would be called before we have the
memory mapped and PHYS_OFFSET set anyway).

> The alternative is to make __va() check whether the input physical
> address falls within the pgdir region, and subtract the PAGE_OFFSET -
> KERNEL_PAGE_OFFSET from the return value, but this is less trivial
> than the __pa() counterpart which only needs to check a single bit.

It looks ugly.

> So I propose we take defines KERNEL_PAGE_OFFSET as (PAGE_OFFSET -
> SZ_64M) and FDT_PAGE_OFFSET (or whatever you want to call it) as
> (PAGE_OFFSET - SZ_2M).
> That way, the kernel, fdt and module space always share the same 128
> MB window, which we can move around in the vmalloc space later if we
> want to implement kASLR.

The only downside is that to make __pa only a bit test requires
PAGE_OFFSET to be always aligned to a power of two. Right now we kind of
use (waste) half of the VA space on vmalloc + I/O + modules but there
are no restrictions on PAGE_OFFSET placement. If we decouple the
KERNEL_PAGE_OFFSET from PAGE_OFFSET, we could make PAGE_OFFSET start at
the beginning of the VA space and the kernel at the top 64MB (+ 64MB
below it for modules). The vmalloc + I/O can leave somewhere in between
but this way we allow the PAGE_OFFSET to grow beyond half the VA space
easily and we still have a single bit check in __pa (only that it's a
higher bit).

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-31 14:49                             ` Catalin Marinas
@ 2015-03-31 16:19                               ` Catalin Marinas
  2015-03-31 16:46                                 ` Catalin Marinas
  0 siblings, 1 reply; 42+ messages in thread
From: Catalin Marinas @ 2015-03-31 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 31, 2015 at 03:49:32PM +0100, Catalin Marinas wrote:
> On Mon, Mar 30, 2015 at 08:08:52PM +0200, Ard Biesheuvel wrote:
> > But using early_fixmap() implies using the ordinary page tables
> > manipulation code, which uses __pa/__va
> > So instead, I should refactor those patches to simply pick a VA offset
> > and map the FDT there from head.S
> 
> I haven't dug out those patches yet but in principle you should not care
> about __va, just __pa for populating the pmd/pud/pgd. Since with
> fixmap we pre-allocate the page tables in the kernel data section
> (bm_pud/pmd/pte), a __pa implementation that takes KERNEL_PAGE_OFFSET
> into account (as per these patches) is enough, you don't really care
> about the linear PAGE_OFFSET at this stage since you would not provide
> __pa with such virtual address until after setup_machine_fdt().

Actually, that's wrong, early_fixmap_init() uses pud/pmd_offset which in
turn use the pmd_offset etc. and pmd_page_vaddr. These would not be
mapped in the linear mapping at this stage. create_mapping() also uses
the *_offset() functions, so we have the same issue.

So I think we still need to calculate the default initial PHYS_OFFSET in
head.S as we currently do, based on the kernel load address so that
__va() returns a mapping in the initial KERNEL_PAGE_OFFSET range. During
our own early_init_dt_add_memory_arch(), we calculate the real
PHYS_OFFSET but don't set it yet. We should defer the setting of the
real PHYS_OFFSET until we map the first block of memory via
create_mapping(). We know that we can't cope with allocations before the
first block anyway, so this approach may work.

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] arm64: align PHYS_OFFSET to block size
  2015-03-31 16:19                               ` Catalin Marinas
@ 2015-03-31 16:46                                 ` Catalin Marinas
  0 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2015-03-31 16:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 31, 2015 at 05:19:03PM +0100, Catalin Marinas wrote:
> On Tue, Mar 31, 2015 at 03:49:32PM +0100, Catalin Marinas wrote:
> > On Mon, Mar 30, 2015 at 08:08:52PM +0200, Ard Biesheuvel wrote:
> > > But using early_fixmap() implies using the ordinary page tables
> > > manipulation code, which uses __pa/__va
> > > So instead, I should refactor those patches to simply pick a VA offset
> > > and map the FDT there from head.S
> > 
> > I haven't dug out those patches yet but in principle you should not care
> > about __va, just __pa for populating the pmd/pud/pgd. Since with
> > fixmap we pre-allocate the page tables in the kernel data section
> > (bm_pud/pmd/pte), a __pa implementation that takes KERNEL_PAGE_OFFSET
> > into account (as per these patches) is enough, you don't really care
> > about the linear PAGE_OFFSET at this stage since you would not provide
> > __pa with such virtual address until after setup_machine_fdt().
> 
> Actually, that's wrong, early_fixmap_init() uses pud/pmd_offset which in
> turn use the pmd_offset etc. and pmd_page_vaddr. These would not be
> mapped in the linear mapping at this stage. create_mapping() also uses
> the *_offset() functions, so we have the same issue.
> 
> So I think we still need to calculate the default initial PHYS_OFFSET in
> head.S as we currently do, based on the kernel load address so that
> __va() returns a mapping in the initial KERNEL_PAGE_OFFSET range. During
> our own early_init_dt_add_memory_arch(), we calculate the real
> PHYS_OFFSET but don't set it yet. We should defer the setting of the
> real PHYS_OFFSET until we map the first block of memory via
> create_mapping(). We know that we can't cope with allocations before the
> first block anyway, so this approach may work.

Some more brainstorming with Mark, I think we can set the real
PHYS_OFFSET immediately after we map the kernel text into the linear
mapping (and before mapping the RAM with create_mapping) so that we have
access to swapper_pg_dir via this mapping. But we need the initial
swapper be pre-populated with pud/pmd for (1) the kernel text at
KERNEL_PAGE_OFFSET, (2) kernel text in the linear mapping and (3) the
1st GB of PAGE_OFFSET (to avoid memblock allocations until we mapped
some RAM). This way we avoid mandating that the kernel image is only
loaded in the first GB of RAM.

-- 
Catalin

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-03-31 16:46 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-16 15:23 [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Ard Biesheuvel
2015-03-16 15:23 ` [RFC PATCH 1/3] arm64: head.S: replace early literals with constant immediates Ard Biesheuvel
2015-03-16 17:14   ` Mark Rutland
2015-03-17  7:01     ` Ard Biesheuvel
2015-03-16 15:23 ` [RFC PATCH 2/3] arm64: add support for relocatable kernel Ard Biesheuvel
2015-03-16 15:23 ` [RFC PATCH 3/3] arm64/efi: use relocated kernel Ard Biesheuvel
2015-03-16 16:09 ` [RFC PATCH 0/3] arm64: relocatable kernel proof of concept Mark Rutland
2015-03-16 16:45   ` Ard Biesheuvel
2015-03-16 17:33     ` Mark Rutland
2015-03-16 17:43       ` Ard Biesheuvel
2015-03-17 16:20         ` Mark Rutland
2015-03-16 23:19 ` Kees Cook
2015-03-17  7:38   ` Ard Biesheuvel
2015-03-17 16:35     ` Mark Rutland
2015-03-17 16:40       ` Ard Biesheuvel
2015-03-17 16:43         ` Mark Rutland
2015-03-23 15:36           ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Ard Biesheuvel
2015-03-23 15:36             ` [PATCH 1/4] arm64: use tagged pointers to distinguish kernel text from the linear mapping Ard Biesheuvel
2015-03-25 14:04               ` Catalin Marinas
2015-03-26  1:27               ` Mark Rutland
2015-03-23 15:36             ` [PATCH 2/4] arm64: fixmap: move translation tables to dedicated region Ard Biesheuvel
2015-03-26  1:28               ` Mark Rutland
2015-03-26  6:20                 ` Ard Biesheuvel
2015-03-30 14:34                   ` Mark Rutland
2015-03-23 15:36             ` [PATCH 3/4] arm64: move kernel text below PAGE_OFFSET Ard Biesheuvel
2015-03-25 14:10               ` Catalin Marinas
2015-03-23 15:36             ` [PATCH 4/4] arm64: align PHYS_OFFSET to block size Ard Biesheuvel
2015-03-25 14:14               ` Catalin Marinas
2015-03-26  6:23                 ` Ard Biesheuvel
2015-03-25 14:59               ` Catalin Marinas
2015-03-26  6:22                 ` Ard Biesheuvel
2015-03-27 13:16                   ` Ard Biesheuvel
2015-03-30 13:49                     ` Catalin Marinas
2015-03-30 14:00                       ` Ard Biesheuvel
2015-03-30 14:55                         ` Mark Rutland
2015-03-30 15:00                         ` Catalin Marinas
2015-03-30 18:08                           ` Ard Biesheuvel
2015-03-31 14:49                             ` Catalin Marinas
2015-03-31 16:19                               ` Catalin Marinas
2015-03-31 16:46                                 ` Catalin Marinas
2015-03-26  1:26             ` [PATCH 0/4] RFC: split text and linear mappings using tagged pointers Mark Rutland
2015-03-26  6:09               ` Ard Biesheuvel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.