All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] vmalloc kernel mapping and relocatable kernel
@ 2020-05-24  8:52 Alexandre Ghiti
  2020-05-24  8:52 ` [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone Alexandre Ghiti
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Alexandre Ghiti @ 2020-05-24  8:52 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Alexandre Ghiti

This patchset originally implemented relocatable kernel support but now
also moves the kernel mapping into the vmalloc zone.

The first patch explains why we need to move the kernel into vmalloc
zone (instead of memcpying it around). That patch should ease KASLR
implementation a lot.

The second patch allows to build relocatable kernels but is not selected
by default.

The third patch takes advantage of an already existing powerpc script
that checks relocations at compile-time, and uses it for riscv.

Alexandre Ghiti (3):
  riscv: Move kernel mapping to vmalloc zone
  riscv: Introduce CONFIG_RELOCATABLE
  arch, scripts: Add script to check relocations at compile time

 arch/powerpc/tools/relocs_check.sh |  18 +----
 arch/riscv/Kconfig                 |  12 +++
 arch/riscv/Makefile                |   5 +-
 arch/riscv/Makefile.postlink       |  36 +++++++++
 arch/riscv/boot/loader.lds.S       |   3 +-
 arch/riscv/include/asm/page.h      |  10 ++-
 arch/riscv/include/asm/pgtable.h   |  37 ++++++---
 arch/riscv/kernel/head.S           |   3 +-
 arch/riscv/kernel/module.c         |   4 +-
 arch/riscv/kernel/vmlinux.lds.S    |   9 ++-
 arch/riscv/mm/Makefile             |   4 +
 arch/riscv/mm/init.c               | 121 +++++++++++++++++++++++++----
 arch/riscv/mm/physaddr.c           |   2 +-
 arch/riscv/tools/relocs_check.sh   |  26 +++++++
 scripts/relocs_check.sh            |  20 +++++
 15 files changed, 258 insertions(+), 52 deletions(-)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh
 create mode 100755 scripts/relocs_check.sh

-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
  2020-05-24  8:52 [PATCH v3 0/3] vmalloc kernel mapping and relocatable kernel Alexandre Ghiti
@ 2020-05-24  8:52 ` Alexandre Ghiti
  2020-05-26  9:43     ` Zong Li
  2020-05-27  7:33     ` kbuild test robot
  2020-05-24  8:52 ` [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE Alexandre Ghiti
  2020-05-24  8:52 ` [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time Alexandre Ghiti
  2 siblings, 2 replies; 30+ messages in thread
From: Alexandre Ghiti @ 2020-05-24  8:52 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Alexandre Ghiti

This is a preparatory patch for relocatable kernel.

The kernel used to be linked at PAGE_OFFSET address and used to be loaded
physically at the beginning of the main memory. Therefore, we could use
the linear mapping for the kernel mapping.

But the relocated kernel base address will be different from PAGE_OFFSET
and since in the linear mapping, two different virtual addresses cannot
point to the same physical address, the kernel mapping needs to lie outside
the linear mapping.

In addition, because modules and BPF must be close to the kernel (inside
+-2GB window), the kernel is placed at the end of the vmalloc zone minus
2GB, which leaves room for modules and BPF. The kernel could not be
placed at the beginning of the vmalloc zone since other vmalloc
allocations from the kernel could get all the +-2GB window around the
kernel which would prevent new modules and BPF programs to be loaded.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/boot/loader.lds.S     |  3 +-
 arch/riscv/include/asm/page.h    | 10 +++++-
 arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
 arch/riscv/kernel/head.S         |  3 +-
 arch/riscv/kernel/module.c       |  4 +--
 arch/riscv/kernel/vmlinux.lds.S  |  3 +-
 arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
 arch/riscv/mm/physaddr.c         |  2 +-
 8 files changed, 87 insertions(+), 33 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include <asm/page.h>
+#include <asm/pgtable.h>
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-	. = PAGE_OFFSET;
+	. = KERNEL_LINK_ADDR;
 
 	.payload : {
 		*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 2d50f76efe48..48bb09b6a9b7 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET		(pfn_base)
 #else
 #define va_pa_offset		0
+#define va_kernel_pa_offset	0
 #define ARCH_PFN_OFFSET		(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
+extern unsigned long kernel_virt_addr;
 
 #define __pa_to_va_nodebug(x)	((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)	((unsigned long)(x) - va_pa_offset)
+#define linear_mapping_va_to_pa(x)	((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x)	\
+	((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)		\
+	(((x) >= PAGE_OFFSET) ?		\
+		linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 35b60035b6b0..25213cfaf680 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,29 @@
 
 #include <asm/pgtable-bits.h>
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include <asm-generic/pgtable-nopud.h>
-#include <asm/page.h>
-#include <asm/tlbflush.h>
-#include <linux/mm_types.h>
-
-#ifdef CONFIG_MMU
+#ifndef CONFIG_MMU
+#define KERNEL_VIRT_ADDR	PAGE_OFFSET
+#define KERNEL_LINK_ADDR	PAGE_OFFSET
+#else
+/*
+ * Leave 2GB for modules and BPF that must lie within a 2GB range around
+ * the kernel.
+ */
+#define KERNEL_VIRT_ADDR	(VMALLOC_END - SZ_2G + 1)
+#define KERNEL_LINK_ADDR	KERNEL_VIRT_ADDR
 
 #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END      (PAGE_OFFSET - 1)
 #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
 
 #define BPF_JIT_REGION_SIZE	(SZ_128M)
-#define BPF_JIT_REGION_START	(PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END	(VMALLOC_END)
+#define BPF_JIT_REGION_START	(kernel_virt_addr)
+#define BPF_JIT_REGION_END	(kernel_virt_addr + BPF_JIT_REGION_SIZE)
+
+#ifdef CONFIG_64BIT
+#define VMALLOC_MODULE_START	BPF_JIT_REGION_END
+#define VMALLOC_MODULE_END	VMALLOC_END
+#endif
 
 /*
  * Roughly size the vmemmap space to be large enough to fit enough
@@ -57,9 +63,16 @@
 #define FIXADDR_SIZE     PGDIR_SIZE
 #endif
 #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
-
 #endif
 
+#ifndef __ASSEMBLY__
+
+/* Page Upper Directory not used in RISC-V */
+#include <asm-generic/pgtable-nopud.h>
+#include <asm/page.h>
+#include <asm/tlbflush.h>
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_64BIT
 #include <asm/pgtable-64.h>
 #else
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 98a406474e7d..8f5bb7731327 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -49,7 +49,8 @@ ENTRY(_start)
 #ifdef CONFIG_MMU
 relocate:
 	/* Relocate return address */
-	li a1, PAGE_OFFSET
+	la a1, kernel_virt_addr
+	REG_L a1, 0(a1)
 	la a2, _start
 	sub a1, a1, a2
 	add ra, ra, a1
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 8bbe5dbe1341..1a8fbe05accf 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 }
 
 #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-#define VMALLOC_MODULE_START \
-	 max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
 void *module_alloc(unsigned long size)
 {
 	return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
-				    VMALLOC_END, GFP_KERNEL,
+				    VMALLOC_MODULE_END, GFP_KERNEL,
 				    PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
 				    __builtin_return_address(0));
 }
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index 0339b6bbe11a..a9abde62909f 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -4,7 +4,8 @@
  * Copyright (C) 2017 SiFive
  */
 
-#define LOAD_OFFSET PAGE_OFFSET
+#include <asm/pgtable.h>
+#define LOAD_OFFSET KERNEL_LINK_ADDR
 #include <asm/vmlinux.lds.h>
 #include <asm/page.h>
 #include <asm/cache.h>
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 27a334106708..17f108baec4f 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -22,6 +22,9 @@
 
 #include "../kernel/head.h"
 
+unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
+EXPORT_SYMBOL(kernel_virt_addr);
+
 unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
 							__page_aligned_bss;
 EXPORT_SYMBOL(empty_zero_page);
@@ -178,8 +181,12 @@ void __init setup_bootmem(void)
 }
 
 #ifdef CONFIG_MMU
+/* Offset between linear mapping virtual address and kernel load address */
 unsigned long va_pa_offset;
 EXPORT_SYMBOL(va_pa_offset);
+/* Offset between kernel mapping virtual address and kernel load address */
+unsigned long va_kernel_pa_offset;
+EXPORT_SYMBOL(va_kernel_pa_offset);
 unsigned long pfn_base;
 EXPORT_SYMBOL(pfn_base);
 
@@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
 	if (mmu_enabled)
 		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 
-	pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
+	pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
 	BUG_ON(pmd_num >= NUM_EARLY_PMDS);
 	return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
 }
@@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
 #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
 #endif
 
+static uintptr_t load_pa, load_sz;
+
+void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
+{
+	uintptr_t va, end_va;
+
+	end_va = kernel_virt_addr + load_sz;
+	for (va = kernel_virt_addr; va < end_va; va += map_size)
+		create_pgd_mapping(pgdir, va,
+				   load_pa + (va - kernel_virt_addr),
+				   map_size, PAGE_KERNEL_EXEC);
+}
+
 asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 {
 	uintptr_t va, end_va;
-	uintptr_t load_pa = (uintptr_t)(&_start);
-	uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
 	uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
 
+	load_pa = (uintptr_t)(&_start);
+	load_sz = (uintptr_t)(&_end) - load_pa;
+
 	va_pa_offset = PAGE_OFFSET - load_pa;
+	va_kernel_pa_offset = kernel_virt_addr - load_pa;
+
 	pfn_base = PFN_DOWN(load_pa);
 
 	/*
@@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 	create_pmd_mapping(fixmap_pmd, FIXADDR_START,
 			   (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
 	/* Setup trampoline PGD and PMD */
-	create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
+	create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
 			   (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
-	create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
+	create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
 			   load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
 #else
 	/* Setup trampoline PGD */
-	create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
+	create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
 			   load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
 #endif
 
 	/*
-	 * Setup early PGD covering entire kernel which will allows
+	 * Setup early PGD covering entire kernel which will allow
 	 * us to reach paging_init(). We map all memory banks later
 	 * in setup_vm_final() below.
 	 */
-	end_va = PAGE_OFFSET + load_sz;
-	for (va = PAGE_OFFSET; va < end_va; va += map_size)
-		create_pgd_mapping(early_pg_dir, va,
-				   load_pa + (va - PAGE_OFFSET),
-				   map_size, PAGE_KERNEL_EXEC);
+	create_kernel_page_table(early_pg_dir, map_size);
 
 	/* Create fixed mapping for early FDT parsing */
 	end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
@@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
 	uintptr_t va, map_size;
 	phys_addr_t pa, start, end;
 	struct memblock_region *reg;
+	static struct vm_struct vm_kernel = { 0 };
 
 	/* Set mmu_enabled flag */
 	mmu_enabled = true;
@@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
 		for (pa = start; pa < end; pa += map_size) {
 			va = (uintptr_t)__va(pa);
 			create_pgd_mapping(swapper_pg_dir, va, pa,
-					   map_size, PAGE_KERNEL_EXEC);
+					   map_size, PAGE_KERNEL);
 		}
 	}
 
+	/* Map the kernel */
+	create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
+
+	/* Reserve the vmalloc area occupied by the kernel */
+	vm_kernel.addr = (void *)kernel_virt_addr;
+	vm_kernel.phys_addr = load_pa;
+	vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
+	vm_kernel.flags = VM_MAP | VM_NO_GUARD;
+	vm_kernel.caller = __builtin_return_address(0);
+
+	vm_area_add_early(&vm_kernel);
+
 	/* Clear fixmap PTE and PMD mappings */
 	clear_fixmap(FIX_PTE);
 	clear_fixmap(FIX_PMD);
diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
index e8e4dcd39fed..35703d5ef5fd 100644
--- a/arch/riscv/mm/physaddr.c
+++ b/arch/riscv/mm/physaddr.c
@@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
 
 phys_addr_t __phys_addr_symbol(unsigned long x)
 {
-	unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
+	unsigned long kernel_start = (unsigned long)kernel_virt_addr;
 	unsigned long kernel_end = (unsigned long)_end;
 
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE
  2020-05-24  8:52 [PATCH v3 0/3] vmalloc kernel mapping and relocatable kernel Alexandre Ghiti
  2020-05-24  8:52 ` [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone Alexandre Ghiti
@ 2020-05-24  8:52 ` Alexandre Ghiti
  2020-05-26  9:05     ` Zong Li
  2020-05-29 12:04     ` Anup Patel
  2020-05-24  8:52 ` [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time Alexandre Ghiti
  2 siblings, 2 replies; 30+ messages in thread
From: Alexandre Ghiti @ 2020-05-24  8:52 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Alexandre Ghiti

This config allows to compile the kernel as PIE and to relocate it at
any virtual address at runtime: this paves the way to KASLR and to 4-level
page table folding at runtime. Runtime relocation is possible since
relocation metadata are embedded into the kernel.

Note that relocating at runtime introduces an overhead even if the
kernel is loaded at the same address it was linked at and that the compiler
options are those used in arm64 which uses the same RELA relocation
format.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/Kconfig              | 12 +++++++
 arch/riscv/Makefile             |  5 ++-
 arch/riscv/kernel/vmlinux.lds.S |  6 ++--
 arch/riscv/mm/Makefile          |  4 +++
 arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
 5 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a31e1a41913a..93127d5913fe 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -170,6 +170,18 @@ config PGTABLE_LEVELS
 	default 3 if 64BIT
 	default 2
 
+config RELOCATABLE
+	bool
+	depends on MMU
+	help
+          This builds a kernel as a Position Independent Executable (PIE),
+          which retains all relocation metadata required to relocate the
+          kernel binary at runtime to a different virtual address than the
+          address it was linked at.
+          Since RISCV uses the RELA relocation format, this requires a
+          relocation pass at runtime even if the kernel is loaded at the
+          same address it was linked at.
+
 source "arch/riscv/Kconfig.socs"
 
 menu "Platform type"
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index fb6e37db836d..1406416ea743 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -9,7 +9,10 @@
 #
 
 OBJCOPYFLAGS    := -O binary
-LDFLAGS_vmlinux :=
+ifeq ($(CONFIG_RELOCATABLE),y)
+LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
+KBUILD_CFLAGS += -fPIE
+endif
 ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
 	LDFLAGS_vmlinux := --no-relax
 endif
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index a9abde62909f..e8ffba8c2044 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -85,8 +85,10 @@ SECTIONS
 
 	BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
 
-	.rel.dyn : {
-		*(.rel.dyn*)
+	.rela.dyn : ALIGN(8) {
+		__rela_dyn_start = .;
+		*(.rela .rela*)
+		__rela_dyn_end = .;
 	}
 
 	_end = .;
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 363ef01c30b1..dc5cdaa80bc1 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -1,6 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 CFLAGS_init.o := -mcmodel=medany
+ifdef CONFIG_RELOCATABLE
+CFLAGS_init.o += -fno-pie
+endif
+
 ifdef CONFIG_FTRACE
 CFLAGS_REMOVE_init.o = -pg
 endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 17f108baec4f..7074522d40c6 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -13,6 +13,9 @@
 #include <linux/of_fdt.h>
 #include <linux/libfdt.h>
 #include <linux/set_memory.h>
+#ifdef CONFIG_RELOCATABLE
+#include <linux/elf.h>
+#endif
 
 #include <asm/fixmap.h>
 #include <asm/tlbflush.h>
@@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
 #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
 #endif
 
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long __rela_dyn_start, __rela_dyn_end;
+
+#ifdef CONFIG_64BIT
+#define Elf_Rela Elf64_Rela
+#define Elf_Addr Elf64_Addr
+#else
+#define Elf_Rela Elf32_Rela
+#define Elf_Addr Elf32_Addr
+#endif
+
+void __init relocate_kernel(uintptr_t load_pa)
+{
+	Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
+	/*
+	 * This holds the offset between the linked virtual address and the
+	 * relocated virtual address.
+	 */
+	uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
+	/*
+	 * This holds the offset between kernel linked virtual address and
+	 * physical address.
+	 */
+	uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
+
+	for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
+		Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
+		Elf_Addr relocated_addr = rela->r_addend;
+
+		if (rela->r_info != R_RISCV_RELATIVE)
+			continue;
+
+		/*
+		 * Make sure to not relocate vdso symbols like rt_sigreturn
+		 * which are linked from the address 0 in vmlinux since
+		 * vdso symbol addresses are actually used as an offset from
+		 * mm->context.vdso in VDSO_OFFSET macro.
+		 */
+		if (relocated_addr >= KERNEL_LINK_ADDR)
+			relocated_addr += reloc_offset;
+
+		*(Elf_Addr *)addr = relocated_addr;
+	}
+}
+
+#endif
+
 static uintptr_t load_pa, load_sz;
 
 void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
@@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 
 	pfn_base = PFN_DOWN(load_pa);
 
+#ifdef CONFIG_RELOCATABLE
+#ifdef CONFIG_64BIT
+	/*
+	 * Early page table uses only one PGDIR, which makes it possible
+	 * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
+	 * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
+	 * since a part of the kernel would not get mapped.
+	 * This cannot happen on rv32 as we use the entire page directory level.
+	 */
+	BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
+#endif
+	relocate_kernel(load_pa);
+#endif
 	/*
 	 * Enforce boot alignment requirements of RV32 and
 	 * RV64 by only allowing PMD or PGD mappings.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time
  2020-05-24  8:52 [PATCH v3 0/3] vmalloc kernel mapping and relocatable kernel Alexandre Ghiti
  2020-05-24  8:52 ` [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone Alexandre Ghiti
  2020-05-24  8:52 ` [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE Alexandre Ghiti
@ 2020-05-24  8:52 ` Alexandre Ghiti
  2020-05-29 12:08     ` Anup Patel
  2 siblings, 1 reply; 30+ messages in thread
From: Alexandre Ghiti @ 2020-05-24  8:52 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, Zong Li, linux-kernel, linuxppc-dev, linux-riscv
  Cc: Alexandre Ghiti

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

Powerpc architecture has a script that allows to check at compile time
for such unexpected relocations: extract the common logic to scripts/
and add arch specific scripts triggered at postlink.

At the moment, powerpc and riscv architectures take advantage of this
compile-time check.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/powerpc/tools/relocs_check.sh | 18 ++-------------
 arch/riscv/Makefile.postlink       | 36 ++++++++++++++++++++++++++++++
 arch/riscv/tools/relocs_check.sh   | 26 +++++++++++++++++++++
 scripts/relocs_check.sh            | 20 +++++++++++++++++
 4 files changed, 84 insertions(+), 16 deletions(-)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh
 create mode 100755 scripts/relocs_check.sh

diff --git a/arch/powerpc/tools/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh
index 014e00e74d2b..e367895941ae 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
 	exit 1
 fi
 
-# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
-objdump="$1"
-nm="$2"
-vmlinux="$3"
-
-# Remove from the bad relocations those that match an undefined weak symbol
-# which will result in an absolute relocation to 0.
-# Weak unresolved symbols are of that form in nm output:
-# "                  w _binary__btf_vmlinux_bin_end"
-undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
-
 bad_relocs=$(
-$objdump -R "$vmlinux" |
-	# Only look at relocation lines.
-	grep -E '\<R_' |
+${srctree}/scripts/relocs_check.sh "$@" |
 	# These relocations are okay
 	# On PPC64:
 	#	R_PPC64_RELATIVE, R_PPC64_NONE
@@ -43,8 +30,7 @@ R_PPC_ADDR16_LO
 R_PPC_ADDR16_HI
 R_PPC_ADDR16_HA
 R_PPC_RELATIVE
-R_PPC_NONE' |
-	([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
+R_PPC_NONE'
 )
 
 if [ -z "$bad_relocs" ]; then
diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
new file mode 100644
index 000000000000..bf2b2bca1845
--- /dev/null
+++ b/arch/riscv/Makefile.postlink
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: GPL-2.0
+# ===========================================================================
+# Post-link riscv pass
+# ===========================================================================
+#
+# Check that vmlinux relocations look sane
+
+PHONY := __archpost
+__archpost:
+
+-include include/config/auto.conf
+include scripts/Kbuild.include
+
+quiet_cmd_relocs_check = CHKREL  $@
+cmd_relocs_check = 							\
+	$(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@"
+
+# `@true` prevents complaint when there is nothing to be done
+
+vmlinux: FORCE
+	@true
+ifdef CONFIG_RELOCATABLE
+	$(call if_changed,relocs_check)
+endif
+
+%.ko: FORCE
+	@true
+
+clean:
+	@true
+
+PHONY += FORCE clean
+
+FORCE:
+
+.PHONY: $(PHONY)
diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
new file mode 100755
index 000000000000..baeb2e7b2290
--- /dev/null
+++ b/arch/riscv/tools/relocs_check.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Based on powerpc relocs_check.sh
+
+# This script checks the relocations of a vmlinux for "suspicious"
+# relocations.
+
+if [ $# -lt 3 ]; then
+        echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
+        exit 1
+fi
+
+bad_relocs=$(
+${srctree}/scripts/relocs_check.sh "$@" |
+	# These relocations are okay
+	#	R_RISCV_RELATIVE
+	grep -F -w -v 'R_RISCV_RELATIVE'
+)
+
+if [ -z "$bad_relocs" ]; then
+	exit 0
+fi
+
+num_bad=$(echo "$bad_relocs" | wc -l)
+echo "WARNING: $num_bad bad relocations"
+echo "$bad_relocs"
diff --git a/scripts/relocs_check.sh b/scripts/relocs_check.sh
new file mode 100755
index 000000000000..137c660499f3
--- /dev/null
+++ b/scripts/relocs_check.sh
@@ -0,0 +1,20 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+# Get a list of all the relocations, remove from it the relocations
+# that are known to be legitimate and return this list to arch specific
+# script that will look for suspicious relocations.
+
+objdump="$1"
+nm="$2"
+vmlinux="$3"
+
+# Remove from the possible bad relocations those that match an undefined
+# weak symbol which will result in an absolute relocation to 0.
+# Weak unresolved symbols are of that form in nm output:
+# "                  w _binary__btf_vmlinux_bin_end"
+undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
+
+$objdump -R "$vmlinux" |
+	grep -E '\<R_' |
+	([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE
  2020-05-24  8:52 ` [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE Alexandre Ghiti
  2020-05-26  9:05     ` Zong Li
@ 2020-05-26  9:05     ` Zong Li
  1 sibling, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-26  9:05 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, linux-kernel@vger.kernel.org List, linuxppc-dev,
	linux-riscv

On Sun, May 24, 2020 at 4:55 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This config allows to compile the kernel as PIE and to relocate it at
> any virtual address at runtime: this paves the way to KASLR and to 4-level
> page table folding at runtime. Runtime relocation is possible since
> relocation metadata are embedded into the kernel.
>
> Note that relocating at runtime introduces an overhead even if the
> kernel is loaded at the same address it was linked at and that the compiler
> options are those used in arm64 which uses the same RELA relocation
> format.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig              | 12 +++++++
>  arch/riscv/Makefile             |  5 ++-
>  arch/riscv/kernel/vmlinux.lds.S |  6 ++--
>  arch/riscv/mm/Makefile          |  4 +++
>  arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
>  5 files changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a31e1a41913a..93127d5913fe 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -170,6 +170,18 @@ config PGTABLE_LEVELS
>         default 3 if 64BIT
>         default 2
>
> +config RELOCATABLE
> +       bool
> +       depends on MMU
> +       help
> +          This builds a kernel as a Position Independent Executable (PIE),
> +          which retains all relocation metadata required to relocate the
> +          kernel binary at runtime to a different virtual address than the
> +          address it was linked at.
> +          Since RISCV uses the RELA relocation format, this requires a
> +          relocation pass at runtime even if the kernel is loaded at the
> +          same address it was linked at.
> +
>  source "arch/riscv/Kconfig.socs"
>
>  menu "Platform type"
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index fb6e37db836d..1406416ea743 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -9,7 +9,10 @@
>  #
>
>  OBJCOPYFLAGS    := -O binary
> -LDFLAGS_vmlinux :=
> +ifeq ($(CONFIG_RELOCATABLE),y)
> +LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
> +KBUILD_CFLAGS += -fPIE
> +endif
>  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>         LDFLAGS_vmlinux := --no-relax
>  endif
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index a9abde62909f..e8ffba8c2044 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -85,8 +85,10 @@ SECTIONS
>
>         BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
>
> -       .rel.dyn : {
> -               *(.rel.dyn*)
> +       .rela.dyn : ALIGN(8) {
> +               __rela_dyn_start = .;
> +               *(.rela .rela*)
> +               __rela_dyn_end = .;
>         }
>
>         _end = .;
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index 363ef01c30b1..dc5cdaa80bc1 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -1,6 +1,10 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>
>  CFLAGS_init.o := -mcmodel=medany
> +ifdef CONFIG_RELOCATABLE
> +CFLAGS_init.o += -fno-pie
> +endif
> +
>  ifdef CONFIG_FTRACE
>  CFLAGS_REMOVE_init.o = -pg
>  endif
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 17f108baec4f..7074522d40c6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -13,6 +13,9 @@
>  #include <linux/of_fdt.h>
>  #include <linux/libfdt.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_RELOCATABLE
> +#include <linux/elf.h>
> +#endif
>
>  #include <asm/fixmap.h>
>  #include <asm/tlbflush.h>
> @@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long __rela_dyn_start, __rela_dyn_end;
> +
> +#ifdef CONFIG_64BIT
> +#define Elf_Rela Elf64_Rela
> +#define Elf_Addr Elf64_Addr
> +#else
> +#define Elf_Rela Elf32_Rela
> +#define Elf_Addr Elf32_Addr
> +#endif
> +
> +void __init relocate_kernel(uintptr_t load_pa)
> +{
> +       Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
> +       /*
> +        * This holds the offset between the linked virtual address and the
> +        * relocated virtual address.
> +        */
> +       uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
> +       /*
> +        * This holds the offset between kernel linked virtual address and
> +        * physical address.
> +        */
> +       uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
> +
> +       for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
> +               Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
> +               Elf_Addr relocated_addr = rela->r_addend;
> +
> +               if (rela->r_info != R_RISCV_RELATIVE)
> +                       continue;
> +
> +               /*
> +                * Make sure to not relocate vdso symbols like rt_sigreturn
> +                * which are linked from the address 0 in vmlinux since
> +                * vdso symbol addresses are actually used as an offset from
> +                * mm->context.vdso in VDSO_OFFSET macro.
> +                */
> +               if (relocated_addr >= KERNEL_LINK_ADDR)
> +                       relocated_addr += reloc_offset;
> +
> +               *(Elf_Addr *)addr = relocated_addr;
> +       }
> +}
> +
> +#endif
> +
>  static uintptr_t load_pa, load_sz;
>
>  void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> @@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>         pfn_base = PFN_DOWN(load_pa);
>
> +#ifdef CONFIG_RELOCATABLE
> +#ifdef CONFIG_64BIT
> +       /*
> +        * Early page table uses only one PGDIR, which makes it possible
> +        * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
> +        * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
> +        * since a part of the kernel would not get mapped.
> +        * This cannot happen on rv32 as we use the entire page directory level.
> +        */
> +       BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
> +#endif
> +       relocate_kernel(load_pa);
> +#endif
>         /*
>          * Enforce boot alignment requirements of RV32 and
>          * RV64 by only allowing PMD or PGD mappings.
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Zong Li <zong.li@sifive.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE
@ 2020-05-26  9:05     ` Zong Li
  0 siblings, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-26  9:05 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, linux-riscv, linuxppc-dev

On Sun, May 24, 2020 at 4:55 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This config allows to compile the kernel as PIE and to relocate it at
> any virtual address at runtime: this paves the way to KASLR and to 4-level
> page table folding at runtime. Runtime relocation is possible since
> relocation metadata are embedded into the kernel.
>
> Note that relocating at runtime introduces an overhead even if the
> kernel is loaded at the same address it was linked at and that the compiler
> options are those used in arm64 which uses the same RELA relocation
> format.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig              | 12 +++++++
>  arch/riscv/Makefile             |  5 ++-
>  arch/riscv/kernel/vmlinux.lds.S |  6 ++--
>  arch/riscv/mm/Makefile          |  4 +++
>  arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
>  5 files changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a31e1a41913a..93127d5913fe 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -170,6 +170,18 @@ config PGTABLE_LEVELS
>         default 3 if 64BIT
>         default 2
>
> +config RELOCATABLE
> +       bool
> +       depends on MMU
> +       help
> +          This builds a kernel as a Position Independent Executable (PIE),
> +          which retains all relocation metadata required to relocate the
> +          kernel binary at runtime to a different virtual address than the
> +          address it was linked at.
> +          Since RISCV uses the RELA relocation format, this requires a
> +          relocation pass at runtime even if the kernel is loaded at the
> +          same address it was linked at.
> +
>  source "arch/riscv/Kconfig.socs"
>
>  menu "Platform type"
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index fb6e37db836d..1406416ea743 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -9,7 +9,10 @@
>  #
>
>  OBJCOPYFLAGS    := -O binary
> -LDFLAGS_vmlinux :=
> +ifeq ($(CONFIG_RELOCATABLE),y)
> +LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
> +KBUILD_CFLAGS += -fPIE
> +endif
>  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>         LDFLAGS_vmlinux := --no-relax
>  endif
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index a9abde62909f..e8ffba8c2044 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -85,8 +85,10 @@ SECTIONS
>
>         BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
>
> -       .rel.dyn : {
> -               *(.rel.dyn*)
> +       .rela.dyn : ALIGN(8) {
> +               __rela_dyn_start = .;
> +               *(.rela .rela*)
> +               __rela_dyn_end = .;
>         }
>
>         _end = .;
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index 363ef01c30b1..dc5cdaa80bc1 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -1,6 +1,10 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>
>  CFLAGS_init.o := -mcmodel=medany
> +ifdef CONFIG_RELOCATABLE
> +CFLAGS_init.o += -fno-pie
> +endif
> +
>  ifdef CONFIG_FTRACE
>  CFLAGS_REMOVE_init.o = -pg
>  endif
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 17f108baec4f..7074522d40c6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -13,6 +13,9 @@
>  #include <linux/of_fdt.h>
>  #include <linux/libfdt.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_RELOCATABLE
> +#include <linux/elf.h>
> +#endif
>
>  #include <asm/fixmap.h>
>  #include <asm/tlbflush.h>
> @@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long __rela_dyn_start, __rela_dyn_end;
> +
> +#ifdef CONFIG_64BIT
> +#define Elf_Rela Elf64_Rela
> +#define Elf_Addr Elf64_Addr
> +#else
> +#define Elf_Rela Elf32_Rela
> +#define Elf_Addr Elf32_Addr
> +#endif
> +
> +void __init relocate_kernel(uintptr_t load_pa)
> +{
> +       Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
> +       /*
> +        * This holds the offset between the linked virtual address and the
> +        * relocated virtual address.
> +        */
> +       uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
> +       /*
> +        * This holds the offset between kernel linked virtual address and
> +        * physical address.
> +        */
> +       uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
> +
> +       for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
> +               Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
> +               Elf_Addr relocated_addr = rela->r_addend;
> +
> +               if (rela->r_info != R_RISCV_RELATIVE)
> +                       continue;
> +
> +               /*
> +                * Make sure to not relocate vdso symbols like rt_sigreturn
> +                * which are linked from the address 0 in vmlinux since
> +                * vdso symbol addresses are actually used as an offset from
> +                * mm->context.vdso in VDSO_OFFSET macro.
> +                */
> +               if (relocated_addr >= KERNEL_LINK_ADDR)
> +                       relocated_addr += reloc_offset;
> +
> +               *(Elf_Addr *)addr = relocated_addr;
> +       }
> +}
> +
> +#endif
> +
>  static uintptr_t load_pa, load_sz;
>
>  void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> @@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>         pfn_base = PFN_DOWN(load_pa);
>
> +#ifdef CONFIG_RELOCATABLE
> +#ifdef CONFIG_64BIT
> +       /*
> +        * Early page table uses only one PGDIR, which makes it possible
> +        * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
> +        * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
> +        * since a part of the kernel would not get mapped.
> +        * This cannot happen on rv32 as we use the entire page directory level.
> +        */
> +       BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
> +#endif
> +       relocate_kernel(load_pa);
> +#endif
>         /*
>          * Enforce boot alignment requirements of RV32 and
>          * RV64 by only allowing PMD or PGD mappings.
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Zong Li <zong.li@sifive.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE
@ 2020-05-26  9:05     ` Zong Li
  0 siblings, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-26  9:05 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Paul Walmsley, Palmer Dabbelt,
	linux-riscv, linuxppc-dev

On Sun, May 24, 2020 at 4:55 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This config allows to compile the kernel as PIE and to relocate it at
> any virtual address at runtime: this paves the way to KASLR and to 4-level
> page table folding at runtime. Runtime relocation is possible since
> relocation metadata are embedded into the kernel.
>
> Note that relocating at runtime introduces an overhead even if the
> kernel is loaded at the same address it was linked at and that the compiler
> options are those used in arm64 which uses the same RELA relocation
> format.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig              | 12 +++++++
>  arch/riscv/Makefile             |  5 ++-
>  arch/riscv/kernel/vmlinux.lds.S |  6 ++--
>  arch/riscv/mm/Makefile          |  4 +++
>  arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
>  5 files changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a31e1a41913a..93127d5913fe 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -170,6 +170,18 @@ config PGTABLE_LEVELS
>         default 3 if 64BIT
>         default 2
>
> +config RELOCATABLE
> +       bool
> +       depends on MMU
> +       help
> +          This builds a kernel as a Position Independent Executable (PIE),
> +          which retains all relocation metadata required to relocate the
> +          kernel binary at runtime to a different virtual address than the
> +          address it was linked at.
> +          Since RISCV uses the RELA relocation format, this requires a
> +          relocation pass at runtime even if the kernel is loaded at the
> +          same address it was linked at.
> +
>  source "arch/riscv/Kconfig.socs"
>
>  menu "Platform type"
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index fb6e37db836d..1406416ea743 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -9,7 +9,10 @@
>  #
>
>  OBJCOPYFLAGS    := -O binary
> -LDFLAGS_vmlinux :=
> +ifeq ($(CONFIG_RELOCATABLE),y)
> +LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
> +KBUILD_CFLAGS += -fPIE
> +endif
>  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>         LDFLAGS_vmlinux := --no-relax
>  endif
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index a9abde62909f..e8ffba8c2044 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -85,8 +85,10 @@ SECTIONS
>
>         BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
>
> -       .rel.dyn : {
> -               *(.rel.dyn*)
> +       .rela.dyn : ALIGN(8) {
> +               __rela_dyn_start = .;
> +               *(.rela .rela*)
> +               __rela_dyn_end = .;
>         }
>
>         _end = .;
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index 363ef01c30b1..dc5cdaa80bc1 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -1,6 +1,10 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>
>  CFLAGS_init.o := -mcmodel=medany
> +ifdef CONFIG_RELOCATABLE
> +CFLAGS_init.o += -fno-pie
> +endif
> +
>  ifdef CONFIG_FTRACE
>  CFLAGS_REMOVE_init.o = -pg
>  endif
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 17f108baec4f..7074522d40c6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -13,6 +13,9 @@
>  #include <linux/of_fdt.h>
>  #include <linux/libfdt.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_RELOCATABLE
> +#include <linux/elf.h>
> +#endif
>
>  #include <asm/fixmap.h>
>  #include <asm/tlbflush.h>
> @@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long __rela_dyn_start, __rela_dyn_end;
> +
> +#ifdef CONFIG_64BIT
> +#define Elf_Rela Elf64_Rela
> +#define Elf_Addr Elf64_Addr
> +#else
> +#define Elf_Rela Elf32_Rela
> +#define Elf_Addr Elf32_Addr
> +#endif
> +
> +void __init relocate_kernel(uintptr_t load_pa)
> +{
> +       Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
> +       /*
> +        * This holds the offset between the linked virtual address and the
> +        * relocated virtual address.
> +        */
> +       uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
> +       /*
> +        * This holds the offset between kernel linked virtual address and
> +        * physical address.
> +        */
> +       uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
> +
> +       for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
> +               Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
> +               Elf_Addr relocated_addr = rela->r_addend;
> +
> +               if (rela->r_info != R_RISCV_RELATIVE)
> +                       continue;
> +
> +               /*
> +                * Make sure to not relocate vdso symbols like rt_sigreturn
> +                * which are linked from the address 0 in vmlinux since
> +                * vdso symbol addresses are actually used as an offset from
> +                * mm->context.vdso in VDSO_OFFSET macro.
> +                */
> +               if (relocated_addr >= KERNEL_LINK_ADDR)
> +                       relocated_addr += reloc_offset;
> +
> +               *(Elf_Addr *)addr = relocated_addr;
> +       }
> +}
> +
> +#endif
> +
>  static uintptr_t load_pa, load_sz;
>
>  void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> @@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>         pfn_base = PFN_DOWN(load_pa);
>
> +#ifdef CONFIG_RELOCATABLE
> +#ifdef CONFIG_64BIT
> +       /*
> +        * Early page table uses only one PGDIR, which makes it possible
> +        * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
> +        * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
> +        * since a part of the kernel would not get mapped.
> +        * This cannot happen on rv32 as we use the entire page directory level.
> +        */
> +       BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
> +#endif
> +       relocate_kernel(load_pa);
> +#endif
>         /*
>          * Enforce boot alignment requirements of RV32 and
>          * RV64 by only allowing PMD or PGD mappings.
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Zong Li <zong.li@sifive.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
  2020-05-24  8:52 ` [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone Alexandre Ghiti
  2020-05-26  9:43     ` Zong Li
@ 2020-05-26  9:43     ` Zong Li
  1 sibling, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-26  9:43 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, linux-kernel@vger.kernel.org List, linuxppc-dev,
	linux-riscv

On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This is a preparatory patch for relocatable kernel.
>
> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> physically at the beginning of the main memory. Therefore, we could use
> the linear mapping for the kernel mapping.
>
> But the relocated kernel base address will be different from PAGE_OFFSET
> and since in the linear mapping, two different virtual addresses cannot
> point to the same physical address, the kernel mapping needs to lie outside
> the linear mapping.
>
> In addition, because modules and BPF must be close to the kernel (inside
> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> 2GB, which leaves room for modules and BPF. The kernel could not be
> placed at the beginning of the vmalloc zone since other vmalloc
> allocations from the kernel could get all the +-2GB window around the
> kernel which would prevent new modules and BPF programs to be loaded.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/boot/loader.lds.S     |  3 +-
>  arch/riscv/include/asm/page.h    | 10 +++++-
>  arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>  arch/riscv/kernel/head.S         |  3 +-
>  arch/riscv/kernel/module.c       |  4 +--
>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>  arch/riscv/mm/physaddr.c         |  2 +-
>  8 files changed, 87 insertions(+), 33 deletions(-)
>
> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> index 47a5003c2e28..62d94696a19c 100644
> --- a/arch/riscv/boot/loader.lds.S
> +++ b/arch/riscv/boot/loader.lds.S
> @@ -1,13 +1,14 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>
>  #include <asm/page.h>
> +#include <asm/pgtable.h>
>
>  OUTPUT_ARCH(riscv)
>  ENTRY(_start)
>
>  SECTIONS
>  {
> -       . = PAGE_OFFSET;
> +       . = KERNEL_LINK_ADDR;
>
>         .payload : {
>                 *(.payload)
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 2d50f76efe48..48bb09b6a9b7 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>
>  #ifdef CONFIG_MMU
>  extern unsigned long va_pa_offset;
> +extern unsigned long va_kernel_pa_offset;
>  extern unsigned long pfn_base;
>  #define ARCH_PFN_OFFSET                (pfn_base)
>  #else
>  #define va_pa_offset           0
> +#define va_kernel_pa_offset    0
>  #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>  #endif /* CONFIG_MMU */
>
>  extern unsigned long max_low_pfn;
>  extern unsigned long min_low_pfn;
> +extern unsigned long kernel_virt_addr;
>
>  #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
> +#define kernel_mapping_va_to_pa(x)     \
> +       ((unsigned long)(x) - va_kernel_pa_offset)
> +#define __va_to_pa_nodebug(x)          \
> +       (((x) >= PAGE_OFFSET) ?         \
> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>
>  #ifdef CONFIG_DEBUG_VIRTUAL
>  extern phys_addr_t __virt_to_phys(unsigned long x);
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 35b60035b6b0..25213cfaf680 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -11,23 +11,29 @@
>
>  #include <asm/pgtable-bits.h>
>
> -#ifndef __ASSEMBLY__
> -
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> -#include <asm/page.h>
> -#include <asm/tlbflush.h>
> -#include <linux/mm_types.h>
> -
> -#ifdef CONFIG_MMU
> +#ifndef CONFIG_MMU
> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
> +#else
> +/*
> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
> + * the kernel.
> + */
> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>
>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>
>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> -#define BPF_JIT_REGION_END     (VMALLOC_END)
> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)

It seems to have a potential risk here, the region of bpf is
overlapping with kernel mapping, so if kernel size is bigger than
128MB, bpf region would be occupied and run out by kernel mapping.

> +
> +#ifdef CONFIG_64BIT
> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
> +#define VMALLOC_MODULE_END     VMALLOC_END
> +#endif
>

Although kernel_virt_addr is a fixed address now, I think it could be
changed for the purpose of relocatable or KASLR, so if
kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
of module would be too big. In addition, the region of module could be
+-2G around the kernel, so we don't be limited in one direction as
before. It seems to me that the region of the module could be decided
at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
VMLLOC_MODULE_END is "&_start + 2G". I'm not sure whether the size of
region of bpf has to be 128MB for some particular reason, if not,
maybe the region of bpf could be the same with module to avoid being
run out by module.

>  /*
>   * Roughly size the vmemmap space to be large enough to fit enough
> @@ -57,9 +63,16 @@
>  #define FIXADDR_SIZE     PGDIR_SIZE
>  #endif
>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
> -
>  #endif
>
> +#ifndef __ASSEMBLY__
> +
> +/* Page Upper Directory not used in RISC-V */
> +#include <asm-generic/pgtable-nopud.h>
> +#include <asm/page.h>
> +#include <asm/tlbflush.h>
> +#include <linux/mm_types.h>
> +
>  #ifdef CONFIG_64BIT
>  #include <asm/pgtable-64.h>
>  #else
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 98a406474e7d..8f5bb7731327 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -49,7 +49,8 @@ ENTRY(_start)
>  #ifdef CONFIG_MMU
>  relocate:
>         /* Relocate return address */
> -       li a1, PAGE_OFFSET
> +       la a1, kernel_virt_addr
> +       REG_L a1, 0(a1)
>         la a2, _start
>         sub a1, a1, a2
>         add ra, ra, a1
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index 8bbe5dbe1341..1a8fbe05accf 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>  }
>
>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> -#define VMALLOC_MODULE_START \
> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>  void *module_alloc(unsigned long size)
>  {
>         return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> -                                   VMALLOC_END, GFP_KERNEL,
> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>                                     PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>                                     __builtin_return_address(0));
>  }
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index 0339b6bbe11a..a9abde62909f 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -4,7 +4,8 @@
>   * Copyright (C) 2017 SiFive
>   */
>
> -#define LOAD_OFFSET PAGE_OFFSET
> +#include <asm/pgtable.h>
> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>  #include <asm/vmlinux.lds.h>
>  #include <asm/page.h>
>  #include <asm/cache.h>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 27a334106708..17f108baec4f 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -22,6 +22,9 @@
>
>  #include "../kernel/head.h"
>
> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> +EXPORT_SYMBOL(kernel_virt_addr);
> +
>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>                                                         __page_aligned_bss;
>  EXPORT_SYMBOL(empty_zero_page);
> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>  }
>
>  #ifdef CONFIG_MMU
> +/* Offset between linear mapping virtual address and kernel load address */
>  unsigned long va_pa_offset;
>  EXPORT_SYMBOL(va_pa_offset);
> +/* Offset between kernel mapping virtual address and kernel load address */
> +unsigned long va_kernel_pa_offset;
> +EXPORT_SYMBOL(va_kernel_pa_offset);
>  unsigned long pfn_base;
>  EXPORT_SYMBOL(pfn_base);
>
> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>         if (mmu_enabled)
>                 return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>
> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>         BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>         return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>  }
> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +static uintptr_t load_pa, load_sz;
> +
> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> +{
> +       uintptr_t va, end_va;
> +
> +       end_va = kernel_virt_addr + load_sz;
> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
> +               create_pgd_mapping(pgdir, va,
> +                                  load_pa + (va - kernel_virt_addr),
> +                                  map_size, PAGE_KERNEL_EXEC);
> +}
> +
>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  {
>         uintptr_t va, end_va;
> -       uintptr_t load_pa = (uintptr_t)(&_start);
> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>         uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>
> +       load_pa = (uintptr_t)(&_start);
> +       load_sz = (uintptr_t)(&_end) - load_pa;
> +
>         va_pa_offset = PAGE_OFFSET - load_pa;
> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
> +
>         pfn_base = PFN_DOWN(load_pa);
>
>         /*
> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>         create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>                            (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>         /* Setup trampoline PGD and PMD */
> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>                            (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>                            load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>  #else
>         /* Setup trampoline PGD */
> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>                            load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>  #endif
>
>         /*
> -        * Setup early PGD covering entire kernel which will allows
> +        * Setup early PGD covering entire kernel which will allow
>          * us to reach paging_init(). We map all memory banks later
>          * in setup_vm_final() below.
>          */
> -       end_va = PAGE_OFFSET + load_sz;
> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
> -               create_pgd_mapping(early_pg_dir, va,
> -                                  load_pa + (va - PAGE_OFFSET),
> -                                  map_size, PAGE_KERNEL_EXEC);
> +       create_kernel_page_table(early_pg_dir, map_size);
>
>         /* Create fixed mapping for early FDT parsing */
>         end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>         uintptr_t va, map_size;
>         phys_addr_t pa, start, end;
>         struct memblock_region *reg;
> +       static struct vm_struct vm_kernel = { 0 };
>
>         /* Set mmu_enabled flag */
>         mmu_enabled = true;
> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>                 for (pa = start; pa < end; pa += map_size) {
>                         va = (uintptr_t)__va(pa);
>                         create_pgd_mapping(swapper_pg_dir, va, pa,
> -                                          map_size, PAGE_KERNEL_EXEC);
> +                                          map_size, PAGE_KERNEL);
>                 }
>         }
>
> +       /* Map the kernel */
> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> +
> +       /* Reserve the vmalloc area occupied by the kernel */
> +       vm_kernel.addr = (void *)kernel_virt_addr;
> +       vm_kernel.phys_addr = load_pa;
> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> +       vm_kernel.caller = __builtin_return_address(0);
> +
> +       vm_area_add_early(&vm_kernel);
> +
>         /* Clear fixmap PTE and PMD mappings */
>         clear_fixmap(FIX_PTE);
>         clear_fixmap(FIX_PMD);
> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> index e8e4dcd39fed..35703d5ef5fd 100644
> --- a/arch/riscv/mm/physaddr.c
> +++ b/arch/riscv/mm/physaddr.c
> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>
>  phys_addr_t __phys_addr_symbol(unsigned long x)
>  {
> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>         unsigned long kernel_end = (unsigned long)_end;
>
>         /*
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-26  9:43     ` Zong Li
  0 siblings, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-26  9:43 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, linux-riscv, linuxppc-dev

On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This is a preparatory patch for relocatable kernel.
>
> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> physically at the beginning of the main memory. Therefore, we could use
> the linear mapping for the kernel mapping.
>
> But the relocated kernel base address will be different from PAGE_OFFSET
> and since in the linear mapping, two different virtual addresses cannot
> point to the same physical address, the kernel mapping needs to lie outside
> the linear mapping.
>
> In addition, because modules and BPF must be close to the kernel (inside
> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> 2GB, which leaves room for modules and BPF. The kernel could not be
> placed at the beginning of the vmalloc zone since other vmalloc
> allocations from the kernel could get all the +-2GB window around the
> kernel which would prevent new modules and BPF programs to be loaded.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/boot/loader.lds.S     |  3 +-
>  arch/riscv/include/asm/page.h    | 10 +++++-
>  arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>  arch/riscv/kernel/head.S         |  3 +-
>  arch/riscv/kernel/module.c       |  4 +--
>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>  arch/riscv/mm/physaddr.c         |  2 +-
>  8 files changed, 87 insertions(+), 33 deletions(-)
>
> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> index 47a5003c2e28..62d94696a19c 100644
> --- a/arch/riscv/boot/loader.lds.S
> +++ b/arch/riscv/boot/loader.lds.S
> @@ -1,13 +1,14 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>
>  #include <asm/page.h>
> +#include <asm/pgtable.h>
>
>  OUTPUT_ARCH(riscv)
>  ENTRY(_start)
>
>  SECTIONS
>  {
> -       . = PAGE_OFFSET;
> +       . = KERNEL_LINK_ADDR;
>
>         .payload : {
>                 *(.payload)
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 2d50f76efe48..48bb09b6a9b7 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>
>  #ifdef CONFIG_MMU
>  extern unsigned long va_pa_offset;
> +extern unsigned long va_kernel_pa_offset;
>  extern unsigned long pfn_base;
>  #define ARCH_PFN_OFFSET                (pfn_base)
>  #else
>  #define va_pa_offset           0
> +#define va_kernel_pa_offset    0
>  #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>  #endif /* CONFIG_MMU */
>
>  extern unsigned long max_low_pfn;
>  extern unsigned long min_low_pfn;
> +extern unsigned long kernel_virt_addr;
>
>  #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
> +#define kernel_mapping_va_to_pa(x)     \
> +       ((unsigned long)(x) - va_kernel_pa_offset)
> +#define __va_to_pa_nodebug(x)          \
> +       (((x) >= PAGE_OFFSET) ?         \
> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>
>  #ifdef CONFIG_DEBUG_VIRTUAL
>  extern phys_addr_t __virt_to_phys(unsigned long x);
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 35b60035b6b0..25213cfaf680 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -11,23 +11,29 @@
>
>  #include <asm/pgtable-bits.h>
>
> -#ifndef __ASSEMBLY__
> -
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> -#include <asm/page.h>
> -#include <asm/tlbflush.h>
> -#include <linux/mm_types.h>
> -
> -#ifdef CONFIG_MMU
> +#ifndef CONFIG_MMU
> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
> +#else
> +/*
> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
> + * the kernel.
> + */
> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>
>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>
>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> -#define BPF_JIT_REGION_END     (VMALLOC_END)
> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)

It seems to have a potential risk here, the region of bpf is
overlapping with kernel mapping, so if kernel size is bigger than
128MB, bpf region would be occupied and run out by kernel mapping.

> +
> +#ifdef CONFIG_64BIT
> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
> +#define VMALLOC_MODULE_END     VMALLOC_END
> +#endif
>

Although kernel_virt_addr is a fixed address now, I think it could be
changed for the purpose of relocatable or KASLR, so if
kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
of module would be too big. In addition, the region of module could be
+-2G around the kernel, so we don't be limited in one direction as
before. It seems to me that the region of the module could be decided
at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
VMLLOC_MODULE_END is "&_start + 2G". I'm not sure whether the size of
region of bpf has to be 128MB for some particular reason, if not,
maybe the region of bpf could be the same with module to avoid being
run out by module.

>  /*
>   * Roughly size the vmemmap space to be large enough to fit enough
> @@ -57,9 +63,16 @@
>  #define FIXADDR_SIZE     PGDIR_SIZE
>  #endif
>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
> -
>  #endif
>
> +#ifndef __ASSEMBLY__
> +
> +/* Page Upper Directory not used in RISC-V */
> +#include <asm-generic/pgtable-nopud.h>
> +#include <asm/page.h>
> +#include <asm/tlbflush.h>
> +#include <linux/mm_types.h>
> +
>  #ifdef CONFIG_64BIT
>  #include <asm/pgtable-64.h>
>  #else
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 98a406474e7d..8f5bb7731327 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -49,7 +49,8 @@ ENTRY(_start)
>  #ifdef CONFIG_MMU
>  relocate:
>         /* Relocate return address */
> -       li a1, PAGE_OFFSET
> +       la a1, kernel_virt_addr
> +       REG_L a1, 0(a1)
>         la a2, _start
>         sub a1, a1, a2
>         add ra, ra, a1
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index 8bbe5dbe1341..1a8fbe05accf 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>  }
>
>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> -#define VMALLOC_MODULE_START \
> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>  void *module_alloc(unsigned long size)
>  {
>         return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> -                                   VMALLOC_END, GFP_KERNEL,
> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>                                     PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>                                     __builtin_return_address(0));
>  }
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index 0339b6bbe11a..a9abde62909f 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -4,7 +4,8 @@
>   * Copyright (C) 2017 SiFive
>   */
>
> -#define LOAD_OFFSET PAGE_OFFSET
> +#include <asm/pgtable.h>
> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>  #include <asm/vmlinux.lds.h>
>  #include <asm/page.h>
>  #include <asm/cache.h>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 27a334106708..17f108baec4f 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -22,6 +22,9 @@
>
>  #include "../kernel/head.h"
>
> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> +EXPORT_SYMBOL(kernel_virt_addr);
> +
>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>                                                         __page_aligned_bss;
>  EXPORT_SYMBOL(empty_zero_page);
> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>  }
>
>  #ifdef CONFIG_MMU
> +/* Offset between linear mapping virtual address and kernel load address */
>  unsigned long va_pa_offset;
>  EXPORT_SYMBOL(va_pa_offset);
> +/* Offset between kernel mapping virtual address and kernel load address */
> +unsigned long va_kernel_pa_offset;
> +EXPORT_SYMBOL(va_kernel_pa_offset);
>  unsigned long pfn_base;
>  EXPORT_SYMBOL(pfn_base);
>
> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>         if (mmu_enabled)
>                 return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>
> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>         BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>         return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>  }
> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +static uintptr_t load_pa, load_sz;
> +
> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> +{
> +       uintptr_t va, end_va;
> +
> +       end_va = kernel_virt_addr + load_sz;
> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
> +               create_pgd_mapping(pgdir, va,
> +                                  load_pa + (va - kernel_virt_addr),
> +                                  map_size, PAGE_KERNEL_EXEC);
> +}
> +
>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  {
>         uintptr_t va, end_va;
> -       uintptr_t load_pa = (uintptr_t)(&_start);
> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>         uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>
> +       load_pa = (uintptr_t)(&_start);
> +       load_sz = (uintptr_t)(&_end) - load_pa;
> +
>         va_pa_offset = PAGE_OFFSET - load_pa;
> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
> +
>         pfn_base = PFN_DOWN(load_pa);
>
>         /*
> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>         create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>                            (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>         /* Setup trampoline PGD and PMD */
> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>                            (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>                            load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>  #else
>         /* Setup trampoline PGD */
> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>                            load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>  #endif
>
>         /*
> -        * Setup early PGD covering entire kernel which will allows
> +        * Setup early PGD covering entire kernel which will allow
>          * us to reach paging_init(). We map all memory banks later
>          * in setup_vm_final() below.
>          */
> -       end_va = PAGE_OFFSET + load_sz;
> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
> -               create_pgd_mapping(early_pg_dir, va,
> -                                  load_pa + (va - PAGE_OFFSET),
> -                                  map_size, PAGE_KERNEL_EXEC);
> +       create_kernel_page_table(early_pg_dir, map_size);
>
>         /* Create fixed mapping for early FDT parsing */
>         end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>         uintptr_t va, map_size;
>         phys_addr_t pa, start, end;
>         struct memblock_region *reg;
> +       static struct vm_struct vm_kernel = { 0 };
>
>         /* Set mmu_enabled flag */
>         mmu_enabled = true;
> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>                 for (pa = start; pa < end; pa += map_size) {
>                         va = (uintptr_t)__va(pa);
>                         create_pgd_mapping(swapper_pg_dir, va, pa,
> -                                          map_size, PAGE_KERNEL_EXEC);
> +                                          map_size, PAGE_KERNEL);
>                 }
>         }
>
> +       /* Map the kernel */
> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> +
> +       /* Reserve the vmalloc area occupied by the kernel */
> +       vm_kernel.addr = (void *)kernel_virt_addr;
> +       vm_kernel.phys_addr = load_pa;
> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> +       vm_kernel.caller = __builtin_return_address(0);
> +
> +       vm_area_add_early(&vm_kernel);
> +
>         /* Clear fixmap PTE and PMD mappings */
>         clear_fixmap(FIX_PTE);
>         clear_fixmap(FIX_PMD);
> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> index e8e4dcd39fed..35703d5ef5fd 100644
> --- a/arch/riscv/mm/physaddr.c
> +++ b/arch/riscv/mm/physaddr.c
> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>
>  phys_addr_t __phys_addr_symbol(unsigned long x)
>  {
> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>         unsigned long kernel_end = (unsigned long)_end;
>
>         /*
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-26  9:43     ` Zong Li
  0 siblings, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-26  9:43 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Paul Walmsley, Palmer Dabbelt,
	linux-riscv, linuxppc-dev

On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This is a preparatory patch for relocatable kernel.
>
> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> physically at the beginning of the main memory. Therefore, we could use
> the linear mapping for the kernel mapping.
>
> But the relocated kernel base address will be different from PAGE_OFFSET
> and since in the linear mapping, two different virtual addresses cannot
> point to the same physical address, the kernel mapping needs to lie outside
> the linear mapping.
>
> In addition, because modules and BPF must be close to the kernel (inside
> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> 2GB, which leaves room for modules and BPF. The kernel could not be
> placed at the beginning of the vmalloc zone since other vmalloc
> allocations from the kernel could get all the +-2GB window around the
> kernel which would prevent new modules and BPF programs to be loaded.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/boot/loader.lds.S     |  3 +-
>  arch/riscv/include/asm/page.h    | 10 +++++-
>  arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>  arch/riscv/kernel/head.S         |  3 +-
>  arch/riscv/kernel/module.c       |  4 +--
>  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>  arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>  arch/riscv/mm/physaddr.c         |  2 +-
>  8 files changed, 87 insertions(+), 33 deletions(-)
>
> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> index 47a5003c2e28..62d94696a19c 100644
> --- a/arch/riscv/boot/loader.lds.S
> +++ b/arch/riscv/boot/loader.lds.S
> @@ -1,13 +1,14 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>
>  #include <asm/page.h>
> +#include <asm/pgtable.h>
>
>  OUTPUT_ARCH(riscv)
>  ENTRY(_start)
>
>  SECTIONS
>  {
> -       . = PAGE_OFFSET;
> +       . = KERNEL_LINK_ADDR;
>
>         .payload : {
>                 *(.payload)
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 2d50f76efe48..48bb09b6a9b7 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>
>  #ifdef CONFIG_MMU
>  extern unsigned long va_pa_offset;
> +extern unsigned long va_kernel_pa_offset;
>  extern unsigned long pfn_base;
>  #define ARCH_PFN_OFFSET                (pfn_base)
>  #else
>  #define va_pa_offset           0
> +#define va_kernel_pa_offset    0
>  #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>  #endif /* CONFIG_MMU */
>
>  extern unsigned long max_low_pfn;
>  extern unsigned long min_low_pfn;
> +extern unsigned long kernel_virt_addr;
>
>  #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
> +#define kernel_mapping_va_to_pa(x)     \
> +       ((unsigned long)(x) - va_kernel_pa_offset)
> +#define __va_to_pa_nodebug(x)          \
> +       (((x) >= PAGE_OFFSET) ?         \
> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>
>  #ifdef CONFIG_DEBUG_VIRTUAL
>  extern phys_addr_t __virt_to_phys(unsigned long x);
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 35b60035b6b0..25213cfaf680 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -11,23 +11,29 @@
>
>  #include <asm/pgtable-bits.h>
>
> -#ifndef __ASSEMBLY__
> -
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> -#include <asm/page.h>
> -#include <asm/tlbflush.h>
> -#include <linux/mm_types.h>
> -
> -#ifdef CONFIG_MMU
> +#ifndef CONFIG_MMU
> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
> +#else
> +/*
> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
> + * the kernel.
> + */
> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>
>  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>  #define VMALLOC_END      (PAGE_OFFSET - 1)
>  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>
>  #define BPF_JIT_REGION_SIZE    (SZ_128M)
> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> -#define BPF_JIT_REGION_END     (VMALLOC_END)
> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)

It seems to have a potential risk here, the region of bpf is
overlapping with kernel mapping, so if kernel size is bigger than
128MB, bpf region would be occupied and run out by kernel mapping.

> +
> +#ifdef CONFIG_64BIT
> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
> +#define VMALLOC_MODULE_END     VMALLOC_END
> +#endif
>

Although kernel_virt_addr is a fixed address now, I think it could be
changed for the purpose of relocatable or KASLR, so if
kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
of module would be too big. In addition, the region of module could be
+-2G around the kernel, so we don't be limited in one direction as
before. It seems to me that the region of the module could be decided
at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
VMLLOC_MODULE_END is "&_start + 2G". I'm not sure whether the size of
region of bpf has to be 128MB for some particular reason, if not,
maybe the region of bpf could be the same with module to avoid being
run out by module.

>  /*
>   * Roughly size the vmemmap space to be large enough to fit enough
> @@ -57,9 +63,16 @@
>  #define FIXADDR_SIZE     PGDIR_SIZE
>  #endif
>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
> -
>  #endif
>
> +#ifndef __ASSEMBLY__
> +
> +/* Page Upper Directory not used in RISC-V */
> +#include <asm-generic/pgtable-nopud.h>
> +#include <asm/page.h>
> +#include <asm/tlbflush.h>
> +#include <linux/mm_types.h>
> +
>  #ifdef CONFIG_64BIT
>  #include <asm/pgtable-64.h>
>  #else
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 98a406474e7d..8f5bb7731327 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -49,7 +49,8 @@ ENTRY(_start)
>  #ifdef CONFIG_MMU
>  relocate:
>         /* Relocate return address */
> -       li a1, PAGE_OFFSET
> +       la a1, kernel_virt_addr
> +       REG_L a1, 0(a1)
>         la a2, _start
>         sub a1, a1, a2
>         add ra, ra, a1
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index 8bbe5dbe1341..1a8fbe05accf 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>  }
>
>  #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> -#define VMALLOC_MODULE_START \
> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>  void *module_alloc(unsigned long size)
>  {
>         return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> -                                   VMALLOC_END, GFP_KERNEL,
> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>                                     PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>                                     __builtin_return_address(0));
>  }
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index 0339b6bbe11a..a9abde62909f 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -4,7 +4,8 @@
>   * Copyright (C) 2017 SiFive
>   */
>
> -#define LOAD_OFFSET PAGE_OFFSET
> +#include <asm/pgtable.h>
> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>  #include <asm/vmlinux.lds.h>
>  #include <asm/page.h>
>  #include <asm/cache.h>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 27a334106708..17f108baec4f 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -22,6 +22,9 @@
>
>  #include "../kernel/head.h"
>
> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> +EXPORT_SYMBOL(kernel_virt_addr);
> +
>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>                                                         __page_aligned_bss;
>  EXPORT_SYMBOL(empty_zero_page);
> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>  }
>
>  #ifdef CONFIG_MMU
> +/* Offset between linear mapping virtual address and kernel load address */
>  unsigned long va_pa_offset;
>  EXPORT_SYMBOL(va_pa_offset);
> +/* Offset between kernel mapping virtual address and kernel load address */
> +unsigned long va_kernel_pa_offset;
> +EXPORT_SYMBOL(va_kernel_pa_offset);
>  unsigned long pfn_base;
>  EXPORT_SYMBOL(pfn_base);
>
> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>         if (mmu_enabled)
>                 return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>
> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>         BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>         return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>  }
> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +static uintptr_t load_pa, load_sz;
> +
> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> +{
> +       uintptr_t va, end_va;
> +
> +       end_va = kernel_virt_addr + load_sz;
> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
> +               create_pgd_mapping(pgdir, va,
> +                                  load_pa + (va - kernel_virt_addr),
> +                                  map_size, PAGE_KERNEL_EXEC);
> +}
> +
>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  {
>         uintptr_t va, end_va;
> -       uintptr_t load_pa = (uintptr_t)(&_start);
> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>         uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>
> +       load_pa = (uintptr_t)(&_start);
> +       load_sz = (uintptr_t)(&_end) - load_pa;
> +
>         va_pa_offset = PAGE_OFFSET - load_pa;
> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
> +
>         pfn_base = PFN_DOWN(load_pa);
>
>         /*
> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>         create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>                            (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>         /* Setup trampoline PGD and PMD */
> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>                            (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>                            load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>  #else
>         /* Setup trampoline PGD */
> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>                            load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>  #endif
>
>         /*
> -        * Setup early PGD covering entire kernel which will allows
> +        * Setup early PGD covering entire kernel which will allow
>          * us to reach paging_init(). We map all memory banks later
>          * in setup_vm_final() below.
>          */
> -       end_va = PAGE_OFFSET + load_sz;
> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
> -               create_pgd_mapping(early_pg_dir, va,
> -                                  load_pa + (va - PAGE_OFFSET),
> -                                  map_size, PAGE_KERNEL_EXEC);
> +       create_kernel_page_table(early_pg_dir, map_size);
>
>         /* Create fixed mapping for early FDT parsing */
>         end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>         uintptr_t va, map_size;
>         phys_addr_t pa, start, end;
>         struct memblock_region *reg;
> +       static struct vm_struct vm_kernel = { 0 };
>
>         /* Set mmu_enabled flag */
>         mmu_enabled = true;
> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>                 for (pa = start; pa < end; pa += map_size) {
>                         va = (uintptr_t)__va(pa);
>                         create_pgd_mapping(swapper_pg_dir, va, pa,
> -                                          map_size, PAGE_KERNEL_EXEC);
> +                                          map_size, PAGE_KERNEL);
>                 }
>         }
>
> +       /* Map the kernel */
> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> +
> +       /* Reserve the vmalloc area occupied by the kernel */
> +       vm_kernel.addr = (void *)kernel_virt_addr;
> +       vm_kernel.phys_addr = load_pa;
> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> +       vm_kernel.caller = __builtin_return_address(0);
> +
> +       vm_area_add_early(&vm_kernel);
> +
>         /* Clear fixmap PTE and PMD mappings */
>         clear_fixmap(FIX_PTE);
>         clear_fixmap(FIX_PMD);
> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> index e8e4dcd39fed..35703d5ef5fd 100644
> --- a/arch/riscv/mm/physaddr.c
> +++ b/arch/riscv/mm/physaddr.c
> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>
>  phys_addr_t __phys_addr_symbol(unsigned long x)
>  {
> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>         unsigned long kernel_end = (unsigned long)_end;
>
>         /*
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
  2020-05-26  9:43     ` Zong Li
  (?)
@ 2020-05-26 17:06       ` Alex Ghiti
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-26 17:06 UTC (permalink / raw)
  To: Zong Li
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, linux-kernel@vger.kernel.org List, linuxppc-dev,
	linux-riscv

Hi Zong,

Le 5/26/20 à 5:43 AM, Zong Li a écrit :
> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>> This is a preparatory patch for relocatable kernel.
>>
>> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
>> physically at the beginning of the main memory. Therefore, we could use
>> the linear mapping for the kernel mapping.
>>
>> But the relocated kernel base address will be different from PAGE_OFFSET
>> and since in the linear mapping, two different virtual addresses cannot
>> point to the same physical address, the kernel mapping needs to lie outside
>> the linear mapping.
>>
>> In addition, because modules and BPF must be close to the kernel (inside
>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>> 2GB, which leaves room for modules and BPF. The kernel could not be
>> placed at the beginning of the vmalloc zone since other vmalloc
>> allocations from the kernel could get all the +-2GB window around the
>> kernel which would prevent new modules and BPF programs to be loaded.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>   arch/riscv/boot/loader.lds.S     |  3 +-
>>   arch/riscv/include/asm/page.h    | 10 +++++-
>>   arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>   arch/riscv/kernel/head.S         |  3 +-
>>   arch/riscv/kernel/module.c       |  4 +--
>>   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>   arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>   arch/riscv/mm/physaddr.c         |  2 +-
>>   8 files changed, 87 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>> index 47a5003c2e28..62d94696a19c 100644
>> --- a/arch/riscv/boot/loader.lds.S
>> +++ b/arch/riscv/boot/loader.lds.S
>> @@ -1,13 +1,14 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>
>>   #include <asm/page.h>
>> +#include <asm/pgtable.h>
>>
>>   OUTPUT_ARCH(riscv)
>>   ENTRY(_start)
>>
>>   SECTIONS
>>   {
>> -       . = PAGE_OFFSET;
>> +       . = KERNEL_LINK_ADDR;
>>
>>          .payload : {
>>                  *(.payload)
>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>> index 2d50f76efe48..48bb09b6a9b7 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>
>>   #ifdef CONFIG_MMU
>>   extern unsigned long va_pa_offset;
>> +extern unsigned long va_kernel_pa_offset;
>>   extern unsigned long pfn_base;
>>   #define ARCH_PFN_OFFSET                (pfn_base)
>>   #else
>>   #define va_pa_offset           0
>> +#define va_kernel_pa_offset    0
>>   #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>   #endif /* CONFIG_MMU */
>>
>>   extern unsigned long max_low_pfn;
>>   extern unsigned long min_low_pfn;
>> +extern unsigned long kernel_virt_addr;
>>
>>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
>> +#define kernel_mapping_va_to_pa(x)     \
>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>> +#define __va_to_pa_nodebug(x)          \
>> +       (((x) >= PAGE_OFFSET) ?         \
>> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>
>>   #ifdef CONFIG_DEBUG_VIRTUAL
>>   extern phys_addr_t __virt_to_phys(unsigned long x);
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 35b60035b6b0..25213cfaf680 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -11,23 +11,29 @@
>>
>>   #include <asm/pgtable-bits.h>
>>
>> -#ifndef __ASSEMBLY__
>> -
>> -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> -#include <asm/page.h>
>> -#include <asm/tlbflush.h>
>> -#include <linux/mm_types.h>
>> -
>> -#ifdef CONFIG_MMU
>> +#ifndef CONFIG_MMU
>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>> +#else
>> +/*
>> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
>> + * the kernel.
>> + */
>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>
>>   #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>   #define VMALLOC_END      (PAGE_OFFSET - 1)
>>   #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>
>>   #define BPF_JIT_REGION_SIZE    (SZ_128M)
>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
> It seems to have a potential risk here, the region of bpf is
> overlapping with kernel mapping, so if kernel size is bigger than
> 128MB, bpf region would be occupied and run out by kernel mapping.
>
>> +
>> +#ifdef CONFIG_64BIT
>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>> +#define VMALLOC_MODULE_END     VMALLOC_END
>> +#endif
>>
> Although kernel_virt_addr is a fixed address now, I think it could be
> changed for the purpose of relocatable or KASLR, so if
> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
> of module would be too big.


Yes you're right, that's wrong to allow modules to lie outside
the 2G window, thanks for noticing.


> In addition, the region of module could be
> +-2G around the kernel, so we don't be limited in one direction as
> before. It seems to me that the region of the module could be decided
> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
> VMLLOC_MODULE_END is "&_start + 2G".


I had tried that, but as we need to make sure BPF region is different 
from the module's
that makes the macro definitions really cumbersome. I'll give a try 
again anyway. And
I tried to use _end and _start here but it failed, I have to debug this.


>   I'm not sure whether the size of
> region of bpf has to be 128MB for some particular reason, if not,
> maybe the region of bpf could be the same with module to avoid being
> run out by module.


On the contrary, BPF region must not be the same as module's since in 
that case,
modules could take all the space and make BPF fail.


Thanks for your review Zong,


Alex


>
>>   /*
>>    * Roughly size the vmemmap space to be large enough to fit enough
>> @@ -57,9 +63,16 @@
>>   #define FIXADDR_SIZE     PGDIR_SIZE
>>   #endif
>>   #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>> -
>>   #endif
>>
>> +#ifndef __ASSEMBLY__
>> +
>> +/* Page Upper Directory not used in RISC-V */
>> +#include <asm-generic/pgtable-nopud.h>
>> +#include <asm/page.h>
>> +#include <asm/tlbflush.h>
>> +#include <linux/mm_types.h>
>> +
>>   #ifdef CONFIG_64BIT
>>   #include <asm/pgtable-64.h>
>>   #else
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 98a406474e7d..8f5bb7731327 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>   #ifdef CONFIG_MMU
>>   relocate:
>>          /* Relocate return address */
>> -       li a1, PAGE_OFFSET
>> +       la a1, kernel_virt_addr
>> +       REG_L a1, 0(a1)
>>          la a2, _start
>>          sub a1, a1, a2
>>          add ra, ra, a1
>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>> index 8bbe5dbe1341..1a8fbe05accf 100644
>> --- a/arch/riscv/kernel/module.c
>> +++ b/arch/riscv/kernel/module.c
>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>>   }
>>
>>   #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>> -#define VMALLOC_MODULE_START \
>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>   void *module_alloc(unsigned long size)
>>   {
>>          return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>> -                                   VMALLOC_END, GFP_KERNEL,
>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>                                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>                                      __builtin_return_address(0));
>>   }
>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>> index 0339b6bbe11a..a9abde62909f 100644
>> --- a/arch/riscv/kernel/vmlinux.lds.S
>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>> @@ -4,7 +4,8 @@
>>    * Copyright (C) 2017 SiFive
>>    */
>>
>> -#define LOAD_OFFSET PAGE_OFFSET
>> +#include <asm/pgtable.h>
>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>   #include <asm/vmlinux.lds.h>
>>   #include <asm/page.h>
>>   #include <asm/cache.h>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 27a334106708..17f108baec4f 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -22,6 +22,9 @@
>>
>>   #include "../kernel/head.h"
>>
>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>> +EXPORT_SYMBOL(kernel_virt_addr);
>> +
>>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>                                                          __page_aligned_bss;
>>   EXPORT_SYMBOL(empty_zero_page);
>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>   }
>>
>>   #ifdef CONFIG_MMU
>> +/* Offset between linear mapping virtual address and kernel load address */
>>   unsigned long va_pa_offset;
>>   EXPORT_SYMBOL(va_pa_offset);
>> +/* Offset between kernel mapping virtual address and kernel load address */
>> +unsigned long va_kernel_pa_offset;
>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>   unsigned long pfn_base;
>>   EXPORT_SYMBOL(pfn_base);
>>
>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>          if (mmu_enabled)
>>                  return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>
>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>          BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>          return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>   }
>> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>   #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>>   #endif
>>
>> +static uintptr_t load_pa, load_sz;
>> +
>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>> +{
>> +       uintptr_t va, end_va;
>> +
>> +       end_va = kernel_virt_addr + load_sz;
>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>> +               create_pgd_mapping(pgdir, va,
>> +                                  load_pa + (va - kernel_virt_addr),
>> +                                  map_size, PAGE_KERNEL_EXEC);
>> +}
>> +
>>   asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>   {
>>          uintptr_t va, end_va;
>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>          uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>
>> +       load_pa = (uintptr_t)(&_start);
>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>> +
>>          va_pa_offset = PAGE_OFFSET - load_pa;
>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>> +
>>          pfn_base = PFN_DOWN(load_pa);
>>
>>          /*
>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>          create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>                             (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>          /* Setup trampoline PGD and PMD */
>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>                             (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>                             load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>   #else
>>          /* Setup trampoline PGD */
>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>                             load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>   #endif
>>
>>          /*
>> -        * Setup early PGD covering entire kernel which will allows
>> +        * Setup early PGD covering entire kernel which will allow
>>           * us to reach paging_init(). We map all memory banks later
>>           * in setup_vm_final() below.
>>           */
>> -       end_va = PAGE_OFFSET + load_sz;
>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>> -               create_pgd_mapping(early_pg_dir, va,
>> -                                  load_pa + (va - PAGE_OFFSET),
>> -                                  map_size, PAGE_KERNEL_EXEC);
>> +       create_kernel_page_table(early_pg_dir, map_size);
>>
>>          /* Create fixed mapping for early FDT parsing */
>>          end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>          uintptr_t va, map_size;
>>          phys_addr_t pa, start, end;
>>          struct memblock_region *reg;
>> +       static struct vm_struct vm_kernel = { 0 };
>>
>>          /* Set mmu_enabled flag */
>>          mmu_enabled = true;
>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>                  for (pa = start; pa < end; pa += map_size) {
>>                          va = (uintptr_t)__va(pa);
>>                          create_pgd_mapping(swapper_pg_dir, va, pa,
>> -                                          map_size, PAGE_KERNEL_EXEC);
>> +                                          map_size, PAGE_KERNEL);
>>                  }
>>          }
>>
>> +       /* Map the kernel */
>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>> +
>> +       /* Reserve the vmalloc area occupied by the kernel */
>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>> +       vm_kernel.phys_addr = load_pa;
>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>> +       vm_kernel.caller = __builtin_return_address(0);
>> +
>> +       vm_area_add_early(&vm_kernel);
>> +
>>          /* Clear fixmap PTE and PMD mappings */
>>          clear_fixmap(FIX_PTE);
>>          clear_fixmap(FIX_PMD);
>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>> index e8e4dcd39fed..35703d5ef5fd 100644
>> --- a/arch/riscv/mm/physaddr.c
>> +++ b/arch/riscv/mm/physaddr.c
>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>
>>   phys_addr_t __phys_addr_symbol(unsigned long x)
>>   {
>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>          unsigned long kernel_end = (unsigned long)_end;
>>
>>          /*
>> --
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-26 17:06       ` Alex Ghiti
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-26 17:06 UTC (permalink / raw)
  To: Zong Li
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, linux-riscv, linuxppc-dev

Hi Zong,

Le 5/26/20 à 5:43 AM, Zong Li a écrit :
> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>> This is a preparatory patch for relocatable kernel.
>>
>> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
>> physically at the beginning of the main memory. Therefore, we could use
>> the linear mapping for the kernel mapping.
>>
>> But the relocated kernel base address will be different from PAGE_OFFSET
>> and since in the linear mapping, two different virtual addresses cannot
>> point to the same physical address, the kernel mapping needs to lie outside
>> the linear mapping.
>>
>> In addition, because modules and BPF must be close to the kernel (inside
>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>> 2GB, which leaves room for modules and BPF. The kernel could not be
>> placed at the beginning of the vmalloc zone since other vmalloc
>> allocations from the kernel could get all the +-2GB window around the
>> kernel which would prevent new modules and BPF programs to be loaded.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>   arch/riscv/boot/loader.lds.S     |  3 +-
>>   arch/riscv/include/asm/page.h    | 10 +++++-
>>   arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>   arch/riscv/kernel/head.S         |  3 +-
>>   arch/riscv/kernel/module.c       |  4 +--
>>   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>   arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>   arch/riscv/mm/physaddr.c         |  2 +-
>>   8 files changed, 87 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>> index 47a5003c2e28..62d94696a19c 100644
>> --- a/arch/riscv/boot/loader.lds.S
>> +++ b/arch/riscv/boot/loader.lds.S
>> @@ -1,13 +1,14 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>
>>   #include <asm/page.h>
>> +#include <asm/pgtable.h>
>>
>>   OUTPUT_ARCH(riscv)
>>   ENTRY(_start)
>>
>>   SECTIONS
>>   {
>> -       . = PAGE_OFFSET;
>> +       . = KERNEL_LINK_ADDR;
>>
>>          .payload : {
>>                  *(.payload)
>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>> index 2d50f76efe48..48bb09b6a9b7 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>
>>   #ifdef CONFIG_MMU
>>   extern unsigned long va_pa_offset;
>> +extern unsigned long va_kernel_pa_offset;
>>   extern unsigned long pfn_base;
>>   #define ARCH_PFN_OFFSET                (pfn_base)
>>   #else
>>   #define va_pa_offset           0
>> +#define va_kernel_pa_offset    0
>>   #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>   #endif /* CONFIG_MMU */
>>
>>   extern unsigned long max_low_pfn;
>>   extern unsigned long min_low_pfn;
>> +extern unsigned long kernel_virt_addr;
>>
>>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
>> +#define kernel_mapping_va_to_pa(x)     \
>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>> +#define __va_to_pa_nodebug(x)          \
>> +       (((x) >= PAGE_OFFSET) ?         \
>> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>
>>   #ifdef CONFIG_DEBUG_VIRTUAL
>>   extern phys_addr_t __virt_to_phys(unsigned long x);
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 35b60035b6b0..25213cfaf680 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -11,23 +11,29 @@
>>
>>   #include <asm/pgtable-bits.h>
>>
>> -#ifndef __ASSEMBLY__
>> -
>> -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> -#include <asm/page.h>
>> -#include <asm/tlbflush.h>
>> -#include <linux/mm_types.h>
>> -
>> -#ifdef CONFIG_MMU
>> +#ifndef CONFIG_MMU
>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>> +#else
>> +/*
>> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
>> + * the kernel.
>> + */
>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>
>>   #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>   #define VMALLOC_END      (PAGE_OFFSET - 1)
>>   #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>
>>   #define BPF_JIT_REGION_SIZE    (SZ_128M)
>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
> It seems to have a potential risk here, the region of bpf is
> overlapping with kernel mapping, so if kernel size is bigger than
> 128MB, bpf region would be occupied and run out by kernel mapping.
>
>> +
>> +#ifdef CONFIG_64BIT
>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>> +#define VMALLOC_MODULE_END     VMALLOC_END
>> +#endif
>>
> Although kernel_virt_addr is a fixed address now, I think it could be
> changed for the purpose of relocatable or KASLR, so if
> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
> of module would be too big.


Yes you're right, that's wrong to allow modules to lie outside
the 2G window, thanks for noticing.


> In addition, the region of module could be
> +-2G around the kernel, so we don't be limited in one direction as
> before. It seems to me that the region of the module could be decided
> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
> VMLLOC_MODULE_END is "&_start + 2G".


I had tried that, but as we need to make sure BPF region is different 
from the module's
that makes the macro definitions really cumbersome. I'll give a try 
again anyway. And
I tried to use _end and _start here but it failed, I have to debug this.


>   I'm not sure whether the size of
> region of bpf has to be 128MB for some particular reason, if not,
> maybe the region of bpf could be the same with module to avoid being
> run out by module.


On the contrary, BPF region must not be the same as module's since in 
that case,
modules could take all the space and make BPF fail.


Thanks for your review Zong,


Alex


>
>>   /*
>>    * Roughly size the vmemmap space to be large enough to fit enough
>> @@ -57,9 +63,16 @@
>>   #define FIXADDR_SIZE     PGDIR_SIZE
>>   #endif
>>   #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>> -
>>   #endif
>>
>> +#ifndef __ASSEMBLY__
>> +
>> +/* Page Upper Directory not used in RISC-V */
>> +#include <asm-generic/pgtable-nopud.h>
>> +#include <asm/page.h>
>> +#include <asm/tlbflush.h>
>> +#include <linux/mm_types.h>
>> +
>>   #ifdef CONFIG_64BIT
>>   #include <asm/pgtable-64.h>
>>   #else
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 98a406474e7d..8f5bb7731327 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>   #ifdef CONFIG_MMU
>>   relocate:
>>          /* Relocate return address */
>> -       li a1, PAGE_OFFSET
>> +       la a1, kernel_virt_addr
>> +       REG_L a1, 0(a1)
>>          la a2, _start
>>          sub a1, a1, a2
>>          add ra, ra, a1
>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>> index 8bbe5dbe1341..1a8fbe05accf 100644
>> --- a/arch/riscv/kernel/module.c
>> +++ b/arch/riscv/kernel/module.c
>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>>   }
>>
>>   #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>> -#define VMALLOC_MODULE_START \
>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>   void *module_alloc(unsigned long size)
>>   {
>>          return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>> -                                   VMALLOC_END, GFP_KERNEL,
>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>                                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>                                      __builtin_return_address(0));
>>   }
>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>> index 0339b6bbe11a..a9abde62909f 100644
>> --- a/arch/riscv/kernel/vmlinux.lds.S
>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>> @@ -4,7 +4,8 @@
>>    * Copyright (C) 2017 SiFive
>>    */
>>
>> -#define LOAD_OFFSET PAGE_OFFSET
>> +#include <asm/pgtable.h>
>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>   #include <asm/vmlinux.lds.h>
>>   #include <asm/page.h>
>>   #include <asm/cache.h>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 27a334106708..17f108baec4f 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -22,6 +22,9 @@
>>
>>   #include "../kernel/head.h"
>>
>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>> +EXPORT_SYMBOL(kernel_virt_addr);
>> +
>>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>                                                          __page_aligned_bss;
>>   EXPORT_SYMBOL(empty_zero_page);
>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>   }
>>
>>   #ifdef CONFIG_MMU
>> +/* Offset between linear mapping virtual address and kernel load address */
>>   unsigned long va_pa_offset;
>>   EXPORT_SYMBOL(va_pa_offset);
>> +/* Offset between kernel mapping virtual address and kernel load address */
>> +unsigned long va_kernel_pa_offset;
>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>   unsigned long pfn_base;
>>   EXPORT_SYMBOL(pfn_base);
>>
>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>          if (mmu_enabled)
>>                  return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>
>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>          BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>          return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>   }
>> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>   #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>>   #endif
>>
>> +static uintptr_t load_pa, load_sz;
>> +
>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>> +{
>> +       uintptr_t va, end_va;
>> +
>> +       end_va = kernel_virt_addr + load_sz;
>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>> +               create_pgd_mapping(pgdir, va,
>> +                                  load_pa + (va - kernel_virt_addr),
>> +                                  map_size, PAGE_KERNEL_EXEC);
>> +}
>> +
>>   asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>   {
>>          uintptr_t va, end_va;
>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>          uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>
>> +       load_pa = (uintptr_t)(&_start);
>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>> +
>>          va_pa_offset = PAGE_OFFSET - load_pa;
>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>> +
>>          pfn_base = PFN_DOWN(load_pa);
>>
>>          /*
>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>          create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>                             (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>          /* Setup trampoline PGD and PMD */
>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>                             (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>                             load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>   #else
>>          /* Setup trampoline PGD */
>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>                             load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>   #endif
>>
>>          /*
>> -        * Setup early PGD covering entire kernel which will allows
>> +        * Setup early PGD covering entire kernel which will allow
>>           * us to reach paging_init(). We map all memory banks later
>>           * in setup_vm_final() below.
>>           */
>> -       end_va = PAGE_OFFSET + load_sz;
>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>> -               create_pgd_mapping(early_pg_dir, va,
>> -                                  load_pa + (va - PAGE_OFFSET),
>> -                                  map_size, PAGE_KERNEL_EXEC);
>> +       create_kernel_page_table(early_pg_dir, map_size);
>>
>>          /* Create fixed mapping for early FDT parsing */
>>          end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>          uintptr_t va, map_size;
>>          phys_addr_t pa, start, end;
>>          struct memblock_region *reg;
>> +       static struct vm_struct vm_kernel = { 0 };
>>
>>          /* Set mmu_enabled flag */
>>          mmu_enabled = true;
>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>                  for (pa = start; pa < end; pa += map_size) {
>>                          va = (uintptr_t)__va(pa);
>>                          create_pgd_mapping(swapper_pg_dir, va, pa,
>> -                                          map_size, PAGE_KERNEL_EXEC);
>> +                                          map_size, PAGE_KERNEL);
>>                  }
>>          }
>>
>> +       /* Map the kernel */
>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>> +
>> +       /* Reserve the vmalloc area occupied by the kernel */
>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>> +       vm_kernel.phys_addr = load_pa;
>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>> +       vm_kernel.caller = __builtin_return_address(0);
>> +
>> +       vm_area_add_early(&vm_kernel);
>> +
>>          /* Clear fixmap PTE and PMD mappings */
>>          clear_fixmap(FIX_PTE);
>>          clear_fixmap(FIX_PMD);
>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>> index e8e4dcd39fed..35703d5ef5fd 100644
>> --- a/arch/riscv/mm/physaddr.c
>> +++ b/arch/riscv/mm/physaddr.c
>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>
>>   phys_addr_t __phys_addr_symbol(unsigned long x)
>>   {
>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>          unsigned long kernel_end = (unsigned long)_end;
>>
>>          /*
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-26 17:06       ` Alex Ghiti
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-26 17:06 UTC (permalink / raw)
  To: Zong Li
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Paul Walmsley, Palmer Dabbelt,
	linux-riscv, linuxppc-dev

Hi Zong,

Le 5/26/20 à 5:43 AM, Zong Li a écrit :
> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>> This is a preparatory patch for relocatable kernel.
>>
>> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
>> physically at the beginning of the main memory. Therefore, we could use
>> the linear mapping for the kernel mapping.
>>
>> But the relocated kernel base address will be different from PAGE_OFFSET
>> and since in the linear mapping, two different virtual addresses cannot
>> point to the same physical address, the kernel mapping needs to lie outside
>> the linear mapping.
>>
>> In addition, because modules and BPF must be close to the kernel (inside
>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>> 2GB, which leaves room for modules and BPF. The kernel could not be
>> placed at the beginning of the vmalloc zone since other vmalloc
>> allocations from the kernel could get all the +-2GB window around the
>> kernel which would prevent new modules and BPF programs to be loaded.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>   arch/riscv/boot/loader.lds.S     |  3 +-
>>   arch/riscv/include/asm/page.h    | 10 +++++-
>>   arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>   arch/riscv/kernel/head.S         |  3 +-
>>   arch/riscv/kernel/module.c       |  4 +--
>>   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>   arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>   arch/riscv/mm/physaddr.c         |  2 +-
>>   8 files changed, 87 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>> index 47a5003c2e28..62d94696a19c 100644
>> --- a/arch/riscv/boot/loader.lds.S
>> +++ b/arch/riscv/boot/loader.lds.S
>> @@ -1,13 +1,14 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>
>>   #include <asm/page.h>
>> +#include <asm/pgtable.h>
>>
>>   OUTPUT_ARCH(riscv)
>>   ENTRY(_start)
>>
>>   SECTIONS
>>   {
>> -       . = PAGE_OFFSET;
>> +       . = KERNEL_LINK_ADDR;
>>
>>          .payload : {
>>                  *(.payload)
>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>> index 2d50f76efe48..48bb09b6a9b7 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>
>>   #ifdef CONFIG_MMU
>>   extern unsigned long va_pa_offset;
>> +extern unsigned long va_kernel_pa_offset;
>>   extern unsigned long pfn_base;
>>   #define ARCH_PFN_OFFSET                (pfn_base)
>>   #else
>>   #define va_pa_offset           0
>> +#define va_kernel_pa_offset    0
>>   #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>   #endif /* CONFIG_MMU */
>>
>>   extern unsigned long max_low_pfn;
>>   extern unsigned long min_low_pfn;
>> +extern unsigned long kernel_virt_addr;
>>
>>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
>> +#define kernel_mapping_va_to_pa(x)     \
>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>> +#define __va_to_pa_nodebug(x)          \
>> +       (((x) >= PAGE_OFFSET) ?         \
>> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>
>>   #ifdef CONFIG_DEBUG_VIRTUAL
>>   extern phys_addr_t __virt_to_phys(unsigned long x);
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 35b60035b6b0..25213cfaf680 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -11,23 +11,29 @@
>>
>>   #include <asm/pgtable-bits.h>
>>
>> -#ifndef __ASSEMBLY__
>> -
>> -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> -#include <asm/page.h>
>> -#include <asm/tlbflush.h>
>> -#include <linux/mm_types.h>
>> -
>> -#ifdef CONFIG_MMU
>> +#ifndef CONFIG_MMU
>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>> +#else
>> +/*
>> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
>> + * the kernel.
>> + */
>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>
>>   #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>   #define VMALLOC_END      (PAGE_OFFSET - 1)
>>   #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>
>>   #define BPF_JIT_REGION_SIZE    (SZ_128M)
>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
> It seems to have a potential risk here, the region of bpf is
> overlapping with kernel mapping, so if kernel size is bigger than
> 128MB, bpf region would be occupied and run out by kernel mapping.
>
>> +
>> +#ifdef CONFIG_64BIT
>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>> +#define VMALLOC_MODULE_END     VMALLOC_END
>> +#endif
>>
> Although kernel_virt_addr is a fixed address now, I think it could be
> changed for the purpose of relocatable or KASLR, so if
> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
> of module would be too big.


Yes you're right, that's wrong to allow modules to lie outside
the 2G window, thanks for noticing.


> In addition, the region of module could be
> +-2G around the kernel, so we don't be limited in one direction as
> before. It seems to me that the region of the module could be decided
> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
> VMLLOC_MODULE_END is "&_start + 2G".


I had tried that, but as we need to make sure BPF region is different 
from the module's
that makes the macro definitions really cumbersome. I'll give a try 
again anyway. And
I tried to use _end and _start here but it failed, I have to debug this.


>   I'm not sure whether the size of
> region of bpf has to be 128MB for some particular reason, if not,
> maybe the region of bpf could be the same with module to avoid being
> run out by module.


On the contrary, BPF region must not be the same as module's since in 
that case,
modules could take all the space and make BPF fail.


Thanks for your review Zong,


Alex


>
>>   /*
>>    * Roughly size the vmemmap space to be large enough to fit enough
>> @@ -57,9 +63,16 @@
>>   #define FIXADDR_SIZE     PGDIR_SIZE
>>   #endif
>>   #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>> -
>>   #endif
>>
>> +#ifndef __ASSEMBLY__
>> +
>> +/* Page Upper Directory not used in RISC-V */
>> +#include <asm-generic/pgtable-nopud.h>
>> +#include <asm/page.h>
>> +#include <asm/tlbflush.h>
>> +#include <linux/mm_types.h>
>> +
>>   #ifdef CONFIG_64BIT
>>   #include <asm/pgtable-64.h>
>>   #else
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 98a406474e7d..8f5bb7731327 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>   #ifdef CONFIG_MMU
>>   relocate:
>>          /* Relocate return address */
>> -       li a1, PAGE_OFFSET
>> +       la a1, kernel_virt_addr
>> +       REG_L a1, 0(a1)
>>          la a2, _start
>>          sub a1, a1, a2
>>          add ra, ra, a1
>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>> index 8bbe5dbe1341..1a8fbe05accf 100644
>> --- a/arch/riscv/kernel/module.c
>> +++ b/arch/riscv/kernel/module.c
>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>>   }
>>
>>   #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>> -#define VMALLOC_MODULE_START \
>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>   void *module_alloc(unsigned long size)
>>   {
>>          return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>> -                                   VMALLOC_END, GFP_KERNEL,
>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>                                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>                                      __builtin_return_address(0));
>>   }
>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>> index 0339b6bbe11a..a9abde62909f 100644
>> --- a/arch/riscv/kernel/vmlinux.lds.S
>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>> @@ -4,7 +4,8 @@
>>    * Copyright (C) 2017 SiFive
>>    */
>>
>> -#define LOAD_OFFSET PAGE_OFFSET
>> +#include <asm/pgtable.h>
>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>   #include <asm/vmlinux.lds.h>
>>   #include <asm/page.h>
>>   #include <asm/cache.h>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 27a334106708..17f108baec4f 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -22,6 +22,9 @@
>>
>>   #include "../kernel/head.h"
>>
>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>> +EXPORT_SYMBOL(kernel_virt_addr);
>> +
>>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>                                                          __page_aligned_bss;
>>   EXPORT_SYMBOL(empty_zero_page);
>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>   }
>>
>>   #ifdef CONFIG_MMU
>> +/* Offset between linear mapping virtual address and kernel load address */
>>   unsigned long va_pa_offset;
>>   EXPORT_SYMBOL(va_pa_offset);
>> +/* Offset between kernel mapping virtual address and kernel load address */
>> +unsigned long va_kernel_pa_offset;
>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>   unsigned long pfn_base;
>>   EXPORT_SYMBOL(pfn_base);
>>
>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>          if (mmu_enabled)
>>                  return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>
>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>          BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>          return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>   }
>> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>   #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>>   #endif
>>
>> +static uintptr_t load_pa, load_sz;
>> +
>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>> +{
>> +       uintptr_t va, end_va;
>> +
>> +       end_va = kernel_virt_addr + load_sz;
>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>> +               create_pgd_mapping(pgdir, va,
>> +                                  load_pa + (va - kernel_virt_addr),
>> +                                  map_size, PAGE_KERNEL_EXEC);
>> +}
>> +
>>   asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>   {
>>          uintptr_t va, end_va;
>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>          uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>
>> +       load_pa = (uintptr_t)(&_start);
>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>> +
>>          va_pa_offset = PAGE_OFFSET - load_pa;
>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>> +
>>          pfn_base = PFN_DOWN(load_pa);
>>
>>          /*
>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>          create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>                             (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>          /* Setup trampoline PGD and PMD */
>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>                             (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>                             load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>   #else
>>          /* Setup trampoline PGD */
>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>                             load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>   #endif
>>
>>          /*
>> -        * Setup early PGD covering entire kernel which will allows
>> +        * Setup early PGD covering entire kernel which will allow
>>           * us to reach paging_init(). We map all memory banks later
>>           * in setup_vm_final() below.
>>           */
>> -       end_va = PAGE_OFFSET + load_sz;
>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>> -               create_pgd_mapping(early_pg_dir, va,
>> -                                  load_pa + (va - PAGE_OFFSET),
>> -                                  map_size, PAGE_KERNEL_EXEC);
>> +       create_kernel_page_table(early_pg_dir, map_size);
>>
>>          /* Create fixed mapping for early FDT parsing */
>>          end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>          uintptr_t va, map_size;
>>          phys_addr_t pa, start, end;
>>          struct memblock_region *reg;
>> +       static struct vm_struct vm_kernel = { 0 };
>>
>>          /* Set mmu_enabled flag */
>>          mmu_enabled = true;
>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>                  for (pa = start; pa < end; pa += map_size) {
>>                          va = (uintptr_t)__va(pa);
>>                          create_pgd_mapping(swapper_pg_dir, va, pa,
>> -                                          map_size, PAGE_KERNEL_EXEC);
>> +                                          map_size, PAGE_KERNEL);
>>                  }
>>          }
>>
>> +       /* Map the kernel */
>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>> +
>> +       /* Reserve the vmalloc area occupied by the kernel */
>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>> +       vm_kernel.phys_addr = load_pa;
>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>> +       vm_kernel.caller = __builtin_return_address(0);
>> +
>> +       vm_area_add_early(&vm_kernel);
>> +
>>          /* Clear fixmap PTE and PMD mappings */
>>          clear_fixmap(FIX_PTE);
>>          clear_fixmap(FIX_PMD);
>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>> index e8e4dcd39fed..35703d5ef5fd 100644
>> --- a/arch/riscv/mm/physaddr.c
>> +++ b/arch/riscv/mm/physaddr.c
>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>
>>   phys_addr_t __phys_addr_symbol(unsigned long x)
>>   {
>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>          unsigned long kernel_end = (unsigned long)_end;
>>
>>          /*
>> --
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
  2020-05-26 17:06       ` Alex Ghiti
  (?)
@ 2020-05-27  6:05         ` Zong Li
  -1 siblings, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-27  6:05 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, linux-kernel@vger.kernel.org List, linuxppc-dev,
	linux-riscv

On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>
> Hi Zong,
>
> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
> > On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >> This is a preparatory patch for relocatable kernel.
> >>
> >> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> >> physically at the beginning of the main memory. Therefore, we could use
> >> the linear mapping for the kernel mapping.
> >>
> >> But the relocated kernel base address will be different from PAGE_OFFSET
> >> and since in the linear mapping, two different virtual addresses cannot
> >> point to the same physical address, the kernel mapping needs to lie outside
> >> the linear mapping.
> >>
> >> In addition, because modules and BPF must be close to the kernel (inside
> >> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> >> 2GB, which leaves room for modules and BPF. The kernel could not be
> >> placed at the beginning of the vmalloc zone since other vmalloc
> >> allocations from the kernel could get all the +-2GB window around the
> >> kernel which would prevent new modules and BPF programs to be loaded.
> >>
> >> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> >> ---
> >>   arch/riscv/boot/loader.lds.S     |  3 +-
> >>   arch/riscv/include/asm/page.h    | 10 +++++-
> >>   arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
> >>   arch/riscv/kernel/head.S         |  3 +-
> >>   arch/riscv/kernel/module.c       |  4 +--
> >>   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
> >>   arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
> >>   arch/riscv/mm/physaddr.c         |  2 +-
> >>   8 files changed, 87 insertions(+), 33 deletions(-)
> >>
> >> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> >> index 47a5003c2e28..62d94696a19c 100644
> >> --- a/arch/riscv/boot/loader.lds.S
> >> +++ b/arch/riscv/boot/loader.lds.S
> >> @@ -1,13 +1,14 @@
> >>   /* SPDX-License-Identifier: GPL-2.0 */
> >>
> >>   #include <asm/page.h>
> >> +#include <asm/pgtable.h>
> >>
> >>   OUTPUT_ARCH(riscv)
> >>   ENTRY(_start)
> >>
> >>   SECTIONS
> >>   {
> >> -       . = PAGE_OFFSET;
> >> +       . = KERNEL_LINK_ADDR;
> >>
> >>          .payload : {
> >>                  *(.payload)
> >> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> >> index 2d50f76efe48..48bb09b6a9b7 100644
> >> --- a/arch/riscv/include/asm/page.h
> >> +++ b/arch/riscv/include/asm/page.h
> >> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
> >>
> >>   #ifdef CONFIG_MMU
> >>   extern unsigned long va_pa_offset;
> >> +extern unsigned long va_kernel_pa_offset;
> >>   extern unsigned long pfn_base;
> >>   #define ARCH_PFN_OFFSET                (pfn_base)
> >>   #else
> >>   #define va_pa_offset           0
> >> +#define va_kernel_pa_offset    0
> >>   #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
> >>   #endif /* CONFIG_MMU */
> >>
> >>   extern unsigned long max_low_pfn;
> >>   extern unsigned long min_low_pfn;
> >> +extern unsigned long kernel_virt_addr;
> >>
> >>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
> >> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> >> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
> >> +#define kernel_mapping_va_to_pa(x)     \
> >> +       ((unsigned long)(x) - va_kernel_pa_offset)
> >> +#define __va_to_pa_nodebug(x)          \
> >> +       (((x) >= PAGE_OFFSET) ?         \
> >> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
> >>
> >>   #ifdef CONFIG_DEBUG_VIRTUAL
> >>   extern phys_addr_t __virt_to_phys(unsigned long x);
> >> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> >> index 35b60035b6b0..25213cfaf680 100644
> >> --- a/arch/riscv/include/asm/pgtable.h
> >> +++ b/arch/riscv/include/asm/pgtable.h
> >> @@ -11,23 +11,29 @@
> >>
> >>   #include <asm/pgtable-bits.h>
> >>
> >> -#ifndef __ASSEMBLY__
> >> -
> >> -/* Page Upper Directory not used in RISC-V */
> >> -#include <asm-generic/pgtable-nopud.h>
> >> -#include <asm/page.h>
> >> -#include <asm/tlbflush.h>
> >> -#include <linux/mm_types.h>
> >> -
> >> -#ifdef CONFIG_MMU
> >> +#ifndef CONFIG_MMU
> >> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
> >> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
> >> +#else
> >> +/*
> >> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
> >> + * the kernel.
> >> + */
> >> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
> >> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
> >>
> >>   #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
> >>   #define VMALLOC_END      (PAGE_OFFSET - 1)
> >>   #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
> >>
> >>   #define BPF_JIT_REGION_SIZE    (SZ_128M)
> >> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> >> -#define BPF_JIT_REGION_END     (VMALLOC_END)
> >> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
> >> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
> > It seems to have a potential risk here, the region of bpf is
> > overlapping with kernel mapping, so if kernel size is bigger than
> > 128MB, bpf region would be occupied and run out by kernel mapping.

Is there the risk as I mentioned?

> >
> >> +
> >> +#ifdef CONFIG_64BIT
> >> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
> >> +#define VMALLOC_MODULE_END     VMALLOC_END
> >> +#endif
> >>
> > Although kernel_virt_addr is a fixed address now, I think it could be
> > changed for the purpose of relocatable or KASLR, so if
> > kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
> > of module would be too big.
>
>
> Yes you're right, that's wrong to allow modules to lie outside
> the 2G window, thanks for noticing.
>
>
> > In addition, the region of module could be
> > +-2G around the kernel, so we don't be limited in one direction as
> > before. It seems to me that the region of the module could be decided
> > at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
> > VMLLOC_MODULE_END is "&_start + 2G".
>
>
> I had tried that, but as we need to make sure BPF region is different
> from the module's
> that makes the macro definitions really cumbersome. I'll give a try
> again anyway. And
> I tried to use _end and _start here but it failed, I have to debug this.
>
>
> >   I'm not sure whether the size of
> > region of bpf has to be 128MB for some particular reason, if not,
> > maybe the region of bpf could be the same with module to avoid being
> > run out by module.
>
>
> On the contrary, BPF region must not be the same as module's since in
> that case,
> modules could take all the space and make BPF fail.

ok, I got it. Thanks for the explaining.


>
>
> Thanks for your review Zong,
>
>
> Alex
>
>
> >
> >>   /*
> >>    * Roughly size the vmemmap space to be large enough to fit enough
> >> @@ -57,9 +63,16 @@
> >>   #define FIXADDR_SIZE     PGDIR_SIZE
> >>   #endif
> >>   #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
> >> -
> >>   #endif
> >>
> >> +#ifndef __ASSEMBLY__
> >> +
> >> +/* Page Upper Directory not used in RISC-V */
> >> +#include <asm-generic/pgtable-nopud.h>
> >> +#include <asm/page.h>
> >> +#include <asm/tlbflush.h>
> >> +#include <linux/mm_types.h>
> >> +
> >>   #ifdef CONFIG_64BIT
> >>   #include <asm/pgtable-64.h>
> >>   #else
> >> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >> index 98a406474e7d..8f5bb7731327 100644
> >> --- a/arch/riscv/kernel/head.S
> >> +++ b/arch/riscv/kernel/head.S
> >> @@ -49,7 +49,8 @@ ENTRY(_start)
> >>   #ifdef CONFIG_MMU
> >>   relocate:
> >>          /* Relocate return address */
> >> -       li a1, PAGE_OFFSET
> >> +       la a1, kernel_virt_addr
> >> +       REG_L a1, 0(a1)
> >>          la a2, _start
> >>          sub a1, a1, a2
> >>          add ra, ra, a1
> >> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> >> index 8bbe5dbe1341..1a8fbe05accf 100644
> >> --- a/arch/riscv/kernel/module.c
> >> +++ b/arch/riscv/kernel/module.c
> >> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
> >>   }
> >>
> >>   #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> >> -#define VMALLOC_MODULE_START \
> >> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> >>   void *module_alloc(unsigned long size)
> >>   {
> >>          return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> >> -                                   VMALLOC_END, GFP_KERNEL,
> >> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
> >>                                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> >>                                      __builtin_return_address(0));
> >>   }
> >> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> >> index 0339b6bbe11a..a9abde62909f 100644
> >> --- a/arch/riscv/kernel/vmlinux.lds.S
> >> +++ b/arch/riscv/kernel/vmlinux.lds.S
> >> @@ -4,7 +4,8 @@
> >>    * Copyright (C) 2017 SiFive
> >>    */
> >>
> >> -#define LOAD_OFFSET PAGE_OFFSET
> >> +#include <asm/pgtable.h>
> >> +#define LOAD_OFFSET KERNEL_LINK_ADDR
> >>   #include <asm/vmlinux.lds.h>
> >>   #include <asm/page.h>
> >>   #include <asm/cache.h>
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index 27a334106708..17f108baec4f 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -22,6 +22,9 @@
> >>
> >>   #include "../kernel/head.h"
> >>
> >> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> >> +EXPORT_SYMBOL(kernel_virt_addr);
> >> +
> >>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> >>                                                          __page_aligned_bss;
> >>   EXPORT_SYMBOL(empty_zero_page);
> >> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
> >>   }
> >>
> >>   #ifdef CONFIG_MMU
> >> +/* Offset between linear mapping virtual address and kernel load address */
> >>   unsigned long va_pa_offset;
> >>   EXPORT_SYMBOL(va_pa_offset);
> >> +/* Offset between kernel mapping virtual address and kernel load address */
> >> +unsigned long va_kernel_pa_offset;
> >> +EXPORT_SYMBOL(va_kernel_pa_offset);
> >>   unsigned long pfn_base;
> >>   EXPORT_SYMBOL(pfn_base);
> >>
> >> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
> >>          if (mmu_enabled)
> >>                  return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>
> >> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> >> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
> >>          BUG_ON(pmd_num >= NUM_EARLY_PMDS);
> >>          return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
> >>   }
> >> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> >>   #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
> >>   #endif
> >>
> >> +static uintptr_t load_pa, load_sz;
> >> +
> >> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> >> +{
> >> +       uintptr_t va, end_va;
> >> +
> >> +       end_va = kernel_virt_addr + load_sz;
> >> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
> >> +               create_pgd_mapping(pgdir, va,
> >> +                                  load_pa + (va - kernel_virt_addr),
> >> +                                  map_size, PAGE_KERNEL_EXEC);
> >> +}
> >> +
> >>   asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>   {
> >>          uintptr_t va, end_va;
> >> -       uintptr_t load_pa = (uintptr_t)(&_start);
> >> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
> >>          uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> >>
> >> +       load_pa = (uintptr_t)(&_start);
> >> +       load_sz = (uintptr_t)(&_end) - load_pa;
> >> +
> >>          va_pa_offset = PAGE_OFFSET - load_pa;
> >> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
> >> +
> >>          pfn_base = PFN_DOWN(load_pa);
> >>
> >>          /*
> >> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>          create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> >>                             (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> >>          /* Setup trampoline PGD and PMD */
> >> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >>                             (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> >> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> >> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
> >>                             load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> >>   #else
> >>          /* Setup trampoline PGD */
> >> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >>                             load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
> >>   #endif
> >>
> >>          /*
> >> -        * Setup early PGD covering entire kernel which will allows
> >> +        * Setup early PGD covering entire kernel which will allow
> >>           * us to reach paging_init(). We map all memory banks later
> >>           * in setup_vm_final() below.
> >>           */
> >> -       end_va = PAGE_OFFSET + load_sz;
> >> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
> >> -               create_pgd_mapping(early_pg_dir, va,
> >> -                                  load_pa + (va - PAGE_OFFSET),
> >> -                                  map_size, PAGE_KERNEL_EXEC);
> >> +       create_kernel_page_table(early_pg_dir, map_size);
> >>
> >>          /* Create fixed mapping for early FDT parsing */
> >>          end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> >> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
> >>          uintptr_t va, map_size;
> >>          phys_addr_t pa, start, end;
> >>          struct memblock_region *reg;
> >> +       static struct vm_struct vm_kernel = { 0 };
> >>
> >>          /* Set mmu_enabled flag */
> >>          mmu_enabled = true;
> >> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
> >>                  for (pa = start; pa < end; pa += map_size) {
> >>                          va = (uintptr_t)__va(pa);
> >>                          create_pgd_mapping(swapper_pg_dir, va, pa,
> >> -                                          map_size, PAGE_KERNEL_EXEC);
> >> +                                          map_size, PAGE_KERNEL);
> >>                  }
> >>          }
> >>
> >> +       /* Map the kernel */
> >> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> >> +
> >> +       /* Reserve the vmalloc area occupied by the kernel */
> >> +       vm_kernel.addr = (void *)kernel_virt_addr;
> >> +       vm_kernel.phys_addr = load_pa;
> >> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
> >> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> >> +       vm_kernel.caller = __builtin_return_address(0);
> >> +
> >> +       vm_area_add_early(&vm_kernel);
> >> +
> >>          /* Clear fixmap PTE and PMD mappings */
> >>          clear_fixmap(FIX_PTE);
> >>          clear_fixmap(FIX_PMD);
> >> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> >> index e8e4dcd39fed..35703d5ef5fd 100644
> >> --- a/arch/riscv/mm/physaddr.c
> >> +++ b/arch/riscv/mm/physaddr.c
> >> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
> >>
> >>   phys_addr_t __phys_addr_symbol(unsigned long x)
> >>   {
> >> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> >> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
> >>          unsigned long kernel_end = (unsigned long)_end;
> >>
> >>          /*
> >> --
> >> 2.20.1
> >>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-27  6:05         ` Zong Li
  0 siblings, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-27  6:05 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, linux-riscv, linuxppc-dev

On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>
> Hi Zong,
>
> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
> > On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >> This is a preparatory patch for relocatable kernel.
> >>
> >> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> >> physically at the beginning of the main memory. Therefore, we could use
> >> the linear mapping for the kernel mapping.
> >>
> >> But the relocated kernel base address will be different from PAGE_OFFSET
> >> and since in the linear mapping, two different virtual addresses cannot
> >> point to the same physical address, the kernel mapping needs to lie outside
> >> the linear mapping.
> >>
> >> In addition, because modules and BPF must be close to the kernel (inside
> >> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> >> 2GB, which leaves room for modules and BPF. The kernel could not be
> >> placed at the beginning of the vmalloc zone since other vmalloc
> >> allocations from the kernel could get all the +-2GB window around the
> >> kernel which would prevent new modules and BPF programs to be loaded.
> >>
> >> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> >> ---
> >>   arch/riscv/boot/loader.lds.S     |  3 +-
> >>   arch/riscv/include/asm/page.h    | 10 +++++-
> >>   arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
> >>   arch/riscv/kernel/head.S         |  3 +-
> >>   arch/riscv/kernel/module.c       |  4 +--
> >>   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
> >>   arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
> >>   arch/riscv/mm/physaddr.c         |  2 +-
> >>   8 files changed, 87 insertions(+), 33 deletions(-)
> >>
> >> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> >> index 47a5003c2e28..62d94696a19c 100644
> >> --- a/arch/riscv/boot/loader.lds.S
> >> +++ b/arch/riscv/boot/loader.lds.S
> >> @@ -1,13 +1,14 @@
> >>   /* SPDX-License-Identifier: GPL-2.0 */
> >>
> >>   #include <asm/page.h>
> >> +#include <asm/pgtable.h>
> >>
> >>   OUTPUT_ARCH(riscv)
> >>   ENTRY(_start)
> >>
> >>   SECTIONS
> >>   {
> >> -       . = PAGE_OFFSET;
> >> +       . = KERNEL_LINK_ADDR;
> >>
> >>          .payload : {
> >>                  *(.payload)
> >> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> >> index 2d50f76efe48..48bb09b6a9b7 100644
> >> --- a/arch/riscv/include/asm/page.h
> >> +++ b/arch/riscv/include/asm/page.h
> >> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
> >>
> >>   #ifdef CONFIG_MMU
> >>   extern unsigned long va_pa_offset;
> >> +extern unsigned long va_kernel_pa_offset;
> >>   extern unsigned long pfn_base;
> >>   #define ARCH_PFN_OFFSET                (pfn_base)
> >>   #else
> >>   #define va_pa_offset           0
> >> +#define va_kernel_pa_offset    0
> >>   #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
> >>   #endif /* CONFIG_MMU */
> >>
> >>   extern unsigned long max_low_pfn;
> >>   extern unsigned long min_low_pfn;
> >> +extern unsigned long kernel_virt_addr;
> >>
> >>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
> >> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> >> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
> >> +#define kernel_mapping_va_to_pa(x)     \
> >> +       ((unsigned long)(x) - va_kernel_pa_offset)
> >> +#define __va_to_pa_nodebug(x)          \
> >> +       (((x) >= PAGE_OFFSET) ?         \
> >> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
> >>
> >>   #ifdef CONFIG_DEBUG_VIRTUAL
> >>   extern phys_addr_t __virt_to_phys(unsigned long x);
> >> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> >> index 35b60035b6b0..25213cfaf680 100644
> >> --- a/arch/riscv/include/asm/pgtable.h
> >> +++ b/arch/riscv/include/asm/pgtable.h
> >> @@ -11,23 +11,29 @@
> >>
> >>   #include <asm/pgtable-bits.h>
> >>
> >> -#ifndef __ASSEMBLY__
> >> -
> >> -/* Page Upper Directory not used in RISC-V */
> >> -#include <asm-generic/pgtable-nopud.h>
> >> -#include <asm/page.h>
> >> -#include <asm/tlbflush.h>
> >> -#include <linux/mm_types.h>
> >> -
> >> -#ifdef CONFIG_MMU
> >> +#ifndef CONFIG_MMU
> >> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
> >> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
> >> +#else
> >> +/*
> >> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
> >> + * the kernel.
> >> + */
> >> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
> >> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
> >>
> >>   #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
> >>   #define VMALLOC_END      (PAGE_OFFSET - 1)
> >>   #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
> >>
> >>   #define BPF_JIT_REGION_SIZE    (SZ_128M)
> >> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> >> -#define BPF_JIT_REGION_END     (VMALLOC_END)
> >> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
> >> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
> > It seems to have a potential risk here, the region of bpf is
> > overlapping with kernel mapping, so if kernel size is bigger than
> > 128MB, bpf region would be occupied and run out by kernel mapping.

Is there the risk as I mentioned?

> >
> >> +
> >> +#ifdef CONFIG_64BIT
> >> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
> >> +#define VMALLOC_MODULE_END     VMALLOC_END
> >> +#endif
> >>
> > Although kernel_virt_addr is a fixed address now, I think it could be
> > changed for the purpose of relocatable or KASLR, so if
> > kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
> > of module would be too big.
>
>
> Yes you're right, that's wrong to allow modules to lie outside
> the 2G window, thanks for noticing.
>
>
> > In addition, the region of module could be
> > +-2G around the kernel, so we don't be limited in one direction as
> > before. It seems to me that the region of the module could be decided
> > at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
> > VMLLOC_MODULE_END is "&_start + 2G".
>
>
> I had tried that, but as we need to make sure BPF region is different
> from the module's
> that makes the macro definitions really cumbersome. I'll give a try
> again anyway. And
> I tried to use _end and _start here but it failed, I have to debug this.
>
>
> >   I'm not sure whether the size of
> > region of bpf has to be 128MB for some particular reason, if not,
> > maybe the region of bpf could be the same with module to avoid being
> > run out by module.
>
>
> On the contrary, BPF region must not be the same as module's since in
> that case,
> modules could take all the space and make BPF fail.

ok, I got it. Thanks for the explaining.


>
>
> Thanks for your review Zong,
>
>
> Alex
>
>
> >
> >>   /*
> >>    * Roughly size the vmemmap space to be large enough to fit enough
> >> @@ -57,9 +63,16 @@
> >>   #define FIXADDR_SIZE     PGDIR_SIZE
> >>   #endif
> >>   #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
> >> -
> >>   #endif
> >>
> >> +#ifndef __ASSEMBLY__
> >> +
> >> +/* Page Upper Directory not used in RISC-V */
> >> +#include <asm-generic/pgtable-nopud.h>
> >> +#include <asm/page.h>
> >> +#include <asm/tlbflush.h>
> >> +#include <linux/mm_types.h>
> >> +
> >>   #ifdef CONFIG_64BIT
> >>   #include <asm/pgtable-64.h>
> >>   #else
> >> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >> index 98a406474e7d..8f5bb7731327 100644
> >> --- a/arch/riscv/kernel/head.S
> >> +++ b/arch/riscv/kernel/head.S
> >> @@ -49,7 +49,8 @@ ENTRY(_start)
> >>   #ifdef CONFIG_MMU
> >>   relocate:
> >>          /* Relocate return address */
> >> -       li a1, PAGE_OFFSET
> >> +       la a1, kernel_virt_addr
> >> +       REG_L a1, 0(a1)
> >>          la a2, _start
> >>          sub a1, a1, a2
> >>          add ra, ra, a1
> >> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> >> index 8bbe5dbe1341..1a8fbe05accf 100644
> >> --- a/arch/riscv/kernel/module.c
> >> +++ b/arch/riscv/kernel/module.c
> >> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
> >>   }
> >>
> >>   #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> >> -#define VMALLOC_MODULE_START \
> >> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> >>   void *module_alloc(unsigned long size)
> >>   {
> >>          return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> >> -                                   VMALLOC_END, GFP_KERNEL,
> >> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
> >>                                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> >>                                      __builtin_return_address(0));
> >>   }
> >> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> >> index 0339b6bbe11a..a9abde62909f 100644
> >> --- a/arch/riscv/kernel/vmlinux.lds.S
> >> +++ b/arch/riscv/kernel/vmlinux.lds.S
> >> @@ -4,7 +4,8 @@
> >>    * Copyright (C) 2017 SiFive
> >>    */
> >>
> >> -#define LOAD_OFFSET PAGE_OFFSET
> >> +#include <asm/pgtable.h>
> >> +#define LOAD_OFFSET KERNEL_LINK_ADDR
> >>   #include <asm/vmlinux.lds.h>
> >>   #include <asm/page.h>
> >>   #include <asm/cache.h>
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index 27a334106708..17f108baec4f 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -22,6 +22,9 @@
> >>
> >>   #include "../kernel/head.h"
> >>
> >> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> >> +EXPORT_SYMBOL(kernel_virt_addr);
> >> +
> >>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> >>                                                          __page_aligned_bss;
> >>   EXPORT_SYMBOL(empty_zero_page);
> >> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
> >>   }
> >>
> >>   #ifdef CONFIG_MMU
> >> +/* Offset between linear mapping virtual address and kernel load address */
> >>   unsigned long va_pa_offset;
> >>   EXPORT_SYMBOL(va_pa_offset);
> >> +/* Offset between kernel mapping virtual address and kernel load address */
> >> +unsigned long va_kernel_pa_offset;
> >> +EXPORT_SYMBOL(va_kernel_pa_offset);
> >>   unsigned long pfn_base;
> >>   EXPORT_SYMBOL(pfn_base);
> >>
> >> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
> >>          if (mmu_enabled)
> >>                  return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>
> >> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> >> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
> >>          BUG_ON(pmd_num >= NUM_EARLY_PMDS);
> >>          return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
> >>   }
> >> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> >>   #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
> >>   #endif
> >>
> >> +static uintptr_t load_pa, load_sz;
> >> +
> >> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> >> +{
> >> +       uintptr_t va, end_va;
> >> +
> >> +       end_va = kernel_virt_addr + load_sz;
> >> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
> >> +               create_pgd_mapping(pgdir, va,
> >> +                                  load_pa + (va - kernel_virt_addr),
> >> +                                  map_size, PAGE_KERNEL_EXEC);
> >> +}
> >> +
> >>   asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>   {
> >>          uintptr_t va, end_va;
> >> -       uintptr_t load_pa = (uintptr_t)(&_start);
> >> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
> >>          uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> >>
> >> +       load_pa = (uintptr_t)(&_start);
> >> +       load_sz = (uintptr_t)(&_end) - load_pa;
> >> +
> >>          va_pa_offset = PAGE_OFFSET - load_pa;
> >> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
> >> +
> >>          pfn_base = PFN_DOWN(load_pa);
> >>
> >>          /*
> >> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>          create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> >>                             (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> >>          /* Setup trampoline PGD and PMD */
> >> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >>                             (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> >> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> >> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
> >>                             load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> >>   #else
> >>          /* Setup trampoline PGD */
> >> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >>                             load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
> >>   #endif
> >>
> >>          /*
> >> -        * Setup early PGD covering entire kernel which will allows
> >> +        * Setup early PGD covering entire kernel which will allow
> >>           * us to reach paging_init(). We map all memory banks later
> >>           * in setup_vm_final() below.
> >>           */
> >> -       end_va = PAGE_OFFSET + load_sz;
> >> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
> >> -               create_pgd_mapping(early_pg_dir, va,
> >> -                                  load_pa + (va - PAGE_OFFSET),
> >> -                                  map_size, PAGE_KERNEL_EXEC);
> >> +       create_kernel_page_table(early_pg_dir, map_size);
> >>
> >>          /* Create fixed mapping for early FDT parsing */
> >>          end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> >> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
> >>          uintptr_t va, map_size;
> >>          phys_addr_t pa, start, end;
> >>          struct memblock_region *reg;
> >> +       static struct vm_struct vm_kernel = { 0 };
> >>
> >>          /* Set mmu_enabled flag */
> >>          mmu_enabled = true;
> >> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
> >>                  for (pa = start; pa < end; pa += map_size) {
> >>                          va = (uintptr_t)__va(pa);
> >>                          create_pgd_mapping(swapper_pg_dir, va, pa,
> >> -                                          map_size, PAGE_KERNEL_EXEC);
> >> +                                          map_size, PAGE_KERNEL);
> >>                  }
> >>          }
> >>
> >> +       /* Map the kernel */
> >> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> >> +
> >> +       /* Reserve the vmalloc area occupied by the kernel */
> >> +       vm_kernel.addr = (void *)kernel_virt_addr;
> >> +       vm_kernel.phys_addr = load_pa;
> >> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
> >> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> >> +       vm_kernel.caller = __builtin_return_address(0);
> >> +
> >> +       vm_area_add_early(&vm_kernel);
> >> +
> >>          /* Clear fixmap PTE and PMD mappings */
> >>          clear_fixmap(FIX_PTE);
> >>          clear_fixmap(FIX_PMD);
> >> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> >> index e8e4dcd39fed..35703d5ef5fd 100644
> >> --- a/arch/riscv/mm/physaddr.c
> >> +++ b/arch/riscv/mm/physaddr.c
> >> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
> >>
> >>   phys_addr_t __phys_addr_symbol(unsigned long x)
> >>   {
> >> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> >> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
> >>          unsigned long kernel_end = (unsigned long)_end;
> >>
> >>          /*
> >> --
> >> 2.20.1
> >>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-27  6:05         ` Zong Li
  0 siblings, 0 replies; 30+ messages in thread
From: Zong Li @ 2020-05-27  6:05 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Paul Walmsley, Palmer Dabbelt,
	linux-riscv, linuxppc-dev

On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>
> Hi Zong,
>
> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
> > On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >> This is a preparatory patch for relocatable kernel.
> >>
> >> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> >> physically at the beginning of the main memory. Therefore, we could use
> >> the linear mapping for the kernel mapping.
> >>
> >> But the relocated kernel base address will be different from PAGE_OFFSET
> >> and since in the linear mapping, two different virtual addresses cannot
> >> point to the same physical address, the kernel mapping needs to lie outside
> >> the linear mapping.
> >>
> >> In addition, because modules and BPF must be close to the kernel (inside
> >> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> >> 2GB, which leaves room for modules and BPF. The kernel could not be
> >> placed at the beginning of the vmalloc zone since other vmalloc
> >> allocations from the kernel could get all the +-2GB window around the
> >> kernel which would prevent new modules and BPF programs to be loaded.
> >>
> >> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> >> ---
> >>   arch/riscv/boot/loader.lds.S     |  3 +-
> >>   arch/riscv/include/asm/page.h    | 10 +++++-
> >>   arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
> >>   arch/riscv/kernel/head.S         |  3 +-
> >>   arch/riscv/kernel/module.c       |  4 +--
> >>   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
> >>   arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
> >>   arch/riscv/mm/physaddr.c         |  2 +-
> >>   8 files changed, 87 insertions(+), 33 deletions(-)
> >>
> >> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> >> index 47a5003c2e28..62d94696a19c 100644
> >> --- a/arch/riscv/boot/loader.lds.S
> >> +++ b/arch/riscv/boot/loader.lds.S
> >> @@ -1,13 +1,14 @@
> >>   /* SPDX-License-Identifier: GPL-2.0 */
> >>
> >>   #include <asm/page.h>
> >> +#include <asm/pgtable.h>
> >>
> >>   OUTPUT_ARCH(riscv)
> >>   ENTRY(_start)
> >>
> >>   SECTIONS
> >>   {
> >> -       . = PAGE_OFFSET;
> >> +       . = KERNEL_LINK_ADDR;
> >>
> >>          .payload : {
> >>                  *(.payload)
> >> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> >> index 2d50f76efe48..48bb09b6a9b7 100644
> >> --- a/arch/riscv/include/asm/page.h
> >> +++ b/arch/riscv/include/asm/page.h
> >> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
> >>
> >>   #ifdef CONFIG_MMU
> >>   extern unsigned long va_pa_offset;
> >> +extern unsigned long va_kernel_pa_offset;
> >>   extern unsigned long pfn_base;
> >>   #define ARCH_PFN_OFFSET                (pfn_base)
> >>   #else
> >>   #define va_pa_offset           0
> >> +#define va_kernel_pa_offset    0
> >>   #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
> >>   #endif /* CONFIG_MMU */
> >>
> >>   extern unsigned long max_low_pfn;
> >>   extern unsigned long min_low_pfn;
> >> +extern unsigned long kernel_virt_addr;
> >>
> >>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
> >> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> >> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
> >> +#define kernel_mapping_va_to_pa(x)     \
> >> +       ((unsigned long)(x) - va_kernel_pa_offset)
> >> +#define __va_to_pa_nodebug(x)          \
> >> +       (((x) >= PAGE_OFFSET) ?         \
> >> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
> >>
> >>   #ifdef CONFIG_DEBUG_VIRTUAL
> >>   extern phys_addr_t __virt_to_phys(unsigned long x);
> >> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> >> index 35b60035b6b0..25213cfaf680 100644
> >> --- a/arch/riscv/include/asm/pgtable.h
> >> +++ b/arch/riscv/include/asm/pgtable.h
> >> @@ -11,23 +11,29 @@
> >>
> >>   #include <asm/pgtable-bits.h>
> >>
> >> -#ifndef __ASSEMBLY__
> >> -
> >> -/* Page Upper Directory not used in RISC-V */
> >> -#include <asm-generic/pgtable-nopud.h>
> >> -#include <asm/page.h>
> >> -#include <asm/tlbflush.h>
> >> -#include <linux/mm_types.h>
> >> -
> >> -#ifdef CONFIG_MMU
> >> +#ifndef CONFIG_MMU
> >> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
> >> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
> >> +#else
> >> +/*
> >> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
> >> + * the kernel.
> >> + */
> >> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
> >> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
> >>
> >>   #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
> >>   #define VMALLOC_END      (PAGE_OFFSET - 1)
> >>   #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
> >>
> >>   #define BPF_JIT_REGION_SIZE    (SZ_128M)
> >> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> >> -#define BPF_JIT_REGION_END     (VMALLOC_END)
> >> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
> >> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
> > It seems to have a potential risk here, the region of bpf is
> > overlapping with kernel mapping, so if kernel size is bigger than
> > 128MB, bpf region would be occupied and run out by kernel mapping.

Is there the risk as I mentioned?

> >
> >> +
> >> +#ifdef CONFIG_64BIT
> >> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
> >> +#define VMALLOC_MODULE_END     VMALLOC_END
> >> +#endif
> >>
> > Although kernel_virt_addr is a fixed address now, I think it could be
> > changed for the purpose of relocatable or KASLR, so if
> > kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
> > of module would be too big.
>
>
> Yes you're right, that's wrong to allow modules to lie outside
> the 2G window, thanks for noticing.
>
>
> > In addition, the region of module could be
> > +-2G around the kernel, so we don't be limited in one direction as
> > before. It seems to me that the region of the module could be decided
> > at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
> > VMLLOC_MODULE_END is "&_start + 2G".
>
>
> I had tried that, but as we need to make sure BPF region is different
> from the module's
> that makes the macro definitions really cumbersome. I'll give a try
> again anyway. And
> I tried to use _end and _start here but it failed, I have to debug this.
>
>
> >   I'm not sure whether the size of
> > region of bpf has to be 128MB for some particular reason, if not,
> > maybe the region of bpf could be the same with module to avoid being
> > run out by module.
>
>
> On the contrary, BPF region must not be the same as module's since in
> that case,
> modules could take all the space and make BPF fail.

ok, I got it. Thanks for the explaining.


>
>
> Thanks for your review Zong,
>
>
> Alex
>
>
> >
> >>   /*
> >>    * Roughly size the vmemmap space to be large enough to fit enough
> >> @@ -57,9 +63,16 @@
> >>   #define FIXADDR_SIZE     PGDIR_SIZE
> >>   #endif
> >>   #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
> >> -
> >>   #endif
> >>
> >> +#ifndef __ASSEMBLY__
> >> +
> >> +/* Page Upper Directory not used in RISC-V */
> >> +#include <asm-generic/pgtable-nopud.h>
> >> +#include <asm/page.h>
> >> +#include <asm/tlbflush.h>
> >> +#include <linux/mm_types.h>
> >> +
> >>   #ifdef CONFIG_64BIT
> >>   #include <asm/pgtable-64.h>
> >>   #else
> >> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >> index 98a406474e7d..8f5bb7731327 100644
> >> --- a/arch/riscv/kernel/head.S
> >> +++ b/arch/riscv/kernel/head.S
> >> @@ -49,7 +49,8 @@ ENTRY(_start)
> >>   #ifdef CONFIG_MMU
> >>   relocate:
> >>          /* Relocate return address */
> >> -       li a1, PAGE_OFFSET
> >> +       la a1, kernel_virt_addr
> >> +       REG_L a1, 0(a1)
> >>          la a2, _start
> >>          sub a1, a1, a2
> >>          add ra, ra, a1
> >> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> >> index 8bbe5dbe1341..1a8fbe05accf 100644
> >> --- a/arch/riscv/kernel/module.c
> >> +++ b/arch/riscv/kernel/module.c
> >> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
> >>   }
> >>
> >>   #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> >> -#define VMALLOC_MODULE_START \
> >> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> >>   void *module_alloc(unsigned long size)
> >>   {
> >>          return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> >> -                                   VMALLOC_END, GFP_KERNEL,
> >> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
> >>                                      PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> >>                                      __builtin_return_address(0));
> >>   }
> >> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> >> index 0339b6bbe11a..a9abde62909f 100644
> >> --- a/arch/riscv/kernel/vmlinux.lds.S
> >> +++ b/arch/riscv/kernel/vmlinux.lds.S
> >> @@ -4,7 +4,8 @@
> >>    * Copyright (C) 2017 SiFive
> >>    */
> >>
> >> -#define LOAD_OFFSET PAGE_OFFSET
> >> +#include <asm/pgtable.h>
> >> +#define LOAD_OFFSET KERNEL_LINK_ADDR
> >>   #include <asm/vmlinux.lds.h>
> >>   #include <asm/page.h>
> >>   #include <asm/cache.h>
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index 27a334106708..17f108baec4f 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -22,6 +22,9 @@
> >>
> >>   #include "../kernel/head.h"
> >>
> >> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> >> +EXPORT_SYMBOL(kernel_virt_addr);
> >> +
> >>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> >>                                                          __page_aligned_bss;
> >>   EXPORT_SYMBOL(empty_zero_page);
> >> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
> >>   }
> >>
> >>   #ifdef CONFIG_MMU
> >> +/* Offset between linear mapping virtual address and kernel load address */
> >>   unsigned long va_pa_offset;
> >>   EXPORT_SYMBOL(va_pa_offset);
> >> +/* Offset between kernel mapping virtual address and kernel load address */
> >> +unsigned long va_kernel_pa_offset;
> >> +EXPORT_SYMBOL(va_kernel_pa_offset);
> >>   unsigned long pfn_base;
> >>   EXPORT_SYMBOL(pfn_base);
> >>
> >> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
> >>          if (mmu_enabled)
> >>                  return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>
> >> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> >> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
> >>          BUG_ON(pmd_num >= NUM_EARLY_PMDS);
> >>          return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
> >>   }
> >> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> >>   #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
> >>   #endif
> >>
> >> +static uintptr_t load_pa, load_sz;
> >> +
> >> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> >> +{
> >> +       uintptr_t va, end_va;
> >> +
> >> +       end_va = kernel_virt_addr + load_sz;
> >> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
> >> +               create_pgd_mapping(pgdir, va,
> >> +                                  load_pa + (va - kernel_virt_addr),
> >> +                                  map_size, PAGE_KERNEL_EXEC);
> >> +}
> >> +
> >>   asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>   {
> >>          uintptr_t va, end_va;
> >> -       uintptr_t load_pa = (uintptr_t)(&_start);
> >> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
> >>          uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> >>
> >> +       load_pa = (uintptr_t)(&_start);
> >> +       load_sz = (uintptr_t)(&_end) - load_pa;
> >> +
> >>          va_pa_offset = PAGE_OFFSET - load_pa;
> >> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
> >> +
> >>          pfn_base = PFN_DOWN(load_pa);
> >>
> >>          /*
> >> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>          create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> >>                             (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> >>          /* Setup trampoline PGD and PMD */
> >> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >>                             (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> >> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> >> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
> >>                             load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> >>   #else
> >>          /* Setup trampoline PGD */
> >> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >>                             load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
> >>   #endif
> >>
> >>          /*
> >> -        * Setup early PGD covering entire kernel which will allows
> >> +        * Setup early PGD covering entire kernel which will allow
> >>           * us to reach paging_init(). We map all memory banks later
> >>           * in setup_vm_final() below.
> >>           */
> >> -       end_va = PAGE_OFFSET + load_sz;
> >> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
> >> -               create_pgd_mapping(early_pg_dir, va,
> >> -                                  load_pa + (va - PAGE_OFFSET),
> >> -                                  map_size, PAGE_KERNEL_EXEC);
> >> +       create_kernel_page_table(early_pg_dir, map_size);
> >>
> >>          /* Create fixed mapping for early FDT parsing */
> >>          end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> >> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
> >>          uintptr_t va, map_size;
> >>          phys_addr_t pa, start, end;
> >>          struct memblock_region *reg;
> >> +       static struct vm_struct vm_kernel = { 0 };
> >>
> >>          /* Set mmu_enabled flag */
> >>          mmu_enabled = true;
> >> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
> >>                  for (pa = start; pa < end; pa += map_size) {
> >>                          va = (uintptr_t)__va(pa);
> >>                          create_pgd_mapping(swapper_pg_dir, va, pa,
> >> -                                          map_size, PAGE_KERNEL_EXEC);
> >> +                                          map_size, PAGE_KERNEL);
> >>                  }
> >>          }
> >>
> >> +       /* Map the kernel */
> >> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> >> +
> >> +       /* Reserve the vmalloc area occupied by the kernel */
> >> +       vm_kernel.addr = (void *)kernel_virt_addr;
> >> +       vm_kernel.phys_addr = load_pa;
> >> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
> >> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> >> +       vm_kernel.caller = __builtin_return_address(0);
> >> +
> >> +       vm_area_add_early(&vm_kernel);
> >> +
> >>          /* Clear fixmap PTE and PMD mappings */
> >>          clear_fixmap(FIX_PTE);
> >>          clear_fixmap(FIX_PMD);
> >> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> >> index e8e4dcd39fed..35703d5ef5fd 100644
> >> --- a/arch/riscv/mm/physaddr.c
> >> +++ b/arch/riscv/mm/physaddr.c
> >> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
> >>
> >>   phys_addr_t __phys_addr_symbol(unsigned long x)
> >>   {
> >> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> >> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
> >>          unsigned long kernel_end = (unsigned long)_end;
> >>
> >>          /*
> >> --
> >> 2.20.1
> >>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
  2020-05-27  6:05         ` Zong Li
  (?)
@ 2020-05-27  7:29           ` Alex Ghiti
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-27  7:29 UTC (permalink / raw)
  To: Zong Li
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, linux-kernel@vger.kernel.org List, linuxppc-dev,
	linux-riscv

Le 5/27/20 à 2:05 AM, Zong Li a écrit :
> On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>> Hi Zong,
>>
>> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
>>> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>> This is a preparatory patch for relocatable kernel.
>>>>
>>>> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
>>>> physically at the beginning of the main memory. Therefore, we could use
>>>> the linear mapping for the kernel mapping.
>>>>
>>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>>> and since in the linear mapping, two different virtual addresses cannot
>>>> point to the same physical address, the kernel mapping needs to lie outside
>>>> the linear mapping.
>>>>
>>>> In addition, because modules and BPF must be close to the kernel (inside
>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>> allocations from the kernel could get all the +-2GB window around the
>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>>
>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>> ---
>>>>    arch/riscv/boot/loader.lds.S     |  3 +-
>>>>    arch/riscv/include/asm/page.h    | 10 +++++-
>>>>    arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>>>    arch/riscv/kernel/head.S         |  3 +-
>>>>    arch/riscv/kernel/module.c       |  4 +--
>>>>    arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>    arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>>    arch/riscv/mm/physaddr.c         |  2 +-
>>>>    8 files changed, 87 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>>> index 47a5003c2e28..62d94696a19c 100644
>>>> --- a/arch/riscv/boot/loader.lds.S
>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>> @@ -1,13 +1,14 @@
>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>
>>>>    #include <asm/page.h>
>>>> +#include <asm/pgtable.h>
>>>>
>>>>    OUTPUT_ARCH(riscv)
>>>>    ENTRY(_start)
>>>>
>>>>    SECTIONS
>>>>    {
>>>> -       . = PAGE_OFFSET;
>>>> +       . = KERNEL_LINK_ADDR;
>>>>
>>>>           .payload : {
>>>>                   *(.payload)
>>>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>
>>>>    #ifdef CONFIG_MMU
>>>>    extern unsigned long va_pa_offset;
>>>> +extern unsigned long va_kernel_pa_offset;
>>>>    extern unsigned long pfn_base;
>>>>    #define ARCH_PFN_OFFSET                (pfn_base)
>>>>    #else
>>>>    #define va_pa_offset           0
>>>> +#define va_kernel_pa_offset    0
>>>>    #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>>>    #endif /* CONFIG_MMU */
>>>>
>>>>    extern unsigned long max_low_pfn;
>>>>    extern unsigned long min_low_pfn;
>>>> +extern unsigned long kernel_virt_addr;
>>>>
>>>>    #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>>>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>>>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
>>>> +#define kernel_mapping_va_to_pa(x)     \
>>>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>>>> +#define __va_to_pa_nodebug(x)          \
>>>> +       (((x) >= PAGE_OFFSET) ?         \
>>>> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>>
>>>>    #ifdef CONFIG_DEBUG_VIRTUAL
>>>>    extern phys_addr_t __virt_to_phys(unsigned long x);
>>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>>>> index 35b60035b6b0..25213cfaf680 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -11,23 +11,29 @@
>>>>
>>>>    #include <asm/pgtable-bits.h>
>>>>
>>>> -#ifndef __ASSEMBLY__
>>>> -
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> -#include <asm/page.h>
>>>> -#include <asm/tlbflush.h>
>>>> -#include <linux/mm_types.h>
>>>> -
>>>> -#ifdef CONFIG_MMU
>>>> +#ifndef CONFIG_MMU
>>>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>>>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>>>> +#else
>>>> +/*
>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
>>>> + * the kernel.
>>>> + */
>>>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>>>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>>>
>>>>    #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>    #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>    #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>
>>>>    #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>>>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>>>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
>>> It seems to have a potential risk here, the region of bpf is
>>> overlapping with kernel mapping, so if kernel size is bigger than
>>> 128MB, bpf region would be occupied and run out by kernel mapping.
> Is there the risk as I mentioned?


Sorry I forgot to answer this one: I was confident that 128MB was large 
enough for kernel
and BPF. But I see no reason to leave this risk so I'll change 
kernel_virt_addr for _end so
BPF will have its 128MB reserved.

Thanks !

Alex


>
>>>> +
>>>> +#ifdef CONFIG_64BIT
>>>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>>>> +#define VMALLOC_MODULE_END     VMALLOC_END
>>>> +#endif
>>>>
>>> Although kernel_virt_addr is a fixed address now, I think it could be
>>> changed for the purpose of relocatable or KASLR, so if
>>> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
>>> of module would be too big.
>>
>> Yes you're right, that's wrong to allow modules to lie outside
>> the 2G window, thanks for noticing.
>>
>>
>>> In addition, the region of module could be
>>> +-2G around the kernel, so we don't be limited in one direction as
>>> before. It seems to me that the region of the module could be decided
>>> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
>>> VMLLOC_MODULE_END is "&_start + 2G".
>>
>> I had tried that, but as we need to make sure BPF region is different
>> from the module's
>> that makes the macro definitions really cumbersome. I'll give a try
>> again anyway. And
>> I tried to use _end and _start here but it failed, I have to debug this.
>>
>>
>>>    I'm not sure whether the size of
>>> region of bpf has to be 128MB for some particular reason, if not,
>>> maybe the region of bpf could be the same with module to avoid being
>>> run out by module.
>>
>> On the contrary, BPF region must not be the same as module's since in
>> that case,
>> modules could take all the space and make BPF fail.
> ok, I got it. Thanks for the explaining.
>
>
>>
>> Thanks for your review Zong,
>>
>>
>> Alex
>>
>>
>>>>    /*
>>>>     * Roughly size the vmemmap space to be large enough to fit enough
>>>> @@ -57,9 +63,16 @@
>>>>    #define FIXADDR_SIZE     PGDIR_SIZE
>>>>    #endif
>>>>    #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>> -
>>>>    #endif
>>>>
>>>> +#ifndef __ASSEMBLY__
>>>> +
>>>> +/* Page Upper Directory not used in RISC-V */
>>>> +#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/tlbflush.h>
>>>> +#include <linux/mm_types.h>
>>>> +
>>>>    #ifdef CONFIG_64BIT
>>>>    #include <asm/pgtable-64.h>
>>>>    #else
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 98a406474e7d..8f5bb7731327 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>    #ifdef CONFIG_MMU
>>>>    relocate:
>>>>           /* Relocate return address */
>>>> -       li a1, PAGE_OFFSET
>>>> +       la a1, kernel_virt_addr
>>>> +       REG_L a1, 0(a1)
>>>>           la a2, _start
>>>>           sub a1, a1, a2
>>>>           add ra, ra, a1
>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>> --- a/arch/riscv/kernel/module.c
>>>> +++ b/arch/riscv/kernel/module.c
>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>>>>    }
>>>>
>>>>    #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>> -#define VMALLOC_MODULE_START \
>>>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>    void *module_alloc(unsigned long size)
>>>>    {
>>>>           return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>> -                                   VMALLOC_END, GFP_KERNEL,
>>>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>>>                                       PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>                                       __builtin_return_address(0));
>>>>    }
>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>> @@ -4,7 +4,8 @@
>>>>     * Copyright (C) 2017 SiFive
>>>>     */
>>>>
>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>> +#include <asm/pgtable.h>
>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>    #include <asm/vmlinux.lds.h>
>>>>    #include <asm/page.h>
>>>>    #include <asm/cache.h>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 27a334106708..17f108baec4f 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -22,6 +22,9 @@
>>>>
>>>>    #include "../kernel/head.h"
>>>>
>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>> +
>>>>    unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>                                                           __page_aligned_bss;
>>>>    EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>    }
>>>>
>>>>    #ifdef CONFIG_MMU
>>>> +/* Offset between linear mapping virtual address and kernel load address */
>>>>    unsigned long va_pa_offset;
>>>>    EXPORT_SYMBOL(va_pa_offset);
>>>> +/* Offset between kernel mapping virtual address and kernel load address */
>>>> +unsigned long va_kernel_pa_offset;
>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>    unsigned long pfn_base;
>>>>    EXPORT_SYMBOL(pfn_base);
>>>>
>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>           if (mmu_enabled)
>>>>                   return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>
>>>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>           BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>           return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>    }
>>>> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>>>    #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>>>>    #endif
>>>>
>>>> +static uintptr_t load_pa, load_sz;
>>>> +
>>>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>> +{
>>>> +       uintptr_t va, end_va;
>>>> +
>>>> +       end_va = kernel_virt_addr + load_sz;
>>>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>> +               create_pgd_mapping(pgdir, va,
>>>> +                                  load_pa + (va - kernel_virt_addr),
>>>> +                                  map_size, PAGE_KERNEL_EXEC);
>>>> +}
>>>> +
>>>>    asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>    {
>>>>           uintptr_t va, end_va;
>>>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>>>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>           uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>>>
>>>> +       load_pa = (uintptr_t)(&_start);
>>>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>>>> +
>>>>           va_pa_offset = PAGE_OFFSET - load_pa;
>>>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> +
>>>>           pfn_base = PFN_DOWN(load_pa);
>>>>
>>>>           /*
>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>           create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>                              (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>>           /* Setup trampoline PGD and PMD */
>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                              (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>                              load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>    #else
>>>>           /* Setup trampoline PGD */
>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                              load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>    #endif
>>>>
>>>>           /*
>>>> -        * Setup early PGD covering entire kernel which will allows
>>>> +        * Setup early PGD covering entire kernel which will allow
>>>>            * us to reach paging_init(). We map all memory banks later
>>>>            * in setup_vm_final() below.
>>>>            */
>>>> -       end_va = PAGE_OFFSET + load_sz;
>>>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>> -               create_pgd_mapping(early_pg_dir, va,
>>>> -                                  load_pa + (va - PAGE_OFFSET),
>>>> -                                  map_size, PAGE_KERNEL_EXEC);
>>>> +       create_kernel_page_table(early_pg_dir, map_size);
>>>>
>>>>           /* Create fixed mapping for early FDT parsing */
>>>>           end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>           uintptr_t va, map_size;
>>>>           phys_addr_t pa, start, end;
>>>>           struct memblock_region *reg;
>>>> +       static struct vm_struct vm_kernel = { 0 };
>>>>
>>>>           /* Set mmu_enabled flag */
>>>>           mmu_enabled = true;
>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>                   for (pa = start; pa < end; pa += map_size) {
>>>>                           va = (uintptr_t)__va(pa);
>>>>                           create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> -                                          map_size, PAGE_KERNEL_EXEC);
>>>> +                                          map_size, PAGE_KERNEL);
>>>>                   }
>>>>           }
>>>>
>>>> +       /* Map the kernel */
>>>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>> +
>>>> +       /* Reserve the vmalloc area occupied by the kernel */
>>>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>>>> +       vm_kernel.phys_addr = load_pa;
>>>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>>>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>> +       vm_kernel.caller = __builtin_return_address(0);
>>>> +
>>>> +       vm_area_add_early(&vm_kernel);
>>>> +
>>>>           /* Clear fixmap PTE and PMD mappings */
>>>>           clear_fixmap(FIX_PTE);
>>>>           clear_fixmap(FIX_PMD);
>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>> --- a/arch/riscv/mm/physaddr.c
>>>> +++ b/arch/riscv/mm/physaddr.c
>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>
>>>>    phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>    {
>>>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>           unsigned long kernel_end = (unsigned long)_end;
>>>>
>>>>           /*
>>>> --
>>>> 2.20.1
>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-27  7:29           ` Alex Ghiti
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-27  7:29 UTC (permalink / raw)
  To: Zong Li
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, linux-riscv, linuxppc-dev

Le 5/27/20 à 2:05 AM, Zong Li a écrit :
> On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>> Hi Zong,
>>
>> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
>>> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>> This is a preparatory patch for relocatable kernel.
>>>>
>>>> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
>>>> physically at the beginning of the main memory. Therefore, we could use
>>>> the linear mapping for the kernel mapping.
>>>>
>>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>>> and since in the linear mapping, two different virtual addresses cannot
>>>> point to the same physical address, the kernel mapping needs to lie outside
>>>> the linear mapping.
>>>>
>>>> In addition, because modules and BPF must be close to the kernel (inside
>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>> allocations from the kernel could get all the +-2GB window around the
>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>>
>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>> ---
>>>>    arch/riscv/boot/loader.lds.S     |  3 +-
>>>>    arch/riscv/include/asm/page.h    | 10 +++++-
>>>>    arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>>>    arch/riscv/kernel/head.S         |  3 +-
>>>>    arch/riscv/kernel/module.c       |  4 +--
>>>>    arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>    arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>>    arch/riscv/mm/physaddr.c         |  2 +-
>>>>    8 files changed, 87 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>>> index 47a5003c2e28..62d94696a19c 100644
>>>> --- a/arch/riscv/boot/loader.lds.S
>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>> @@ -1,13 +1,14 @@
>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>
>>>>    #include <asm/page.h>
>>>> +#include <asm/pgtable.h>
>>>>
>>>>    OUTPUT_ARCH(riscv)
>>>>    ENTRY(_start)
>>>>
>>>>    SECTIONS
>>>>    {
>>>> -       . = PAGE_OFFSET;
>>>> +       . = KERNEL_LINK_ADDR;
>>>>
>>>>           .payload : {
>>>>                   *(.payload)
>>>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>
>>>>    #ifdef CONFIG_MMU
>>>>    extern unsigned long va_pa_offset;
>>>> +extern unsigned long va_kernel_pa_offset;
>>>>    extern unsigned long pfn_base;
>>>>    #define ARCH_PFN_OFFSET                (pfn_base)
>>>>    #else
>>>>    #define va_pa_offset           0
>>>> +#define va_kernel_pa_offset    0
>>>>    #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>>>    #endif /* CONFIG_MMU */
>>>>
>>>>    extern unsigned long max_low_pfn;
>>>>    extern unsigned long min_low_pfn;
>>>> +extern unsigned long kernel_virt_addr;
>>>>
>>>>    #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>>>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>>>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
>>>> +#define kernel_mapping_va_to_pa(x)     \
>>>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>>>> +#define __va_to_pa_nodebug(x)          \
>>>> +       (((x) >= PAGE_OFFSET) ?         \
>>>> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>>
>>>>    #ifdef CONFIG_DEBUG_VIRTUAL
>>>>    extern phys_addr_t __virt_to_phys(unsigned long x);
>>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>>>> index 35b60035b6b0..25213cfaf680 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -11,23 +11,29 @@
>>>>
>>>>    #include <asm/pgtable-bits.h>
>>>>
>>>> -#ifndef __ASSEMBLY__
>>>> -
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> -#include <asm/page.h>
>>>> -#include <asm/tlbflush.h>
>>>> -#include <linux/mm_types.h>
>>>> -
>>>> -#ifdef CONFIG_MMU
>>>> +#ifndef CONFIG_MMU
>>>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>>>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>>>> +#else
>>>> +/*
>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
>>>> + * the kernel.
>>>> + */
>>>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>>>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>>>
>>>>    #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>    #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>    #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>
>>>>    #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>>>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>>>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
>>> It seems to have a potential risk here, the region of bpf is
>>> overlapping with kernel mapping, so if kernel size is bigger than
>>> 128MB, bpf region would be occupied and run out by kernel mapping.
> Is there the risk as I mentioned?


Sorry I forgot to answer this one: I was confident that 128MB was large 
enough for kernel
and BPF. But I see no reason to leave this risk so I'll change 
kernel_virt_addr for _end so
BPF will have its 128MB reserved.

Thanks !

Alex


>
>>>> +
>>>> +#ifdef CONFIG_64BIT
>>>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>>>> +#define VMALLOC_MODULE_END     VMALLOC_END
>>>> +#endif
>>>>
>>> Although kernel_virt_addr is a fixed address now, I think it could be
>>> changed for the purpose of relocatable or KASLR, so if
>>> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
>>> of module would be too big.
>>
>> Yes you're right, that's wrong to allow modules to lie outside
>> the 2G window, thanks for noticing.
>>
>>
>>> In addition, the region of module could be
>>> +-2G around the kernel, so we don't be limited in one direction as
>>> before. It seems to me that the region of the module could be decided
>>> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
>>> VMLLOC_MODULE_END is "&_start + 2G".
>>
>> I had tried that, but as we need to make sure BPF region is different
>> from the module's
>> that makes the macro definitions really cumbersome. I'll give a try
>> again anyway. And
>> I tried to use _end and _start here but it failed, I have to debug this.
>>
>>
>>>    I'm not sure whether the size of
>>> region of bpf has to be 128MB for some particular reason, if not,
>>> maybe the region of bpf could be the same with module to avoid being
>>> run out by module.
>>
>> On the contrary, BPF region must not be the same as module's since in
>> that case,
>> modules could take all the space and make BPF fail.
> ok, I got it. Thanks for the explaining.
>
>
>>
>> Thanks for your review Zong,
>>
>>
>> Alex
>>
>>
>>>>    /*
>>>>     * Roughly size the vmemmap space to be large enough to fit enough
>>>> @@ -57,9 +63,16 @@
>>>>    #define FIXADDR_SIZE     PGDIR_SIZE
>>>>    #endif
>>>>    #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>> -
>>>>    #endif
>>>>
>>>> +#ifndef __ASSEMBLY__
>>>> +
>>>> +/* Page Upper Directory not used in RISC-V */
>>>> +#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/tlbflush.h>
>>>> +#include <linux/mm_types.h>
>>>> +
>>>>    #ifdef CONFIG_64BIT
>>>>    #include <asm/pgtable-64.h>
>>>>    #else
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 98a406474e7d..8f5bb7731327 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>    #ifdef CONFIG_MMU
>>>>    relocate:
>>>>           /* Relocate return address */
>>>> -       li a1, PAGE_OFFSET
>>>> +       la a1, kernel_virt_addr
>>>> +       REG_L a1, 0(a1)
>>>>           la a2, _start
>>>>           sub a1, a1, a2
>>>>           add ra, ra, a1
>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>> --- a/arch/riscv/kernel/module.c
>>>> +++ b/arch/riscv/kernel/module.c
>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>>>>    }
>>>>
>>>>    #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>> -#define VMALLOC_MODULE_START \
>>>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>    void *module_alloc(unsigned long size)
>>>>    {
>>>>           return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>> -                                   VMALLOC_END, GFP_KERNEL,
>>>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>>>                                       PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>                                       __builtin_return_address(0));
>>>>    }
>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>> @@ -4,7 +4,8 @@
>>>>     * Copyright (C) 2017 SiFive
>>>>     */
>>>>
>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>> +#include <asm/pgtable.h>
>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>    #include <asm/vmlinux.lds.h>
>>>>    #include <asm/page.h>
>>>>    #include <asm/cache.h>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 27a334106708..17f108baec4f 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -22,6 +22,9 @@
>>>>
>>>>    #include "../kernel/head.h"
>>>>
>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>> +
>>>>    unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>                                                           __page_aligned_bss;
>>>>    EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>    }
>>>>
>>>>    #ifdef CONFIG_MMU
>>>> +/* Offset between linear mapping virtual address and kernel load address */
>>>>    unsigned long va_pa_offset;
>>>>    EXPORT_SYMBOL(va_pa_offset);
>>>> +/* Offset between kernel mapping virtual address and kernel load address */
>>>> +unsigned long va_kernel_pa_offset;
>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>    unsigned long pfn_base;
>>>>    EXPORT_SYMBOL(pfn_base);
>>>>
>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>           if (mmu_enabled)
>>>>                   return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>
>>>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>           BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>           return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>    }
>>>> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>>>    #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>>>>    #endif
>>>>
>>>> +static uintptr_t load_pa, load_sz;
>>>> +
>>>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>> +{
>>>> +       uintptr_t va, end_va;
>>>> +
>>>> +       end_va = kernel_virt_addr + load_sz;
>>>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>> +               create_pgd_mapping(pgdir, va,
>>>> +                                  load_pa + (va - kernel_virt_addr),
>>>> +                                  map_size, PAGE_KERNEL_EXEC);
>>>> +}
>>>> +
>>>>    asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>    {
>>>>           uintptr_t va, end_va;
>>>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>>>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>           uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>>>
>>>> +       load_pa = (uintptr_t)(&_start);
>>>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>>>> +
>>>>           va_pa_offset = PAGE_OFFSET - load_pa;
>>>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> +
>>>>           pfn_base = PFN_DOWN(load_pa);
>>>>
>>>>           /*
>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>           create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>                              (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>>           /* Setup trampoline PGD and PMD */
>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                              (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>                              load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>    #else
>>>>           /* Setup trampoline PGD */
>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                              load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>    #endif
>>>>
>>>>           /*
>>>> -        * Setup early PGD covering entire kernel which will allows
>>>> +        * Setup early PGD covering entire kernel which will allow
>>>>            * us to reach paging_init(). We map all memory banks later
>>>>            * in setup_vm_final() below.
>>>>            */
>>>> -       end_va = PAGE_OFFSET + load_sz;
>>>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>> -               create_pgd_mapping(early_pg_dir, va,
>>>> -                                  load_pa + (va - PAGE_OFFSET),
>>>> -                                  map_size, PAGE_KERNEL_EXEC);
>>>> +       create_kernel_page_table(early_pg_dir, map_size);
>>>>
>>>>           /* Create fixed mapping for early FDT parsing */
>>>>           end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>           uintptr_t va, map_size;
>>>>           phys_addr_t pa, start, end;
>>>>           struct memblock_region *reg;
>>>> +       static struct vm_struct vm_kernel = { 0 };
>>>>
>>>>           /* Set mmu_enabled flag */
>>>>           mmu_enabled = true;
>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>                   for (pa = start; pa < end; pa += map_size) {
>>>>                           va = (uintptr_t)__va(pa);
>>>>                           create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> -                                          map_size, PAGE_KERNEL_EXEC);
>>>> +                                          map_size, PAGE_KERNEL);
>>>>                   }
>>>>           }
>>>>
>>>> +       /* Map the kernel */
>>>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>> +
>>>> +       /* Reserve the vmalloc area occupied by the kernel */
>>>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>>>> +       vm_kernel.phys_addr = load_pa;
>>>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>>>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>> +       vm_kernel.caller = __builtin_return_address(0);
>>>> +
>>>> +       vm_area_add_early(&vm_kernel);
>>>> +
>>>>           /* Clear fixmap PTE and PMD mappings */
>>>>           clear_fixmap(FIX_PTE);
>>>>           clear_fixmap(FIX_PMD);
>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>> --- a/arch/riscv/mm/physaddr.c
>>>> +++ b/arch/riscv/mm/physaddr.c
>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>
>>>>    phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>    {
>>>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>           unsigned long kernel_end = (unsigned long)_end;
>>>>
>>>>           /*
>>>> --
>>>> 2.20.1
>>>>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-27  7:29           ` Alex Ghiti
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-27  7:29 UTC (permalink / raw)
  To: Zong Li
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Paul Walmsley, Palmer Dabbelt,
	linux-riscv, linuxppc-dev

Le 5/27/20 à 2:05 AM, Zong Li a écrit :
> On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>> Hi Zong,
>>
>> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
>>> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>> This is a preparatory patch for relocatable kernel.
>>>>
>>>> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
>>>> physically at the beginning of the main memory. Therefore, we could use
>>>> the linear mapping for the kernel mapping.
>>>>
>>>> But the relocated kernel base address will be different from PAGE_OFFSET
>>>> and since in the linear mapping, two different virtual addresses cannot
>>>> point to the same physical address, the kernel mapping needs to lie outside
>>>> the linear mapping.
>>>>
>>>> In addition, because modules and BPF must be close to the kernel (inside
>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>> allocations from the kernel could get all the +-2GB window around the
>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>>
>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>> ---
>>>>    arch/riscv/boot/loader.lds.S     |  3 +-
>>>>    arch/riscv/include/asm/page.h    | 10 +++++-
>>>>    arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>>>    arch/riscv/kernel/head.S         |  3 +-
>>>>    arch/riscv/kernel/module.c       |  4 +--
>>>>    arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>    arch/riscv/mm/init.c             | 58 +++++++++++++++++++++++++-------
>>>>    arch/riscv/mm/physaddr.c         |  2 +-
>>>>    8 files changed, 87 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>>> index 47a5003c2e28..62d94696a19c 100644
>>>> --- a/arch/riscv/boot/loader.lds.S
>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>> @@ -1,13 +1,14 @@
>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>
>>>>    #include <asm/page.h>
>>>> +#include <asm/pgtable.h>
>>>>
>>>>    OUTPUT_ARCH(riscv)
>>>>    ENTRY(_start)
>>>>
>>>>    SECTIONS
>>>>    {
>>>> -       . = PAGE_OFFSET;
>>>> +       . = KERNEL_LINK_ADDR;
>>>>
>>>>           .payload : {
>>>>                   *(.payload)
>>>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>
>>>>    #ifdef CONFIG_MMU
>>>>    extern unsigned long va_pa_offset;
>>>> +extern unsigned long va_kernel_pa_offset;
>>>>    extern unsigned long pfn_base;
>>>>    #define ARCH_PFN_OFFSET                (pfn_base)
>>>>    #else
>>>>    #define va_pa_offset           0
>>>> +#define va_kernel_pa_offset    0
>>>>    #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>>>    #endif /* CONFIG_MMU */
>>>>
>>>>    extern unsigned long max_low_pfn;
>>>>    extern unsigned long min_low_pfn;
>>>> +extern unsigned long kernel_virt_addr;
>>>>
>>>>    #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>>>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>>>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - va_pa_offset)
>>>> +#define kernel_mapping_va_to_pa(x)     \
>>>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>>>> +#define __va_to_pa_nodebug(x)          \
>>>> +       (((x) >= PAGE_OFFSET) ?         \
>>>> +               linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>>
>>>>    #ifdef CONFIG_DEBUG_VIRTUAL
>>>>    extern phys_addr_t __virt_to_phys(unsigned long x);
>>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>>>> index 35b60035b6b0..25213cfaf680 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -11,23 +11,29 @@
>>>>
>>>>    #include <asm/pgtable-bits.h>
>>>>
>>>> -#ifndef __ASSEMBLY__
>>>> -
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> -#include <asm/page.h>
>>>> -#include <asm/tlbflush.h>
>>>> -#include <linux/mm_types.h>
>>>> -
>>>> -#ifdef CONFIG_MMU
>>>> +#ifndef CONFIG_MMU
>>>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>>>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>>>> +#else
>>>> +/*
>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
>>>> + * the kernel.
>>>> + */
>>>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>>>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>>>
>>>>    #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>    #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>    #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>
>>>>    #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>>>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>>>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + BPF_JIT_REGION_SIZE)
>>> It seems to have a potential risk here, the region of bpf is
>>> overlapping with kernel mapping, so if kernel size is bigger than
>>> 128MB, bpf region would be occupied and run out by kernel mapping.
> Is there the risk as I mentioned?


Sorry I forgot to answer this one: I was confident that 128MB was large 
enough for kernel
and BPF. But I see no reason to leave this risk so I'll change 
kernel_virt_addr for _end so
BPF will have its 128MB reserved.

Thanks !

Alex


>
>>>> +
>>>> +#ifdef CONFIG_64BIT
>>>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>>>> +#define VMALLOC_MODULE_END     VMALLOC_END
>>>> +#endif
>>>>
>>> Although kernel_virt_addr is a fixed address now, I think it could be
>>> changed for the purpose of relocatable or KASLR, so if
>>> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
>>> of module would be too big.
>>
>> Yes you're right, that's wrong to allow modules to lie outside
>> the 2G window, thanks for noticing.
>>
>>
>>> In addition, the region of module could be
>>> +-2G around the kernel, so we don't be limited in one direction as
>>> before. It seems to me that the region of the module could be decided
>>> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
>>> VMLLOC_MODULE_END is "&_start + 2G".
>>
>> I had tried that, but as we need to make sure BPF region is different
>> from the module's
>> that makes the macro definitions really cumbersome. I'll give a try
>> again anyway. And
>> I tried to use _end and _start here but it failed, I have to debug this.
>>
>>
>>>    I'm not sure whether the size of
>>> region of bpf has to be 128MB for some particular reason, if not,
>>> maybe the region of bpf could be the same with module to avoid being
>>> run out by module.
>>
>> On the contrary, BPF region must not be the same as module's since in
>> that case,
>> modules could take all the space and make BPF fail.
> ok, I got it. Thanks for the explaining.
>
>
>>
>> Thanks for your review Zong,
>>
>>
>> Alex
>>
>>
>>>>    /*
>>>>     * Roughly size the vmemmap space to be large enough to fit enough
>>>> @@ -57,9 +63,16 @@
>>>>    #define FIXADDR_SIZE     PGDIR_SIZE
>>>>    #endif
>>>>    #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>> -
>>>>    #endif
>>>>
>>>> +#ifndef __ASSEMBLY__
>>>> +
>>>> +/* Page Upper Directory not used in RISC-V */
>>>> +#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/tlbflush.h>
>>>> +#include <linux/mm_types.h>
>>>> +
>>>>    #ifdef CONFIG_64BIT
>>>>    #include <asm/pgtable-64.h>
>>>>    #else
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 98a406474e7d..8f5bb7731327 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>    #ifdef CONFIG_MMU
>>>>    relocate:
>>>>           /* Relocate return address */
>>>> -       li a1, PAGE_OFFSET
>>>> +       la a1, kernel_virt_addr
>>>> +       REG_L a1, 0(a1)
>>>>           la a2, _start
>>>>           sub a1, a1, a2
>>>>           add ra, ra, a1
>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>> --- a/arch/riscv/kernel/module.c
>>>> +++ b/arch/riscv/kernel/module.c
>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>>>>    }
>>>>
>>>>    #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>> -#define VMALLOC_MODULE_START \
>>>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>    void *module_alloc(unsigned long size)
>>>>    {
>>>>           return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>> -                                   VMALLOC_END, GFP_KERNEL,
>>>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>>>                                       PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>>                                       __builtin_return_address(0));
>>>>    }
>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>> @@ -4,7 +4,8 @@
>>>>     * Copyright (C) 2017 SiFive
>>>>     */
>>>>
>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>> +#include <asm/pgtable.h>
>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>    #include <asm/vmlinux.lds.h>
>>>>    #include <asm/page.h>
>>>>    #include <asm/cache.h>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 27a334106708..17f108baec4f 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -22,6 +22,9 @@
>>>>
>>>>    #include "../kernel/head.h"
>>>>
>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>> +
>>>>    unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>                                                           __page_aligned_bss;
>>>>    EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>    }
>>>>
>>>>    #ifdef CONFIG_MMU
>>>> +/* Offset between linear mapping virtual address and kernel load address */
>>>>    unsigned long va_pa_offset;
>>>>    EXPORT_SYMBOL(va_pa_offset);
>>>> +/* Offset between kernel mapping virtual address and kernel load address */
>>>> +unsigned long va_kernel_pa_offset;
>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>    unsigned long pfn_base;
>>>>    EXPORT_SYMBOL(pfn_base);
>>>>
>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>           if (mmu_enabled)
>>>>                   return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>
>>>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>           BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>           return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>    }
>>>> @@ -372,14 +379,30 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>>>    #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>>>>    #endif
>>>>
>>>> +static uintptr_t load_pa, load_sz;
>>>> +
>>>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>> +{
>>>> +       uintptr_t va, end_va;
>>>> +
>>>> +       end_va = kernel_virt_addr + load_sz;
>>>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>> +               create_pgd_mapping(pgdir, va,
>>>> +                                  load_pa + (va - kernel_virt_addr),
>>>> +                                  map_size, PAGE_KERNEL_EXEC);
>>>> +}
>>>> +
>>>>    asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>    {
>>>>           uintptr_t va, end_va;
>>>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>>>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>           uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>>>
>>>> +       load_pa = (uintptr_t)(&_start);
>>>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>>>> +
>>>>           va_pa_offset = PAGE_OFFSET - load_pa;
>>>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> +
>>>>           pfn_base = PFN_DOWN(load_pa);
>>>>
>>>>           /*
>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>           create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>                              (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>>           /* Setup trampoline PGD and PMD */
>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                              (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>                              load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>    #else
>>>>           /* Setup trampoline PGD */
>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>                              load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>    #endif
>>>>
>>>>           /*
>>>> -        * Setup early PGD covering entire kernel which will allows
>>>> +        * Setup early PGD covering entire kernel which will allow
>>>>            * us to reach paging_init(). We map all memory banks later
>>>>            * in setup_vm_final() below.
>>>>            */
>>>> -       end_va = PAGE_OFFSET + load_sz;
>>>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>> -               create_pgd_mapping(early_pg_dir, va,
>>>> -                                  load_pa + (va - PAGE_OFFSET),
>>>> -                                  map_size, PAGE_KERNEL_EXEC);
>>>> +       create_kernel_page_table(early_pg_dir, map_size);
>>>>
>>>>           /* Create fixed mapping for early FDT parsing */
>>>>           end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>           uintptr_t va, map_size;
>>>>           phys_addr_t pa, start, end;
>>>>           struct memblock_region *reg;
>>>> +       static struct vm_struct vm_kernel = { 0 };
>>>>
>>>>           /* Set mmu_enabled flag */
>>>>           mmu_enabled = true;
>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>                   for (pa = start; pa < end; pa += map_size) {
>>>>                           va = (uintptr_t)__va(pa);
>>>>                           create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> -                                          map_size, PAGE_KERNEL_EXEC);
>>>> +                                          map_size, PAGE_KERNEL);
>>>>                   }
>>>>           }
>>>>
>>>> +       /* Map the kernel */
>>>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>> +
>>>> +       /* Reserve the vmalloc area occupied by the kernel */
>>>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>>>> +       vm_kernel.phys_addr = load_pa;
>>>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>>>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>> +       vm_kernel.caller = __builtin_return_address(0);
>>>> +
>>>> +       vm_area_add_early(&vm_kernel);
>>>> +
>>>>           /* Clear fixmap PTE and PMD mappings */
>>>>           clear_fixmap(FIX_PTE);
>>>>           clear_fixmap(FIX_PMD);
>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>> --- a/arch/riscv/mm/physaddr.c
>>>> +++ b/arch/riscv/mm/physaddr.c
>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>
>>>>    phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>    {
>>>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>           unsigned long kernel_end = (unsigned long)_end;
>>>>
>>>>           /*
>>>> --
>>>> 2.20.1
>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
  2020-05-24  8:52 ` [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone Alexandre Ghiti
@ 2020-05-27  7:33     ` kbuild test robot
  2020-05-27  7:33     ` kbuild test robot
  1 sibling, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2020-05-27  7:33 UTC (permalink / raw)
  To: Alexandre Ghiti, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Anup Patel, Atish Patra, Zong Li, linux-kernel, linuxppc-dev,
	linux-riscv
  Cc: kbuild-all, Alexandre Ghiti

[-- Attachment #1: Type: text/plain, Size: 1976 bytes --]

Hi Alexandre,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on linus/master v5.7-rc7 next-20200526]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Alexandre-Ghiti/vmalloc-kernel-mapping-and-relocatable-kernel/20200524-170109
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: riscv-allyesconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> arch/riscv/mm/init.c:383:6: warning: no previous prototype for 'create_kernel_page_table' [-Wmissing-prototypes]
383 | void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
|      ^~~~~~~~~~~~~~~~~~~~~~~~

vim +/create_kernel_page_table +383 arch/riscv/mm/init.c

   382	
 > 383	void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
   384	{
   385		uintptr_t va, end_va;
   386	
   387		end_va = kernel_virt_addr + load_sz;
   388		for (va = kernel_virt_addr; va < end_va; va += map_size)
   389			create_pgd_mapping(pgdir, va,
   390					   load_pa + (va - kernel_virt_addr),
   391					   map_size, PAGE_KERNEL_EXEC);
   392	}
   393	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 63405 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-27  7:33     ` kbuild test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2020-05-27  7:33 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2023 bytes --]

Hi Alexandre,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on linus/master v5.7-rc7 next-20200526]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Alexandre-Ghiti/vmalloc-kernel-mapping-and-relocatable-kernel/20200524-170109
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: riscv-allyesconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> arch/riscv/mm/init.c:383:6: warning: no previous prototype for 'create_kernel_page_table' [-Wmissing-prototypes]
383 | void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
|      ^~~~~~~~~~~~~~~~~~~~~~~~

vim +/create_kernel_page_table +383 arch/riscv/mm/init.c

   382	
 > 383	void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
   384	{
   385		uintptr_t va, end_va;
   386	
   387		end_va = kernel_virt_addr + load_sz;
   388		for (va = kernel_virt_addr; va < end_va; va += map_size)
   389			create_pgd_mapping(pgdir, va,
   390					   load_pa + (va - kernel_virt_addr),
   391					   map_size, PAGE_KERNEL_EXEC);
   392	}
   393	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 63405 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
  2020-05-27  7:29           ` Alex Ghiti
  (?)
@ 2020-05-28 13:07             ` Alex Ghiti
  -1 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-28 13:07 UTC (permalink / raw)
  To: Zong Li
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, linux-kernel@vger.kernel.org List, linuxppc-dev,
	linux-riscv

Hi Zong,

Le 5/27/20 à 3:29 AM, Alex Ghiti a écrit :
> Le 5/27/20 à 2:05 AM, Zong Li a écrit :
>> On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>>> Hi Zong,
>>>
>>> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
>>>> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>>> This is a preparatory patch for relocatable kernel.
>>>>>
>>>>> The kernel used to be linked at PAGE_OFFSET address and used to be 
>>>>> loaded
>>>>> physically at the beginning of the main memory. Therefore, we 
>>>>> could use
>>>>> the linear mapping for the kernel mapping.
>>>>>
>>>>> But the relocated kernel base address will be different from 
>>>>> PAGE_OFFSET
>>>>> and since in the linear mapping, two different virtual addresses 
>>>>> cannot
>>>>> point to the same physical address, the kernel mapping needs to 
>>>>> lie outside
>>>>> the linear mapping.
>>>>>
>>>>> In addition, because modules and BPF must be close to the kernel 
>>>>> (inside
>>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone 
>>>>> minus
>>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>>> allocations from the kernel could get all the +-2GB window around the
>>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>>>
>>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>>> ---
>>>>>    arch/riscv/boot/loader.lds.S     |  3 +-
>>>>>    arch/riscv/include/asm/page.h    | 10 +++++-
>>>>>    arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>>>>    arch/riscv/kernel/head.S         |  3 +-
>>>>>    arch/riscv/kernel/module.c       |  4 +--
>>>>>    arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>>    arch/riscv/mm/init.c             | 58 
>>>>> +++++++++++++++++++++++++-------
>>>>>    arch/riscv/mm/physaddr.c         |  2 +-
>>>>>    8 files changed, 87 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/boot/loader.lds.S 
>>>>> b/arch/riscv/boot/loader.lds.S
>>>>> index 47a5003c2e28..62d94696a19c 100644
>>>>> --- a/arch/riscv/boot/loader.lds.S
>>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>>> @@ -1,13 +1,14 @@
>>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>>
>>>>>    #include <asm/page.h>
>>>>> +#include <asm/pgtable.h>
>>>>>
>>>>>    OUTPUT_ARCH(riscv)
>>>>>    ENTRY(_start)
>>>>>
>>>>>    SECTIONS
>>>>>    {
>>>>> -       . = PAGE_OFFSET;
>>>>> +       . = KERNEL_LINK_ADDR;
>>>>>
>>>>>           .payload : {
>>>>>                   *(.payload)
>>>>> diff --git a/arch/riscv/include/asm/page.h 
>>>>> b/arch/riscv/include/asm/page.h
>>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>>> --- a/arch/riscv/include/asm/page.h
>>>>> +++ b/arch/riscv/include/asm/page.h
>>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>>    extern unsigned long va_pa_offset;
>>>>> +extern unsigned long va_kernel_pa_offset;
>>>>>    extern unsigned long pfn_base;
>>>>>    #define ARCH_PFN_OFFSET                (pfn_base)
>>>>>    #else
>>>>>    #define va_pa_offset           0
>>>>> +#define va_kernel_pa_offset    0
>>>>>    #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>>>>    #endif /* CONFIG_MMU */
>>>>>
>>>>>    extern unsigned long max_low_pfn;
>>>>>    extern unsigned long min_low_pfn;
>>>>> +extern unsigned long kernel_virt_addr;
>>>>>
>>>>>    #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + 
>>>>> va_pa_offset))
>>>>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>>>>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - 
>>>>> va_pa_offset)
>>>>> +#define kernel_mapping_va_to_pa(x)     \
>>>>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>>>>> +#define __va_to_pa_nodebug(x)          \
>>>>> +       (((x) >= PAGE_OFFSET) ?         \
>>>>> +               linear_mapping_va_to_pa(x) : 
>>>>> kernel_mapping_va_to_pa(x))
>>>>>
>>>>>    #ifdef CONFIG_DEBUG_VIRTUAL
>>>>>    extern phys_addr_t __virt_to_phys(unsigned long x);
>>>>> diff --git a/arch/riscv/include/asm/pgtable.h 
>>>>> b/arch/riscv/include/asm/pgtable.h
>>>>> index 35b60035b6b0..25213cfaf680 100644
>>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>>> @@ -11,23 +11,29 @@
>>>>>
>>>>>    #include <asm/pgtable-bits.h>
>>>>>
>>>>> -#ifndef __ASSEMBLY__
>>>>> -
>>>>> -/* Page Upper Directory not used in RISC-V */
>>>>> -#include <asm-generic/pgtable-nopud.h>
>>>>> -#include <asm/page.h>
>>>>> -#include <asm/tlbflush.h>
>>>>> -#include <linux/mm_types.h>
>>>>> -
>>>>> -#ifdef CONFIG_MMU
>>>>> +#ifndef CONFIG_MMU
>>>>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>>>>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>>>>> +#else
>>>>> +/*
>>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range 
>>>>> around
>>>>> + * the kernel.
>>>>> + */
>>>>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>>>>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>>>>
>>>>>    #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>>    #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>>    #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>>
>>>>>    #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>>>>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>>>>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + 
>>>>> BPF_JIT_REGION_SIZE)
>>>> It seems to have a potential risk here, the region of bpf is
>>>> overlapping with kernel mapping, so if kernel size is bigger than
>>>> 128MB, bpf region would be occupied and run out by kernel mapping.
>> Is there the risk as I mentioned?
>
>
> Sorry I forgot to answer this one: I was confident that 128MB was 
> large enough for kernel
> and BPF. But I see no reason to leave this risk so I'll change 
> kernel_virt_addr for _end so
> BPF will have its 128MB reserved.
>
> Thanks !
>
> Alex
>
>
>>
>>>>> +
>>>>> +#ifdef CONFIG_64BIT
>>>>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>>>>> +#define VMALLOC_MODULE_END     VMALLOC_END
>>>>> +#endif
>>>>>
>>>> Although kernel_virt_addr is a fixed address now, I think it could be
>>>> changed for the purpose of relocatable or KASLR, so if
>>>> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
>>>> of module would be too big.
>>>
>>> Yes you're right, that's wrong to allow modules to lie outside
>>> the 2G window, thanks for noticing.
>>>
>>>
>>>> In addition, the region of module could be
>>>> +-2G around the kernel, so we don't be limited in one direction as
>>>> before. It seems to me that the region of the module could be decided
>>>> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
>>>> VMLLOC_MODULE_END is "&_start + 2G".
>>>
>>> I had tried that, but as we need to make sure BPF region is different
>>> from the module's
>>> that makes the macro definitions really cumbersome. I'll give a try
>>> again anyway. And
>>> I tried to use _end and _start here but it failed, I have to debug 
>>> this.


I gave more thought about that and it is actually not possible to use 
the 2GB
before and after the kernel: modules can call exported functions from 
each other,
so we need to make sure that the "distance" between 2 modules is at most 
2GB.
And I assume BPF comes with the same restrictions with respect to 
modules so the
kernel + BPF + modules must live in the same 2GB region.

I'll come with a v4 quickly,

Thanks,

Alex


>>>
>>>
>>>>    I'm not sure whether the size of
>>>> region of bpf has to be 128MB for some particular reason, if not,
>>>> maybe the region of bpf could be the same with module to avoid being
>>>> run out by module.
>>>
>>> On the contrary, BPF region must not be the same as module's since in
>>> that case,
>>> modules could take all the space and make BPF fail.
>> ok, I got it. Thanks for the explaining.
>>
>>
>>>
>>> Thanks for your review Zong,
>>>
>>>
>>> Alex
>>>
>>>
>>>>>    /*
>>>>>     * Roughly size the vmemmap space to be large enough to fit enough
>>>>> @@ -57,9 +63,16 @@
>>>>>    #define FIXADDR_SIZE     PGDIR_SIZE
>>>>>    #endif
>>>>>    #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>>> -
>>>>>    #endif
>>>>>
>>>>> +#ifndef __ASSEMBLY__
>>>>> +
>>>>> +/* Page Upper Directory not used in RISC-V */
>>>>> +#include <asm-generic/pgtable-nopud.h>
>>>>> +#include <asm/page.h>
>>>>> +#include <asm/tlbflush.h>
>>>>> +#include <linux/mm_types.h>
>>>>> +
>>>>>    #ifdef CONFIG_64BIT
>>>>>    #include <asm/pgtable-64.h>
>>>>>    #else
>>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>>> index 98a406474e7d..8f5bb7731327 100644
>>>>> --- a/arch/riscv/kernel/head.S
>>>>> +++ b/arch/riscv/kernel/head.S
>>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>>    #ifdef CONFIG_MMU
>>>>>    relocate:
>>>>>           /* Relocate return address */
>>>>> -       li a1, PAGE_OFFSET
>>>>> +       la a1, kernel_virt_addr
>>>>> +       REG_L a1, 0(a1)
>>>>>           la a2, _start
>>>>>           sub a1, a1, a2
>>>>>           add ra, ra, a1
>>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>>> --- a/arch/riscv/kernel/module.c
>>>>> +++ b/arch/riscv/kernel/module.c
>>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, 
>>>>> const char *strtab,
>>>>>    }
>>>>>
>>>>>    #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>>> -#define VMALLOC_MODULE_START \
>>>>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>>    void *module_alloc(unsigned long size)
>>>>>    {
>>>>>           return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>>> -                                   VMALLOC_END, GFP_KERNEL,
>>>>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>>>>                                       PAGE_KERNEL_EXEC, 0, 
>>>>> NUMA_NO_NODE,
>>>>> __builtin_return_address(0));
>>>>>    }
>>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S 
>>>>> b/arch/riscv/kernel/vmlinux.lds.S
>>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>>> @@ -4,7 +4,8 @@
>>>>>     * Copyright (C) 2017 SiFive
>>>>>     */
>>>>>
>>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>>> +#include <asm/pgtable.h>
>>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>>    #include <asm/vmlinux.lds.h>
>>>>>    #include <asm/page.h>
>>>>>    #include <asm/cache.h>
>>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>>> index 27a334106708..17f108baec4f 100644
>>>>> --- a/arch/riscv/mm/init.c
>>>>> +++ b/arch/riscv/mm/init.c
>>>>> @@ -22,6 +22,9 @@
>>>>>
>>>>>    #include "../kernel/head.h"
>>>>>
>>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>>> +
>>>>>    unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>> __page_aligned_bss;
>>>>>    EXPORT_SYMBOL(empty_zero_page);
>>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>>    }
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>> +/* Offset between linear mapping virtual address and kernel load 
>>>>> address */
>>>>>    unsigned long va_pa_offset;
>>>>>    EXPORT_SYMBOL(va_pa_offset);
>>>>> +/* Offset between kernel mapping virtual address and kernel load 
>>>>> address */
>>>>> +unsigned long va_kernel_pa_offset;
>>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>>    unsigned long pfn_base;
>>>>>    EXPORT_SYMBOL(pfn_base);
>>>>>
>>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>>           if (mmu_enabled)
>>>>>                   return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>>
>>>>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>>           BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>>           return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>>    }
>>>>> @@ -372,14 +379,30 @@ static uintptr_t __init 
>>>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>>>    #error "setup_vm() is called from head.S before relocate so it 
>>>>> should not use absolute addressing."
>>>>>    #endif
>>>>>
>>>>> +static uintptr_t load_pa, load_sz;
>>>>> +
>>>>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>>> +{
>>>>> +       uintptr_t va, end_va;
>>>>> +
>>>>> +       end_va = kernel_virt_addr + load_sz;
>>>>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>>> +               create_pgd_mapping(pgdir, va,
>>>>> +                                  load_pa + (va - kernel_virt_addr),
>>>>> +                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +}
>>>>> +
>>>>>    asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>>    {
>>>>>           uintptr_t va, end_va;
>>>>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>>>>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>>           uintptr_t map_size = best_map_size(load_pa, 
>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>>
>>>>> +       load_pa = (uintptr_t)(&_start);
>>>>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>>>>> +
>>>>>           va_pa_offset = PAGE_OFFSET - load_pa;
>>>>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>>> +
>>>>>           pfn_base = PFN_DOWN(load_pa);
>>>>>
>>>>>           /*
>>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t 
>>>>> dtb_pa)
>>>>>           create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>>                              (uintptr_t)fixmap_pte, PMD_SIZE, 
>>>>> PAGE_TABLE);
>>>>>           /* Setup trampoline PGD and PMD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              (uintptr_t)trampoline_pmd, 
>>>>> PGDIR_SIZE, PAGE_TABLE);
>>>>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>>                              load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #else
>>>>>           /* Setup trampoline PGD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #endif
>>>>>
>>>>>           /*
>>>>> -        * Setup early PGD covering entire kernel which will allows
>>>>> +        * Setup early PGD covering entire kernel which will allow
>>>>>            * us to reach paging_init(). We map all memory banks later
>>>>>            * in setup_vm_final() below.
>>>>>            */
>>>>> -       end_va = PAGE_OFFSET + load_sz;
>>>>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>>> -               create_pgd_mapping(early_pg_dir, va,
>>>>> -                                  load_pa + (va - PAGE_OFFSET),
>>>>> -                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +       create_kernel_page_table(early_pg_dir, map_size);
>>>>>
>>>>>           /* Create fixed mapping for early FDT parsing */
>>>>>           end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>>           uintptr_t va, map_size;
>>>>>           phys_addr_t pa, start, end;
>>>>>           struct memblock_region *reg;
>>>>> +       static struct vm_struct vm_kernel = { 0 };
>>>>>
>>>>>           /* Set mmu_enabled flag */
>>>>>           mmu_enabled = true;
>>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>>                   for (pa = start; pa < end; pa += map_size) {
>>>>>                           va = (uintptr_t)__va(pa);
>>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>>> -                                          map_size, 
>>>>> PAGE_KERNEL_EXEC);
>>>>> +                                          map_size, PAGE_KERNEL);
>>>>>                   }
>>>>>           }
>>>>>
>>>>> +       /* Map the kernel */
>>>>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>>> +
>>>>> +       /* Reserve the vmalloc area occupied by the kernel */
>>>>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>>>>> +       vm_kernel.phys_addr = load_pa;
>>>>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>>>>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>>> +       vm_kernel.caller = __builtin_return_address(0);
>>>>> +
>>>>> +       vm_area_add_early(&vm_kernel);
>>>>> +
>>>>>           /* Clear fixmap PTE and PMD mappings */
>>>>>           clear_fixmap(FIX_PTE);
>>>>>           clear_fixmap(FIX_PMD);
>>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>>> --- a/arch/riscv/mm/physaddr.c
>>>>> +++ b/arch/riscv/mm/physaddr.c
>>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>>
>>>>>    phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>>    {
>>>>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>>           unsigned long kernel_end = (unsigned long)_end;
>>>>>
>>>>>           /*
>>>>> -- 
>>>>> 2.20.1
>>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-28 13:07             ` Alex Ghiti
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-28 13:07 UTC (permalink / raw)
  To: Zong Li
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, linux-riscv, linuxppc-dev

Hi Zong,

Le 5/27/20 à 3:29 AM, Alex Ghiti a écrit :
> Le 5/27/20 à 2:05 AM, Zong Li a écrit :
>> On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>>> Hi Zong,
>>>
>>> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
>>>> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>>> This is a preparatory patch for relocatable kernel.
>>>>>
>>>>> The kernel used to be linked at PAGE_OFFSET address and used to be 
>>>>> loaded
>>>>> physically at the beginning of the main memory. Therefore, we 
>>>>> could use
>>>>> the linear mapping for the kernel mapping.
>>>>>
>>>>> But the relocated kernel base address will be different from 
>>>>> PAGE_OFFSET
>>>>> and since in the linear mapping, two different virtual addresses 
>>>>> cannot
>>>>> point to the same physical address, the kernel mapping needs to 
>>>>> lie outside
>>>>> the linear mapping.
>>>>>
>>>>> In addition, because modules and BPF must be close to the kernel 
>>>>> (inside
>>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone 
>>>>> minus
>>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>>> allocations from the kernel could get all the +-2GB window around the
>>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>>>
>>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>>> ---
>>>>>    arch/riscv/boot/loader.lds.S     |  3 +-
>>>>>    arch/riscv/include/asm/page.h    | 10 +++++-
>>>>>    arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>>>>    arch/riscv/kernel/head.S         |  3 +-
>>>>>    arch/riscv/kernel/module.c       |  4 +--
>>>>>    arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>>    arch/riscv/mm/init.c             | 58 
>>>>> +++++++++++++++++++++++++-------
>>>>>    arch/riscv/mm/physaddr.c         |  2 +-
>>>>>    8 files changed, 87 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/boot/loader.lds.S 
>>>>> b/arch/riscv/boot/loader.lds.S
>>>>> index 47a5003c2e28..62d94696a19c 100644
>>>>> --- a/arch/riscv/boot/loader.lds.S
>>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>>> @@ -1,13 +1,14 @@
>>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>>
>>>>>    #include <asm/page.h>
>>>>> +#include <asm/pgtable.h>
>>>>>
>>>>>    OUTPUT_ARCH(riscv)
>>>>>    ENTRY(_start)
>>>>>
>>>>>    SECTIONS
>>>>>    {
>>>>> -       . = PAGE_OFFSET;
>>>>> +       . = KERNEL_LINK_ADDR;
>>>>>
>>>>>           .payload : {
>>>>>                   *(.payload)
>>>>> diff --git a/arch/riscv/include/asm/page.h 
>>>>> b/arch/riscv/include/asm/page.h
>>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>>> --- a/arch/riscv/include/asm/page.h
>>>>> +++ b/arch/riscv/include/asm/page.h
>>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>>    extern unsigned long va_pa_offset;
>>>>> +extern unsigned long va_kernel_pa_offset;
>>>>>    extern unsigned long pfn_base;
>>>>>    #define ARCH_PFN_OFFSET                (pfn_base)
>>>>>    #else
>>>>>    #define va_pa_offset           0
>>>>> +#define va_kernel_pa_offset    0
>>>>>    #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>>>>    #endif /* CONFIG_MMU */
>>>>>
>>>>>    extern unsigned long max_low_pfn;
>>>>>    extern unsigned long min_low_pfn;
>>>>> +extern unsigned long kernel_virt_addr;
>>>>>
>>>>>    #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + 
>>>>> va_pa_offset))
>>>>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>>>>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - 
>>>>> va_pa_offset)
>>>>> +#define kernel_mapping_va_to_pa(x)     \
>>>>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>>>>> +#define __va_to_pa_nodebug(x)          \
>>>>> +       (((x) >= PAGE_OFFSET) ?         \
>>>>> +               linear_mapping_va_to_pa(x) : 
>>>>> kernel_mapping_va_to_pa(x))
>>>>>
>>>>>    #ifdef CONFIG_DEBUG_VIRTUAL
>>>>>    extern phys_addr_t __virt_to_phys(unsigned long x);
>>>>> diff --git a/arch/riscv/include/asm/pgtable.h 
>>>>> b/arch/riscv/include/asm/pgtable.h
>>>>> index 35b60035b6b0..25213cfaf680 100644
>>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>>> @@ -11,23 +11,29 @@
>>>>>
>>>>>    #include <asm/pgtable-bits.h>
>>>>>
>>>>> -#ifndef __ASSEMBLY__
>>>>> -
>>>>> -/* Page Upper Directory not used in RISC-V */
>>>>> -#include <asm-generic/pgtable-nopud.h>
>>>>> -#include <asm/page.h>
>>>>> -#include <asm/tlbflush.h>
>>>>> -#include <linux/mm_types.h>
>>>>> -
>>>>> -#ifdef CONFIG_MMU
>>>>> +#ifndef CONFIG_MMU
>>>>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>>>>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>>>>> +#else
>>>>> +/*
>>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range 
>>>>> around
>>>>> + * the kernel.
>>>>> + */
>>>>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>>>>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>>>>
>>>>>    #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>>    #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>>    #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>>
>>>>>    #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>>>>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>>>>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + 
>>>>> BPF_JIT_REGION_SIZE)
>>>> It seems to have a potential risk here, the region of bpf is
>>>> overlapping with kernel mapping, so if kernel size is bigger than
>>>> 128MB, bpf region would be occupied and run out by kernel mapping.
>> Is there the risk as I mentioned?
>
>
> Sorry I forgot to answer this one: I was confident that 128MB was 
> large enough for kernel
> and BPF. But I see no reason to leave this risk so I'll change 
> kernel_virt_addr for _end so
> BPF will have its 128MB reserved.
>
> Thanks !
>
> Alex
>
>
>>
>>>>> +
>>>>> +#ifdef CONFIG_64BIT
>>>>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>>>>> +#define VMALLOC_MODULE_END     VMALLOC_END
>>>>> +#endif
>>>>>
>>>> Although kernel_virt_addr is a fixed address now, I think it could be
>>>> changed for the purpose of relocatable or KASLR, so if
>>>> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
>>>> of module would be too big.
>>>
>>> Yes you're right, that's wrong to allow modules to lie outside
>>> the 2G window, thanks for noticing.
>>>
>>>
>>>> In addition, the region of module could be
>>>> +-2G around the kernel, so we don't be limited in one direction as
>>>> before. It seems to me that the region of the module could be decided
>>>> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
>>>> VMLLOC_MODULE_END is "&_start + 2G".
>>>
>>> I had tried that, but as we need to make sure BPF region is different
>>> from the module's
>>> that makes the macro definitions really cumbersome. I'll give a try
>>> again anyway. And
>>> I tried to use _end and _start here but it failed, I have to debug 
>>> this.


I gave more thought about that and it is actually not possible to use 
the 2GB
before and after the kernel: modules can call exported functions from 
each other,
so we need to make sure that the "distance" between 2 modules is at most 
2GB.
And I assume BPF comes with the same restrictions with respect to 
modules so the
kernel + BPF + modules must live in the same 2GB region.

I'll come with a v4 quickly,

Thanks,

Alex


>>>
>>>
>>>>    I'm not sure whether the size of
>>>> region of bpf has to be 128MB for some particular reason, if not,
>>>> maybe the region of bpf could be the same with module to avoid being
>>>> run out by module.
>>>
>>> On the contrary, BPF region must not be the same as module's since in
>>> that case,
>>> modules could take all the space and make BPF fail.
>> ok, I got it. Thanks for the explaining.
>>
>>
>>>
>>> Thanks for your review Zong,
>>>
>>>
>>> Alex
>>>
>>>
>>>>>    /*
>>>>>     * Roughly size the vmemmap space to be large enough to fit enough
>>>>> @@ -57,9 +63,16 @@
>>>>>    #define FIXADDR_SIZE     PGDIR_SIZE
>>>>>    #endif
>>>>>    #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>>> -
>>>>>    #endif
>>>>>
>>>>> +#ifndef __ASSEMBLY__
>>>>> +
>>>>> +/* Page Upper Directory not used in RISC-V */
>>>>> +#include <asm-generic/pgtable-nopud.h>
>>>>> +#include <asm/page.h>
>>>>> +#include <asm/tlbflush.h>
>>>>> +#include <linux/mm_types.h>
>>>>> +
>>>>>    #ifdef CONFIG_64BIT
>>>>>    #include <asm/pgtable-64.h>
>>>>>    #else
>>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>>> index 98a406474e7d..8f5bb7731327 100644
>>>>> --- a/arch/riscv/kernel/head.S
>>>>> +++ b/arch/riscv/kernel/head.S
>>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>>    #ifdef CONFIG_MMU
>>>>>    relocate:
>>>>>           /* Relocate return address */
>>>>> -       li a1, PAGE_OFFSET
>>>>> +       la a1, kernel_virt_addr
>>>>> +       REG_L a1, 0(a1)
>>>>>           la a2, _start
>>>>>           sub a1, a1, a2
>>>>>           add ra, ra, a1
>>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>>> --- a/arch/riscv/kernel/module.c
>>>>> +++ b/arch/riscv/kernel/module.c
>>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, 
>>>>> const char *strtab,
>>>>>    }
>>>>>
>>>>>    #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>>> -#define VMALLOC_MODULE_START \
>>>>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>>    void *module_alloc(unsigned long size)
>>>>>    {
>>>>>           return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>>> -                                   VMALLOC_END, GFP_KERNEL,
>>>>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>>>>                                       PAGE_KERNEL_EXEC, 0, 
>>>>> NUMA_NO_NODE,
>>>>> __builtin_return_address(0));
>>>>>    }
>>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S 
>>>>> b/arch/riscv/kernel/vmlinux.lds.S
>>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>>> @@ -4,7 +4,8 @@
>>>>>     * Copyright (C) 2017 SiFive
>>>>>     */
>>>>>
>>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>>> +#include <asm/pgtable.h>
>>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>>    #include <asm/vmlinux.lds.h>
>>>>>    #include <asm/page.h>
>>>>>    #include <asm/cache.h>
>>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>>> index 27a334106708..17f108baec4f 100644
>>>>> --- a/arch/riscv/mm/init.c
>>>>> +++ b/arch/riscv/mm/init.c
>>>>> @@ -22,6 +22,9 @@
>>>>>
>>>>>    #include "../kernel/head.h"
>>>>>
>>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>>> +
>>>>>    unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>> __page_aligned_bss;
>>>>>    EXPORT_SYMBOL(empty_zero_page);
>>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>>    }
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>> +/* Offset between linear mapping virtual address and kernel load 
>>>>> address */
>>>>>    unsigned long va_pa_offset;
>>>>>    EXPORT_SYMBOL(va_pa_offset);
>>>>> +/* Offset between kernel mapping virtual address and kernel load 
>>>>> address */
>>>>> +unsigned long va_kernel_pa_offset;
>>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>>    unsigned long pfn_base;
>>>>>    EXPORT_SYMBOL(pfn_base);
>>>>>
>>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>>           if (mmu_enabled)
>>>>>                   return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>>
>>>>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>>           BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>>           return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>>    }
>>>>> @@ -372,14 +379,30 @@ static uintptr_t __init 
>>>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>>>    #error "setup_vm() is called from head.S before relocate so it 
>>>>> should not use absolute addressing."
>>>>>    #endif
>>>>>
>>>>> +static uintptr_t load_pa, load_sz;
>>>>> +
>>>>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>>> +{
>>>>> +       uintptr_t va, end_va;
>>>>> +
>>>>> +       end_va = kernel_virt_addr + load_sz;
>>>>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>>> +               create_pgd_mapping(pgdir, va,
>>>>> +                                  load_pa + (va - kernel_virt_addr),
>>>>> +                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +}
>>>>> +
>>>>>    asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>>    {
>>>>>           uintptr_t va, end_va;
>>>>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>>>>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>>           uintptr_t map_size = best_map_size(load_pa, 
>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>>
>>>>> +       load_pa = (uintptr_t)(&_start);
>>>>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>>>>> +
>>>>>           va_pa_offset = PAGE_OFFSET - load_pa;
>>>>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>>> +
>>>>>           pfn_base = PFN_DOWN(load_pa);
>>>>>
>>>>>           /*
>>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t 
>>>>> dtb_pa)
>>>>>           create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>>                              (uintptr_t)fixmap_pte, PMD_SIZE, 
>>>>> PAGE_TABLE);
>>>>>           /* Setup trampoline PGD and PMD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              (uintptr_t)trampoline_pmd, 
>>>>> PGDIR_SIZE, PAGE_TABLE);
>>>>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>>                              load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #else
>>>>>           /* Setup trampoline PGD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #endif
>>>>>
>>>>>           /*
>>>>> -        * Setup early PGD covering entire kernel which will allows
>>>>> +        * Setup early PGD covering entire kernel which will allow
>>>>>            * us to reach paging_init(). We map all memory banks later
>>>>>            * in setup_vm_final() below.
>>>>>            */
>>>>> -       end_va = PAGE_OFFSET + load_sz;
>>>>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>>> -               create_pgd_mapping(early_pg_dir, va,
>>>>> -                                  load_pa + (va - PAGE_OFFSET),
>>>>> -                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +       create_kernel_page_table(early_pg_dir, map_size);
>>>>>
>>>>>           /* Create fixed mapping for early FDT parsing */
>>>>>           end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>>           uintptr_t va, map_size;
>>>>>           phys_addr_t pa, start, end;
>>>>>           struct memblock_region *reg;
>>>>> +       static struct vm_struct vm_kernel = { 0 };
>>>>>
>>>>>           /* Set mmu_enabled flag */
>>>>>           mmu_enabled = true;
>>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>>                   for (pa = start; pa < end; pa += map_size) {
>>>>>                           va = (uintptr_t)__va(pa);
>>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>>> -                                          map_size, 
>>>>> PAGE_KERNEL_EXEC);
>>>>> +                                          map_size, PAGE_KERNEL);
>>>>>                   }
>>>>>           }
>>>>>
>>>>> +       /* Map the kernel */
>>>>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>>> +
>>>>> +       /* Reserve the vmalloc area occupied by the kernel */
>>>>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>>>>> +       vm_kernel.phys_addr = load_pa;
>>>>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>>>>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>>> +       vm_kernel.caller = __builtin_return_address(0);
>>>>> +
>>>>> +       vm_area_add_early(&vm_kernel);
>>>>> +
>>>>>           /* Clear fixmap PTE and PMD mappings */
>>>>>           clear_fixmap(FIX_PTE);
>>>>>           clear_fixmap(FIX_PMD);
>>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>>> --- a/arch/riscv/mm/physaddr.c
>>>>> +++ b/arch/riscv/mm/physaddr.c
>>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>>
>>>>>    phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>>    {
>>>>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>>           unsigned long kernel_end = (unsigned long)_end;
>>>>>
>>>>>           /*
>>>>> -- 
>>>>> 2.20.1
>>>>>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone
@ 2020-05-28 13:07             ` Alex Ghiti
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-05-28 13:07 UTC (permalink / raw)
  To: Zong Li
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Paul Walmsley, Palmer Dabbelt,
	linux-riscv, linuxppc-dev

Hi Zong,

Le 5/27/20 à 3:29 AM, Alex Ghiti a écrit :
> Le 5/27/20 à 2:05 AM, Zong Li a écrit :
>> On Wed, May 27, 2020 at 1:06 AM Alex Ghiti <alex@ghiti.fr> wrote:
>>> Hi Zong,
>>>
>>> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
>>>> On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>>> This is a preparatory patch for relocatable kernel.
>>>>>
>>>>> The kernel used to be linked at PAGE_OFFSET address and used to be 
>>>>> loaded
>>>>> physically at the beginning of the main memory. Therefore, we 
>>>>> could use
>>>>> the linear mapping for the kernel mapping.
>>>>>
>>>>> But the relocated kernel base address will be different from 
>>>>> PAGE_OFFSET
>>>>> and since in the linear mapping, two different virtual addresses 
>>>>> cannot
>>>>> point to the same physical address, the kernel mapping needs to 
>>>>> lie outside
>>>>> the linear mapping.
>>>>>
>>>>> In addition, because modules and BPF must be close to the kernel 
>>>>> (inside
>>>>> +-2GB window), the kernel is placed at the end of the vmalloc zone 
>>>>> minus
>>>>> 2GB, which leaves room for modules and BPF. The kernel could not be
>>>>> placed at the beginning of the vmalloc zone since other vmalloc
>>>>> allocations from the kernel could get all the +-2GB window around the
>>>>> kernel which would prevent new modules and BPF programs to be loaded.
>>>>>
>>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>>> ---
>>>>>    arch/riscv/boot/loader.lds.S     |  3 +-
>>>>>    arch/riscv/include/asm/page.h    | 10 +++++-
>>>>>    arch/riscv/include/asm/pgtable.h | 37 +++++++++++++-------
>>>>>    arch/riscv/kernel/head.S         |  3 +-
>>>>>    arch/riscv/kernel/module.c       |  4 +--
>>>>>    arch/riscv/kernel/vmlinux.lds.S  |  3 +-
>>>>>    arch/riscv/mm/init.c             | 58 
>>>>> +++++++++++++++++++++++++-------
>>>>>    arch/riscv/mm/physaddr.c         |  2 +-
>>>>>    8 files changed, 87 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/boot/loader.lds.S 
>>>>> b/arch/riscv/boot/loader.lds.S
>>>>> index 47a5003c2e28..62d94696a19c 100644
>>>>> --- a/arch/riscv/boot/loader.lds.S
>>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>>> @@ -1,13 +1,14 @@
>>>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>>>
>>>>>    #include <asm/page.h>
>>>>> +#include <asm/pgtable.h>
>>>>>
>>>>>    OUTPUT_ARCH(riscv)
>>>>>    ENTRY(_start)
>>>>>
>>>>>    SECTIONS
>>>>>    {
>>>>> -       . = PAGE_OFFSET;
>>>>> +       . = KERNEL_LINK_ADDR;
>>>>>
>>>>>           .payload : {
>>>>>                   *(.payload)
>>>>> diff --git a/arch/riscv/include/asm/page.h 
>>>>> b/arch/riscv/include/asm/page.h
>>>>> index 2d50f76efe48..48bb09b6a9b7 100644
>>>>> --- a/arch/riscv/include/asm/page.h
>>>>> +++ b/arch/riscv/include/asm/page.h
>>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>>    extern unsigned long va_pa_offset;
>>>>> +extern unsigned long va_kernel_pa_offset;
>>>>>    extern unsigned long pfn_base;
>>>>>    #define ARCH_PFN_OFFSET                (pfn_base)
>>>>>    #else
>>>>>    #define va_pa_offset           0
>>>>> +#define va_kernel_pa_offset    0
>>>>>    #define ARCH_PFN_OFFSET                (PAGE_OFFSET >> PAGE_SHIFT)
>>>>>    #endif /* CONFIG_MMU */
>>>>>
>>>>>    extern unsigned long max_low_pfn;
>>>>>    extern unsigned long min_low_pfn;
>>>>> +extern unsigned long kernel_virt_addr;
>>>>>
>>>>>    #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + 
>>>>> va_pa_offset))
>>>>> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>>>>> +#define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - 
>>>>> va_pa_offset)
>>>>> +#define kernel_mapping_va_to_pa(x)     \
>>>>> +       ((unsigned long)(x) - va_kernel_pa_offset)
>>>>> +#define __va_to_pa_nodebug(x)          \
>>>>> +       (((x) >= PAGE_OFFSET) ?         \
>>>>> +               linear_mapping_va_to_pa(x) : 
>>>>> kernel_mapping_va_to_pa(x))
>>>>>
>>>>>    #ifdef CONFIG_DEBUG_VIRTUAL
>>>>>    extern phys_addr_t __virt_to_phys(unsigned long x);
>>>>> diff --git a/arch/riscv/include/asm/pgtable.h 
>>>>> b/arch/riscv/include/asm/pgtable.h
>>>>> index 35b60035b6b0..25213cfaf680 100644
>>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>>> @@ -11,23 +11,29 @@
>>>>>
>>>>>    #include <asm/pgtable-bits.h>
>>>>>
>>>>> -#ifndef __ASSEMBLY__
>>>>> -
>>>>> -/* Page Upper Directory not used in RISC-V */
>>>>> -#include <asm-generic/pgtable-nopud.h>
>>>>> -#include <asm/page.h>
>>>>> -#include <asm/tlbflush.h>
>>>>> -#include <linux/mm_types.h>
>>>>> -
>>>>> -#ifdef CONFIG_MMU
>>>>> +#ifndef CONFIG_MMU
>>>>> +#define KERNEL_VIRT_ADDR       PAGE_OFFSET
>>>>> +#define KERNEL_LINK_ADDR       PAGE_OFFSET
>>>>> +#else
>>>>> +/*
>>>>> + * Leave 2GB for modules and BPF that must lie within a 2GB range 
>>>>> around
>>>>> + * the kernel.
>>>>> + */
>>>>> +#define KERNEL_VIRT_ADDR       (VMALLOC_END - SZ_2G + 1)
>>>>> +#define KERNEL_LINK_ADDR       KERNEL_VIRT_ADDR
>>>>>
>>>>>    #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
>>>>>    #define VMALLOC_END      (PAGE_OFFSET - 1)
>>>>>    #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
>>>>>
>>>>>    #define BPF_JIT_REGION_SIZE    (SZ_128M)
>>>>> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>>> -#define BPF_JIT_REGION_END     (VMALLOC_END)
>>>>> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
>>>>> +#define BPF_JIT_REGION_END     (kernel_virt_addr + 
>>>>> BPF_JIT_REGION_SIZE)
>>>> It seems to have a potential risk here, the region of bpf is
>>>> overlapping with kernel mapping, so if kernel size is bigger than
>>>> 128MB, bpf region would be occupied and run out by kernel mapping.
>> Is there the risk as I mentioned?
>
>
> Sorry I forgot to answer this one: I was confident that 128MB was 
> large enough for kernel
> and BPF. But I see no reason to leave this risk so I'll change 
> kernel_virt_addr for _end so
> BPF will have its 128MB reserved.
>
> Thanks !
>
> Alex
>
>
>>
>>>>> +
>>>>> +#ifdef CONFIG_64BIT
>>>>> +#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
>>>>> +#define VMALLOC_MODULE_END     VMALLOC_END
>>>>> +#endif
>>>>>
>>>> Although kernel_virt_addr is a fixed address now, I think it could be
>>>> changed for the purpose of relocatable or KASLR, so if
>>>> kernel_virt_addr is moved to far from VMALLOC_END than 2G, the region
>>>> of module would be too big.
>>>
>>> Yes you're right, that's wrong to allow modules to lie outside
>>> the 2G window, thanks for noticing.
>>>
>>>
>>>> In addition, the region of module could be
>>>> +-2G around the kernel, so we don't be limited in one direction as
>>>> before. It seems to me that the region of the module could be decided
>>>> at runtime, for example, VMALLOC_MODULE_START is "&_end - 2G" and
>>>> VMLLOC_MODULE_END is "&_start + 2G".
>>>
>>> I had tried that, but as we need to make sure BPF region is different
>>> from the module's
>>> that makes the macro definitions really cumbersome. I'll give a try
>>> again anyway. And
>>> I tried to use _end and _start here but it failed, I have to debug 
>>> this.


I gave more thought about that and it is actually not possible to use 
the 2GB
before and after the kernel: modules can call exported functions from 
each other,
so we need to make sure that the "distance" between 2 modules is at most 
2GB.
And I assume BPF comes with the same restrictions with respect to 
modules so the
kernel + BPF + modules must live in the same 2GB region.

I'll come with a v4 quickly,

Thanks,

Alex


>>>
>>>
>>>>    I'm not sure whether the size of
>>>> region of bpf has to be 128MB for some particular reason, if not,
>>>> maybe the region of bpf could be the same with module to avoid being
>>>> run out by module.
>>>
>>> On the contrary, BPF region must not be the same as module's since in
>>> that case,
>>> modules could take all the space and make BPF fail.
>> ok, I got it. Thanks for the explaining.
>>
>>
>>>
>>> Thanks for your review Zong,
>>>
>>>
>>> Alex
>>>
>>>
>>>>>    /*
>>>>>     * Roughly size the vmemmap space to be large enough to fit enough
>>>>> @@ -57,9 +63,16 @@
>>>>>    #define FIXADDR_SIZE     PGDIR_SIZE
>>>>>    #endif
>>>>>    #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>>>> -
>>>>>    #endif
>>>>>
>>>>> +#ifndef __ASSEMBLY__
>>>>> +
>>>>> +/* Page Upper Directory not used in RISC-V */
>>>>> +#include <asm-generic/pgtable-nopud.h>
>>>>> +#include <asm/page.h>
>>>>> +#include <asm/tlbflush.h>
>>>>> +#include <linux/mm_types.h>
>>>>> +
>>>>>    #ifdef CONFIG_64BIT
>>>>>    #include <asm/pgtable-64.h>
>>>>>    #else
>>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>>> index 98a406474e7d..8f5bb7731327 100644
>>>>> --- a/arch/riscv/kernel/head.S
>>>>> +++ b/arch/riscv/kernel/head.S
>>>>> @@ -49,7 +49,8 @@ ENTRY(_start)
>>>>>    #ifdef CONFIG_MMU
>>>>>    relocate:
>>>>>           /* Relocate return address */
>>>>> -       li a1, PAGE_OFFSET
>>>>> +       la a1, kernel_virt_addr
>>>>> +       REG_L a1, 0(a1)
>>>>>           la a2, _start
>>>>>           sub a1, a1, a2
>>>>>           add ra, ra, a1
>>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>>> index 8bbe5dbe1341..1a8fbe05accf 100644
>>>>> --- a/arch/riscv/kernel/module.c
>>>>> +++ b/arch/riscv/kernel/module.c
>>>>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, 
>>>>> const char *strtab,
>>>>>    }
>>>>>
>>>>>    #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>>> -#define VMALLOC_MODULE_START \
>>>>> -        max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>>>    void *module_alloc(unsigned long size)
>>>>>    {
>>>>>           return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>>> -                                   VMALLOC_END, GFP_KERNEL,
>>>>> +                                   VMALLOC_MODULE_END, GFP_KERNEL,
>>>>>                                       PAGE_KERNEL_EXEC, 0, 
>>>>> NUMA_NO_NODE,
>>>>> __builtin_return_address(0));
>>>>>    }
>>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S 
>>>>> b/arch/riscv/kernel/vmlinux.lds.S
>>>>> index 0339b6bbe11a..a9abde62909f 100644
>>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>>> @@ -4,7 +4,8 @@
>>>>>     * Copyright (C) 2017 SiFive
>>>>>     */
>>>>>
>>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>>> +#include <asm/pgtable.h>
>>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>>>    #include <asm/vmlinux.lds.h>
>>>>>    #include <asm/page.h>
>>>>>    #include <asm/cache.h>
>>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>>> index 27a334106708..17f108baec4f 100644
>>>>> --- a/arch/riscv/mm/init.c
>>>>> +++ b/arch/riscv/mm/init.c
>>>>> @@ -22,6 +22,9 @@
>>>>>
>>>>>    #include "../kernel/head.h"
>>>>>
>>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>>> +
>>>>>    unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>> __page_aligned_bss;
>>>>>    EXPORT_SYMBOL(empty_zero_page);
>>>>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>>>>>    }
>>>>>
>>>>>    #ifdef CONFIG_MMU
>>>>> +/* Offset between linear mapping virtual address and kernel load 
>>>>> address */
>>>>>    unsigned long va_pa_offset;
>>>>>    EXPORT_SYMBOL(va_pa_offset);
>>>>> +/* Offset between kernel mapping virtual address and kernel load 
>>>>> address */
>>>>> +unsigned long va_kernel_pa_offset;
>>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>>>    unsigned long pfn_base;
>>>>>    EXPORT_SYMBOL(pfn_base);
>>>>>
>>>>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>>           if (mmu_enabled)
>>>>>                   return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>>
>>>>> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>>> +       pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>>>           BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>>>           return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>>>    }
>>>>> @@ -372,14 +379,30 @@ static uintptr_t __init 
>>>>> best_map_size(phys_addr_t base, phys_addr_t size)
>>>>>    #error "setup_vm() is called from head.S before relocate so it 
>>>>> should not use absolute addressing."
>>>>>    #endif
>>>>>
>>>>> +static uintptr_t load_pa, load_sz;
>>>>> +
>>>>> +void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>>> +{
>>>>> +       uintptr_t va, end_va;
>>>>> +
>>>>> +       end_va = kernel_virt_addr + load_sz;
>>>>> +       for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>>> +               create_pgd_mapping(pgdir, va,
>>>>> +                                  load_pa + (va - kernel_virt_addr),
>>>>> +                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +}
>>>>> +
>>>>>    asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>>    {
>>>>>           uintptr_t va, end_va;
>>>>> -       uintptr_t load_pa = (uintptr_t)(&_start);
>>>>> -       uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>>>           uintptr_t map_size = best_map_size(load_pa, 
>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>>
>>>>> +       load_pa = (uintptr_t)(&_start);
>>>>> +       load_sz = (uintptr_t)(&_end) - load_pa;
>>>>> +
>>>>>           va_pa_offset = PAGE_OFFSET - load_pa;
>>>>> +       va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>>> +
>>>>>           pfn_base = PFN_DOWN(load_pa);
>>>>>
>>>>>           /*
>>>>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t 
>>>>> dtb_pa)
>>>>>           create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>>                              (uintptr_t)fixmap_pte, PMD_SIZE, 
>>>>> PAGE_TABLE);
>>>>>           /* Setup trampoline PGD and PMD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              (uintptr_t)trampoline_pmd, 
>>>>> PGDIR_SIZE, PAGE_TABLE);
>>>>> -       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>>> +       create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>>>                              load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #else
>>>>>           /* Setup trampoline PGD */
>>>>> -       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>>> +       create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>>>                              load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>>>    #endif
>>>>>
>>>>>           /*
>>>>> -        * Setup early PGD covering entire kernel which will allows
>>>>> +        * Setup early PGD covering entire kernel which will allow
>>>>>            * us to reach paging_init(). We map all memory banks later
>>>>>            * in setup_vm_final() below.
>>>>>            */
>>>>> -       end_va = PAGE_OFFSET + load_sz;
>>>>> -       for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>>> -               create_pgd_mapping(early_pg_dir, va,
>>>>> -                                  load_pa + (va - PAGE_OFFSET),
>>>>> -                                  map_size, PAGE_KERNEL_EXEC);
>>>>> +       create_kernel_page_table(early_pg_dir, map_size);
>>>>>
>>>>>           /* Create fixed mapping for early FDT parsing */
>>>>>           end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>>>>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>>>>>           uintptr_t va, map_size;
>>>>>           phys_addr_t pa, start, end;
>>>>>           struct memblock_region *reg;
>>>>> +       static struct vm_struct vm_kernel = { 0 };
>>>>>
>>>>>           /* Set mmu_enabled flag */
>>>>>           mmu_enabled = true;
>>>>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>>>>>                   for (pa = start; pa < end; pa += map_size) {
>>>>>                           va = (uintptr_t)__va(pa);
>>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>>> -                                          map_size, 
>>>>> PAGE_KERNEL_EXEC);
>>>>> +                                          map_size, PAGE_KERNEL);
>>>>>                   }
>>>>>           }
>>>>>
>>>>> +       /* Map the kernel */
>>>>> +       create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>>> +
>>>>> +       /* Reserve the vmalloc area occupied by the kernel */
>>>>> +       vm_kernel.addr = (void *)kernel_virt_addr;
>>>>> +       vm_kernel.phys_addr = load_pa;
>>>>> +       vm_kernel.size = (load_sz + PMD_SIZE) & ~(PMD_SIZE - 1);
>>>>> +       vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>>> +       vm_kernel.caller = __builtin_return_address(0);
>>>>> +
>>>>> +       vm_area_add_early(&vm_kernel);
>>>>> +
>>>>>           /* Clear fixmap PTE and PMD mappings */
>>>>>           clear_fixmap(FIX_PTE);
>>>>>           clear_fixmap(FIX_PMD);
>>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>>> --- a/arch/riscv/mm/physaddr.c
>>>>> +++ b/arch/riscv/mm/physaddr.c
>>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>>
>>>>>    phys_addr_t __phys_addr_symbol(unsigned long x)
>>>>>    {
>>>>> -       unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>>> +       unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>>>           unsigned long kernel_end = (unsigned long)_end;
>>>>>
>>>>>           /*
>>>>> -- 
>>>>> 2.20.1
>>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE
  2020-05-24  8:52 ` [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE Alexandre Ghiti
  2020-05-26  9:05     ` Zong Li
@ 2020-05-29 12:04     ` Anup Patel
  1 sibling, 0 replies; 30+ messages in thread
From: Anup Patel @ 2020-05-29 12:04 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, Zong Li, linux-kernel@vger.kernel.org List,
	linuxppc-dev, linux-riscv

On Sun, May 24, 2020 at 2:25 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This config allows to compile the kernel as PIE and to relocate it at
> any virtual address at runtime: this paves the way to KASLR and to 4-level
> page table folding at runtime. Runtime relocation is possible since
> relocation metadata are embedded into the kernel.
>
> Note that relocating at runtime introduces an overhead even if the
> kernel is loaded at the same address it was linked at and that the compiler
> options are those used in arm64 which uses the same RELA relocation
> format.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig              | 12 +++++++
>  arch/riscv/Makefile             |  5 ++-
>  arch/riscv/kernel/vmlinux.lds.S |  6 ++--
>  arch/riscv/mm/Makefile          |  4 +++
>  arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
>  5 files changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a31e1a41913a..93127d5913fe 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -170,6 +170,18 @@ config PGTABLE_LEVELS
>         default 3 if 64BIT
>         default 2
>
> +config RELOCATABLE
> +       bool
> +       depends on MMU
> +       help
> +          This builds a kernel as a Position Independent Executable (PIE),
> +          which retains all relocation metadata required to relocate the
> +          kernel binary at runtime to a different virtual address than the
> +          address it was linked at.
> +          Since RISCV uses the RELA relocation format, this requires a
> +          relocation pass at runtime even if the kernel is loaded at the
> +          same address it was linked at.
> +
>  source "arch/riscv/Kconfig.socs"
>
>  menu "Platform type"
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index fb6e37db836d..1406416ea743 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -9,7 +9,10 @@
>  #
>
>  OBJCOPYFLAGS    := -O binary
> -LDFLAGS_vmlinux :=
> +ifeq ($(CONFIG_RELOCATABLE),y)
> +LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
> +KBUILD_CFLAGS += -fPIE
> +endif
>  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>         LDFLAGS_vmlinux := --no-relax
>  endif
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index a9abde62909f..e8ffba8c2044 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -85,8 +85,10 @@ SECTIONS
>
>         BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
>
> -       .rel.dyn : {
> -               *(.rel.dyn*)
> +       .rela.dyn : ALIGN(8) {
> +               __rela_dyn_start = .;
> +               *(.rela .rela*)
> +               __rela_dyn_end = .;
>         }
>
>         _end = .;
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index 363ef01c30b1..dc5cdaa80bc1 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -1,6 +1,10 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>
>  CFLAGS_init.o := -mcmodel=medany
> +ifdef CONFIG_RELOCATABLE
> +CFLAGS_init.o += -fno-pie
> +endif
> +
>  ifdef CONFIG_FTRACE
>  CFLAGS_REMOVE_init.o = -pg
>  endif
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 17f108baec4f..7074522d40c6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -13,6 +13,9 @@
>  #include <linux/of_fdt.h>
>  #include <linux/libfdt.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_RELOCATABLE
> +#include <linux/elf.h>
> +#endif
>
>  #include <asm/fixmap.h>
>  #include <asm/tlbflush.h>
> @@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long __rela_dyn_start, __rela_dyn_end;
> +
> +#ifdef CONFIG_64BIT
> +#define Elf_Rela Elf64_Rela
> +#define Elf_Addr Elf64_Addr
> +#else
> +#define Elf_Rela Elf32_Rela
> +#define Elf_Addr Elf32_Addr
> +#endif
> +
> +void __init relocate_kernel(uintptr_t load_pa)
> +{
> +       Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
> +       /*
> +        * This holds the offset between the linked virtual address and the
> +        * relocated virtual address.
> +        */
> +       uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
> +       /*
> +        * This holds the offset between kernel linked virtual address and
> +        * physical address.
> +        */
> +       uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
> +
> +       for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
> +               Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
> +               Elf_Addr relocated_addr = rela->r_addend;
> +
> +               if (rela->r_info != R_RISCV_RELATIVE)
> +                       continue;
> +
> +               /*
> +                * Make sure to not relocate vdso symbols like rt_sigreturn
> +                * which are linked from the address 0 in vmlinux since
> +                * vdso symbol addresses are actually used as an offset from
> +                * mm->context.vdso in VDSO_OFFSET macro.
> +                */
> +               if (relocated_addr >= KERNEL_LINK_ADDR)
> +                       relocated_addr += reloc_offset;
> +
> +               *(Elf_Addr *)addr = relocated_addr;
> +       }
> +}
> +
> +#endif
> +
>  static uintptr_t load_pa, load_sz;
>
>  void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> @@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>         pfn_base = PFN_DOWN(load_pa);
>
> +#ifdef CONFIG_RELOCATABLE
> +#ifdef CONFIG_64BIT
> +       /*
> +        * Early page table uses only one PGDIR, which makes it possible
> +        * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
> +        * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
> +        * since a part of the kernel would not get mapped.
> +        * This cannot happen on rv32 as we use the entire page directory level.
> +        */
> +       BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
> +#endif
> +       relocate_kernel(load_pa);
> +#endif
>         /*
>          * Enforce boot alignment requirements of RV32 and
>          * RV64 by only allowing PMD or PGD mappings.
> --
> 2.20.1
>
>

Looks good to me as well.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE
@ 2020-05-29 12:04     ` Anup Patel
  0 siblings, 0 replies; 30+ messages in thread
From: Anup Patel @ 2020-05-29 12:04 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Zong Li, Paul Walmsley, Palmer Dabbelt, linux-riscv,
	linuxppc-dev

On Sun, May 24, 2020 at 2:25 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This config allows to compile the kernel as PIE and to relocate it at
> any virtual address at runtime: this paves the way to KASLR and to 4-level
> page table folding at runtime. Runtime relocation is possible since
> relocation metadata are embedded into the kernel.
>
> Note that relocating at runtime introduces an overhead even if the
> kernel is loaded at the same address it was linked at and that the compiler
> options are those used in arm64 which uses the same RELA relocation
> format.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig              | 12 +++++++
>  arch/riscv/Makefile             |  5 ++-
>  arch/riscv/kernel/vmlinux.lds.S |  6 ++--
>  arch/riscv/mm/Makefile          |  4 +++
>  arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
>  5 files changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a31e1a41913a..93127d5913fe 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -170,6 +170,18 @@ config PGTABLE_LEVELS
>         default 3 if 64BIT
>         default 2
>
> +config RELOCATABLE
> +       bool
> +       depends on MMU
> +       help
> +          This builds a kernel as a Position Independent Executable (PIE),
> +          which retains all relocation metadata required to relocate the
> +          kernel binary at runtime to a different virtual address than the
> +          address it was linked at.
> +          Since RISCV uses the RELA relocation format, this requires a
> +          relocation pass at runtime even if the kernel is loaded at the
> +          same address it was linked at.
> +
>  source "arch/riscv/Kconfig.socs"
>
>  menu "Platform type"
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index fb6e37db836d..1406416ea743 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -9,7 +9,10 @@
>  #
>
>  OBJCOPYFLAGS    := -O binary
> -LDFLAGS_vmlinux :=
> +ifeq ($(CONFIG_RELOCATABLE),y)
> +LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
> +KBUILD_CFLAGS += -fPIE
> +endif
>  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>         LDFLAGS_vmlinux := --no-relax
>  endif
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index a9abde62909f..e8ffba8c2044 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -85,8 +85,10 @@ SECTIONS
>
>         BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
>
> -       .rel.dyn : {
> -               *(.rel.dyn*)
> +       .rela.dyn : ALIGN(8) {
> +               __rela_dyn_start = .;
> +               *(.rela .rela*)
> +               __rela_dyn_end = .;
>         }
>
>         _end = .;
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index 363ef01c30b1..dc5cdaa80bc1 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -1,6 +1,10 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>
>  CFLAGS_init.o := -mcmodel=medany
> +ifdef CONFIG_RELOCATABLE
> +CFLAGS_init.o += -fno-pie
> +endif
> +
>  ifdef CONFIG_FTRACE
>  CFLAGS_REMOVE_init.o = -pg
>  endif
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 17f108baec4f..7074522d40c6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -13,6 +13,9 @@
>  #include <linux/of_fdt.h>
>  #include <linux/libfdt.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_RELOCATABLE
> +#include <linux/elf.h>
> +#endif
>
>  #include <asm/fixmap.h>
>  #include <asm/tlbflush.h>
> @@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long __rela_dyn_start, __rela_dyn_end;
> +
> +#ifdef CONFIG_64BIT
> +#define Elf_Rela Elf64_Rela
> +#define Elf_Addr Elf64_Addr
> +#else
> +#define Elf_Rela Elf32_Rela
> +#define Elf_Addr Elf32_Addr
> +#endif
> +
> +void __init relocate_kernel(uintptr_t load_pa)
> +{
> +       Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
> +       /*
> +        * This holds the offset between the linked virtual address and the
> +        * relocated virtual address.
> +        */
> +       uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
> +       /*
> +        * This holds the offset between kernel linked virtual address and
> +        * physical address.
> +        */
> +       uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
> +
> +       for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
> +               Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
> +               Elf_Addr relocated_addr = rela->r_addend;
> +
> +               if (rela->r_info != R_RISCV_RELATIVE)
> +                       continue;
> +
> +               /*
> +                * Make sure to not relocate vdso symbols like rt_sigreturn
> +                * which are linked from the address 0 in vmlinux since
> +                * vdso symbol addresses are actually used as an offset from
> +                * mm->context.vdso in VDSO_OFFSET macro.
> +                */
> +               if (relocated_addr >= KERNEL_LINK_ADDR)
> +                       relocated_addr += reloc_offset;
> +
> +               *(Elf_Addr *)addr = relocated_addr;
> +       }
> +}
> +
> +#endif
> +
>  static uintptr_t load_pa, load_sz;
>
>  void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> @@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>         pfn_base = PFN_DOWN(load_pa);
>
> +#ifdef CONFIG_RELOCATABLE
> +#ifdef CONFIG_64BIT
> +       /*
> +        * Early page table uses only one PGDIR, which makes it possible
> +        * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
> +        * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
> +        * since a part of the kernel would not get mapped.
> +        * This cannot happen on rv32 as we use the entire page directory level.
> +        */
> +       BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
> +#endif
> +       relocate_kernel(load_pa);
> +#endif
>         /*
>          * Enforce boot alignment requirements of RV32 and
>          * RV64 by only allowing PMD or PGD mappings.
> --
> 2.20.1
>
>

Looks good to me as well.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE
@ 2020-05-29 12:04     ` Anup Patel
  0 siblings, 0 replies; 30+ messages in thread
From: Anup Patel @ 2020-05-29 12:04 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Zong Li, Paul Walmsley,
	Palmer Dabbelt, linux-riscv, linuxppc-dev

On Sun, May 24, 2020 at 2:25 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This config allows to compile the kernel as PIE and to relocate it at
> any virtual address at runtime: this paves the way to KASLR and to 4-level
> page table folding at runtime. Runtime relocation is possible since
> relocation metadata are embedded into the kernel.
>
> Note that relocating at runtime introduces an overhead even if the
> kernel is loaded at the same address it was linked at and that the compiler
> options are those used in arm64 which uses the same RELA relocation
> format.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig              | 12 +++++++
>  arch/riscv/Makefile             |  5 ++-
>  arch/riscv/kernel/vmlinux.lds.S |  6 ++--
>  arch/riscv/mm/Makefile          |  4 +++
>  arch/riscv/mm/init.c            | 63 +++++++++++++++++++++++++++++++++
>  5 files changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a31e1a41913a..93127d5913fe 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -170,6 +170,18 @@ config PGTABLE_LEVELS
>         default 3 if 64BIT
>         default 2
>
> +config RELOCATABLE
> +       bool
> +       depends on MMU
> +       help
> +          This builds a kernel as a Position Independent Executable (PIE),
> +          which retains all relocation metadata required to relocate the
> +          kernel binary at runtime to a different virtual address than the
> +          address it was linked at.
> +          Since RISCV uses the RELA relocation format, this requires a
> +          relocation pass at runtime even if the kernel is loaded at the
> +          same address it was linked at.
> +
>  source "arch/riscv/Kconfig.socs"
>
>  menu "Platform type"
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index fb6e37db836d..1406416ea743 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -9,7 +9,10 @@
>  #
>
>  OBJCOPYFLAGS    := -O binary
> -LDFLAGS_vmlinux :=
> +ifeq ($(CONFIG_RELOCATABLE),y)
> +LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
> +KBUILD_CFLAGS += -fPIE
> +endif
>  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>         LDFLAGS_vmlinux := --no-relax
>  endif
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index a9abde62909f..e8ffba8c2044 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -85,8 +85,10 @@ SECTIONS
>
>         BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
>
> -       .rel.dyn : {
> -               *(.rel.dyn*)
> +       .rela.dyn : ALIGN(8) {
> +               __rela_dyn_start = .;
> +               *(.rela .rela*)
> +               __rela_dyn_end = .;
>         }
>
>         _end = .;
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index 363ef01c30b1..dc5cdaa80bc1 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -1,6 +1,10 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>
>  CFLAGS_init.o := -mcmodel=medany
> +ifdef CONFIG_RELOCATABLE
> +CFLAGS_init.o += -fno-pie
> +endif
> +
>  ifdef CONFIG_FTRACE
>  CFLAGS_REMOVE_init.o = -pg
>  endif
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 17f108baec4f..7074522d40c6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -13,6 +13,9 @@
>  #include <linux/of_fdt.h>
>  #include <linux/libfdt.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_RELOCATABLE
> +#include <linux/elf.h>
> +#endif
>
>  #include <asm/fixmap.h>
>  #include <asm/tlbflush.h>
> @@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>  #endif
>
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long __rela_dyn_start, __rela_dyn_end;
> +
> +#ifdef CONFIG_64BIT
> +#define Elf_Rela Elf64_Rela
> +#define Elf_Addr Elf64_Addr
> +#else
> +#define Elf_Rela Elf32_Rela
> +#define Elf_Addr Elf32_Addr
> +#endif
> +
> +void __init relocate_kernel(uintptr_t load_pa)
> +{
> +       Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
> +       /*
> +        * This holds the offset between the linked virtual address and the
> +        * relocated virtual address.
> +        */
> +       uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
> +       /*
> +        * This holds the offset between kernel linked virtual address and
> +        * physical address.
> +        */
> +       uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
> +
> +       for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
> +               Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
> +               Elf_Addr relocated_addr = rela->r_addend;
> +
> +               if (rela->r_info != R_RISCV_RELATIVE)
> +                       continue;
> +
> +               /*
> +                * Make sure to not relocate vdso symbols like rt_sigreturn
> +                * which are linked from the address 0 in vmlinux since
> +                * vdso symbol addresses are actually used as an offset from
> +                * mm->context.vdso in VDSO_OFFSET macro.
> +                */
> +               if (relocated_addr >= KERNEL_LINK_ADDR)
> +                       relocated_addr += reloc_offset;
> +
> +               *(Elf_Addr *)addr = relocated_addr;
> +       }
> +}
> +
> +#endif
> +
>  static uintptr_t load_pa, load_sz;
>
>  void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> @@ -405,6 +455,19 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>         pfn_base = PFN_DOWN(load_pa);
>
> +#ifdef CONFIG_RELOCATABLE
> +#ifdef CONFIG_64BIT
> +       /*
> +        * Early page table uses only one PGDIR, which makes it possible
> +        * to map PGDIR_SIZE aligned on PGDIR_SIZE: if the relocation offset
> +        * makes the kernel cross over a PGDIR_SIZE boundary, raise a bug
> +        * since a part of the kernel would not get mapped.
> +        * This cannot happen on rv32 as we use the entire page directory level.
> +        */
> +       BUG_ON(PGDIR_SIZE - (kernel_virt_addr & (PGDIR_SIZE - 1)) < load_sz);
> +#endif
> +       relocate_kernel(load_pa);
> +#endif
>         /*
>          * Enforce boot alignment requirements of RV32 and
>          * RV64 by only allowing PMD or PGD mappings.
> --
> 2.20.1
>
>

Looks good to me as well.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time
  2020-05-24  8:52 ` [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time Alexandre Ghiti
  2020-05-29 12:08     ` Anup Patel
@ 2020-05-29 12:08     ` Anup Patel
  0 siblings, 0 replies; 30+ messages in thread
From: Anup Patel @ 2020-05-29 12:08 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel,
	Atish Patra, Zong Li, linux-kernel@vger.kernel.org List,
	linuxppc-dev, linux-riscv

On Sun, May 24, 2020 at 2:26 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Relocating kernel at runtime is done very early in the boot process, so
> it is not convenient to check for relocations there and react in case a
> relocation was not expected.
>
> Powerpc architecture has a script that allows to check at compile time
> for such unexpected relocations: extract the common logic to scripts/
> and add arch specific scripts triggered at postlink.
>
> At the moment, powerpc and riscv architectures take advantage of this
> compile-time check.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/powerpc/tools/relocs_check.sh | 18 ++-------------
>  arch/riscv/Makefile.postlink       | 36 ++++++++++++++++++++++++++++++
>  arch/riscv/tools/relocs_check.sh   | 26 +++++++++++++++++++++
>  scripts/relocs_check.sh            | 20 +++++++++++++++++
>  4 files changed, 84 insertions(+), 16 deletions(-)
>  create mode 100644 arch/riscv/Makefile.postlink
>  create mode 100755 arch/riscv/tools/relocs_check.sh
>  create mode 100755 scripts/relocs_check.sh

Maybe you should send the change arch/powerpc/tools/relocs_check.sh
as a separate patch so that it can be picked up by arch/powerpc maintainers.

>
> diff --git a/arch/powerpc/tools/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh
> index 014e00e74d2b..e367895941ae 100755
> --- a/arch/powerpc/tools/relocs_check.sh
> +++ b/arch/powerpc/tools/relocs_check.sh
> @@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
>         exit 1
>  fi
>
> -# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
> -objdump="$1"
> -nm="$2"
> -vmlinux="$3"
> -
> -# Remove from the bad relocations those that match an undefined weak symbol
> -# which will result in an absolute relocation to 0.
> -# Weak unresolved symbols are of that form in nm output:
> -# "                  w _binary__btf_vmlinux_bin_end"
> -undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> -
>  bad_relocs=$(
> -$objdump -R "$vmlinux" |
> -       # Only look at relocation lines.
> -       grep -E '\<R_' |
> +${srctree}/scripts/relocs_check.sh "$@" |
>         # These relocations are okay
>         # On PPC64:
>         #       R_PPC64_RELATIVE, R_PPC64_NONE
> @@ -43,8 +30,7 @@ R_PPC_ADDR16_LO
>  R_PPC_ADDR16_HI
>  R_PPC_ADDR16_HA
>  R_PPC_RELATIVE
> -R_PPC_NONE' |
> -       ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
> +R_PPC_NONE'
>  )
>
>  if [ -z "$bad_relocs" ]; then
> diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
> new file mode 100644
> index 000000000000..bf2b2bca1845
> --- /dev/null
> +++ b/arch/riscv/Makefile.postlink
> @@ -0,0 +1,36 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# ===========================================================================
> +# Post-link riscv pass
> +# ===========================================================================
> +#
> +# Check that vmlinux relocations look sane
> +
> +PHONY := __archpost
> +__archpost:
> +
> +-include include/config/auto.conf
> +include scripts/Kbuild.include
> +
> +quiet_cmd_relocs_check = CHKREL  $@
> +cmd_relocs_check =                                                     \
> +       $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@"
> +
> +# `@true` prevents complaint when there is nothing to be done
> +
> +vmlinux: FORCE
> +       @true
> +ifdef CONFIG_RELOCATABLE
> +       $(call if_changed,relocs_check)
> +endif
> +
> +%.ko: FORCE
> +       @true
> +
> +clean:
> +       @true
> +
> +PHONY += FORCE clean
> +
> +FORCE:
> +
> +.PHONY: $(PHONY)
> diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
> new file mode 100755
> index 000000000000..baeb2e7b2290
> --- /dev/null
> +++ b/arch/riscv/tools/relocs_check.sh
> @@ -0,0 +1,26 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Based on powerpc relocs_check.sh
> +
> +# This script checks the relocations of a vmlinux for "suspicious"
> +# relocations.
> +
> +if [ $# -lt 3 ]; then
> +        echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
> +        exit 1
> +fi
> +
> +bad_relocs=$(
> +${srctree}/scripts/relocs_check.sh "$@" |
> +       # These relocations are okay
> +       #       R_RISCV_RELATIVE
> +       grep -F -w -v 'R_RISCV_RELATIVE'
> +)
> +
> +if [ -z "$bad_relocs" ]; then
> +       exit 0
> +fi
> +
> +num_bad=$(echo "$bad_relocs" | wc -l)
> +echo "WARNING: $num_bad bad relocations"
> +echo "$bad_relocs"
> diff --git a/scripts/relocs_check.sh b/scripts/relocs_check.sh
> new file mode 100755
> index 000000000000..137c660499f3
> --- /dev/null
> +++ b/scripts/relocs_check.sh
> @@ -0,0 +1,20 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +
> +# Get a list of all the relocations, remove from it the relocations
> +# that are known to be legitimate and return this list to arch specific
> +# script that will look for suspicious relocations.
> +
> +objdump="$1"
> +nm="$2"
> +vmlinux="$3"
> +
> +# Remove from the possible bad relocations those that match an undefined
> +# weak symbol which will result in an absolute relocation to 0.
> +# Weak unresolved symbols are of that form in nm output:
> +# "                  w _binary__btf_vmlinux_bin_end"
> +undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> +
> +$objdump -R "$vmlinux" |
> +       grep -E '\<R_' |
> +       ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
> --
> 2.20.1
>

Otherwise, looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time
@ 2020-05-29 12:08     ` Anup Patel
  0 siblings, 0 replies; 30+ messages in thread
From: Anup Patel @ 2020-05-29 12:08 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Benjamin Herrenschmidt, Michael Ellerman, Anup Patel,
	linux-kernel@vger.kernel.org List, Atish Patra, Paul Mackerras,
	Zong Li, Paul Walmsley, Palmer Dabbelt, linux-riscv,
	linuxppc-dev

On Sun, May 24, 2020 at 2:26 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Relocating kernel at runtime is done very early in the boot process, so
> it is not convenient to check for relocations there and react in case a
> relocation was not expected.
>
> Powerpc architecture has a script that allows to check at compile time
> for such unexpected relocations: extract the common logic to scripts/
> and add arch specific scripts triggered at postlink.
>
> At the moment, powerpc and riscv architectures take advantage of this
> compile-time check.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/powerpc/tools/relocs_check.sh | 18 ++-------------
>  arch/riscv/Makefile.postlink       | 36 ++++++++++++++++++++++++++++++
>  arch/riscv/tools/relocs_check.sh   | 26 +++++++++++++++++++++
>  scripts/relocs_check.sh            | 20 +++++++++++++++++
>  4 files changed, 84 insertions(+), 16 deletions(-)
>  create mode 100644 arch/riscv/Makefile.postlink
>  create mode 100755 arch/riscv/tools/relocs_check.sh
>  create mode 100755 scripts/relocs_check.sh

Maybe you should send the change arch/powerpc/tools/relocs_check.sh
as a separate patch so that it can be picked up by arch/powerpc maintainers.

>
> diff --git a/arch/powerpc/tools/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh
> index 014e00e74d2b..e367895941ae 100755
> --- a/arch/powerpc/tools/relocs_check.sh
> +++ b/arch/powerpc/tools/relocs_check.sh
> @@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
>         exit 1
>  fi
>
> -# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
> -objdump="$1"
> -nm="$2"
> -vmlinux="$3"
> -
> -# Remove from the bad relocations those that match an undefined weak symbol
> -# which will result in an absolute relocation to 0.
> -# Weak unresolved symbols are of that form in nm output:
> -# "                  w _binary__btf_vmlinux_bin_end"
> -undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> -
>  bad_relocs=$(
> -$objdump -R "$vmlinux" |
> -       # Only look at relocation lines.
> -       grep -E '\<R_' |
> +${srctree}/scripts/relocs_check.sh "$@" |
>         # These relocations are okay
>         # On PPC64:
>         #       R_PPC64_RELATIVE, R_PPC64_NONE
> @@ -43,8 +30,7 @@ R_PPC_ADDR16_LO
>  R_PPC_ADDR16_HI
>  R_PPC_ADDR16_HA
>  R_PPC_RELATIVE
> -R_PPC_NONE' |
> -       ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
> +R_PPC_NONE'
>  )
>
>  if [ -z "$bad_relocs" ]; then
> diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
> new file mode 100644
> index 000000000000..bf2b2bca1845
> --- /dev/null
> +++ b/arch/riscv/Makefile.postlink
> @@ -0,0 +1,36 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# ===========================================================================
> +# Post-link riscv pass
> +# ===========================================================================
> +#
> +# Check that vmlinux relocations look sane
> +
> +PHONY := __archpost
> +__archpost:
> +
> +-include include/config/auto.conf
> +include scripts/Kbuild.include
> +
> +quiet_cmd_relocs_check = CHKREL  $@
> +cmd_relocs_check =                                                     \
> +       $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@"
> +
> +# `@true` prevents complaint when there is nothing to be done
> +
> +vmlinux: FORCE
> +       @true
> +ifdef CONFIG_RELOCATABLE
> +       $(call if_changed,relocs_check)
> +endif
> +
> +%.ko: FORCE
> +       @true
> +
> +clean:
> +       @true
> +
> +PHONY += FORCE clean
> +
> +FORCE:
> +
> +.PHONY: $(PHONY)
> diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
> new file mode 100755
> index 000000000000..baeb2e7b2290
> --- /dev/null
> +++ b/arch/riscv/tools/relocs_check.sh
> @@ -0,0 +1,26 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Based on powerpc relocs_check.sh
> +
> +# This script checks the relocations of a vmlinux for "suspicious"
> +# relocations.
> +
> +if [ $# -lt 3 ]; then
> +        echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
> +        exit 1
> +fi
> +
> +bad_relocs=$(
> +${srctree}/scripts/relocs_check.sh "$@" |
> +       # These relocations are okay
> +       #       R_RISCV_RELATIVE
> +       grep -F -w -v 'R_RISCV_RELATIVE'
> +)
> +
> +if [ -z "$bad_relocs" ]; then
> +       exit 0
> +fi
> +
> +num_bad=$(echo "$bad_relocs" | wc -l)
> +echo "WARNING: $num_bad bad relocations"
> +echo "$bad_relocs"
> diff --git a/scripts/relocs_check.sh b/scripts/relocs_check.sh
> new file mode 100755
> index 000000000000..137c660499f3
> --- /dev/null
> +++ b/scripts/relocs_check.sh
> @@ -0,0 +1,20 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +
> +# Get a list of all the relocations, remove from it the relocations
> +# that are known to be legitimate and return this list to arch specific
> +# script that will look for suspicious relocations.
> +
> +objdump="$1"
> +nm="$2"
> +vmlinux="$3"
> +
> +# Remove from the possible bad relocations those that match an undefined
> +# weak symbol which will result in an absolute relocation to 0.
> +# Weak unresolved symbols are of that form in nm output:
> +# "                  w _binary__btf_vmlinux_bin_end"
> +undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> +
> +$objdump -R "$vmlinux" |
> +       grep -E '\<R_' |
> +       ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
> --
> 2.20.1
>

Otherwise, looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time
@ 2020-05-29 12:08     ` Anup Patel
  0 siblings, 0 replies; 30+ messages in thread
From: Anup Patel @ 2020-05-29 12:08 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, Anup Patel, linux-kernel@vger.kernel.org List,
	Atish Patra, Paul Mackerras, Zong Li, Paul Walmsley,
	Palmer Dabbelt, linux-riscv, linuxppc-dev

On Sun, May 24, 2020 at 2:26 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Relocating kernel at runtime is done very early in the boot process, so
> it is not convenient to check for relocations there and react in case a
> relocation was not expected.
>
> Powerpc architecture has a script that allows to check at compile time
> for such unexpected relocations: extract the common logic to scripts/
> and add arch specific scripts triggered at postlink.
>
> At the moment, powerpc and riscv architectures take advantage of this
> compile-time check.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/powerpc/tools/relocs_check.sh | 18 ++-------------
>  arch/riscv/Makefile.postlink       | 36 ++++++++++++++++++++++++++++++
>  arch/riscv/tools/relocs_check.sh   | 26 +++++++++++++++++++++
>  scripts/relocs_check.sh            | 20 +++++++++++++++++
>  4 files changed, 84 insertions(+), 16 deletions(-)
>  create mode 100644 arch/riscv/Makefile.postlink
>  create mode 100755 arch/riscv/tools/relocs_check.sh
>  create mode 100755 scripts/relocs_check.sh

Maybe you should send the change arch/powerpc/tools/relocs_check.sh
as a separate patch so that it can be picked up by arch/powerpc maintainers.

>
> diff --git a/arch/powerpc/tools/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh
> index 014e00e74d2b..e367895941ae 100755
> --- a/arch/powerpc/tools/relocs_check.sh
> +++ b/arch/powerpc/tools/relocs_check.sh
> @@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
>         exit 1
>  fi
>
> -# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
> -objdump="$1"
> -nm="$2"
> -vmlinux="$3"
> -
> -# Remove from the bad relocations those that match an undefined weak symbol
> -# which will result in an absolute relocation to 0.
> -# Weak unresolved symbols are of that form in nm output:
> -# "                  w _binary__btf_vmlinux_bin_end"
> -undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> -
>  bad_relocs=$(
> -$objdump -R "$vmlinux" |
> -       # Only look at relocation lines.
> -       grep -E '\<R_' |
> +${srctree}/scripts/relocs_check.sh "$@" |
>         # These relocations are okay
>         # On PPC64:
>         #       R_PPC64_RELATIVE, R_PPC64_NONE
> @@ -43,8 +30,7 @@ R_PPC_ADDR16_LO
>  R_PPC_ADDR16_HI
>  R_PPC_ADDR16_HA
>  R_PPC_RELATIVE
> -R_PPC_NONE' |
> -       ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
> +R_PPC_NONE'
>  )
>
>  if [ -z "$bad_relocs" ]; then
> diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
> new file mode 100644
> index 000000000000..bf2b2bca1845
> --- /dev/null
> +++ b/arch/riscv/Makefile.postlink
> @@ -0,0 +1,36 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# ===========================================================================
> +# Post-link riscv pass
> +# ===========================================================================
> +#
> +# Check that vmlinux relocations look sane
> +
> +PHONY := __archpost
> +__archpost:
> +
> +-include include/config/auto.conf
> +include scripts/Kbuild.include
> +
> +quiet_cmd_relocs_check = CHKREL  $@
> +cmd_relocs_check =                                                     \
> +       $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@"
> +
> +# `@true` prevents complaint when there is nothing to be done
> +
> +vmlinux: FORCE
> +       @true
> +ifdef CONFIG_RELOCATABLE
> +       $(call if_changed,relocs_check)
> +endif
> +
> +%.ko: FORCE
> +       @true
> +
> +clean:
> +       @true
> +
> +PHONY += FORCE clean
> +
> +FORCE:
> +
> +.PHONY: $(PHONY)
> diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
> new file mode 100755
> index 000000000000..baeb2e7b2290
> --- /dev/null
> +++ b/arch/riscv/tools/relocs_check.sh
> @@ -0,0 +1,26 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Based on powerpc relocs_check.sh
> +
> +# This script checks the relocations of a vmlinux for "suspicious"
> +# relocations.
> +
> +if [ $# -lt 3 ]; then
> +        echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
> +        exit 1
> +fi
> +
> +bad_relocs=$(
> +${srctree}/scripts/relocs_check.sh "$@" |
> +       # These relocations are okay
> +       #       R_RISCV_RELATIVE
> +       grep -F -w -v 'R_RISCV_RELATIVE'
> +)
> +
> +if [ -z "$bad_relocs" ]; then
> +       exit 0
> +fi
> +
> +num_bad=$(echo "$bad_relocs" | wc -l)
> +echo "WARNING: $num_bad bad relocations"
> +echo "$bad_relocs"
> diff --git a/scripts/relocs_check.sh b/scripts/relocs_check.sh
> new file mode 100755
> index 000000000000..137c660499f3
> --- /dev/null
> +++ b/scripts/relocs_check.sh
> @@ -0,0 +1,20 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +
> +# Get a list of all the relocations, remove from it the relocations
> +# that are known to be legitimate and return this list to arch specific
> +# script that will look for suspicious relocations.
> +
> +objdump="$1"
> +nm="$2"
> +vmlinux="$3"
> +
> +# Remove from the possible bad relocations those that match an undefined
> +# weak symbol which will result in an absolute relocation to 0.
> +# Weak unresolved symbols are of that form in nm output:
> +# "                  w _binary__btf_vmlinux_bin_end"
> +undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> +
> +$objdump -R "$vmlinux" |
> +       grep -E '\<R_' |
> +       ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || cat)
> --
> 2.20.1
>

Otherwise, looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-05-29 12:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-24  8:52 [PATCH v3 0/3] vmalloc kernel mapping and relocatable kernel Alexandre Ghiti
2020-05-24  8:52 ` [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone Alexandre Ghiti
2020-05-26  9:43   ` Zong Li
2020-05-26  9:43     ` Zong Li
2020-05-26  9:43     ` Zong Li
2020-05-26 17:06     ` Alex Ghiti
2020-05-26 17:06       ` Alex Ghiti
2020-05-26 17:06       ` Alex Ghiti
2020-05-27  6:05       ` Zong Li
2020-05-27  6:05         ` Zong Li
2020-05-27  6:05         ` Zong Li
2020-05-27  7:29         ` Alex Ghiti
2020-05-27  7:29           ` Alex Ghiti
2020-05-27  7:29           ` Alex Ghiti
2020-05-28 13:07           ` Alex Ghiti
2020-05-28 13:07             ` Alex Ghiti
2020-05-28 13:07             ` Alex Ghiti
2020-05-27  7:33   ` kbuild test robot
2020-05-27  7:33     ` kbuild test robot
2020-05-24  8:52 ` [PATCH v3 2/3] riscv: Introduce CONFIG_RELOCATABLE Alexandre Ghiti
2020-05-26  9:05   ` Zong Li
2020-05-26  9:05     ` Zong Li
2020-05-26  9:05     ` Zong Li
2020-05-29 12:04   ` Anup Patel
2020-05-29 12:04     ` Anup Patel
2020-05-29 12:04     ` Anup Patel
2020-05-24  8:52 ` [PATCH v3 3/3] arch, scripts: Add script to check relocations at compile time Alexandre Ghiti
2020-05-29 12:08   ` Anup Patel
2020-05-29 12:08     ` Anup Patel
2020-05-29 12:08     ` Anup Patel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.