linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation
@ 2019-07-16 16:56 Pavel Tatashin
  2019-07-16 16:56 ` [RFC v1 1/4] arm64, mm: identity mapped page table Pavel Tatashin
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Pavel Tatashin @ 2019-07-16 16:56 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel

Added identity mapped page table, and keep MMU enabled while
kernel is being relocated from sparse pages to the final
destination during kexec.

More description about the problem I am trying to solve here, can be
found here:
https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com/

This patch series works in terms, that I can kexec-reboot both in QEMU
and on a physical machine. However, I do not see performance improvement
during relocation. The performance is just as slow as before with disabled
caches.

Am I missing something? Perhaps, there is some flag that I should also
enable in page table? Please provide me with any suggestions.

Pavel Tatashin (4):
  arm64, mm: identity mapped page table
  arm64, kexec: interface preparation for mmu enabled kexec
  arm64, kexec: add kexec's own identity page table
  arm64: Keep MMU on while kernel is being relocated

 arch/arm64/include/asm/ident_map.h  |  26 ++++++
 arch/arm64/include/asm/kexec.h      |   5 +-
 arch/arm64/kernel/cpu-reset.S       |   8 --
 arch/arm64/kernel/cpu-reset.h       |   7 +-
 arch/arm64/kernel/machine_kexec.c   | 128 +++++++++++++++++++++-------
 arch/arm64/kernel/relocate_kernel.S |  36 +++++---
 arch/arm64/mm/Makefile              |   1 +
 arch/arm64/mm/ident_map.c           |  99 +++++++++++++++++++++
 8 files changed, 255 insertions(+), 55 deletions(-)
 create mode 100644 arch/arm64/include/asm/ident_map.h
 create mode 100644 arch/arm64/mm/ident_map.c

-- 
2.22.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC v1 1/4] arm64, mm: identity mapped page table
  2019-07-16 16:56 [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Pavel Tatashin
@ 2019-07-16 16:56 ` Pavel Tatashin
  2019-07-16 16:56 ` [RFC v1 2/4] arm64, kexec: interface preparation for mmu enabled kexec Pavel Tatashin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Pavel Tatashin @ 2019-07-16 16:56 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel

Created identiy mapped page table that maps 1 to 1 virtual to physical
addresses.

Similarly to x86, this table can be used in kasan, hibernate, and kexec.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/include/asm/ident_map.h | 26 ++++++++
 arch/arm64/mm/Makefile             |  1 +
 arch/arm64/mm/ident_map.c          | 99 ++++++++++++++++++++++++++++++
 3 files changed, 126 insertions(+)
 create mode 100644 arch/arm64/include/asm/ident_map.h
 create mode 100644 arch/arm64/mm/ident_map.c

diff --git a/arch/arm64/include/asm/ident_map.h b/arch/arm64/include/asm/ident_map.h
new file mode 100644
index 000000000000..1bb9fcd27368
--- /dev/null
+++ b/arch/arm64/include/asm/ident_map.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
+ */
+
+#ifndef _ASM_IDENT_MAP_H
+#define _ASM_IDENT_MAP_H
+
+#include <linux/types.h>
+#include <asm/pgtable.h>
+
+struct ident_map_info {
+	void * (*alloc_pgt_page)(void *);	/* allocate a page  */
+	void *alloc_arg;			/* arg. for alloc_pgt_page */
+	unsigned long page_flags;		/* PMD or PUD flags */
+	unsigned long offset;			/* ident mapping offset */
+	bool pud_pages;				/* PUD level huge pages */
+};
+
+int ident_map_pgd_populate(struct ident_map_info *info,
+			   phys_addr_t pgd_page,
+			   phys_addr_t addr,
+			   phys_addr_t end);
+
+#endif /* _ASM_ARM64_IDENT_MAP_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..dfa5a074a360 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -5,6 +5,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   context.o proc.o pageattr.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_ARM64_PTDUMP_CORE)	+= dump.o
+obj-$(CONFIG_KEXEC_CORE)	+= ident_map.o
 obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
 obj-$(CONFIG_NUMA)		+= numa.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
diff --git a/arch/arm64/mm/ident_map.c b/arch/arm64/mm/ident_map.c
new file mode 100644
index 000000000000..bcfff5e2573b
--- /dev/null
+++ b/arch/arm64/mm/ident_map.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
+ */
+
+#include <asm/ident_map.h>
+#include <asm/pgalloc.h>
+
+/* Initialize PMD size huge entries in page table */
+static void ident_map_pmd_init(struct ident_map_info *info,
+			       phys_addr_t pmd_page, phys_addr_t addr,
+			       phys_addr_t end)
+{
+	const unsigned long flags = info->page_flags;
+	const unsigned long offset = info->offset;
+	pmd_t *pmdp = (pmd_t *)__va(pmd_page) + pmd_index(addr);
+
+	addr &= PMD_MASK;
+	for (; addr < end; addr += PMD_SIZE, pmdp++) {
+		set_pmd(pmdp, __pmd(__phys_to_pmd_val(addr - offset) | flags));
+	}
+}
+
+/* Initialize PUD size huge entries in page table */
+static void ident_map_pud_init(struct ident_map_info *info,
+			       phys_addr_t pud_page, phys_addr_t addr,
+			       phys_addr_t end)
+{
+	const unsigned long flags = info->page_flags;
+	const unsigned long offset = info->offset;
+	pud_t *pudp = (pud_t *)__va(pud_page) + pud_index(addr);
+
+	addr &= PUD_MASK;
+	for (; addr < end; addr += PUD_SIZE, pudp++) {
+		set_pud(pudp, __pud(__phys_to_pud_val(addr - offset) | flags));
+	}
+}
+
+/* Populate PUD level with PMD entries */
+static int ident_map_pud_populate(struct ident_map_info *info,
+				  phys_addr_t pud_page, phys_addr_t addr,
+				  phys_addr_t end)
+{
+	pud_t *pudp = (pud_t *)__va(pud_page) + pud_index(addr);
+	phys_addr_t pmd_page, next;
+
+	for (; addr < end; addr = next, pudp++) {
+		next = pud_addr_end(addr, end);
+		if (pud_none(*pudp)) {
+			void *pmd = info->alloc_pgt_page(info->alloc_arg);
+
+			if (!pmd)
+				return -ENOMEM;
+
+			clear_page(pmd);
+			__pud_populate(pudp, __pa(pmd), PUD_TYPE_TABLE);
+		}
+		pmd_page = __pud_to_phys(*pudp);
+		ident_map_pmd_init(info, pmd_page, addr, next);
+	}
+
+	return 0;
+}
+
+/* Populate identify mapped page table with physical range[addr, end) */
+int ident_map_pgd_populate(struct ident_map_info *info,
+			   phys_addr_t pgd_page, phys_addr_t addr,
+			   phys_addr_t end)
+{
+	const bool pud_pages = info->pud_pages;
+	pgd_t *pgdp = (pgd_t *)__va(pgd_page) + pgd_index(addr);
+	phys_addr_t pud_page, next;
+
+	for (; addr < end; addr = next, pgdp++) {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp)) {
+			void *pud = info->alloc_pgt_page(info->alloc_arg);
+
+			if (!pud)
+				return -ENOMEM;
+
+			clear_page(pud);
+			__pgd_populate(pgdp, __pa(pud), PUD_TYPE_TABLE);
+		}
+		pud_page = __pgd_to_phys(*pgdp);
+		if (pud_pages) {
+			ident_map_pud_init(info, pud_page, addr, next);
+		} else {
+			int rv = ident_map_pud_populate(info, pud_page, addr,
+				 next);
+
+			if (rv)
+				return rv;
+		}
+	}
+
+	return 0;
+}
-- 
2.22.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC v1 2/4] arm64, kexec: interface preparation for mmu enabled kexec
  2019-07-16 16:56 [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Pavel Tatashin
  2019-07-16 16:56 ` [RFC v1 1/4] arm64, mm: identity mapped page table Pavel Tatashin
@ 2019-07-16 16:56 ` Pavel Tatashin
  2019-07-16 16:56 ` [RFC v1 3/4] arm64, kexec: add kexec's own identity page table Pavel Tatashin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Pavel Tatashin @ 2019-07-16 16:56 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel

Currently cpu_install_idmap() is used to install page table during kexec
switch over to purgatory. We soon will be using our own page table, that
maps the whole physical range (and might be even more, i.e if new DTB
describes a bigger physical range or mem= parameter limited physical
range in the current kernel).

Make kimage_arch to be always part of arm64.
Add relocate_kern and kexec_pgtable verctors to this struct, as we won't
be able to rely on a single control page anymore.

Copy relocation function in machine_kexec_prepare(), and setup page
table there as well (for now idmap_pg_dir).

Cleanup call to cpu_soft_restart by removing ugly ifdefs. When
kimage->arch.dtb_mem is not set, it is 0 anyway.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/include/asm/kexec.h    |  5 +--
 arch/arm64/kernel/cpu-reset.h     |  7 ++++-
 arch/arm64/kernel/machine_kexec.c | 52 ++++++++++++++-----------------
 3 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 12a561a54128..ef2d2442b890 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,14 +90,15 @@ static inline void crash_prepare_suspend(void) {}
 static inline void crash_post_resume(void) {}
 #endif
 
-#ifdef CONFIG_KEXEC_FILE
 #define ARCH_HAS_KIMAGE_ARCH
-
 struct kimage_arch {
 	void *dtb;
 	unsigned long dtb_mem;
+	void  *relocate_kern;
+	pgd_t *kexec_pgtable;
 };
 
+#ifdef CONFIG_KEXEC_FILE
 extern const struct kexec_file_ops kexec_image_ops;
 
 struct kimage;
diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index ed50e9587ad8..c795811587f0 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -14,6 +14,7 @@ void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
 	unsigned long arg0, unsigned long arg1, unsigned long arg2);
 
 static inline void __noreturn cpu_soft_restart(unsigned long entry,
+					       pgd_t *kexec_pgtable,
 					       unsigned long arg0,
 					       unsigned long arg1,
 					       unsigned long arg2)
@@ -24,7 +25,11 @@ static inline void __noreturn cpu_soft_restart(unsigned long entry,
 		is_hyp_mode_available();
 	restart = (void *)__pa_symbol(__cpu_soft_restart);
 
-	cpu_install_idmap();
+	cpu_set_reserved_ttbr0();
+	local_flush_tlb_all();
+	write_sysreg(phys_to_ttbr(virt_to_phys(kexec_pgtable)), ttbr0_el1);
+	isb();
+
 	restart(el2_switch, entry, arg0, arg1, arg2);
 	unreachable();
 }
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 0df8493624e0..f4565eb01d09 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -42,6 +42,8 @@ static void _kexec_image_info(const char *func, int line,
 	pr_debug("    start:       %lx\n", kimage->start);
 	pr_debug("    head:        %lx\n", kimage->head);
 	pr_debug("    nr_segments: %lu\n", kimage->nr_segments);
+	pr_debug("    arch.kexec_pgtable: %p\n", kimage->arch.kexec_pgtable);
+	pr_debug("    arch.relocate_kern: %p\n", kimage->arch.relocate_kern);
 
 	for (i = 0; i < kimage->nr_segments; i++) {
 		pr_debug("      segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu pages\n",
@@ -67,13 +69,24 @@ void machine_kexec_cleanup(struct kimage *kimage)
  */
 int machine_kexec_prepare(struct kimage *kimage)
 {
-	kexec_image_info(kimage);
+	void *reloc_buf = page_address(kimage->control_code_page);
 
 	if (kimage->type != KEXEC_TYPE_CRASH && cpus_are_stuck_in_kernel()) {
 		pr_err("Can't kexec: CPUs are stuck in the kernel.\n");
 		return -EBUSY;
 	}
 
+	/*
+	 * Copy arm64_relocate_new_kernel to the buffer for use after the kernel
+	 * is shut down.
+	 */
+	memcpy(reloc_buf, arm64_relocate_new_kernel,
+	       arm64_relocate_new_kernel_size);
+
+	kimage->arch.relocate_kern = reloc_buf;
+	kimage->arch.kexec_pgtable = lm_alias(idmap_pg_dir);
+	kexec_image_info(kimage);
+
 	return 0;
 }
 
@@ -143,8 +156,6 @@ static void kexec_segment_flush(const struct kimage *kimage)
  */
 void machine_kexec(struct kimage *kimage)
 {
-	phys_addr_t reboot_code_buffer_phys;
-	void *reboot_code_buffer;
 	bool in_kexec_crash = (kimage == kexec_crash_image);
 	bool stuck_cpus = cpus_are_stuck_in_kernel();
 
@@ -155,32 +166,17 @@ void machine_kexec(struct kimage *kimage)
 	WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
 		"Some CPUs may be stale, kdump will be unreliable.\n");
 
-	reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
-	reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
-
 	kexec_image_info(kimage);
-
-	pr_debug("%s:%d: control_code_page:        %p\n", __func__, __LINE__,
-		kimage->control_code_page);
-	pr_debug("%s:%d: reboot_code_buffer_phys:  %pa\n", __func__, __LINE__,
-		&reboot_code_buffer_phys);
-	pr_debug("%s:%d: reboot_code_buffer:       %p\n", __func__, __LINE__,
-		reboot_code_buffer);
 	pr_debug("%s:%d: relocate_new_kernel:      %p\n", __func__, __LINE__,
 		arm64_relocate_new_kernel);
 	pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
 		__func__, __LINE__, arm64_relocate_new_kernel_size,
 		arm64_relocate_new_kernel_size);
 
-	/*
-	 * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
-	 * after the kernel is shut down.
-	 */
-	memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
-		arm64_relocate_new_kernel_size);
 
-	/* Flush the reboot_code_buffer in preparation for its execution. */
-	__flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
+	/* Flush the relocate_kern in preparation for its execution. */
+	__flush_dcache_area(kimage->arch.relocate_kern,
+			    arm64_relocate_new_kernel_size);
 
 	/*
 	 * Although we've killed off the secondary CPUs, we don't update
@@ -188,7 +184,7 @@ void machine_kexec(struct kimage *kimage)
 	 * need to avoid flush_icache_range(), which will attempt to IPI
 	 * the offline CPUs. Therefore, we must use the __* variant here.
 	 */
-	__flush_icache_range((uintptr_t)reboot_code_buffer,
+	__flush_icache_range((uintptr_t)kimage->arch.relocate_kern,
 			     arm64_relocate_new_kernel_size);
 
 	/* Flush the kimage list and its buffers. */
@@ -204,7 +200,7 @@ void machine_kexec(struct kimage *kimage)
 
 	/*
 	 * cpu_soft_restart will shutdown the MMU, disable data caches, then
-	 * transfer control to the reboot_code_buffer which contains a copy of
+	 * transfer control to the relocate_kern which contains a copy of
 	 * the arm64_relocate_new_kernel routine.  arm64_relocate_new_kernel
 	 * uses physical addressing to relocate the new image to its final
 	 * position and transfers control to the image entry point when the
@@ -214,12 +210,10 @@ void machine_kexec(struct kimage *kimage)
 	 * userspace (kexec-tools).
 	 * In kexec_file case, the kernel starts directly without purgatory.
 	 */
-	cpu_soft_restart(reboot_code_buffer_phys, kimage->head, kimage->start,
-#ifdef CONFIG_KEXEC_FILE
-						kimage->arch.dtb_mem);
-#else
-						0);
-#endif
+	cpu_soft_restart(__pa(kimage->arch.relocate_kern),
+			 kimage->arch.kexec_pgtable,
+			 kimage->head, kimage->start,
+			 kimage->arch.dtb_mem);
 
 	BUG(); /* Should never get here. */
 }
-- 
2.22.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC v1 3/4] arm64, kexec: add kexec's own identity page table
  2019-07-16 16:56 [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Pavel Tatashin
  2019-07-16 16:56 ` [RFC v1 1/4] arm64, mm: identity mapped page table Pavel Tatashin
  2019-07-16 16:56 ` [RFC v1 2/4] arm64, kexec: interface preparation for mmu enabled kexec Pavel Tatashin
@ 2019-07-16 16:56 ` Pavel Tatashin
  2019-07-16 16:56 ` [RFC v1 4/4] arm64: Keep MMU on while kernel is being relocated Pavel Tatashin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Pavel Tatashin @ 2019-07-16 16:56 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel

Allocate and configure identity page table to be used for kexec reboot.
Note, for now we still have MMU disabled during kernel relocation phase,
so this table is still used the same as idmap_pg_dir was used.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/kernel/machine_kexec.c | 78 ++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index f4565eb01d09..60433c264178 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -12,6 +12,7 @@
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
 #include <linux/smp.h>
+#include <linux/memblock.h>
 
 #include <asm/cacheflush.h>
 #include <asm/cpu_ops.h>
@@ -20,6 +21,7 @@
 #include <asm/mmu.h>
 #include <asm/mmu_context.h>
 #include <asm/page.h>
+#include <asm/ident_map.h>
 
 #include "cpu-reset.h"
 
@@ -55,6 +57,77 @@ static void _kexec_image_info(const char *func, int line,
 	}
 }
 
+/* Allocates pages for kexec page table */
+static void *kexec_pgtable_alloc(void *arg)
+{
+	struct kimage *kimage = (struct kimage *)arg;
+	struct page *page = kimage_alloc_control_pages(kimage, 0);
+
+	if (!page)
+		return NULL;
+
+	return page_address(page);
+}
+
+/*
+ * Create identity mapped page table for kexec purposes. The flags that are used
+ * in this page table are the same as what is set in __create_page_tables. The
+ * page table is needed for performance reasons. Without it, kernel relocation
+ * is rather slow, because when MMU is off, d-caching is disabled as well.
+ */
+static int
+kexec_create_pgtable(struct kimage *kimage)
+{
+	void *pgd_page = kexec_pgtable_alloc(kimage);
+	phys_addr_t kexec_pgtable;
+	int rv, i;
+	struct memblock_region *reg;
+	struct ident_map_info info = {
+		.alloc_pgt_page	= kexec_pgtable_alloc,
+		.alloc_arg	= kimage,
+		.page_flags	= PMD_SECT_VALID | PMD_SECT_AF | PMD_SECT_S |
+				  PMD_ATTRINDX(MT_NORMAL),
+		.offset		= 0,
+		.pud_pages	= false,
+	};
+
+	if (!pgd_page)
+		return -ENOMEM;
+
+	clear_page(pgd_page);
+	kexec_pgtable = __pa(pgd_page);
+
+	for_each_memblock(memory, reg) {
+		phys_addr_t mstart = reg->base;
+		phys_addr_t mend   = reg->base + reg->size;
+
+		rv = ident_map_pgd_populate(&info, kexec_pgtable, mstart, mend);
+		if (rv)
+			return rv;
+	}
+
+	/*
+	 * It is possible new kernel knows of some physical addresses that this
+	 * kernel does not know: for example a different device tree might
+	 * provide information of a memory region, or memory could have been
+	 * reduced via mem= kernel parameter.
+	 * This is why also unconditionally map new kernel segments, even though
+	 * most likely this is redundant.
+	 */
+	for (i = 0; i < kimage->nr_segments; i++) {
+		phys_addr_t mstart = kimage->segment[i].mem;
+		phys_addr_t mend   = mstart + kimage->segment[i].memsz;
+
+		rv = ident_map_pgd_populate(&info, kexec_pgtable, mstart, mend);
+		if (rv)
+			return rv;
+	}
+
+	kimage->arch.kexec_pgtable = pgd_page;
+
+	return 0;
+}
+
 void machine_kexec_cleanup(struct kimage *kimage)
 {
 	/* Empty routine needed to avoid build errors. */
@@ -70,6 +143,7 @@ void machine_kexec_cleanup(struct kimage *kimage)
 int machine_kexec_prepare(struct kimage *kimage)
 {
 	void *reloc_buf = page_address(kimage->control_code_page);
+	int rv;
 
 	if (kimage->type != KEXEC_TYPE_CRASH && cpus_are_stuck_in_kernel()) {
 		pr_err("Can't kexec: CPUs are stuck in the kernel.\n");
@@ -84,10 +158,10 @@ int machine_kexec_prepare(struct kimage *kimage)
 	       arm64_relocate_new_kernel_size);
 
 	kimage->arch.relocate_kern = reloc_buf;
-	kimage->arch.kexec_pgtable = lm_alias(idmap_pg_dir);
+	rv = kexec_create_pgtable(kimage);
 	kexec_image_info(kimage);
 
-	return 0;
+	return rv;
 }
 
 /**
-- 
2.22.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC v1 4/4] arm64: Keep MMU on while kernel is being relocated
  2019-07-16 16:56 [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Pavel Tatashin
                   ` (2 preceding siblings ...)
  2019-07-16 16:56 ` [RFC v1 3/4] arm64, kexec: add kexec's own identity page table Pavel Tatashin
@ 2019-07-16 16:56 ` Pavel Tatashin
  2019-07-16 19:14 ` [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Bhupesh Sharma
  2019-07-17 17:51 ` James Morse
  5 siblings, 0 replies; 10+ messages in thread
From: Pavel Tatashin @ 2019-07-16 16:56 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel

It is inefficient to do kernel relocation with MMU disabled. This is
because if MMU is disabled,  dcache must also be disabled.

Now, that we have identity page table we can disable MMU after relocation
is completed.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/kernel/cpu-reset.S       |  8 -------
 arch/arm64/kernel/relocate_kernel.S | 36 ++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
index 6ea337d464c4..d5cfc17b8e1f 100644
--- a/arch/arm64/kernel/cpu-reset.S
+++ b/arch/arm64/kernel/cpu-reset.S
@@ -30,14 +30,6 @@
  * flat identity mapping.
  */
 ENTRY(__cpu_soft_restart)
-	/* Clear sctlr_el1 flags. */
-	mrs	x12, sctlr_el1
-	ldr	x13, =SCTLR_ELx_FLAGS
-	bic	x12, x12, x13
-	pre_disable_mmu_workaround
-	msr	sctlr_el1, x12
-	isb
-
 	cbz	x0, 1f				// el2_switch?
 	mov	x0, #HVC_SOFT_RESTART
 	hvc	#0				// no return
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index c1d7db71a726..e2724fedd082 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -36,18 +36,6 @@ ENTRY(arm64_relocate_new_kernel)
 	mov	x14, xzr			/* x14 = entry ptr */
 	mov	x13, xzr			/* x13 = copy dest */
 
-	/* Clear the sctlr_el2 flags. */
-	mrs	x0, CurrentEL
-	cmp	x0, #CurrentEL_EL2
-	b.ne	1f
-	mrs	x0, sctlr_el2
-	ldr	x1, =SCTLR_ELx_FLAGS
-	bic	x0, x0, x1
-	pre_disable_mmu_workaround
-	msr	sctlr_el2, x0
-	isb
-1:
-
 	/* Check if the new image needs relocation. */
 	tbnz	x16, IND_DONE_BIT, .Ldone
 
@@ -63,10 +51,10 @@ ENTRY(arm64_relocate_new_kernel)
 	add     x20, x0, #PAGE_SIZE
 	sub     x1, x15, #1
 	bic     x0, x0, x1
-2:	dc      ivac, x0
+1:	dc      ivac, x0
 	add     x0, x0, x15
 	cmp     x0, x20
-	b.lo    2b
+	b.lo    1b
 	dsb     sy
 
 	mov x20, x13
@@ -104,6 +92,26 @@ ENTRY(arm64_relocate_new_kernel)
 	dsb	nsh
 	isb
 
+	/* Clear sctlr_el1 flags. */
+	mrs	x12, sctlr_el1
+	ldr	x13, =SCTLR_ELx_FLAGS
+	bic	x12, x12, x13
+	pre_disable_mmu_workaround
+	msr	sctlr_el1, x12
+	isb
+
+	/* Clear the sctlr_el2 flags. */
+	mrs	x0, CurrentEL
+	cmp	x0, #CurrentEL_EL2
+	b.ne	2f
+	mrs	x0, sctlr_el2
+	ldr	x1, =SCTLR_ELx_FLAGS
+	bic	x0, x0, x1
+	pre_disable_mmu_workaround
+	msr	sctlr_el2, x0
+	isb
+2:
+
 	/* Start new image. */
 	mov	x0, x18
 	mov	x1, xzr
-- 
2.22.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation
  2019-07-16 16:56 [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Pavel Tatashin
                   ` (3 preceding siblings ...)
  2019-07-16 16:56 ` [RFC v1 4/4] arm64: Keep MMU on while kernel is being relocated Pavel Tatashin
@ 2019-07-16 19:14 ` Bhupesh Sharma
  2019-07-16 19:26   ` Pavel Tatashin
  2019-07-17 17:51 ` James Morse
  5 siblings, 1 reply; 10+ messages in thread
From: Bhupesh Sharma @ 2019-07-16 19:14 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: sashal, Jonathan Corbet, Catalin Marinas, Linux Doc Mailing List,
	kexec mailing list, Linux Kernel Mailing List, James Morris,
	Eric Biederman, will, linux-arm-kernel

Hi Pavel,

On Tue, Jul 16, 2019 at 10:26 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> Added identity mapped page table, and keep MMU enabled while
> kernel is being relocated from sparse pages to the final
> destination during kexec.
>
> More description about the problem I am trying to solve here, can be
> found here:
> https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com/
>
> This patch series works in terms, that I can kexec-reboot both in QEMU
> and on a physical machine. However, I do not see performance improvement
> during relocation. The performance is just as slow as before with disabled
> caches.

Thanks for the patchset, but if the changes still don't positively
impact the kexec-reboot timings, I am not sure we if gain by adding
these to the kernel.

Like I mentioned in the previous threads, we have been carrying some
relevant fixes for the same in Linux distros. I have been trying to
find time to fix them and send them upstream, but I am caught up with
some nasty kexec_file_load() issues on arm64 currently.

So, I will find some time to work on them (may be next week) and will
Cc you when I post them out after some checks on real physical
hardware.

Thanks,
Bhupesh

> Am I missing something? Perhaps, there is some flag that I should also
> enable in page table? Please provide me with any suggestions.
>
> Pavel Tatashin (4):
>   arm64, mm: identity mapped page table
>   arm64, kexec: interface preparation for mmu enabled kexec
>   arm64, kexec: add kexec's own identity page table
>   arm64: Keep MMU on while kernel is being relocated
>
>  arch/arm64/include/asm/ident_map.h  |  26 ++++++
>  arch/arm64/include/asm/kexec.h      |   5 +-
>  arch/arm64/kernel/cpu-reset.S       |   8 --
>  arch/arm64/kernel/cpu-reset.h       |   7 +-
>  arch/arm64/kernel/machine_kexec.c   | 128 +++++++++++++++++++++-------
>  arch/arm64/kernel/relocate_kernel.S |  36 +++++---
>  arch/arm64/mm/Makefile              |   1 +
>  arch/arm64/mm/ident_map.c           |  99 +++++++++++++++++++++
>  8 files changed, 255 insertions(+), 55 deletions(-)
>  create mode 100644 arch/arm64/include/asm/ident_map.h
>  create mode 100644 arch/arm64/mm/ident_map.c
>
> --
> 2.22.0
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation
  2019-07-16 19:14 ` [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Bhupesh Sharma
@ 2019-07-16 19:26   ` Pavel Tatashin
  0 siblings, 0 replies; 10+ messages in thread
From: Pavel Tatashin @ 2019-07-16 19:26 UTC (permalink / raw)
  To: Bhupesh Sharma
  Cc: Sasha Levin, Jonathan Corbet, Catalin Marinas,
	Linux Doc Mailing List, kexec mailing list,
	Linux Kernel Mailing List, James Morris, Eric Biederman, will,
	linux-arm-kernel

On Tue, Jul 16, 2019 at 3:14 PM Bhupesh Sharma <bhsharma@redhat.com> wrote:
>
> Hi Pavel,
>
> On Tue, Jul 16, 2019 at 10:26 PM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > Added identity mapped page table, and keep MMU enabled while
> > kernel is being relocated from sparse pages to the final
> > destination during kexec.
> >
> > More description about the problem I am trying to solve here, can be
> > found here:
> > https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com/
> >
> > This patch series works in terms, that I can kexec-reboot both in QEMU
> > and on a physical machine. However, I do not see performance improvement
> > during relocation. The performance is just as slow as before with disabled
> > caches.
>
> Thanks for the patchset, but if the changes still don't positively
> impact the kexec-reboot timings, I am not sure we if gain by adding
> these to the kernel.

Hi Bhupesh,

I am not asking to add these to the kernel (hence RFC), I am looking
for help to figure out why the relocation is still slow, once that is
understood I will submit the patches for integration. My previous
patch series fixed the relocation problem by pre-reserving space, but
because the culprit of the problem was narrowed down to disabled
caches it was decided that a better fix would be to do relocation with
MMU still enabled, this is why I created this new series.

>
> Like I mentioned in the previous threads, we have been carrying some
> relevant fixes for the same in Linux distros. I have been trying to
> find time to fix them and send them upstream, but I am caught up with
> some nasty kexec_file_load() issues on arm64 currently.

As I understood, the fixes were for slow purgatory checksum checking,
and not for relocation of there kernel. Are you saying redhat is
carrying some patches that address slow relocation problem as well?

Thank you,
Pasha

>
> So, I will find some time to work on them (may be next week) and will
> Cc you when I post them out after some checks on real physical
> hardware.
>
> Thanks,
> Bhupesh
>
> > Am I missing something? Perhaps, there is some flag that I should also
> > enable in page table? Please provide me with any suggestions.
> >
> > Pavel Tatashin (4):
> >   arm64, mm: identity mapped page table
> >   arm64, kexec: interface preparation for mmu enabled kexec
> >   arm64, kexec: add kexec's own identity page table
> >   arm64: Keep MMU on while kernel is being relocated
> >
> >  arch/arm64/include/asm/ident_map.h  |  26 ++++++
> >  arch/arm64/include/asm/kexec.h      |   5 +-
> >  arch/arm64/kernel/cpu-reset.S       |   8 --
> >  arch/arm64/kernel/cpu-reset.h       |   7 +-
> >  arch/arm64/kernel/machine_kexec.c   | 128 +++++++++++++++++++++-------
> >  arch/arm64/kernel/relocate_kernel.S |  36 +++++---
> >  arch/arm64/mm/Makefile              |   1 +
> >  arch/arm64/mm/ident_map.c           |  99 +++++++++++++++++++++
> >  8 files changed, 255 insertions(+), 55 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/ident_map.h
> >  create mode 100644 arch/arm64/mm/ident_map.c
> >
> > --
> > 2.22.0
> >
> >
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation
  2019-07-16 16:56 [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Pavel Tatashin
                   ` (4 preceding siblings ...)
  2019-07-16 19:14 ` [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Bhupesh Sharma
@ 2019-07-17 17:51 ` James Morse
  2019-07-17 19:13   ` Pavel Tatashin
  5 siblings, 1 reply; 10+ messages in thread
From: James Morse @ 2019-07-17 17:51 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: sashal, corbet, catalin.marinas, linux-doc, kexec, linux-kernel,
	jmorris, ebiederm, will, linux-arm-kernel

Hi Pavel,

On 16/07/2019 17:56, Pavel Tatashin wrote:
> Added identity mapped page table, and keep MMU enabled while
> kernel is being relocated from sparse pages to the final
> destination during kexec.

The 'tl;dr' version of this: I strongly urge you to start with the hibernate code that
already covers all these known corner cases. x86 was not a good starting point.


After a quick skim:

This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
These regions were described by firmware as having content that was/is written with
different attributes. The attributes must match whenever it is mapped, otherwise we have a
loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
cache whenever it likes.
It may be important that we do not ever map some of these regions, even though its
described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
its own use; it is made secure-only, and any access causes an
external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
'nomap' attribute.

Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
size causing more memory to become cacheable than was intended.
Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
winges about firmware bugs).
If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
that is a bug we should fix.

The only way to do this properly is to copy the linear mapping. The arch code has lots of
complex code to generate it correctly at boot, we do not want to duplicate it.
(this is why hibernate copies the linear mapping)


These patches do not remove the running page tables from TTBR1. As you overwrite the live
page tables you will corrupt the state of the CPU. The page-table walker may access things
that aren't memory, cache memory that shouldn't be cached (see above), and allocate
conflicting entries in the TLB.

You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
helpers have a compile-time VA_BITS, and we support systems where there is no memory below
1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
the idmap will have more page table levels than the page table helpers can build. This is
why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
You already need to copy the linear-map, so using an idmap is extra work. You want to work
with linear-map addresses, you probably need to add the field to the appropriate structure.

The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
the hyp-stub. This is the reason hibernate posts EL2 in a holding pen while it rewrites
all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
doesn't need independently tweaking/testing. You need to do something similar, either
calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

You will need to alter the relocation code to do nothing for kdump, as no relocation is
required and building page-tables is extra work where the kernel may croak, preventing us
from reaching kdump.

Finally, having this independent idmap machinery isn't desirable from a maintenance
perspective. Please start with the hibernate code that already solves a very similar
problem, as it already has most of these problems covered.


> This patch series works in terms, that I can kexec-reboot both in QEMU

I wouldn't expect Qemu's emulation of the MMU and caches to be performance accurate.


> and on a physical machine. However, I do not see performance improvement
> during relocation. The performance is just as slow as before with disabled
> caches.

> Am I missing something? Perhaps, there is some flag that I should also
> enable in page table? Please provide me with any suggestions.

Some information about the physical machine you tested this on would help.
I'm guessing its v8.0, and booted at EL2....


Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation
  2019-07-17 17:51 ` James Morse
@ 2019-07-17 19:13   ` Pavel Tatashin
  2019-07-26 14:00     ` James Morse
  0 siblings, 1 reply; 10+ messages in thread
From: Pavel Tatashin @ 2019-07-17 19:13 UTC (permalink / raw)
  To: James Morse
  Cc: Sasha Levin, Jonathan Corbet, Catalin Marinas,
	Linux Doc Mailing List, kexec mailing list, LKML, James Morris,
	Eric W. Biederman, will, Linux ARM

Hi James,

Thank you for taking a look at this work.

> After a quick skim:
>
> This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
> These regions were described by firmware as having content that was/is written with
> different attributes. The attributes must match whenever it is mapped, otherwise we have a
> loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
> cache whenever it likes.

> It may be important that we do not ever map some of these regions, even though its
> described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
> its own use; it is made secure-only, and any access causes an
> external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
> the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
> 'nomap' attribute.
>
> Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
> trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
> size causing more memory to become cacheable than was intended.
> Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
> winges about firmware bugs).
> If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
> that is a bug we should fix.

We are allowing this. If you consider this to be a bug, I will fix it,
and this will actually simplify the idmap page table. User will
receive an error during kexec load if a request is made to load into
!IORESOURCE_SYSTEM_RAM region.

>
> The only way to do this properly is to copy the linear mapping. The arch code has lots of
> complex code to generate it correctly at boot, we do not want to duplicate it.
> (this is why hibernate copies the linear mapping)

As I understand, you would like to take a copy of idmap page table,
and add entries only for segment
sources and destinations into the new page table?

If so, there is a slight problem: arch hook machine_kexec_prepare() is
called prior to loading segments from userland. We can solve this by
adding another hook that is called after kimage_terminate().

> These patches do not remove the running page tables from TTBR1. As you overwrite the live
> page tables you will corrupt the state of the CPU. The page-table walker may access things
> that aren't memory, cache memory that shouldn't be cached (see above), and allocate
> conflicting entries in the TLB.

Indeed. However, I was following what is done in create_safe_exec_page():
https://soleen.com/source/xref/linux/arch/arm64/kernel/hibernate.c?r=af873fce#263

ttbr1 is not removed there. Am I missing something, or is not yet
configured there?

I will set ttbr1 to zero page.

> You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
> helpers have a compile-time VA_BITS, and we support systems where there is no memory below
> 1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
> the idmap will have more page table levels than the page table helpers can build. This is
> why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
> You already need to copy the linear-map, so using an idmap is extra work. You want to work
> with linear-map addresses, you probably need to add the field to the appropriate structure.

OK. Makes sense. I will do the way hibernate setup this table. I was
indeed following x86, hoping that eventually it would be possible to
unite: kasan, hibernate, and kexec implementation of this page table.

>
> The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
> as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
> the hyp-stub.

As I understand normally on baremetal kexec runs at EL1 not EL2. On my
machine is_kernel_in_hyp_mode() == false in cpu_soft_restart.

This is the reason hibernate posts EL2 in a holding pen while it rewrites
> all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
> doesn't need independently tweaking/testing. You need to do something similar, either
> calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

OK, I will study how hibernate does this. I was thinking that if we
are running in EL2 we can simply configure TTBR0_EL2 instead of
TTBR0_EL1. But, I need to understand this better.

>
> You will need to alter the relocation code to do nothing for kdump, as no relocation is
> required and building page-tables is extra work where the kernel may croak, preventing us
> from reaching kdump.

Yes, I was planning to do nothing for kdump, which involves not
allocating page table. It is not part of the current patchest, as the
current series is not ready.

>
> Finally, having this independent idmap machinery isn't desirable from a maintenance
> perspective. Please start with the hibernate code that already solves a very similar
> problem, as it already has most of these problems covered.

OK.

> > This patch series works in terms, that I can kexec-reboot both in QEMU
>
> I wouldn't expect Qemu's emulation of the MMU and caches to be performance accurate.

I am not measuring performance in QEMU, I use it for
development/verification only. The performance is measured on real
hardware only.

>
> > and on a physical machine. However, I do not see performance improvement
> > during relocation. The performance is just as slow as before with disabled
> > caches.
>
> > Am I missing something? Perhaps, there is some flag that I should also
> > enable in page table? Please provide me with any suggestions.
>
> Some information about the physical machine you tested this on would help.
> I'm guessing its v8.0, and booted at EL2....

I am using Broadcom's Stingray SoC. Because  is_kernel_in_hyp_mode()
returns false, I believe it is EL1. How can I boot it at EL2?

So, I am still confused why I do not see performance improvements
during relocation on this machine. Any theories?

Thank you,
Pasha

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation
  2019-07-17 19:13   ` Pavel Tatashin
@ 2019-07-26 14:00     ` James Morse
  0 siblings, 0 replies; 10+ messages in thread
From: James Morse @ 2019-07-26 14:00 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Sasha Levin, Jonathan Corbet, Catalin Marinas,
	Linux Doc Mailing List, kexec mailing list, LKML, James Morris,
	Eric W. Biederman, will, Linux ARM

Hi Pavel,

On 17/07/2019 20:13, Pavel Tatashin wrote:
>> After a quick skim:
>>
>> This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
>> These regions were described by firmware as having content that was/is written with
>> different attributes. The attributes must match whenever it is mapped, otherwise we have a
>> loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
>> cache whenever it likes.
> 
>> It may be important that we do not ever map some of these regions, even though its
>> described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
>> its own use; it is made secure-only, and any access causes an
>> external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
>> the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
>> 'nomap' attribute.
>>
>> Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
>> trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
>> size causing more memory to become cacheable than was intended.
>> Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
>> winges about firmware bugs).
>> If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
>> that is a bug we should fix.
> 
> We are allowing this. If you consider this to be a bug, I will fix it,
> and this will actually simplify the idmap page table. User will
> receive an error during kexec load if a request is made to load into
> !IORESOURCE_SYSTEM_RAM region.

I consider this a bug, but we can see what others think.
This suggests kexec-tools can open /proc/iomem, find a likely looking gap, and try to load
the new kernel between two platform devices.


>> The only way to do this properly is to copy the linear mapping. The arch code has lots of
>> complex code to generate it correctly at boot, we do not want to duplicate it.
>> (this is why hibernate copies the linear mapping)
> 
> As I understand, you would like to take a copy of idmap page table,
> and add entries only for segment
> sources and destinations into the new page table?

I don't think there is a need to idmap memory at all. We should copy the linear map so you
know you won't overwrite its page tables as part of loading the new kernel.


> If so, there is a slight problem: arch hook machine_kexec_prepare() is
> called prior to loading segments from userland. We can solve this by
> adding another hook that is called after kimage_terminate().

Yes, all this would need doing as machine_kexec() runs. We preferably need to allocate
memory in this path, or at least have a bitmap of what we can/can't overwrite.


>> These patches do not remove the running page tables from TTBR1. As you overwrite the live
>> page tables you will corrupt the state of the CPU. The page-table walker may access things
>> that aren't memory, cache memory that shouldn't be cached (see above), and allocate
>> conflicting entries in the TLB.
> 
> Indeed. However, I was following what is done in create_safe_exec_page():
> https://soleen.com/source/xref/linux/arch/arm64/kernel/hibernate.c?r=af873fce#263
> 
> ttbr1 is not removed there. Am I missing something, or is not yet
> configured there?

Hibernate maps a single executable page in ttbr0_el1 that holds its relocation code.
The relocation code then switches ttbr1_el1 to point to the copy of the linear map. See
the 'break_before_make_ttbr_switch' macro in swsusp_arch_suspend_exit().


> I will set ttbr1 to zero page.
> 
>> You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
>> helpers have a compile-time VA_BITS, and we support systems where there is no memory below
>> 1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
>> the idmap will have more page table levels than the page table helpers can build. This is
>> why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
>> You already need to copy the linear-map, so using an idmap is extra work. You want to work
>> with linear-map addresses, you probably need to add the field to the appropriate structure.
> 
> OK. Makes sense. I will do the way hibernate setup this table. I was
> indeed following x86, hoping that eventually it would be possible to
> unite: kasan, hibernate, and kexec implementation of this page table.

Our kasan and hibernate code already went a different way. I doubt we can bring them back
in to look like x86, they have different problems to solve.


>> The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
>> as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
>> the hyp-stub.
> 
> As I understand normally on baremetal kexec runs at EL1 not EL2. On my
> machine is_kernel_in_hyp_mode() == false in cpu_soft_restart.

and is_hyp_mode_available() ?


This depends on which exception-level your bootloader set Linux running. You should get a
boot message that tells you:
| CPU: All CPU(s) started at EL2

is_kernel_in_hyp_mode() is for determining if the kernel is running at EL2. This is the
case if you get a message like:
| kvm [1]: VHE mode initialized successfully
VHE is a v8.1 feature that repaints the system-registers so a kernel written to run at EL1
can run almost unmodified at EL2.

We also have is_hyp_mode_available(), which will return true if all the CPUs booted at EL2.


kexec either runs its relocation code at EL1, or at EL2 if that is where the first kernel
booted. If you call to EL2, the MMU was already off as KVM will reset EL2 to the hyp-stub
in response to the reboot notifiers hardware_disable() call.

(kvm_arch_hardware_disable() calls cpu_hyp_reset())


> This is the reason hibernate posts EL2 in a holding pen while it rewrites
>> all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
>> doesn't need independently tweaking/testing. You need to do something similar, either
>> calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

> OK, I will study how hibernate does this. I was thinking that if we
> are running in EL2 we can simply configure TTBR0_EL2 instead of
> TTBR0_EL1. But, I need to understand this better.

Yes, if you've got a VHE system it can skip the jumping around Exception-levels, but we
still need to support the non-VHE systems.


>>> This patch series works in terms, that I can kexec-reboot both in QEMU
>>> and on a physical machine. However, I do not see performance improvement
>>> during relocation. The performance is just as slow as before with disabled
>>> caches.
>>
>>> Am I missing something? Perhaps, there is some flag that I should also
>>> enable in page table? Please provide me with any suggestions.
>>
>> Some information about the physical machine you tested this on would help.
>> I'm guessing its v8.0, and booted at EL2....

> I am using Broadcom's Stingray SoC.

First hit on google is [0]. Which assuming its the same SoC, says its Cortex-A72, this is
a v8.0 part, it doesn't have VHE. The kernel will be running at EL1, if it supports KVM it
must have booted at EL2.


> Because  is_kernel_in_hyp_mode()
> returns false, I believe it is EL1. How can I boot it at EL2?

This check is for VHE.
|static inline bool is_kernel_in_hyp_mode(void)
| {
|	return read_sysreg(CurrentEL) == CurrentEL_EL2;
| }

The kernel's early startup in head.S detects if its running at EL2 or EL1. If it has VHE,
it enables the feature and stays at EL2. Otherwise it installs the 'hyp-stub' and drops to
EL1.

On v8.0 the kernel has to run at EL1 because we need the two ttbr registers.
(kernel/user-space). EL2 on these parts only has one. v8.1's VHE adds a second ttbr to
EL2, meaning we can run linux at EL2 and KVM avoids the jumping between exception levels.


You can check for that:
| CPU: All CPU(s) started at EL2
message. If your bootloader is starting the OS at EL1, you will need to speak to the
firmware folk about getting that changed.


> So, I am still confused why I do not see performance improvements
> during relocation on this machine. Any theories?

I assume you started at EL2. You moved EL1's mmu-off code, but cpu_soft_restart() will
call the relocation code at EL2 where the MMU is already off. Because you expected to be
using an idmap, nothing goes wrong, but nothing has changed.


Thanks,

James

[0] https://www.broadcom.com/products/storage/ethernet-storage-adapters-ics/ps1100r

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-07-26 14:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-16 16:56 [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 1/4] arm64, mm: identity mapped page table Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 2/4] arm64, kexec: interface preparation for mmu enabled kexec Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 3/4] arm64, kexec: add kexec's own identity page table Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 4/4] arm64: Keep MMU on while kernel is being relocated Pavel Tatashin
2019-07-16 19:14 ` [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Bhupesh Sharma
2019-07-16 19:26   ` Pavel Tatashin
2019-07-17 17:51 ` James Morse
2019-07-17 19:13   ` Pavel Tatashin
2019-07-26 14:00     ` James Morse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).