linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 0/8] arm64: MMU enabled kexec relocation
@ 2019-07-31 15:38 Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 1/8] kexec: quiet down kexec reboot Pavel Tatashin
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

Changelog from previous RFC:
- Added trans_table support for both hibernate and kexec.
- Fixed performance issue, where enabling MMU did not yield the
  actual performance improvement.

Bug:
With the current state, this patch series works on kernels booted with EL1
mode, but for some reason, when elevated to EL2 mode reboot freezes in
both QEMU and on real hardware.

The freeze happens in:

arch/arm64/kernel/relocate_kernel.S
	turn_on_mmu()

Right after sctlr_el2 is written (MMU on EL2 is enabled)

	msr     sctlr_el2, \tmp1

I've been studying all the relevant control registers for EL2, but do not
see what might be causing this hang:

MAIR_EL2 is set to be exactly the same as MAIR_EL1 0xbbff440c0400

TCR_EL2        0x80843510
Enabled bits:
PS      Physical Address Size. (0b100   44 bits, 16TB.)
SH0     Shareability    11 Inner Shareable
ORGN0   Normal memory, Outer Write-Back Read-Allocate Write-Allocate Cach.
IRGN0   Normal memory, Inner Write-Back Read-Allocate Write-Allocate Cach.
T0SZ    01 0000

SCTLR_EL2	0x30e5183f
RES1    : Reserve ones
M       : MMU enabled
A       : Align check
C       : Cacheability control
SA      : SP Alignment check enable
IESB    : Implicit Error Synchronization event
I       : Instruction access Cacheability

TTBR0_EL2      0x1b3069000 (address of trans_table)

Any suggestion of what else might be missing that causes this freeze when
MMU is enabled in EL2?

=====
Here is the current data from the real hardware:
(because of bug, I forced EL1 mode by setting el2_switch always to zero in
cpu_soft_restart()):

For this experiment, the size of kernel plus initramfs is 25M. If initramfs
was larger, than the improvements would be even greater, as time spent in
relocation is proportional to the size of relocation.

Previously:
kernel shutdown	0.022131328s
relocation	0.440510736s
kernel startup	0.294706768s

Relocation was taking: 58.2% of reboot time

Now:
kernel shutdown	0.032066576s
relocation	0.022158152s
kernel startup	0.296055880s

Now: Relocation takes 6.3% of reboot time

Total reboot is x2.16 times faster.

Previous approaches and discussions
-----------------------------------
https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com
reserve space for kexec to avoid relocation, involves changes to generic code
to optimize a problem that exists on arm64 only:

https://lore.kernel.org/lkml/20190716165641.6990-1-pasha.tatashin@soleen.com
The first attempt to enable MMU, some bugs that prevented performance
improvement. The page tables unnecessary configured idmap for the whole
physical space.

Pavel Tatashin (8):
  kexec: quiet down kexec reboot
  arm64, mm: transitional tables
  arm64: hibernate: switch to transtional page tables.
  kexec: add machine_kexec_post_load()
  arm64, kexec: move relocation function setup and clean up
  arm64, kexec: add expandable argument to relocation function
  arm64, kexec: configure transitional page table for kexec
  arm64, kexec: enable MMU during kexec relocation

 arch/arm64/Kconfig                     |   4 +
 arch/arm64/include/asm/kexec.h         |  24 ++-
 arch/arm64/include/asm/pgtable-hwdef.h |   1 +
 arch/arm64/include/asm/trans_table.h   |  66 ++++++
 arch/arm64/kernel/asm-offsets.c        |  10 +
 arch/arm64/kernel/cpu-reset.S          |   4 +-
 arch/arm64/kernel/cpu-reset.h          |   8 +-
 arch/arm64/kernel/hibernate.c          | 261 ++++++------------------
 arch/arm64/kernel/machine_kexec.c      | 168 ++++++++++++---
 arch/arm64/kernel/relocate_kernel.S    | 238 +++++++++++++++-------
 arch/arm64/mm/Makefile                 |   1 +
 arch/arm64/mm/trans_table.c            | 272 +++++++++++++++++++++++++
 kernel/kexec.c                         |   4 +
 kernel/kexec_core.c                    |   8 +-
 kernel/kexec_file.c                    |   4 +
 kernel/kexec_internal.h                |   2 +
 16 files changed, 756 insertions(+), 319 deletions(-)
 create mode 100644 arch/arm64/include/asm/trans_table.h
 create mode 100644 arch/arm64/mm/trans_table.c

-- 
2.22.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC v2 1/8] kexec: quiet down kexec reboot
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 2/8] arm64, mm: transitional tables Pavel Tatashin
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

Here is a regular kexec command sequence and output:
=====
$ kexec --reuse-cmdline -i --load Image
$ kexec -e
[  161.342002] kexec_core: Starting new kernel

Welcome to Buildroot
buildroot login:
=====

Even when "quiet" kernel parameter is specified, "kexec_core: Starting
new kernel" is printed.

This message has  KERN_EMERG level, but there is no emergency, it is a
normal kexec operation, so quiet it down to appropriate KERN_NOTICE.

Machines that have slow console baud rate benefit from less output.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
---
 kernel/kexec_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index d5870723b8ad..2c5b72863b7b 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1169,7 +1169,7 @@ int kernel_kexec(void)
 		 * CPU hotplug again; so re-enable it here.
 		 */
 		cpu_hotplug_enable();
-		pr_emerg("Starting new kernel\n");
+		pr_notice("Starting new kernel\n");
 		machine_shutdown();
 	}
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v2 2/8] arm64, mm: transitional tables
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 1/8] kexec: quiet down kexec reboot Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 3/8] arm64: hibernate: switch to transtional page tables Pavel Tatashin
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

There are cases where normal kernel pages tables, i.e. idmap_pg_dir
and swapper_pg_dir are not sufficient because they may be overwritten.

This happens when we transition from one world to another: for example
during kexec kernel relocation transition, and also during hibernate
kernel restore transition.

In these cases, if MMU is needed, the page table memory must be allocated
from a safe place. Transitional tables is intended to allow just that.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/Kconfig                     |   4 +
 arch/arm64/include/asm/pgtable-hwdef.h |   1 +
 arch/arm64/include/asm/trans_table.h   |  66 ++++++
 arch/arm64/mm/Makefile                 |   1 +
 arch/arm64/mm/trans_table.c            | 272 +++++++++++++++++++++++++
 5 files changed, 344 insertions(+)
 create mode 100644 arch/arm64/include/asm/trans_table.h
 create mode 100644 arch/arm64/mm/trans_table.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3adcec05b1f6..91a7416ffe4e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -999,6 +999,10 @@ config CRASH_DUMP
 
 	  For more details see Documentation/admin-guide/kdump/kdump.rst
 
+config TRANS_TABLE
+	def_bool y
+	depends on HIBERNATION || KEXEC_CORE
+
 config XEN_DOM0
 	def_bool y
 	depends on XEN
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index db92950bb1a0..dcb4f13c7888 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -110,6 +110,7 @@
 #define PUD_TABLE_BIT		(_AT(pudval_t, 1) << 1)
 #define PUD_TYPE_MASK		(_AT(pudval_t, 3) << 0)
 #define PUD_TYPE_SECT		(_AT(pudval_t, 1) << 0)
+#define PUD_SECT_RDONLY		(_AT(pudval_t, 1) << 7)		/* AP[2] */
 
 /*
  * Level 2 descriptor (PMD).
diff --git a/arch/arm64/include/asm/trans_table.h b/arch/arm64/include/asm/trans_table.h
new file mode 100644
index 000000000000..4d7bd0bf36c0
--- /dev/null
+++ b/arch/arm64/include/asm/trans_table.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
+ */
+
+#ifndef _ASM_TRANS_TABLE_H
+#define _ASM_TRANS_TABLE_H
+
+#include <asm/pgtable-types.h>
+
+/*
+ * trans_alloc_page
+ *	- Allocator that should return exactly one uninitilaized page, if this
+ *	 allocator fails, trans_table returns -ENOMEM error.
+ *
+ * trans_alloc_arg
+ *	- Passed to trans_alloc_page as an argument
+ *
+ * trans_flags
+ *	- bitmap with flags that control how page table is filled.
+ *	  TRANS_MKWRITE: during page table copy make PTE, PME, and PUD page
+ *			 writeable by removing RDONLY flag from PTE.
+ *	  TRANS_MKVALID: during page table copy, if PTE present, but not valid,
+ *			 make it valid.
+ *	  TRANS_CHECKPFN: During page table copy, for every PTE entry check that
+ *			  PFN that this PTE points to is valid. Otherwise return
+ *			  -ENXIO
+ *	  TRANS_FORCEMAP: During page map, if translation exists, force
+ *			  overwrite it. Otherwise -ENXIO may be returned by
+ *			  trans_table_map_* functions if conflict is detected.
+ */
+
+#define	TRANS_MKWRITE	(1 << 0)
+#define	TRANS_MKVALID	(1 << 1)
+#define	TRANS_CHECKPFN	(1 << 2)
+#define	TRANS_FORCEMAP	(1 << 3)
+
+struct trans_table_info {
+	void * (*trans_alloc_page)(void *);
+	void *trans_alloc_arg;
+	unsigned long trans_flags;
+};
+
+/* Create and empty trans table. */
+int trans_table_create_empty(struct trans_table_info *info,
+			     pgd_t **trans_table);
+
+/*
+ * Create trans table and copy entries from from_table to trans_table in range
+ * [start, end)
+ */
+int trans_table_create_copy(struct trans_table_info *info, pgd_t **trans_table,
+			    pgd_t *from_table, unsigned long start,
+			    unsigned long end);
+
+/*
+ * Add map entry to trans_table for a base-size page at PTE level.
+ * page:	page to be mapped.
+ * dst_addr:	new VA address for the pages
+ * pgprot:	protection for the page.
+ */
+int trans_table_map_page(struct trans_table_info *info, pgd_t *trans_table,
+			 void *page, unsigned long dst_addr, pgprot_t pgprot);
+
+#endif /* _ASM_TRANS_TABLE_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..3794fff18659 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -6,6 +6,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_ARM64_PTDUMP_CORE)	+= dump.o
 obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
+obj-$(CONFIG_TRANS_TABLE)	+= trans_table.o
 obj-$(CONFIG_NUMA)		+= numa.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 KASAN_SANITIZE_physaddr.o	+= n
diff --git a/arch/arm64/mm/trans_table.c b/arch/arm64/mm/trans_table.c
new file mode 100644
index 000000000000..d5729eb318b7
--- /dev/null
+++ b/arch/arm64/mm/trans_table.c
@@ -0,0 +1,272 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
+ */
+
+/*
+ * Transitional tables are used during system transferring from one world to
+ * another: such as during hibernate restore, and kexec reboots. During these
+ * phases one cannot rely on page table not being overwritten.
+ *
+ */
+
+#include <asm/trans_table.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+
+static void *trans_alloc(struct trans_table_info *info)
+{
+	void *page = info->trans_alloc_page(info->trans_alloc_arg);
+
+	if (page)
+		clear_page(page);
+
+	return page;
+}
+
+static int trans_table_copy_pte(struct trans_table_info *info, pte_t *dst_ptep,
+				pte_t *src_ptep, unsigned long start,
+				unsigned long end)
+{
+	unsigned long addr = start;
+	int i = pgd_index(addr);
+
+	do {
+		pte_t src_pte = READ_ONCE(src_ptep[i]);
+
+		if (pte_none(src_pte))
+			continue;
+		if (info->trans_flags & TRANS_MKWRITE)
+			src_pte = pte_mkwrite(src_pte);
+		if (info->trans_flags & TRANS_MKVALID)
+			src_pte = pte_mkpresent(src_pte);
+		if (info->trans_flags & TRANS_CHECKPFN) {
+			if (!pfn_valid(pte_pfn(src_pte)))
+				return -ENXIO;
+		}
+		set_pte(&dst_ptep[i], src_pte);
+	} while (addr += PAGE_SIZE, i++, addr != end && i < PTRS_PER_PTE);
+
+	return 0;
+}
+
+static int trans_table_copy_pmd(struct trans_table_info *info, pmd_t *dst_pmdp,
+				pmd_t *src_pmdp, unsigned long start,
+				unsigned long end)
+{
+	unsigned long next;
+	unsigned long addr = start;
+	int i = pgd_index(addr);
+	int rc;
+
+	do {
+		pmd_t src_pmd = READ_ONCE(src_pmdp[i]);
+		pmd_t dst_pmd = READ_ONCE(dst_pmdp[i]);
+		pte_t *dst_ptep, *src_ptep;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(src_pmd))
+			continue;
+
+		if (!pmd_table(src_pmd)) {
+			if (info->trans_flags & TRANS_MKWRITE)
+				pmd_val(src_pmd) &= ~PMD_SECT_RDONLY;
+			set_pmd(&dst_pmdp[i], src_pmd);
+			continue;
+		}
+
+		if (pmd_none(dst_pmd)) {
+			pte_t *t = trans_alloc(info);
+
+			if (!t)
+				return -ENOMEM;
+
+			__pmd_populate(&dst_pmdp[i], __pa(t), PTE_TYPE_PAGE);
+			dst_pmd = READ_ONCE(dst_pmdp[i]);
+		}
+
+		src_ptep = __va(pmd_page_paddr(src_pmd));
+		dst_ptep = __va(pmd_page_paddr(dst_pmd));
+
+		rc = trans_table_copy_pte(info, dst_ptep, src_ptep, addr, next);
+		if (rc)
+			return rc;
+	} while (addr = next, i++, addr != end && i < PTRS_PER_PMD);
+
+	return 0;
+}
+
+static int trans_table_copy_pud(struct trans_table_info *info, pud_t *dst_pudp,
+				pud_t *src_pudp, unsigned long start,
+				unsigned long end)
+{
+	unsigned long next;
+	unsigned long addr = start;
+	int i = pgd_index(addr);
+	int rc;
+
+	do {
+		pud_t src_pud = READ_ONCE(src_pudp[i]);
+		pud_t dst_pud = READ_ONCE(dst_pudp[i]);
+		pmd_t *dst_pmdp, *src_pmdp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(src_pud))
+			continue;
+
+		if (!pud_table(src_pud)) {
+			if (info->trans_flags & TRANS_MKWRITE)
+				pud_val(src_pud) &= ~PUD_SECT_RDONLY;
+			set_pud(&dst_pudp[i], src_pud);
+			continue;
+		}
+
+		if (pud_none(dst_pud)) {
+			pmd_t *t = trans_alloc(info);
+
+			if (!t)
+				return -ENOMEM;
+
+			__pud_populate(&dst_pudp[i], __pa(t), PMD_TYPE_TABLE);
+			dst_pud = READ_ONCE(dst_pudp[i]);
+		}
+
+		src_pmdp = __va(pud_page_paddr(src_pud));
+		dst_pmdp = __va(pud_page_paddr(dst_pud));
+
+		rc = trans_table_copy_pmd(info, dst_pmdp, src_pmdp, addr, next);
+		if (rc)
+			return rc;
+	} while (addr = next, i++, addr != end && i < PTRS_PER_PUD);
+
+	return 0;
+}
+
+static int trans_table_copy_pgd(struct trans_table_info *info, pgd_t *dst_pgdp,
+				pgd_t *src_pgdp, unsigned long start,
+				unsigned long end)
+{
+	unsigned long next;
+	unsigned long addr = start;
+	int i = pgd_index(addr);
+	int rc;
+
+	do {
+		pgd_t src_pgd;
+		pgd_t dst_pgd;
+		pud_t *dst_pudp, *src_pudp;
+
+		src_pgd = READ_ONCE(src_pgdp[i]);
+		dst_pgd = READ_ONCE(dst_pgdp[i]);
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(src_pgd))
+			continue;
+
+		if (pgd_none(dst_pgd)) {
+			pud_t *t = trans_alloc(info);
+
+			if (!t)
+				return -ENOMEM;
+
+			__pgd_populate(&dst_pgdp[i], __pa(t), PUD_TYPE_TABLE);
+			dst_pgd = READ_ONCE(dst_pgdp[i]);
+		}
+
+		src_pudp = __va(pgd_page_paddr(src_pgd));
+		dst_pudp = __va(pgd_page_paddr(dst_pgd));
+
+		rc = trans_table_copy_pud(info, dst_pudp, src_pudp, addr, next);
+		if (rc)
+			return rc;
+	} while (addr = next, i++, addr != end && i < PTRS_PER_PGD);
+
+	return 0;
+}
+
+int trans_table_create_empty(struct trans_table_info *info, pgd_t **trans_table)
+{
+	pgd_t *dst_pgdp = trans_alloc(info);
+
+	if (!dst_pgdp)
+		return -ENOMEM;
+
+	*trans_table = dst_pgdp;
+
+	return 0;
+}
+
+int trans_table_create_copy(struct trans_table_info *info, pgd_t **trans_table,
+			    pgd_t *from_table, unsigned long start,
+			    unsigned long end)
+{
+	int rc;
+
+	rc = trans_table_create_empty(info, trans_table);
+	if (rc)
+		return rc;
+
+	return trans_table_copy_pgd(info, *trans_table, from_table, start, end);
+}
+
+int trans_table_map_page(struct trans_table_info *info, pgd_t *trans_table,
+			 void *page, unsigned long dst_addr, pgprot_t pgprot)
+{
+	int pgd_idx = pgd_index(dst_addr);
+	int pud_idx = pud_index(dst_addr);
+	int pmd_idx = pmd_index(dst_addr);
+	int pte_idx = pte_index(dst_addr);
+	pgd_t *pgdp = trans_table;
+	pgd_t pgd = READ_ONCE(pgdp[pgd_idx]);
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
+
+	if (pgd_none(pgd)) {
+		pud_t *t = trans_alloc(info);
+
+		if (!t)
+			return -ENOMEM;
+
+		__pgd_populate(&pgdp[pgd_idx], __pa(t), PUD_TYPE_TABLE);
+		pgd = READ_ONCE(pgdp[pgd_idx]);
+	}
+
+	pudp = __va(pgd_page_paddr(pgd));
+	pud = READ_ONCE(pudp[pud_idx]);
+	if (pud_sect(pud) && !(info->trans_flags & TRANS_FORCEMAP)) {
+		return -ENXIO;
+	} else if (pud_none(pud) || pud_sect(pud)) {
+		pmd_t *t = trans_alloc(info);
+
+		if (!t)
+			return -ENOMEM;
+
+		__pud_populate(&pudp[pud_idx], __pa(t), PMD_TYPE_TABLE);
+		pud = READ_ONCE(pudp[pud_idx]);
+	}
+
+	pmdp = __va(pud_page_paddr(pud));
+	pmd = READ_ONCE(pmdp[pmd_idx]);
+	if (pmd_sect(pmd) && !(info->trans_flags & TRANS_FORCEMAP)) {
+		return -ENXIO;
+	} else if (pmd_none(pmd) || pmd_sect(pmd)) {
+		pte_t *t = trans_alloc(info);
+
+		if (!t)
+			return -ENOMEM;
+
+		__pmd_populate(&pmdp[pmd_idx], __pa(t), PTE_TYPE_PAGE);
+		pmd = READ_ONCE(pmdp[pmd_idx]);
+	}
+
+	ptep = __va(pmd_page_paddr(pmd));
+	pte = READ_ONCE(ptep[pte_idx]);
+
+	if (!pte_none(pte) && !(info->trans_flags & TRANS_FORCEMAP))
+		return -ENXIO;
+
+	set_pte(&ptep[pte_idx], pfn_pte(virt_to_pfn(page), pgprot));
+
+	return 0;
+}
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v2 3/8] arm64: hibernate: switch to transtional page tables.
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 1/8] kexec: quiet down kexec reboot Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 2/8] arm64, mm: transitional tables Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 4/8] kexec: add machine_kexec_post_load() Pavel Tatashin
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

Transitional page tables  provide the needed functionality to setup
temporary page tables needed for hibernate resume.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/kernel/hibernate.c | 261 ++++++++--------------------------
 1 file changed, 60 insertions(+), 201 deletions(-)

diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 9341fcc6e809..4120b03a02fd 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -16,7 +16,6 @@
 #define pr_fmt(x) "hibernate: " x
 #include <linux/cpu.h>
 #include <linux/kvm_host.h>
-#include <linux/mm.h>
 #include <linux/pm.h>
 #include <linux/sched.h>
 #include <linux/suspend.h>
@@ -31,14 +30,12 @@
 #include <asm/kexec.h>
 #include <asm/memory.h>
 #include <asm/mmu_context.h>
-#include <asm/pgalloc.h>
-#include <asm/pgtable.h>
-#include <asm/pgtable-hwdef.h>
 #include <asm/sections.h>
 #include <asm/smp.h>
 #include <asm/smp_plat.h>
 #include <asm/suspend.h>
 #include <asm/sysreg.h>
+#include <asm/trans_table.h>
 #include <asm/virt.h>
 
 /*
@@ -182,6 +179,12 @@ int arch_hibernation_header_restore(void *addr)
 }
 EXPORT_SYMBOL(arch_hibernation_header_restore);
 
+static void *
+hibernate_page_alloc(void *arg)
+{
+	return (void *)get_safe_page((gfp_t)(unsigned long)arg);
+}
+
 /*
  * Copies length bytes, starting at src_start into an new page,
  * perform cache maintentance, then maps it at the specified address low
@@ -196,57 +199,31 @@ EXPORT_SYMBOL(arch_hibernation_header_restore);
  */
 static int create_safe_exec_page(void *src_start, size_t length,
 				 unsigned long dst_addr,
-				 phys_addr_t *phys_dst_addr,
-				 void *(*allocator)(gfp_t mask),
-				 gfp_t mask)
+				 phys_addr_t *phys_dst_addr)
 {
-	int rc = 0;
-	pgd_t *pgdp;
-	pud_t *pudp;
-	pmd_t *pmdp;
-	pte_t *ptep;
-	unsigned long dst = (unsigned long)allocator(mask);
-
-	if (!dst) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
-	memcpy((void *)dst, src_start, length);
-	__flush_icache_range(dst, dst + length);
+	struct trans_table_info trans_info = {
+		.trans_alloc_page	= hibernate_page_alloc,
+		.trans_alloc_arg	= (void *)GFP_ATOMIC,
+		.trans_flags		= 0,
+	};
+	void *page = (void *)get_safe_page(GFP_ATOMIC);
+	pgd_t *trans_table;
+	int rc;
+
+	if (!page)
+		return -ENOMEM;
 
-	pgdp = pgd_offset_raw(allocator(mask), dst_addr);
-	if (pgd_none(READ_ONCE(*pgdp))) {
-		pudp = allocator(mask);
-		if (!pudp) {
-			rc = -ENOMEM;
-			goto out;
-		}
-		pgd_populate(&init_mm, pgdp, pudp);
-	}
+	memcpy(page, src_start, length);
+	__flush_icache_range((unsigned long)page, (unsigned long)page + length);
 
-	pudp = pud_offset(pgdp, dst_addr);
-	if (pud_none(READ_ONCE(*pudp))) {
-		pmdp = allocator(mask);
-		if (!pmdp) {
-			rc = -ENOMEM;
-			goto out;
-		}
-		pud_populate(&init_mm, pudp, pmdp);
-	}
-
-	pmdp = pmd_offset(pudp, dst_addr);
-	if (pmd_none(READ_ONCE(*pmdp))) {
-		ptep = allocator(mask);
-		if (!ptep) {
-			rc = -ENOMEM;
-			goto out;
-		}
-		pmd_populate_kernel(&init_mm, pmdp, ptep);
-	}
+	rc = trans_table_create_empty(&trans_info, &trans_table);
+	if (rc)
+		return rc;
 
-	ptep = pte_offset_kernel(pmdp, dst_addr);
-	set_pte(ptep, pfn_pte(virt_to_pfn(dst), PAGE_KERNEL_EXEC));
+	rc = trans_table_map_page(&trans_info, trans_table, page, dst_addr,
+				  PAGE_KERNEL_EXEC);
+	if (rc)
+		return rc;
 
 	/*
 	 * Load our new page tables. A strict BBM approach requires that we
@@ -262,13 +239,12 @@ static int create_safe_exec_page(void *src_start, size_t length,
 	 */
 	cpu_set_reserved_ttbr0();
 	local_flush_tlb_all();
-	write_sysreg(phys_to_ttbr(virt_to_phys(pgdp)), ttbr0_el1);
+	write_sysreg(phys_to_ttbr(virt_to_phys(trans_table)), ttbr0_el1);
 	isb();
 
-	*phys_dst_addr = virt_to_phys((void *)dst);
+	*phys_dst_addr = virt_to_phys(page);
 
-out:
-	return rc;
+	return 0;
 }
 
 #define dcache_clean_range(start, end)	__flush_dcache_area(start, (end - start))
@@ -332,143 +308,6 @@ int swsusp_arch_suspend(void)
 	return ret;
 }
 
-static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
-{
-	pte_t pte = READ_ONCE(*src_ptep);
-
-	if (pte_valid(pte)) {
-		/*
-		 * Resume will overwrite areas that may be marked
-		 * read only (code, rodata). Clear the RDONLY bit from
-		 * the temporary mappings we use during restore.
-		 */
-		set_pte(dst_ptep, pte_mkwrite(pte));
-	} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
-		/*
-		 * debug_pagealloc will removed the PTE_VALID bit if
-		 * the page isn't in use by the resume kernel. It may have
-		 * been in use by the original kernel, in which case we need
-		 * to put it back in our copy to do the restore.
-		 *
-		 * Before marking this entry valid, check the pfn should
-		 * be mapped.
-		 */
-		BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
-	}
-}
-
-static int copy_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long start,
-		    unsigned long end)
-{
-	pte_t *src_ptep;
-	pte_t *dst_ptep;
-	unsigned long addr = start;
-
-	dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
-	if (!dst_ptep)
-		return -ENOMEM;
-	pmd_populate_kernel(&init_mm, dst_pmdp, dst_ptep);
-	dst_ptep = pte_offset_kernel(dst_pmdp, start);
-
-	src_ptep = pte_offset_kernel(src_pmdp, start);
-	do {
-		_copy_pte(dst_ptep, src_ptep, addr);
-	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end);
-
-	return 0;
-}
-
-static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
-		    unsigned long end)
-{
-	pmd_t *src_pmdp;
-	pmd_t *dst_pmdp;
-	unsigned long next;
-	unsigned long addr = start;
-
-	if (pud_none(READ_ONCE(*dst_pudp))) {
-		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
-		if (!dst_pmdp)
-			return -ENOMEM;
-		pud_populate(&init_mm, dst_pudp, dst_pmdp);
-	}
-	dst_pmdp = pmd_offset(dst_pudp, start);
-
-	src_pmdp = pmd_offset(src_pudp, start);
-	do {
-		pmd_t pmd = READ_ONCE(*src_pmdp);
-
-		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
-			continue;
-		if (pmd_table(pmd)) {
-			if (copy_pte(dst_pmdp, src_pmdp, addr, next))
-				return -ENOMEM;
-		} else {
-			set_pmd(dst_pmdp,
-				__pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY));
-		}
-	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
-
-	return 0;
-}
-
-static int copy_pud(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
-		    unsigned long end)
-{
-	pud_t *dst_pudp;
-	pud_t *src_pudp;
-	unsigned long next;
-	unsigned long addr = start;
-
-	if (pgd_none(READ_ONCE(*dst_pgdp))) {
-		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
-		if (!dst_pudp)
-			return -ENOMEM;
-		pgd_populate(&init_mm, dst_pgdp, dst_pudp);
-	}
-	dst_pudp = pud_offset(dst_pgdp, start);
-
-	src_pudp = pud_offset(src_pgdp, start);
-	do {
-		pud_t pud = READ_ONCE(*src_pudp);
-
-		next = pud_addr_end(addr, end);
-		if (pud_none(pud))
-			continue;
-		if (pud_table(pud)) {
-			if (copy_pmd(dst_pudp, src_pudp, addr, next))
-				return -ENOMEM;
-		} else {
-			set_pud(dst_pudp,
-				__pud(pud_val(pud) & ~PMD_SECT_RDONLY));
-		}
-	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
-
-	return 0;
-}
-
-static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
-			    unsigned long end)
-{
-	unsigned long next;
-	unsigned long addr = start;
-	pgd_t *src_pgdp = pgd_offset_k(start);
-
-	dst_pgdp = pgd_offset_raw(dst_pgdp, start);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(READ_ONCE(*src_pgdp)))
-			continue;
-		if (copy_pud(dst_pgdp, src_pgdp, addr, next))
-			return -ENOMEM;
-	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
-
-	return 0;
-}
-
 /*
  * Setup then Resume from the hibernate image using swsusp_arch_suspend_exit().
  *
@@ -484,21 +323,42 @@ int swsusp_arch_resume(void)
 	phys_addr_t phys_hibernate_exit;
 	void __noreturn (*hibernate_exit)(phys_addr_t, phys_addr_t, void *,
 					  void *, phys_addr_t, phys_addr_t);
+	struct trans_table_info trans_info = {
+		.trans_alloc_page	= hibernate_page_alloc,
+		.trans_alloc_arg	= (void *)GFP_ATOMIC,
+		/*
+		 * Resume will overwrite areas that may be marked read only
+		 * (code, rodata). Clear the RDONLY bit from the temporary
+		 * mappings we use during restore.
+		 */
+		.trans_flags		= TRANS_MKWRITE,
+	};
+
+	/*
+	 * debug_pagealloc will removed the PTE_VALID bit if the page isn't in
+	 * use by the resume kernel. It may have been in use by the original
+	 * kernel, in which case we need to put it back in our copy to do the
+	 * restore.
+	 *
+	 * Before marking this entry valid, check the pfn should be mapped.
+	 */
+	if (debug_pagealloc_enabled())
+		trans_info.trans_flags |= (TRANS_MKVALID | TRANS_CHECKPFN);
 
 	/*
 	 * Restoring the memory image will overwrite the ttbr1 page tables.
 	 * Create a second copy of just the linear map, and use this when
 	 * restoring.
 	 */
-	tmp_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
-	if (!tmp_pg_dir) {
-		pr_err("Failed to allocate memory for temporary page tables.\n");
-		rc = -ENOMEM;
+	rc = trans_table_create_copy(&trans_info, &tmp_pg_dir,
+				     pgd_offset_k(PAGE_OFFSET), PAGE_OFFSET, 0);
+	if (rc) {
+		if (rc == -ENOMEM)
+			pr_err("Failed to allocate memory for temporary page tables.\n");
+		else if (rc == -ENXIO)
+			pr_err("Tried to set PTE for PFN that does not exist\n");
 		goto out;
 	}
-	rc = copy_page_tables(tmp_pg_dir, PAGE_OFFSET, 0);
-	if (rc)
-		goto out;
 
 	/*
 	 * We need a zero page that is zero before & after resume in order to
@@ -523,8 +383,7 @@ int swsusp_arch_resume(void)
 	 */
 	rc = create_safe_exec_page(__hibernate_exit_text_start, exit_size,
 				   (unsigned long)hibernate_exit,
-				   &phys_hibernate_exit,
-				   (void *)get_safe_page, GFP_ATOMIC);
+				   &phys_hibernate_exit);
 	if (rc) {
 		pr_err("Failed to create safe executable page for hibernate_exit code.\n");
 		goto out;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v2 4/8] kexec: add machine_kexec_post_load()
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
                   ` (2 preceding siblings ...)
  2019-07-31 15:38 ` [RFC v2 3/8] arm64: hibernate: switch to transtional page tables Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 5/8] arm64, kexec: move relocation function setup and clean up Pavel Tatashin
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

It is the same as machine_kexec_prepare(), but is called after segments are
loaded. This way, can do processing work with already loaded relocation
segments. One such example is arm64: it has to have segments loaded in
order to create a page table, but it cannot do it during kexec time, because
at that time allocations won't be possible anymore.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 kernel/kexec.c          | 4 ++++
 kernel/kexec_core.c     | 6 ++++++
 kernel/kexec_file.c     | 4 ++++
 kernel/kexec_internal.h | 2 ++
 4 files changed, 16 insertions(+)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 1b018f1a6e0d..27b71dc7b35a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -159,6 +159,10 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 
 	kimage_terminate(image);
 
+	ret = machine_kexec_post_load(image);
+	if (ret)
+		goto out;
+
 	/* Install the new kernel and uninstall the old */
 	image = xchg(dest_image, image);
 
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 2c5b72863b7b..8360645d1bbe 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -587,6 +587,12 @@ static void kimage_free_extra_pages(struct kimage *image)
 	kimage_free_page_list(&image->unusable_pages);
 
 }
+
+int __weak machine_kexec_post_load(struct kimage *image)
+{
+	return 0;
+}
+
 void kimage_terminate(struct kimage *image)
 {
 	if (*image->entry != 0)
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index b8cc032d5620..cb531d768114 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -391,6 +391,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 
 	kimage_terminate(image);
 
+	ret = machine_kexec_post_load(image);
+	if (ret)
+		goto out;
+
 	/*
 	 * Free up any temporary buffers allocated which are not needed
 	 * after image has been loaded
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 48aaf2ac0d0d..39d30ccf8d87 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -13,6 +13,8 @@ void kimage_terminate(struct kimage *image);
 int kimage_is_destination_range(struct kimage *image,
 				unsigned long start, unsigned long end);
 
+int machine_kexec_post_load(struct kimage *image);
+
 extern struct mutex kexec_mutex;
 
 #ifdef CONFIG_KEXEC_FILE
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v2 5/8] arm64, kexec: move relocation function setup and clean up
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
                   ` (3 preceding siblings ...)
  2019-07-31 15:38 ` [RFC v2 4/8] kexec: add machine_kexec_post_load() Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 6/8] arm64, kexec: add expandable argument to relocation function Pavel Tatashin
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

Currently, kernel relocation function is configured in machine_kexec()
at the time of kexec reboot by using control_code_page.

This operation, however, is more logical to be done during kexec_load,
and thus remove from reboot time. Move, setup of this function to
newly added machine_kexec_post_load().

In addition, do some cleanup: add infor about reloction function to
kexec_image_info(), and remove extra messages from machine_kexec().

Make dtb_mem, always available, if CONFIG_KEXEC_FILE is not configured
dtb_mem is set to zero anyway.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/include/asm/kexec.h    |  3 +-
 arch/arm64/kernel/machine_kexec.c | 47 +++++++++++--------------------
 2 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 12a561a54128..d15ca1ca1e83 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,14 +90,15 @@ static inline void crash_prepare_suspend(void) {}
 static inline void crash_post_resume(void) {}
 #endif
 
-#ifdef CONFIG_KEXEC_FILE
 #define ARCH_HAS_KIMAGE_ARCH
 
 struct kimage_arch {
 	void *dtb;
 	unsigned long dtb_mem;
+	unsigned long kern_reloc;
 };
 
+#ifdef CONFIG_KEXEC_FILE
 extern const struct kexec_file_ops kexec_image_ops;
 
 struct kimage;
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 0df8493624e0..596c9b9657be 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -42,6 +42,7 @@ static void _kexec_image_info(const char *func, int line,
 	pr_debug("    start:       %lx\n", kimage->start);
 	pr_debug("    head:        %lx\n", kimage->head);
 	pr_debug("    nr_segments: %lu\n", kimage->nr_segments);
+	pr_debug("    kern_reloc: %pa\n", &kimage->arch.kern_reloc);
 
 	for (i = 0; i < kimage->nr_segments; i++) {
 		pr_debug("      segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu pages\n",
@@ -58,6 +59,19 @@ void machine_kexec_cleanup(struct kimage *kimage)
 	/* Empty routine needed to avoid build errors. */
 }
 
+int machine_kexec_post_load(struct kimage *kimage)
+{
+	unsigned long kern_reloc;
+
+	kern_reloc = page_to_phys(kimage->control_code_page);
+	memcpy(__va(kern_reloc), arm64_relocate_new_kernel,
+	       arm64_relocate_new_kernel_size);
+	kimage->arch.kern_reloc = kern_reloc;
+
+	kexec_image_info(kimage);
+	return 0;
+}
+
 /**
  * machine_kexec_prepare - Prepare for a kexec reboot.
  *
@@ -67,8 +81,6 @@ void machine_kexec_cleanup(struct kimage *kimage)
  */
 int machine_kexec_prepare(struct kimage *kimage)
 {
-	kexec_image_info(kimage);
-
 	if (kimage->type != KEXEC_TYPE_CRASH && cpus_are_stuck_in_kernel()) {
 		pr_err("Can't kexec: CPUs are stuck in the kernel.\n");
 		return -EBUSY;
@@ -143,8 +155,7 @@ static void kexec_segment_flush(const struct kimage *kimage)
  */
 void machine_kexec(struct kimage *kimage)
 {
-	phys_addr_t reboot_code_buffer_phys;
-	void *reboot_code_buffer;
+	void *reboot_code_buffer = phys_to_virt(kimage->arch.kern_reloc);
 	bool in_kexec_crash = (kimage == kexec_crash_image);
 	bool stuck_cpus = cpus_are_stuck_in_kernel();
 
@@ -155,30 +166,8 @@ void machine_kexec(struct kimage *kimage)
 	WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
 		"Some CPUs may be stale, kdump will be unreliable.\n");
 
-	reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
-	reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
-
 	kexec_image_info(kimage);
 
-	pr_debug("%s:%d: control_code_page:        %p\n", __func__, __LINE__,
-		kimage->control_code_page);
-	pr_debug("%s:%d: reboot_code_buffer_phys:  %pa\n", __func__, __LINE__,
-		&reboot_code_buffer_phys);
-	pr_debug("%s:%d: reboot_code_buffer:       %p\n", __func__, __LINE__,
-		reboot_code_buffer);
-	pr_debug("%s:%d: relocate_new_kernel:      %p\n", __func__, __LINE__,
-		arm64_relocate_new_kernel);
-	pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
-		__func__, __LINE__, arm64_relocate_new_kernel_size,
-		arm64_relocate_new_kernel_size);
-
-	/*
-	 * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
-	 * after the kernel is shut down.
-	 */
-	memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
-		arm64_relocate_new_kernel_size);
-
 	/* Flush the reboot_code_buffer in preparation for its execution. */
 	__flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
 
@@ -214,12 +203,8 @@ void machine_kexec(struct kimage *kimage)
 	 * userspace (kexec-tools).
 	 * In kexec_file case, the kernel starts directly without purgatory.
 	 */
-	cpu_soft_restart(reboot_code_buffer_phys, kimage->head, kimage->start,
-#ifdef CONFIG_KEXEC_FILE
+	cpu_soft_restart(kimage->arch.kern_reloc, kimage->head, kimage->start,
 						kimage->arch.dtb_mem);
-#else
-						0);
-#endif
 
 	BUG(); /* Should never get here. */
 }
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v2 6/8] arm64, kexec: add expandable argument to relocation function
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
                   ` (4 preceding siblings ...)
  2019-07-31 15:38 ` [RFC v2 5/8] arm64, kexec: move relocation function setup and clean up Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 7/8] arm64, kexec: configure transitional page table for kexec Pavel Tatashin
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

Currently, kexec relocation function (arm64_relocate_new_kernel) accepts
the following arguments:

head:		start of array that contains relocation information.
entry:		entry point for new kernel or purgatory.
dtb_mem:	first and only argument to entry.

The number of arguments cannot be easily expended, because this
function is also called from HVC_SOFT_RESTART, which preserves only
three arguments. And, also arm64_relocate_new_kernel is written in
assembly but called without stack, thus no place to move extra
arguments to free registers.

Soon, we will need to pass more arguments: once we enable MMU we
will need to pass information about page tables.

Another benefit of allowing this function to accept more arguments, is that
kernel can actually accept up to 4 arguments (x0-x3), however currently
only one is used, but if in the future we will need for more (for example,
pass information about when previous kernel exited to have a precise
measurement in time spent in purgatory), we won't be easilty do that
if arm64_relocate_new_kernel can't accept more arguments.

So, add a new struct: kern_reloc_arg, and place it in kexec safe page (i.e
memory that is not overwritten during relocation).
Thus, make arm64_relocate_new_kernel to only take one argument, that
contains all the needed information.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/include/asm/kexec.h      | 18 ++++++
 arch/arm64/kernel/asm-offsets.c     |  9 +++
 arch/arm64/kernel/cpu-reset.S       |  4 +-
 arch/arm64/kernel/cpu-reset.h       |  8 +--
 arch/arm64/kernel/machine_kexec.c   | 29 +++++++++-
 arch/arm64/kernel/relocate_kernel.S | 88 ++++++++++-------------------
 6 files changed, 87 insertions(+), 69 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index d15ca1ca1e83..d5b79d4c7fae 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,12 +90,30 @@ static inline void crash_prepare_suspend(void) {}
 static inline void crash_post_resume(void) {}
 #endif
 
+/*
+ * kern_reloc_arg is passed to kernel relocation function as an argument.
+ * head		kimage->head, allows to traverse through relocation segments.
+ * entry_addr	kimage->start, where to jump from relocation function (new
+ *		kernel, or purgatory entry address).
+ * kern_arg0	first argument to kernel is its dtb address. The other
+ *		arguments are currently unused, and must be set to 0
+ */
+struct kern_reloc_arg {
+	unsigned long	head;
+	unsigned long	entry_addr;
+	unsigned long	kern_arg0;
+	unsigned long	kern_arg1;
+	unsigned long	kern_arg2;
+	unsigned long	kern_arg3;
+};
+
 #define ARCH_HAS_KIMAGE_ARCH
 
 struct kimage_arch {
 	void *dtb;
 	unsigned long dtb_mem;
 	unsigned long kern_reloc;
+	unsigned long kern_reloc_arg;
 };
 
 #ifdef CONFIG_KEXEC_FILE
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 214685760e1c..900394907fd8 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -23,6 +23,7 @@
 #include <asm/suspend.h>
 #include <linux/kbuild.h>
 #include <linux/arm-smccc.h>
+#include <linux/kexec.h>
 
 int main(void)
 {
@@ -126,6 +127,14 @@ int main(void)
 #ifdef CONFIG_ARM_SDE_INTERFACE
   DEFINE(SDEI_EVENT_INTREGS,	offsetof(struct sdei_registered_event, interrupted_regs));
   DEFINE(SDEI_EVENT_PRIORITY,	offsetof(struct sdei_registered_event, priority));
+#endif
+#ifdef CONFIG_KEXEC_CORE
+  DEFINE(KRELOC_HEAD,		offsetof(struct kern_reloc_arg, head));
+  DEFINE(KRELOC_ENTRY_ADDR,	offsetof(struct kern_reloc_arg, entry_addr));
+  DEFINE(KRELOC_KERN_ARG0,	offsetof(struct kern_reloc_arg, kern_arg0));
+  DEFINE(KRELOC_KERN_ARG1,	offsetof(struct kern_reloc_arg, kern_arg1));
+  DEFINE(KRELOC_KERN_ARG2,	offsetof(struct kern_reloc_arg, kern_arg2));
+  DEFINE(KRELOC_KERN_ARG3,	offsetof(struct kern_reloc_arg, kern_arg3));
 #endif
   return 0;
 }
diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
index 6ea337d464c4..64c78a42919f 100644
--- a/arch/arm64/kernel/cpu-reset.S
+++ b/arch/arm64/kernel/cpu-reset.S
@@ -43,9 +43,7 @@ ENTRY(__cpu_soft_restart)
 	hvc	#0				// no return
 
 1:	mov	x18, x1				// entry
-	mov	x0, x2				// arg0
-	mov	x1, x3				// arg1
-	mov	x2, x4				// arg2
+	mov	x0, x2				// arg
 	br	x18
 ENDPROC(__cpu_soft_restart)
 
diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index ed50e9587ad8..7a8720ff186f 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -11,12 +11,10 @@
 #include <asm/virt.h>
 
 void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
-	unsigned long arg0, unsigned long arg1, unsigned long arg2);
+			unsigned long arg);
 
 static inline void __noreturn cpu_soft_restart(unsigned long entry,
-					       unsigned long arg0,
-					       unsigned long arg1,
-					       unsigned long arg2)
+					       unsigned long arg)
 {
 	typeof(__cpu_soft_restart) *restart;
 
@@ -25,7 +23,7 @@ static inline void __noreturn cpu_soft_restart(unsigned long entry,
 	restart = (void *)__pa_symbol(__cpu_soft_restart);
 
 	cpu_install_idmap();
-	restart(el2_switch, entry, arg0, arg1, arg2);
+	restart(el2_switch, entry, arg);
 	unreachable();
 }
 
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 596c9b9657be..d7291a663379 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -43,6 +43,7 @@ static void _kexec_image_info(const char *func, int line,
 	pr_debug("    head:        %lx\n", kimage->head);
 	pr_debug("    nr_segments: %lu\n", kimage->nr_segments);
 	pr_debug("    kern_reloc: %pa\n", &kimage->arch.kern_reloc);
+	pr_debug("    kern_reloc_arg: %pa\n", &kimage->arch.kern_reloc_arg);
 
 	for (i = 0; i < kimage->nr_segments; i++) {
 		pr_debug("      segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu pages\n",
@@ -59,14 +60,39 @@ void machine_kexec_cleanup(struct kimage *kimage)
 	/* Empty routine needed to avoid build errors. */
 }
 
+/* Allocates pages for kexec page table */
+static void *kexec_page_alloc(void *arg)
+{
+	struct kimage *kimage = (struct kimage *)arg;
+	struct page *page = kimage_alloc_control_pages(kimage, 0);
+
+	if (!page)
+		return NULL;
+
+	return page_address(page);
+}
+
+
 int machine_kexec_post_load(struct kimage *kimage)
 {
 	unsigned long kern_reloc;
+	struct kern_reloc_arg *kern_reloc_arg;
 
 	kern_reloc = page_to_phys(kimage->control_code_page);
 	memcpy(__va(kern_reloc), arm64_relocate_new_kernel,
 	       arm64_relocate_new_kernel_size);
+
+	kern_reloc_arg = kexec_page_alloc(kimage);
+	if (!kern_reloc_arg)
+		return -ENOMEM;
+	memset(kern_reloc_arg, 0, sizeof (struct kern_reloc_arg));
+
 	kimage->arch.kern_reloc = kern_reloc;
+	kimage->arch.kern_reloc_arg = __pa(kern_reloc_arg);
+
+	kern_reloc_arg->head = kimage->head;
+	kern_reloc_arg->entry_addr = kimage->start;
+	kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;
 
 	kexec_image_info(kimage);
 	return 0;
@@ -203,8 +229,7 @@ void machine_kexec(struct kimage *kimage)
 	 * userspace (kexec-tools).
 	 * In kexec_file case, the kernel starts directly without purgatory.
 	 */
-	cpu_soft_restart(kimage->arch.kern_reloc, kimage->head, kimage->start,
-						kimage->arch.dtb_mem);
+	cpu_soft_restart(kimage->arch.kern_reloc, kimage->arch.kern_reloc_arg);
 
 	BUG(); /* Should never get here. */
 }
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index c1d7db71a726..d352faf7cbe6 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -8,7 +8,7 @@
 
 #include <linux/kexec.h>
 #include <linux/linkage.h>
-
+#include <asm/asm-offsets.h>
 #include <asm/assembler.h>
 #include <asm/kexec.h>
 #include <asm/page.h>
@@ -17,86 +17,58 @@
 /*
  * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
  *
- * The memory that the old kernel occupies may be overwritten when coping the
+ * The memory that the old kernel occupies may be overwritten when copying the
  * new image to its final location.  To assure that the
  * arm64_relocate_new_kernel routine which does that copy is not overwritten,
  * all code and data needed by arm64_relocate_new_kernel must be between the
  * symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end.  The
  * machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
- * control_code_page, a special page which has been set up to be preserved
- * during the copy operation.
+ * safe memory that has been set up to be preserved during the copy operation.
  */
 ENTRY(arm64_relocate_new_kernel)
-
-	/* Setup the list loop variables. */
-	mov	x18, x2				/* x18 = dtb address */
-	mov	x17, x1				/* x17 = kimage_start */
-	mov	x16, x0				/* x16 = kimage_head */
-	raw_dcache_line_size x15, x0		/* x15 = dcache line size */
-	mov	x14, xzr			/* x14 = entry ptr */
-	mov	x13, xzr			/* x13 = copy dest */
-
 	/* Clear the sctlr_el2 flags. */
-	mrs	x0, CurrentEL
-	cmp	x0, #CurrentEL_EL2
+	mrs	x2, CurrentEL
+	cmp	x2, #CurrentEL_EL2
 	b.ne	1f
-	mrs	x0, sctlr_el2
+	mrs	x2, sctlr_el2
 	ldr	x1, =SCTLR_ELx_FLAGS
-	bic	x0, x0, x1
+	bic	x2, x2, x1
 	pre_disable_mmu_workaround
-	msr	sctlr_el2, x0
+	msr	sctlr_el2, x2
 	isb
-1:
-
-	/* Check if the new image needs relocation. */
+1:	/* Check if the new image needs relocation. */
+	ldr	x16, [x0, #KRELOC_HEAD]		/* x16 = kimage_head */
 	tbnz	x16, IND_DONE_BIT, .Ldone
-
+	raw_dcache_line_size x15, x1		/* x15 = dcache line size */
 .Lloop:
 	and	x12, x16, PAGE_MASK		/* x12 = addr */
-
 	/* Test the entry flags. */
 .Ltest_source:
 	tbz	x16, IND_SOURCE_BIT, .Ltest_indirection
 
 	/* Invalidate dest page to PoC. */
-	mov     x0, x13
-	add     x20, x0, #PAGE_SIZE
+	mov     x2, x13
+	add     x20, x2, #PAGE_SIZE
 	sub     x1, x15, #1
-	bic     x0, x0, x1
-2:	dc      ivac, x0
-	add     x0, x0, x15
-	cmp     x0, x20
+	bic     x2, x2, x1
+2:	dc      ivac, x2
+	add     x2, x2, x15
+	cmp     x2, x20
 	b.lo    2b
 	dsb     sy
 
-	mov x20, x13
-	mov x21, x12
-	copy_page x20, x21, x0, x1, x2, x3, x4, x5, x6, x7
-
-	/* dest += PAGE_SIZE */
-	add	x13, x13, PAGE_SIZE
+	copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
 	b	.Lnext
-
 .Ltest_indirection:
 	tbz	x16, IND_INDIRECTION_BIT, .Ltest_destination
-
-	/* ptr = addr */
-	mov	x14, x12
+	mov	x14, x12			/* ptr = addr */
 	b	.Lnext
-
 .Ltest_destination:
 	tbz	x16, IND_DESTINATION_BIT, .Lnext
-
-	/* dest = addr */
-	mov	x13, x12
-
+	mov	x13, x12			/* dest = addr */
 .Lnext:
-	/* entry = *ptr++ */
-	ldr	x16, [x14], #8
-
-	/* while (!(entry & DONE)) */
-	tbz	x16, IND_DONE_BIT, .Lloop
-
+	ldr	x16, [x14], #8			/* entry = *ptr++ */
+	tbz	x16, IND_DONE_BIT, .Lloop	/* while (!(entry & DONE)) */
 .Ldone:
 	/* wait for writes from copy_page to finish */
 	dsb	nsh
@@ -105,18 +77,16 @@ ENTRY(arm64_relocate_new_kernel)
 	isb
 
 	/* Start new image. */
-	mov	x0, x18
-	mov	x1, xzr
-	mov	x2, xzr
-	mov	x3, xzr
-	br	x17
-
-ENDPROC(arm64_relocate_new_kernel)
+	ldr	x4, [x0, #KRELOC_ENTRY_ADDR]	/* x4 = kimage_start */
+	ldr	x3, [x0, #KRELOC_KERN_ARG3]
+	ldr	x2, [x0, #KRELOC_KERN_ARG2]
+	ldr	x1, [x0, #KRELOC_KERN_ARG1]
+	ldr	x0, [x0, #KRELOC_KERN_ARG0]	/* x0 = dtb address */
+	br	x4
+END(arm64_relocate_new_kernel)
 
 .ltorg
-
 .align 3	/* To keep the 64-bit values below naturally aligned. */
-
 .Lcopy_end:
 .org	KEXEC_CONTROL_PAGE_SIZE
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v2 7/8] arm64, kexec: configure transitional page table for kexec
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
                   ` (5 preceding siblings ...)
  2019-07-31 15:38 ` [RFC v2 6/8] arm64, kexec: add expandable argument to relocation function Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:38 ` [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation Pavel Tatashin
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

Configure a page table located in kexec-safe memory that has
the following mappings:

1. mapping for text of relocation function with executable permission.
2. mapping for argument for relocation function.
3. mappings for all source ranges
4. mappings for all destination ranges.
5. mappings for array that contains information about source/destinations.

We could make this page table to contain liner addresses, but instead do
identity maps (va == pa) for every mapping. This is because the relocation
code can be executed at EV2, where ttbr1 might not be available. There is
no way to execute relocation code at EV1, because the old world is
overwritten and thus no place to trap to to escalator to EV2.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/include/asm/kexec.h    |  3 +
 arch/arm64/kernel/asm-offsets.c   |  1 +
 arch/arm64/kernel/machine_kexec.c | 96 ++++++++++++++++++++++++++++++-
 3 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index d5b79d4c7fae..1f226cc76e24 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -97,6 +97,8 @@ static inline void crash_post_resume(void) {}
  *		kernel, or purgatory entry address).
  * kern_arg0	first argument to kernel is its dtb address. The other
  *		arguments are currently unused, and must be set to 0
+ * trans_table:	idmap for source and destination pages, as well as for
+ *		relocation text.
  */
 struct kern_reloc_arg {
 	unsigned long	head;
@@ -105,6 +107,7 @@ struct kern_reloc_arg {
 	unsigned long	kern_arg1;
 	unsigned long	kern_arg2;
 	unsigned long	kern_arg3;
+	unsigned long	trans_table;
 };
 
 #define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 900394907fd8..002db58b28f3 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -135,6 +135,7 @@ int main(void)
   DEFINE(KRELOC_KERN_ARG1,	offsetof(struct kern_reloc_arg, kern_arg1));
   DEFINE(KRELOC_KERN_ARG2,	offsetof(struct kern_reloc_arg, kern_arg2));
   DEFINE(KRELOC_KERN_ARG3,	offsetof(struct kern_reloc_arg, kern_arg3));
+  DEFINE(KRELOC_TRANS_TABLE,	offsetof(struct kern_reloc_arg, trans_table));
 #endif
   return 0;
 }
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index d7291a663379..402c8fb48f7e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -20,6 +20,7 @@
 #include <asm/mmu.h>
 #include <asm/mmu_context.h>
 #include <asm/page.h>
+#include <asm/trans_table.h>
 
 #include "cpu-reset.h"
 
@@ -72,11 +73,96 @@ static void *kexec_page_alloc(void *arg)
 	return page_address(page);
 }
 
+/*
+ * idmap every segment that needs to be relocated. We map pages for
+ * destination, source, and also array that holds source, and destination
+ * addresses.
+ * Ideally, we could linearly map src and dst addresses, so, in relocation
+ * routine we would need to only do memcpy(dst, src, len), but this is not
+ * possible, because on armv8.0 EL2 does not have ttbr1, and thus we might
+ * not have enough linear VA range. So, simply idmap it here, that works
+ * for both EL1, and EL2. Note: we cannot really do relocation in EL1 and
+ * later upgrade to EL2 because old world is erased, so there is no where
+ * to trap.
+ */
+static int map_segments(struct kimage *kimage, pgd_t *pgdp,
+			struct trans_table_info *info)
+{
+	unsigned long *ptr = 0;
+	unsigned long dest = 0;
+	unsigned long entry, addr;
+	int rc;
+
+	for (entry = kimage->head; !(entry & IND_DONE); entry = *ptr++) {
+		addr = entry & PAGE_MASK;
+
+		switch (entry & IND_FLAGS) {
+		case IND_DESTINATION:
+			dest = addr;
+			break;
+		case IND_INDIRECTION:
+			ptr = __va(addr);
+			rc = trans_table_map_page(info, pgdp, ptr,
+						  addr, PAGE_KERNEL);
+			if (rc)
+				return rc;
+			break;
+		case IND_SOURCE:
+			rc = trans_table_map_page(info, pgdp, __va(addr),
+						  addr, PAGE_KERNEL);
+			if (rc)
+				return rc;
+			rc = trans_table_map_page(info, pgdp, __va(dest),
+						  dest, PAGE_KERNEL);
+			if (rc)
+				return rc;
+			dest += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static int mmu_relocate_setup(struct kimage *kimage, unsigned long kern_reloc,
+			      struct kern_reloc_arg *kern_reloc_arg)
+{
+	struct trans_table_info info = {
+		.trans_alloc_page	= kexec_page_alloc,
+		.trans_alloc_arg	= kimage,
+		.trans_flags		= 0,
+	};
+	pgd_t *trans_table;
+	int rc;
+
+	rc = trans_table_create_empty(&info, &trans_table);
+	if (rc)
+		return rc;
+
+	rc = map_segments(kimage, trans_table, &info);
+	if (rc)
+		return rc;
+
+	/* Map relocation function va == pa */
+	rc = trans_table_map_page(&info, trans_table,  __va(kern_reloc),
+				  kern_reloc, PAGE_KERNEL_EXEC);
+	if (rc)
+		return rc;
+
+	/* Map relocation function argument va == pa */
+	rc = trans_table_map_page(&info, trans_table, kern_reloc_arg,
+				  __pa(kern_reloc_arg), PAGE_KERNEL);
+	if (rc)
+		return rc;
+
+	kern_reloc_arg->trans_table = __pa(trans_table);
+
+	return 0;
+}
 
 int machine_kexec_post_load(struct kimage *kimage)
 {
 	unsigned long kern_reloc;
 	struct kern_reloc_arg *kern_reloc_arg;
+	int rc = 0;
 
 	kern_reloc = page_to_phys(kimage->control_code_page);
 	memcpy(__va(kern_reloc), arm64_relocate_new_kernel,
@@ -94,8 +180,16 @@ int machine_kexec_post_load(struct kimage *kimage)
 	kern_reloc_arg->entry_addr = kimage->start;
 	kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;
 
+	/*
+	 * If relocation is not needed, we do not need to enable MMU in
+	 * relocation routine, therefore do not create page tables for
+	 * scenarios such as crash kernel
+	 */
+	if (!(kimage->head & IND_DONE))
+		rc = mmu_relocate_setup(kimage, kern_reloc, kern_reloc_arg);
+
 	kexec_image_info(kimage);
-	return 0;
+	return rc;
 }
 
 /**
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
                   ` (6 preceding siblings ...)
  2019-07-31 15:38 ` [RFC v2 7/8] arm64, kexec: configure transitional page table for kexec Pavel Tatashin
@ 2019-07-31 15:38 ` Pavel Tatashin
  2019-07-31 15:50   ` Mark Rutland
  2019-07-31 16:32 ` [RFC v2 0/8] arm64: MMU enabled " Mark Rutland
  2019-08-01 13:24 ` Pavel Tatashin
  9 siblings, 1 reply; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 15:38 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, ebiederm, kexec, linux-kernel,
	corbet, catalin.marinas, will, linux-doc, linux-arm-kernel,
	marc.zyngier, james.morse, vladimir.murzin, matthias.bgg,
	bhsharma

Now, that we have transitional page tables configured, temporarily enable
MMU to allow faster relocation of segments to final destination.

The performance data: for a moderate size kernel + initramfs: 25M the
relocation was taking 0.382s, with enabled MMU it now takes
0.019s only or x20 improvement.

The time is proportional to the size of relocation, therefore if initramfs
is larger, 100M it could take over a second.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/kernel/relocate_kernel.S | 192 ++++++++++++++++++++++------
 1 file changed, 154 insertions(+), 38 deletions(-)

diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index d352faf7cbe6..88fc69adb90d 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -4,6 +4,8 @@
  *
  * Copyright (C) Linaro.
  * Copyright (C) Huawei Futurewei Technologies.
+ * Copyright (c) 2019, Microsoft Corporation.
+ * Pavel Tatashin <patatash@linux.microsoft.com>
  */
 
 #include <linux/kexec.h>
@@ -13,6 +15,130 @@
 #include <asm/kexec.h>
 #include <asm/page.h>
 #include <asm/sysreg.h>
+#include <asm/kvm_arm.h>
+
+/*
+ * The following code is adoped from "Bare-metal Boot Code for ARMv8-A
+ * Processors Version 1.0, 5.3.1 Cleaning and invalidating the caches".
+ * http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a
+ */
+.macro dcache_invalidate tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8
+	mov	\tmp0, #0x0			/* tmp0 = Cache level */
+	msr	CSSELR_EL1, \tmp0		/* 0x0 for L1, 0x2 for L2 */
+	mrs	\tmp4, CCSIDR_EL1		/* Read Cache Size ID */
+	and	\tmp1, \tmp4, #0x7
+	add	\tmp1, \tmp1, #0x4		/* tmp1 Cache Line Size */
+	ldr	\tmp3, =0x7fff
+	and	\tmp2, \tmp3, \tmp4, lsr #13	/* tmp2 Cache Set num - 1 */
+	ldr	\tmp3, =0x3ff
+	and	\tmp3, \tmp3, \tmp4, lsr #3	/* tmp3 Cache Assoc. num - 1 */
+	clz	\tmp4, \tmp3			/* tmp4 way pos. in the CISW */
+	mov	\tmp5, #0			/* tmp5 way counter way_loop */
+1: /* way_loop */
+	mov	\tmp6, #0			/* tmp6 set counter set_loop */
+2: /* set_loop */
+	lsl	\tmp7, \tmp5, \tmp4
+	orr	\tmp7, \tmp0, \tmp7		/* Set way */
+	lsl	\tmp8, \tmp6, \tmp1
+	orr	\tmp7, \tmp7, \tmp8		/* Set set */
+	dc	cisw, \tmp7			/* Clean & Inval. cache line */
+	add	\tmp6, \tmp6, #1		/* Increment set counter */
+	cmp	\tmp6, \tmp2			/* Last set reached yet? */
+	ble	2b				/* If not, iterate set_loop, */
+	add	\tmp5, \tmp5, #1		/* else, next way. */
+	cmp	\tmp5, \tmp3			/* Last way reached yet? */
+	ble	1b				/* If not, iterate way_loop. */
+.endm
+
+/*
+ * Invalidae all TLB: if we are running at EL2, invalidate all TLB at EL1 & EL2,
+ * if we are running at EL1 invalidate all current VMID TLB at EL1.
+ */
+.macro tlb_invalidate tmp
+	mrs	\tmp, CurrentEL
+	cmp	\tmp, #CurrentEL_EL2
+	isb
+	b.ne	1f
+	dsb	sy
+	tlbi	alle2
+	tlbi	alle1
+	dsb	ish
+	isb
+	b	2f
+1:
+	dsb	sy
+	tlbi	vmalle1
+	dsb	ish
+	isb
+2:
+.endm
+
+.macro turn_off_mmu_el	sctlr, tmp1, tmp2
+	mrs	\tmp1, \sctlr
+	ldr	\tmp2, =SCTLR_ELx_FLAGS
+	bic	\tmp1, \tmp1, \tmp2
+	pre_disable_mmu_workaround
+	msr	\sctlr, \tmp1
+	isb
+.endm
+
+.macro turn_off_mmu tmp1, tmp2
+	turn_off_mmu_el	sctlr_el1, \tmp1, \tmp2	/* Turn off MMU at EL1 */
+	mrs	\tmp1, CurrentEL
+	cmp	\tmp1, #CurrentEL_EL2
+	b.ne	1f
+	turn_off_mmu_el	sctlr_el2, \tmp1, \tmp2	/* Turn off MMU at EL2 */
+1:
+.endm
+
+/* Configure TCR_EL2 and MAIR_EL2 */
+.macro tcr_mair_mmu_el2 tmp1, tmp2, tmp3
+	mrs	\tmp1, tcr_el1
+	ldr	\tmp2, =TCR_EL2_MASK
+	and	\tmp1, \tmp1, \tmp2
+	mov	\tmp2, #TCR_EL2_RES1
+	orr	\tmp1, \tmp1, \tmp2
+	ldr	\tmp2, =TCR_T0SZ(VA_BITS)
+	orr	\tmp1, \tmp1, \tmp2
+	tcr_compute_pa_size \tmp1, #TCR_EL2_PS_SHIFT, \tmp2, \tmp3
+	msr	tcr_el2, \tmp1
+	mrs	\tmp1, mair_el1
+	msr	mair_el2, \tmp1
+.endm
+
+.macro turn_on_mmu tmp1, tmp2, tmp3
+	mrs	\tmp1, CurrentEL
+	cmp	\tmp1, #CurrentEL_EL2
+	b.ne	1f
+	tcr_mair_mmu_el2 \tmp1, \tmp2, \tmp3
+	ldr	\tmp1, =(SCTLR_EL2_RES1 | SCTLR_ELx_FLAGS | ENDIAN_SET_EL2)
+	msr	sctlr_el2, \tmp1
+	b	2f
+1:	mrs	\tmp1, sctlr_el1
+	ldr	\tmp2, =SCTLR_ELx_FLAGS
+	orr	\tmp1, \tmp1, \tmp2
+	msr	sctlr_el1, \tmp1
+2:	ic	iallu
+	dsb	nsh
+	isb
+.endm
+
+.macro set_ttbr_el ttbr_reg, trans_table
+	phys_to_ttbr \trans_table, \trans_table
+	msr	\ttbr_reg, \trans_table
+	isb
+.endm
+
+.macro set_ttbr trans_table, tmp
+	mrs	\tmp, CurrentEL
+	cmp	\tmp, #CurrentEL_EL2
+	b.ne	1f
+	set_ttbr_el	ttbr0_el2 \trans_table
+	b	2f
+1:
+	set_ttbr_el	ttbr0_el1 \trans_table
+2:
+.endm
 
 /*
  * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
@@ -24,59 +150,49 @@
  * symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end.  The
  * machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
  * safe memory that has been set up to be preserved during the copy operation.
+ *
+ * This function temporarily enables MMU if kernel relocation is needed. This is
+ * done for performance reasons: with MMU-enabled arm64 is much quicker at
+ * copying pages due to also having enabled caching.
  */
 ENTRY(arm64_relocate_new_kernel)
-	/* Clear the sctlr_el2 flags. */
-	mrs	x2, CurrentEL
-	cmp	x2, #CurrentEL_EL2
-	b.ne	1f
-	mrs	x2, sctlr_el2
-	ldr	x1, =SCTLR_ELx_FLAGS
-	bic	x2, x2, x1
-	pre_disable_mmu_workaround
-	msr	sctlr_el2, x2
-	isb
-1:	/* Check if the new image needs relocation. */
-	ldr	x16, [x0, #KRELOC_HEAD]		/* x16 = kimage_head */
-	tbnz	x16, IND_DONE_BIT, .Ldone
-	raw_dcache_line_size x15, x1		/* x15 = dcache line size */
+	/* MMU on EL2 might still be on, turn it off for now */
+	turn_off_mmu	x1, x2
+	dcache_invalidate x1, x2, x3, x4, x5, x6, x7, x8, x9
+	tlb_invalidate x1
+
+	/* Check if the new image needs relocation. */
+	ldr	x12, [x0, #KRELOC_HEAD]		/* x12 = kimage_head */
+	tbnz	x12, IND_DONE_BIT, .Ldone
+	ldr	x1, [x0, #KRELOC_TRANS_TABLE]
+	set_ttbr x1, x2
+	turn_on_mmu x1, x2, x3
 .Lloop:
-	and	x12, x16, PAGE_MASK		/* x12 = addr */
+	and	x2, x12, PAGE_MASK		/* x2 = addr */
 	/* Test the entry flags. */
 .Ltest_source:
-	tbz	x16, IND_SOURCE_BIT, .Ltest_indirection
-
-	/* Invalidate dest page to PoC. */
-	mov     x2, x13
-	add     x20, x2, #PAGE_SIZE
-	sub     x1, x15, #1
-	bic     x2, x2, x1
-2:	dc      ivac, x2
-	add     x2, x2, x15
-	cmp     x2, x20
-	b.lo    2b
-	dsb     sy
-
-	copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
+	tbz	x12, IND_SOURCE_BIT, .Ltest_indirection
+	copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
 	b	.Lnext
 .Ltest_indirection:
-	tbz	x16, IND_INDIRECTION_BIT, .Ltest_destination
-	mov	x14, x12			/* ptr = addr */
+	tbz	x12, IND_INDIRECTION_BIT, .Ltest_destination
+	mov	x11, x2				/* x11 = ptr */
 	b	.Lnext
 .Ltest_destination:
-	tbz	x16, IND_DESTINATION_BIT, .Lnext
-	mov	x13, x12			/* dest = addr */
+	tbz	x12, IND_DESTINATION_BIT, .Lnext
+	mov	x1, x2				/* x1 = dest */
 .Lnext:
-	ldr	x16, [x14], #8			/* entry = *ptr++ */
-	tbz	x16, IND_DONE_BIT, .Lloop	/* while (!(entry & DONE)) */
-.Ldone:
+	ldr	x12, [x11], #8			/* x12 = entry = *ptr++ */
+	tbz	x12, IND_DONE_BIT, .Lloop	/* while (!(entry & DONE)) */
 	/* wait for writes from copy_page to finish */
 	dsb	nsh
 	ic	iallu
 	dsb	nsh
 	isb
-
-	/* Start new image. */
+	turn_off_mmu	x1, x2
+	dcache_invalidate x1, x2, x3, x4, x5, x6, x7, x8, x9
+	tlb_invalidate x1
+.Ldone:	/* Start new image. */
 	ldr	x4, [x0, #KRELOC_ENTRY_ADDR]	/* x4 = kimage_start */
 	ldr	x3, [x0, #KRELOC_KERN_ARG3]
 	ldr	x2, [x0, #KRELOC_KERN_ARG2]
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation
  2019-07-31 15:38 ` [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation Pavel Tatashin
@ 2019-07-31 15:50   ` Mark Rutland
  2019-07-31 16:01     ` Pavel Tatashin
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2019-07-31 15:50 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: jmorris, sashal, ebiederm, kexec, linux-kernel, corbet,
	catalin.marinas, will, linux-doc, linux-arm-kernel, marc.zyngier,
	james.morse, vladimir.murzin, matthias.bgg, bhsharma

On Wed, Jul 31, 2019 at 11:38:57AM -0400, Pavel Tatashin wrote:
> +/*
> + * The following code is adoped from "Bare-metal Boot Code for ARMv8-A
> + * Processors Version 1.0, 5.3.1 Cleaning and invalidating the caches".
> + * http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a
> + */
> +.macro dcache_invalidate tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8
> +	mov	\tmp0, #0x0			/* tmp0 = Cache level */
> +	msr	CSSELR_EL1, \tmp0		/* 0x0 for L1, 0x2 for L2 */
> +	mrs	\tmp4, CCSIDR_EL1		/* Read Cache Size ID */
> +	and	\tmp1, \tmp4, #0x7
> +	add	\tmp1, \tmp1, #0x4		/* tmp1 Cache Line Size */
> +	ldr	\tmp3, =0x7fff
> +	and	\tmp2, \tmp3, \tmp4, lsr #13	/* tmp2 Cache Set num - 1 */
> +	ldr	\tmp3, =0x3ff
> +	and	\tmp3, \tmp3, \tmp4, lsr #3	/* tmp3 Cache Assoc. num - 1 */
> +	clz	\tmp4, \tmp3			/* tmp4 way pos. in the CISW */
> +	mov	\tmp5, #0			/* tmp5 way counter way_loop */
> +1: /* way_loop */
> +	mov	\tmp6, #0			/* tmp6 set counter set_loop */
> +2: /* set_loop */
> +	lsl	\tmp7, \tmp5, \tmp4
> +	orr	\tmp7, \tmp0, \tmp7		/* Set way */
> +	lsl	\tmp8, \tmp6, \tmp1
> +	orr	\tmp7, \tmp7, \tmp8		/* Set set */
> +	dc	cisw, \tmp7			/* Clean & Inval. cache line */
> +	add	\tmp6, \tmp6, #1		/* Increment set counter */
> +	cmp	\tmp6, \tmp2			/* Last set reached yet? */
> +	ble	2b				/* If not, iterate set_loop, */
> +	add	\tmp5, \tmp5, #1		/* else, next way. */
> +	cmp	\tmp5, \tmp3			/* Last way reached yet? */
> +	ble	1b				/* If not, iterate way_loop. */
> +.endm
> +

For various reasons, one cannot safely use Set/Way operations in
portable code. They only make sense for low-level platform-specific
firmware performing power management operations.

If you need to perform D-cache maintenance, you must use the VA
operations to do so.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation
  2019-07-31 15:50   ` Mark Rutland
@ 2019-07-31 16:01     ` Pavel Tatashin
  0 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 16:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: James Morris, Sasha Levin, Eric W. Biederman, kexec mailing list,
	LKML, Jonathan Corbet, Catalin Marinas, will,
	Linux Doc Mailing List, Linux ARM, Marc Zyngier, James Morse,
	Vladimir Murzin, Matthias Brugger, Bhupesh Sharma

> For various reasons, one cannot safely use Set/Way operations in
> portable code. They only make sense for low-level platform-specific
> firmware performing power management operations.
>
> If you need to perform D-cache maintenance, you must use the VA
> operations to do so.

Hi Mark,

I see, thank you for letting me know. I will do d-cache flushing by VA
in the next iteration. First I need to root cause/fix the bug
described in the cover letter.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v2 0/8] arm64: MMU enabled kexec relocation
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
                   ` (7 preceding siblings ...)
  2019-07-31 15:38 ` [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation Pavel Tatashin
@ 2019-07-31 16:32 ` Mark Rutland
  2019-07-31 16:40   ` Pavel Tatashin
  2019-08-01 13:24 ` Pavel Tatashin
  9 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2019-07-31 16:32 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: jmorris, sashal, ebiederm, kexec, linux-kernel, corbet,
	catalin.marinas, will, linux-doc, linux-arm-kernel, marc.zyngier,
	james.morse, vladimir.murzin, matthias.bgg, bhsharma

Hi Pavel,

Generally, the cover letter should state up-front what the goal is (or
what problem you're trying to solve). It would be really helpful to have
that so that we understand what you're trying to achieve, and why.

Messing with the MMU is often fraught with danger (and very painful to
debug, as you are now aware), and so far we've tried to minimize the
number of places where we have to do so.

On Wed, Jul 31, 2019 at 11:38:49AM -0400, Pavel Tatashin wrote:
> Changelog from previous RFC:
> - Added trans_table support for both hibernate and kexec.
> - Fixed performance issue, where enabling MMU did not yield the
>   actual performance improvement.
> 
> Bug:
> With the current state, this patch series works on kernels booted with EL1
> mode, but for some reason, when elevated to EL2 mode reboot freezes in
> both QEMU and on real hardware.
> 
> The freeze happens in:
> 
> arch/arm64/kernel/relocate_kernel.S
> 	turn_on_mmu()
> 
> Right after sctlr_el2 is written (MMU on EL2 is enabled)
> 
> 	msr     sctlr_el2, \tmp1
> 
> I've been studying all the relevant control registers for EL2, but do not
> see what might be causing this hang:
> 
> MAIR_EL2 is set to be exactly the same as MAIR_EL1 0xbbff440c0400
> 
> TCR_EL2        0x80843510
> Enabled bits:
> PS      Physical Address Size. (0b100   44 bits, 16TB.)
> SH0     Shareability    11 Inner Shareable
> ORGN0   Normal memory, Outer Write-Back Read-Allocate Write-Allocate Cach.
> IRGN0   Normal memory, Inner Write-Back Read-Allocate Write-Allocate Cach.
> T0SZ    01 0000
> 
> SCTLR_EL2	0x30e5183f
> RES1    : Reserve ones
> M       : MMU enabled
> A       : Align check
> C       : Cacheability control
> SA      : SP Alignment check enable
> IESB    : Implicit Error Synchronization event
> I       : Instruction access Cacheability
> 
> TTBR0_EL2      0x1b3069000 (address of trans_table)
> 
> Any suggestion of what else might be missing that causes this freeze when
> MMU is enabled in EL2?
> 
> =====

> Here is the current data from the real hardware:
> (because of bug, I forced EL1 mode by setting el2_switch always to zero in
> cpu_soft_restart()):
> 
> For this experiment, the size of kernel plus initramfs is 25M. If initramfs
> was larger, than the improvements would be even greater, as time spent in
> relocation is proportional to the size of relocation.
> 
> Previously:
> kernel shutdown	0.022131328s
> relocation	0.440510736s
> kernel startup	0.294706768s

In total this takes ~0.76s...

> 
> Relocation was taking: 58.2% of reboot time
> 
> Now:
> kernel shutdown	0.032066576s
> relocation	0.022158152s
> kernel startup	0.296055880s

... and this takes ~0.35s

So do we really need this complexity for a few blinks of an eye?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v2 0/8] arm64: MMU enabled kexec relocation
  2019-07-31 16:32 ` [RFC v2 0/8] arm64: MMU enabled " Mark Rutland
@ 2019-07-31 16:40   ` Pavel Tatashin
  2019-07-31 16:50     ` Mark Rutland
  0 siblings, 1 reply; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 16:40 UTC (permalink / raw)
  To: Mark Rutland
  Cc: James Morris, Sasha Levin, Eric W. Biederman, kexec mailing list,
	LKML, Jonathan Corbet, Catalin Marinas, will,
	Linux Doc Mailing List, Linux ARM, Marc Zyngier, James Morse,
	Vladimir Murzin, Matthias Brugger, Bhupesh Sharma

On Wed, Jul 31, 2019 at 12:33 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Pavel,
>
> Generally, the cover letter should state up-front what the goal is (or
> what problem you're trying to solve). It would be really helpful to have
> that so that we understand what you're trying to achieve, and why.
>
> Messing with the MMU is often fraught with danger (and very painful to
> debug, as you are now aware), and so far we've tried to minimize the
> number of places where we have to do so.

Hi Mark,

I understand, this is why I first went another route of solving this
problem: pre-reserving contiguous memory, and avoid relocation
entirely (the same as what happens during crash reboot). But, that
solution was not accepted because it introduces a change to the common
code to solve ARM specific problem. So, James Morse, and other
suggested that I take a look at the root of the problem, and enable
MMU during relocation by doing what is already done during hibernate
restore.

>
> On Wed, Jul 31, 2019 at 11:38:49AM -0400, Pavel Tatashin wrote:
> > Changelog from previous RFC:
> > - Added trans_table support for both hibernate and kexec.
> > - Fixed performance issue, where enabling MMU did not yield the
> >   actual performance improvement.
> >
> > Bug:
> > With the current state, this patch series works on kernels booted with EL1
> > mode, but for some reason, when elevated to EL2 mode reboot freezes in
> > both QEMU and on real hardware.
> >
> > The freeze happens in:
> >
> > arch/arm64/kernel/relocate_kernel.S
> >       turn_on_mmu()
> >
> > Right after sctlr_el2 is written (MMU on EL2 is enabled)
> >
> >       msr     sctlr_el2, \tmp1
> >
> > I've been studying all the relevant control registers for EL2, but do not
> > see what might be causing this hang:
> >
> > MAIR_EL2 is set to be exactly the same as MAIR_EL1 0xbbff440c0400
> >
> > TCR_EL2        0x80843510
> > Enabled bits:
> > PS      Physical Address Size. (0b100   44 bits, 16TB.)
> > SH0     Shareability    11 Inner Shareable
> > ORGN0   Normal memory, Outer Write-Back Read-Allocate Write-Allocate Cach.
> > IRGN0   Normal memory, Inner Write-Back Read-Allocate Write-Allocate Cach.
> > T0SZ    01 0000
> >
> > SCTLR_EL2     0x30e5183f
> > RES1    : Reserve ones
> > M       : MMU enabled
> > A       : Align check
> > C       : Cacheability control
> > SA      : SP Alignment check enable
> > IESB    : Implicit Error Synchronization event
> > I       : Instruction access Cacheability
> >
> > TTBR0_EL2      0x1b3069000 (address of trans_table)
> >
> > Any suggestion of what else might be missing that causes this freeze when
> > MMU is enabled in EL2?
> >
> > =====
>
> > Here is the current data from the real hardware:
> > (because of bug, I forced EL1 mode by setting el2_switch always to zero in
> > cpu_soft_restart()):
> >
> > For this experiment, the size of kernel plus initramfs is 25M. If initramfs
> > was larger, than the improvements would be even greater, as time spent in
> > relocation is proportional to the size of relocation.
> >
> > Previously:
> > kernel shutdown       0.022131328s
> > relocation    0.440510736s
> > kernel startup        0.294706768s
>
> In total this takes ~0.76s...
>
> >
> > Relocation was taking: 58.2% of reboot time
> >
> > Now:
> > kernel shutdown       0.032066576s
> > relocation    0.022158152s
> > kernel startup        0.296055880s
>
> ... and this takes ~0.35s
>
> So do we really need this complexity for a few blinks of an eye?

Yes, we have an extremely tight reboot budget, 0.35s is not an acceptable waste.

>
> Thanks,
> Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v2 0/8] arm64: MMU enabled kexec relocation
  2019-07-31 16:40   ` Pavel Tatashin
@ 2019-07-31 16:50     ` Mark Rutland
  2019-07-31 17:04       ` Pavel Tatashin
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2019-07-31 16:50 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: James Morris, Sasha Levin, Eric W. Biederman, kexec mailing list,
	LKML, Jonathan Corbet, Catalin Marinas, will,
	Linux Doc Mailing List, Linux ARM, Marc Zyngier, James Morse,
	Vladimir Murzin, Matthias Brugger, Bhupesh Sharma

On Wed, Jul 31, 2019 at 12:40:51PM -0400, Pavel Tatashin wrote:
> On Wed, Jul 31, 2019 at 12:33 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Hi Pavel,
> >
> > Generally, the cover letter should state up-front what the goal is (or
> > what problem you're trying to solve). It would be really helpful to have
> > that so that we understand what you're trying to achieve, and why.

[...]

> > > Here is the current data from the real hardware:
> > > (because of bug, I forced EL1 mode by setting el2_switch always to zero in
> > > cpu_soft_restart()):
> > >
> > > For this experiment, the size of kernel plus initramfs is 25M. If initramfs
> > > was larger, than the improvements would be even greater, as time spent in
> > > relocation is proportional to the size of relocation.
> > >
> > > Previously:
> > > kernel shutdown       0.022131328s
> > > relocation    0.440510736s
> > > kernel startup        0.294706768s
> >
> > In total this takes ~0.76s...
> >
> > >
> > > Relocation was taking: 58.2% of reboot time
> > >
> > > Now:
> > > kernel shutdown       0.032066576s
> > > relocation    0.022158152s
> > > kernel startup        0.296055880s
> >
> > ... and this takes ~0.35s
> >
> > So do we really need this complexity for a few blinks of an eye?
> 
> Yes, we have an extremely tight reboot budget, 0.35s is not an acceptable waste.

Could you please elaborate on your use-case?

Understanfin what you're trying to achieve would help us to understand
which solutions make sense.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v2 0/8] arm64: MMU enabled kexec relocation
  2019-07-31 16:50     ` Mark Rutland
@ 2019-07-31 17:04       ` Pavel Tatashin
  0 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-07-31 17:04 UTC (permalink / raw)
  To: Mark Rutland
  Cc: James Morris, Sasha Levin, Eric W. Biederman, kexec mailing list,
	LKML, Jonathan Corbet, Catalin Marinas, will,
	Linux Doc Mailing List, Linux ARM, Marc Zyngier, James Morse,
	Vladimir Murzin, Matthias Brugger, Bhupesh Sharma

On Wed, Jul 31, 2019 at 12:50 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Wed, Jul 31, 2019 at 12:40:51PM -0400, Pavel Tatashin wrote:
> > On Wed, Jul 31, 2019 at 12:33 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > Hi Pavel,
> > >
> > > Generally, the cover letter should state up-front what the goal is (or
> > > what problem you're trying to solve). It would be really helpful to have
> > > that so that we understand what you're trying to achieve, and why.
>
> [...]
>
> > > > Here is the current data from the real hardware:
> > > > (because of bug, I forced EL1 mode by setting el2_switch always to zero in
> > > > cpu_soft_restart()):
> > > >
> > > > For this experiment, the size of kernel plus initramfs is 25M. If initramfs
> > > > was larger, than the improvements would be even greater, as time spent in
> > > > relocation is proportional to the size of relocation.
> > > >
> > > > Previously:
> > > > kernel shutdown       0.022131328s
> > > > relocation    0.440510736s
> > > > kernel startup        0.294706768s
> > >
> > > In total this takes ~0.76s...
> > >
> > > >
> > > > Relocation was taking: 58.2% of reboot time
> > > >
> > > > Now:
> > > > kernel shutdown       0.032066576s
> > > > relocation    0.022158152s
> > > > kernel startup        0.296055880s
> > >
> > > ... and this takes ~0.35s
> > >
> > > So do we really need this complexity for a few blinks of an eye?
> >
> > Yes, we have an extremely tight reboot budget, 0.35s is not an acceptable waste.
>
> Could you please elaborate on your use-case?
>
> Understanfin what you're trying to achieve would help us to understand
> which solutions make sense.

An extremely high availability device with an update story utilizing
kexec functionality for a faster kernel update and also for being able
to preserve some state in memory without wasting the time of copying
it to and from a backing storage. We at Microsoft will be using a
fleet of these devices. The total reboot budget is less than half a
second, out of which 0.44s is currently spent in kexec relocation.

Pasha

>
> Thanks,
> Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v2 0/8] arm64: MMU enabled kexec relocation
  2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
                   ` (8 preceding siblings ...)
  2019-07-31 16:32 ` [RFC v2 0/8] arm64: MMU enabled " Mark Rutland
@ 2019-08-01 13:24 ` Pavel Tatashin
  9 siblings, 0 replies; 16+ messages in thread
From: Pavel Tatashin @ 2019-08-01 13:24 UTC (permalink / raw)
  To: Pavel Tatashin, James Morris, Sasha Levin, Eric W. Biederman,
	kexec mailing list, LKML, Jonathan Corbet, Catalin Marinas, will,
	Linux Doc Mailing List, Linux ARM, Marc Zyngier, James Morse,
	Vladimir Murzin, Matthias Brugger, Bhupesh Sharma

I will send a new version soon, so please do not spend time reviewing
this work.  In the new version I will fix MMU at EL2 issue by doing
what we are doing in hibernation: reduce to EL1 to do the copying, and
escalate back to to EL2 to branch to new kernel. Also, this will
simplify copying function by actually doing the linear copy as ttbr1
and ttbr0 are always available this way.

Thank you,
Pasha

On Wed, Jul 31, 2019 at 11:38 AM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> Changelog from previous RFC:
> - Added trans_table support for both hibernate and kexec.
> - Fixed performance issue, where enabling MMU did not yield the
>   actual performance improvement.
>
> Bug:
> With the current state, this patch series works on kernels booted with EL1
> mode, but for some reason, when elevated to EL2 mode reboot freezes in
> both QEMU and on real hardware.
>
> The freeze happens in:
>
> arch/arm64/kernel/relocate_kernel.S
>         turn_on_mmu()
>
> Right after sctlr_el2 is written (MMU on EL2 is enabled)
>
>         msr     sctlr_el2, \tmp1
>
> I've been studying all the relevant control registers for EL2, but do not
> see what might be causing this hang:
>
> MAIR_EL2 is set to be exactly the same as MAIR_EL1 0xbbff440c0400
>
> TCR_EL2        0x80843510
> Enabled bits:
> PS      Physical Address Size. (0b100   44 bits, 16TB.)
> SH0     Shareability    11 Inner Shareable
> ORGN0   Normal memory, Outer Write-Back Read-Allocate Write-Allocate Cach.
> IRGN0   Normal memory, Inner Write-Back Read-Allocate Write-Allocate Cach.
> T0SZ    01 0000
>
> SCTLR_EL2       0x30e5183f
> RES1    : Reserve ones
> M       : MMU enabled
> A       : Align check
> C       : Cacheability control
> SA      : SP Alignment check enable
> IESB    : Implicit Error Synchronization event
> I       : Instruction access Cacheability
>
> TTBR0_EL2      0x1b3069000 (address of trans_table)
>
> Any suggestion of what else might be missing that causes this freeze when
> MMU is enabled in EL2?
>
> =====
> Here is the current data from the real hardware:
> (because of bug, I forced EL1 mode by setting el2_switch always to zero in
> cpu_soft_restart()):
>
> For this experiment, the size of kernel plus initramfs is 25M. If initramfs
> was larger, than the improvements would be even greater, as time spent in
> relocation is proportional to the size of relocation.
>
> Previously:
> kernel shutdown 0.022131328s
> relocation      0.440510736s
> kernel startup  0.294706768s
>
> Relocation was taking: 58.2% of reboot time
>
> Now:
> kernel shutdown 0.032066576s
> relocation      0.022158152s
> kernel startup  0.296055880s
>
> Now: Relocation takes 6.3% of reboot time
>
> Total reboot is x2.16 times faster.
>
> Previous approaches and discussions
> -----------------------------------
> https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com
> reserve space for kexec to avoid relocation, involves changes to generic code
> to optimize a problem that exists on arm64 only:
>
> https://lore.kernel.org/lkml/20190716165641.6990-1-pasha.tatashin@soleen.com
> The first attempt to enable MMU, some bugs that prevented performance
> improvement. The page tables unnecessary configured idmap for the whole
> physical space.
>
> Pavel Tatashin (8):
>   kexec: quiet down kexec reboot
>   arm64, mm: transitional tables
>   arm64: hibernate: switch to transtional page tables.
>   kexec: add machine_kexec_post_load()
>   arm64, kexec: move relocation function setup and clean up
>   arm64, kexec: add expandable argument to relocation function
>   arm64, kexec: configure transitional page table for kexec
>   arm64, kexec: enable MMU during kexec relocation
>
>  arch/arm64/Kconfig                     |   4 +
>  arch/arm64/include/asm/kexec.h         |  24 ++-
>  arch/arm64/include/asm/pgtable-hwdef.h |   1 +
>  arch/arm64/include/asm/trans_table.h   |  66 ++++++
>  arch/arm64/kernel/asm-offsets.c        |  10 +
>  arch/arm64/kernel/cpu-reset.S          |   4 +-
>  arch/arm64/kernel/cpu-reset.h          |   8 +-
>  arch/arm64/kernel/hibernate.c          | 261 ++++++------------------
>  arch/arm64/kernel/machine_kexec.c      | 168 ++++++++++++---
>  arch/arm64/kernel/relocate_kernel.S    | 238 +++++++++++++++-------
>  arch/arm64/mm/Makefile                 |   1 +
>  arch/arm64/mm/trans_table.c            | 272 +++++++++++++++++++++++++
>  kernel/kexec.c                         |   4 +
>  kernel/kexec_core.c                    |   8 +-
>  kernel/kexec_file.c                    |   4 +
>  kernel/kexec_internal.h                |   2 +
>  16 files changed, 756 insertions(+), 319 deletions(-)
>  create mode 100644 arch/arm64/include/asm/trans_table.h
>  create mode 100644 arch/arm64/mm/trans_table.c
>
> --
> 2.22.0
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-08-01 13:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-31 15:38 [RFC v2 0/8] arm64: MMU enabled kexec relocation Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 1/8] kexec: quiet down kexec reboot Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 2/8] arm64, mm: transitional tables Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 3/8] arm64: hibernate: switch to transtional page tables Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 4/8] kexec: add machine_kexec_post_load() Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 5/8] arm64, kexec: move relocation function setup and clean up Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 6/8] arm64, kexec: add expandable argument to relocation function Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 7/8] arm64, kexec: configure transitional page table for kexec Pavel Tatashin
2019-07-31 15:38 ` [RFC v2 8/8] arm64, kexec: enable MMU during kexec relocation Pavel Tatashin
2019-07-31 15:50   ` Mark Rutland
2019-07-31 16:01     ` Pavel Tatashin
2019-07-31 16:32 ` [RFC v2 0/8] arm64: MMU enabled " Mark Rutland
2019-07-31 16:40   ` Pavel Tatashin
2019-07-31 16:50     ` Mark Rutland
2019-07-31 17:04       ` Pavel Tatashin
2019-08-01 13:24 ` Pavel Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).