linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory
@ 2018-02-26 18:04 Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 1/5] x86/boot/compressed/64: Describe the logic behind LA57 check Kirill A. Shutemov
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-26 18:04 UTC (permalink / raw)
  To: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin, Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel, Kirill A. Shutemov

Here's re-split of the patch that prepares trampoline memory, but doesn't
actually uses it yet. The original patch turned out to be problematic.
Splitting the patch should help to pin down the issue.

The functionality should match the original patch (although I've moved a
bit more into C).

Borislav, could you check which patch breaks boot for you (if any)?

Kirill A. Shutemov (5):
  x86/boot/compressed/64: Describe the logic behind LA57 check
  x86/boot/compressed/64: Find a place for 32-bit trampoline
  x86/boot/compressed/64: Save and restore trampoline memory
  x86/boot/compressed/64: Set up trampoline memory
  x86/boot/compressed/64: Prepare new top-level page table for
    trampoline

 arch/x86/boot/compressed/head_64.S    |  13 +++-
 arch/x86/boot/compressed/misc.c       |   4 +
 arch/x86/boot/compressed/pgtable.h    |  20 +++++
 arch/x86/boot/compressed/pgtable_64.c | 133 +++++++++++++++++++++++++++++++++-
 4 files changed, 166 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pgtable.h

-- 
2.16.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] x86/boot/compressed/64: Describe the logic behind LA57 check
  2018-02-26 18:04 [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Kirill A. Shutemov
@ 2018-02-26 18:04 ` Kirill A. Shutemov
  2018-03-12  9:27   ` [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the " tip-bot for Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline Kirill A. Shutemov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-26 18:04 UTC (permalink / raw)
  To: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin, Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel, Kirill A. Shutemov

The patch explains the LA57 check in more details.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/pgtable_64.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 3f1697fcc7a8..45c76eff2718 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -18,10 +18,22 @@ struct paging_config paging_prepare(void)
 {
 	struct paging_config paging_config = {};
 
-	/* Check if LA57 is desired and supported */
-	if (IS_ENABLED(CONFIG_X86_5LEVEL) && native_cpuid_eax(0) >= 7 &&
-			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
+	/*
+	 * Check if LA57 is desired and supported.
+	 *
+	 * There are two parts to the check:
+	 *   - if the kernel supports 5-level paging: CONFIG_X86_5LEVEL=y
+	 *   - if the machine supports 5-level paging:
+	 *     + CPUID leaf 7 is supported
+	 *     + the leaf has the feature bit set
+	 *
+	 * That's substitute for boot_cpu_has() in early boot code.
+	 */
+	if (IS_ENABLED(CONFIG_X86_5LEVEL) &&
+			native_cpuid_eax(0) >= 7 &&
+			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
 		paging_config.l5_required = 1;
+	}
 
 	return paging_config;
 }
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline
  2018-02-26 18:04 [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 1/5] x86/boot/compressed/64: Describe the logic behind LA57 check Kirill A. Shutemov
@ 2018-02-26 18:04 ` Kirill A. Shutemov
  2018-02-26 22:30   ` Borislav Petkov
  2018-03-12  9:28   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 3/5] x86/boot/compressed/64: Save and restore trampoline memory Kirill A. Shutemov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-26 18:04 UTC (permalink / raw)
  To: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin, Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel, Kirill A. Shutemov

If a bootloader enables 64-bit mode with 4-level paging, we might need to
switch over to 5-level paging. The switching requires the disabling of
paging, which works fine if kernel itself is loaded below 4G.

But if the bootloader puts the kernel above 4G (not sure if anybody does
this), we would lose control as soon as paging is disabled, because the
code becomes unreachable to the CPU.

To handle the situation, we need a trampoline in lower memory that would
take care of switching on 5-level paging.

This patch finds a spot in low memory for a trampoline.

The heuristic is based on code in reserve_bios_regions().

We find the end of low memory based on BIOS and EBDA start addresses.
The trampoline is put just before end of low memory. It's mimic approach
taken to allocate memory for realtime trampoline.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/misc.c       |  4 ++++
 arch/x86/boot/compressed/pgtable.h    | 11 +++++++++++
 arch/x86/boot/compressed/pgtable_64.c | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 49 insertions(+)
 create mode 100644 arch/x86/boot/compressed/pgtable.h

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index b50c42455e25..e58409667b13 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -14,6 +14,7 @@
 
 #include "misc.h"
 #include "error.h"
+#include "pgtable.h"
 #include "../string.h"
 #include "../voffset.h"
 
@@ -372,6 +373,9 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	debug_putaddr(output_len);
 	debug_putaddr(kernel_total_size);
 
+	/* Report address of 32-bit trampoline */
+	debug_putaddr(trampoline_32bit);
+
 	/*
 	 * The memory hole needed for the kernel is the larger of either
 	 * the entire decompressed kernel plus relocation table, or the
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
new file mode 100644
index 000000000000..57722a2fe2a0
--- /dev/null
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -0,0 +1,11 @@
+#ifndef BOOT_COMPRESSED_PAGETABLE_H
+#define BOOT_COMPRESSED_PAGETABLE_H
+
+#define TRAMPOLINE_32BIT_SIZE		(2 * PAGE_SIZE)
+
+#ifndef __ASSEMBLER__
+
+extern unsigned long *trampoline_32bit;
+
+#endif /* __ASSEMBLER__ */
+#endif /* BOOT_COMPRESSED_PAGETABLE_H */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 45c76eff2718..21d5cc1cd5fa 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -1,4 +1,5 @@
 #include <asm/processor.h>
+#include "pgtable.h"
 
 /*
  * __force_order is used by special_insns.h asm code to force instruction
@@ -9,14 +10,27 @@
  */
 unsigned long __force_order;
 
+#define BIOS_START_MIN		0x20000U	/* 128K, less than this is insane */
+#define BIOS_START_MAX		0x9f000U	/* 640K, absolute maximum */
+
 struct paging_config {
 	unsigned long trampoline_start;
 	unsigned long l5_required;
 };
 
+/*
+ * Trampoline address will be printed by extract_kernel() for debugging
+ * purposes.
+ *
+ * Avoid putting the pointer into .bss as it will be cleared between
+ * paging_prepare() and extract_kernel().
+ */
+unsigned long *trampoline_32bit __section(.data);
+
 struct paging_config paging_prepare(void)
 {
 	struct paging_config paging_config = {};
+	unsigned long bios_start, ebda_start;
 
 	/*
 	 * Check if LA57 is desired and supported.
@@ -35,5 +49,25 @@ struct paging_config paging_prepare(void)
 		paging_config.l5_required = 1;
 	}
 
+	/*
+	 * Find a suitable spot for the trampoline.
+	 * This code is based on reserve_bios_regions().
+	 */
+
+	ebda_start = *(unsigned short *)0x40e << 4;
+	bios_start = *(unsigned short *)0x413 << 10;
+
+	if (bios_start < BIOS_START_MIN || bios_start > BIOS_START_MAX)
+		bios_start = BIOS_START_MAX;
+
+	if (ebda_start > BIOS_START_MIN && ebda_start < bios_start)
+		bios_start = ebda_start;
+
+	/* Place the trampoline just below the end of low memory, aligned to 4k */
+	paging_config.trampoline_start = bios_start - TRAMPOLINE_32BIT_SIZE;
+	paging_config.trampoline_start = round_down(paging_config.trampoline_start, PAGE_SIZE);
+
+	trampoline_32bit = (unsigned long *)paging_config.trampoline_start;
+
 	return paging_config;
 }
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] x86/boot/compressed/64: Save and restore trampoline memory
  2018-02-26 18:04 [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 1/5] x86/boot/compressed/64: Describe the logic behind LA57 check Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline Kirill A. Shutemov
@ 2018-02-26 18:04 ` Kirill A. Shutemov
  2018-03-12  9:29   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 4/5] x86/boot/compressed/64: Set up " Kirill A. Shutemov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-26 18:04 UTC (permalink / raw)
  To: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin, Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel, Kirill A. Shutemov

The memory area we found for trampoline shouldn't contain anything
useful. But let's preserve the data anyway. Just to be on safe side.

paging_prepare() would save the data into a buffer.

cleanup_trampoline() would restore it back once we are done with the
trampoline.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S    | 10 ++++++++++
 arch/x86/boot/compressed/pgtable_64.c | 13 +++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d598d65db32c..8ba0582c65d5 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -355,6 +355,16 @@ ENTRY(startup_64)
 	lretq
 lvl5:
 
+	/*
+	 * cleanup_trampoline() would restore trampoline memory.
+	 *
+	 * RSI holds real mode data and needs to be preserved across
+	 * this function call.
+	 */
+	pushq	%rsi
+	call	cleanup_trampoline
+	popq	%rsi
+
 	/* Zero EFLAGS */
 	pushq	$0
 	popfq
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 21d5cc1cd5fa..01d08d3e3e43 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -1,5 +1,6 @@
 #include <asm/processor.h>
 #include "pgtable.h"
+#include "../string.h"
 
 /*
  * __force_order is used by special_insns.h asm code to force instruction
@@ -18,6 +19,9 @@ struct paging_config {
 	unsigned long l5_required;
 };
 
+/* Buffer to preserve trampoline memory */
+static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
+
 /*
  * Trampoline address will be printed by extract_kernel() for debugging
  * purposes.
@@ -69,5 +73,14 @@ struct paging_config paging_prepare(void)
 
 	trampoline_32bit = (unsigned long *)paging_config.trampoline_start;
 
+	/* Preserve trampoline memory */
+	memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
+
 	return paging_config;
 }
+
+void cleanup_trampoline(void)
+{
+	/* Restore trampoline memory */
+	memcpy(trampoline_32bit, trampoline_save, TRAMPOLINE_32BIT_SIZE);
+}
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] x86/boot/compressed/64: Set up trampoline memory
  2018-02-26 18:04 [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2018-02-26 18:04 ` [PATCH 3/5] x86/boot/compressed/64: Save and restore trampoline memory Kirill A. Shutemov
@ 2018-02-26 18:04 ` Kirill A. Shutemov
  2018-03-12  9:29   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
  2018-02-26 18:04 ` [PATCH 5/5] x86/boot/compressed/64: Prepare new top-level page table for trampoline Kirill A. Shutemov
  2018-02-26 19:32 ` [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Borislav Petkov
  5 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-26 18:04 UTC (permalink / raw)
  To: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin, Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel, Kirill A. Shutemov

This patch clears up trampoline memory and copies trampoline code in
place. It's not yet used though.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S    | 3 ++-
 arch/x86/boot/compressed/pgtable.h    | 9 +++++++++
 arch/x86/boot/compressed/pgtable_64.c | 7 +++++++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 8ba0582c65d5..c813cb004056 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -501,8 +501,9 @@ relocated:
 	jmp	*%rax
 
 	.code32
+ENTRY(trampoline_32bit_src)
 compatible_mode:
-	/* Setup data and stack segments */
+	/* Set up data and stack segments */
 	movl	$__KERNEL_DS, %eax
 	movl	%eax, %ds
 	movl	%eax, %ss
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 57722a2fe2a0..91f75638f6e6 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -3,9 +3,18 @@
 
 #define TRAMPOLINE_32BIT_SIZE		(2 * PAGE_SIZE)
 
+#define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
+
+#define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x60
+
+#define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
+
 #ifndef __ASSEMBLER__
 
 extern unsigned long *trampoline_32bit;
 
+extern void trampoline_32bit_src(void *return_ptr);
+
 #endif /* __ASSEMBLER__ */
 #endif /* BOOT_COMPRESSED_PAGETABLE_H */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 01d08d3e3e43..810c2c32d98e 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -76,6 +76,13 @@ struct paging_config paging_prepare(void)
 	/* Preserve trampoline memory */
 	memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
 
+	/* Clear trampoline memory first */
+	memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
+
+	/* Copy trampoline code in place */
+	memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
+			&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
+
 	return paging_config;
 }
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] x86/boot/compressed/64: Prepare new top-level page table for trampoline
  2018-02-26 18:04 [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2018-02-26 18:04 ` [PATCH 4/5] x86/boot/compressed/64: Set up " Kirill A. Shutemov
@ 2018-02-26 18:04 ` Kirill A. Shutemov
  2018-03-12  9:30   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
  2018-02-26 19:32 ` [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Borislav Petkov
  5 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-26 18:04 UTC (permalink / raw)
  To: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin, Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel, Kirill A. Shutemov

If trampoline code would need to switch between 4- and 5-level paging
modes, we have to use a page table in trampoline memory.

Having it in trampoline memory guarantees that it's below 4G and we can
point CR3 to it from 32-bit trampoline code.

We only use the page table if the desired paging mode doesn't match the
mode we are in. Otherwise the page table is unused and trampoline code
wouldn't touch CR3.

For 4- to 5-level paging transition, we set up current (4-level paging)
CR3 as the first and the only entry in a new top-level page table.

For 5- to 4-level paging transition, copy page table pointed by first
entry in the current top-level page table as our new top-level page
table.

If the page table is used by trampoline we would need to copy it to new
page table outside trampoline and update CR3 before restoring trampoline
memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/pgtable_64.c | 61 +++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 810c2c32d98e..32af1cbcd903 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -22,6 +22,14 @@ struct paging_config {
 /* Buffer to preserve trampoline memory */
 static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
 
+/*
+ * The page table is going to be used instead of page table in the trampoline
+ * memory.
+ *
+ * It must not be in BSS as BSS is cleared after cleanup_trampoline().
+ */
+static char top_pgtable[PAGE_SIZE] __aligned(PAGE_SIZE) __section(.data);
+
 /*
  * Trampoline address will be printed by extract_kernel() for debugging
  * purposes.
@@ -83,11 +91,64 @@ struct paging_config paging_prepare(void)
 	memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
 			&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
 
+	/*
+	 * The code below prepares page table in trampoline memory.
+	 *
+	 * The new page table will be used by trampoline code for switching
+	 * from 4- to 5-level paging or vice versa.
+	 *
+	 * If switching is not required, the page table is unused: trampoline
+	 * code wouldn't touch CR3.
+	 */
+
+	/*
+	 * We are not going to use the page table in trampoline memory if we
+	 * are already in the desired paging mode.
+	 */
+	if (paging_config.l5_required == !!(native_read_cr4() & X86_CR4_LA57))
+		goto out;
+
+	if (paging_config.l5_required) {
+		/*
+		 * For 4- to 5-level paging transition, set up current CR3 as
+		 * the first and the only entry in a new top-level page table.
+		 */
+		trampoline_32bit[TRAMPOLINE_32BIT_PGTABLE_OFFSET] = __native_read_cr3() | _PAGE_TABLE_NOENC;
+	} else {
+		unsigned long src;
+
+		/*
+		 * For 5- to 4-level paging transition, copy page table pointed
+		 * by first entry in the current top-level page table as our
+		 * new top-level page table.
+		 *
+		 * We cannot just point to the page table from trampoline as it
+		 * may be above 4G.
+		 */
+		src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
+		memcpy(trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long),
+		       (void *)src, PAGE_SIZE);
+	}
+
+out:
 	return paging_config;
 }
 
 void cleanup_trampoline(void)
 {
+	void *trampoline_pgtable;
+
+	trampoline_pgtable = trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET;
+
+	/*
+	 * Move the top level page table out of trampoline memory,
+	 * if it's there.
+	 */
+	if ((void *)__native_read_cr3() == trampoline_pgtable) {
+		memcpy(top_pgtable, trampoline_pgtable, PAGE_SIZE);
+		native_write_cr3((unsigned long)top_pgtable);
+	}
+
 	/* Restore trampoline memory */
 	memcpy(trampoline_32bit, trampoline_save, TRAMPOLINE_32BIT_SIZE);
 }
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory
  2018-02-26 18:04 [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2018-02-26 18:04 ` [PATCH 5/5] x86/boot/compressed/64: Prepare new top-level page table for trampoline Kirill A. Shutemov
@ 2018-02-26 19:32 ` Borislav Petkov
  2018-02-26 20:55   ` Kirill A. Shutemov
  5 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2018-02-26 19:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin,
	Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel

On Mon, Feb 26, 2018 at 09:04:46PM +0300, Kirill A. Shutemov wrote:
> Borislav, could you check which patch breaks boot for you (if any)?

What is that ontop? tip/master from today or?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory
  2018-02-26 19:32 ` [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Borislav Petkov
@ 2018-02-26 20:55   ` Kirill A. Shutemov
  2018-02-27  9:32     ` Borislav Petkov
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-26 20:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Ingo Molnar, x86, Thomas Gleixner,
	H. Peter Anvin, Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov,
	Andi Kleen, Matthew Wilcox, linux-mm, linux-kernel

On Mon, Feb 26, 2018 at 08:32:44PM +0100, Borislav Petkov wrote:
> On Mon, Feb 26, 2018 at 09:04:46PM +0300, Kirill A. Shutemov wrote:
> > Borislav, could you check which patch breaks boot for you (if any)?
> 
> What is that ontop? tip/master from today or?

I made it on top of tip/x86/mm, but tip/master should be fine too.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline
  2018-02-26 18:04 ` [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline Kirill A. Shutemov
@ 2018-02-26 22:30   ` Borislav Petkov
  2018-02-27  8:14     ` Kirill A. Shutemov
  2018-03-12  9:28   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
  1 sibling, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2018-02-26 22:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, x86, Thomas Gleixner, H. Peter Anvin,
	Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov, Andi Kleen,
	Matthew Wilcox, linux-mm, linux-kernel

On Mon, Feb 26, 2018 at 09:04:48PM +0300, Kirill A. Shutemov wrote:
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -0,0 +1,11 @@
> +#ifndef BOOT_COMPRESSED_PAGETABLE_H
> +#define BOOT_COMPRESSED_PAGETABLE_H
> +
> +#define TRAMPOLINE_32BIT_SIZE		(2 * PAGE_SIZE)
> +
> +#ifndef __ASSEMBLER__

x86 uses __ASSEMBLY__ everywhere and I see

arch/x86/boot/compressed/Makefile:41:KBUILD_AFLAGS  := $(KBUILD_CFLAGS) -D__ASSEMBLY__

so it should work here too.

Even though __ASSEMBLER__ is gcc predefined.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline
  2018-02-26 22:30   ` Borislav Petkov
@ 2018-02-27  8:14     ` Kirill A. Shutemov
  0 siblings, 0 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-02-27  8:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Ingo Molnar, x86, Thomas Gleixner,
	H. Peter Anvin, Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov,
	Andi Kleen, Matthew Wilcox, linux-mm, linux-kernel

On Mon, Feb 26, 2018 at 11:30:38PM +0100, Borislav Petkov wrote:
> On Mon, Feb 26, 2018 at 09:04:48PM +0300, Kirill A. Shutemov wrote:
> > +++ b/arch/x86/boot/compressed/pgtable.h
> > @@ -0,0 +1,11 @@
> > +#ifndef BOOT_COMPRESSED_PAGETABLE_H
> > +#define BOOT_COMPRESSED_PAGETABLE_H
> > +
> > +#define TRAMPOLINE_32BIT_SIZE		(2 * PAGE_SIZE)
> > +
> > +#ifndef __ASSEMBLER__
> 
> x86 uses __ASSEMBLY__ everywhere and I see
> 
> arch/x86/boot/compressed/Makefile:41:KBUILD_AFLAGS  := $(KBUILD_CFLAGS) -D__ASSEMBLY__
> 
> so it should work here too.
> 
> Even though __ASSEMBLER__ is gcc predefined.

Okay, I'll fix this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory
  2018-02-26 20:55   ` Kirill A. Shutemov
@ 2018-02-27  9:32     ` Borislav Petkov
  0 siblings, 0 replies; 28+ messages in thread
From: Borislav Petkov @ 2018-02-27  9:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, x86, Thomas Gleixner,
	H. Peter Anvin, Linus Torvalds, Andy Lutomirski, Cyrill Gorcunov,
	Andi Kleen, Matthew Wilcox, linux-mm, linux-kernel

On Mon, Feb 26, 2018 at 11:55:27PM +0300, Kirill A. Shutemov wrote:
> On Mon, Feb 26, 2018 at 08:32:44PM +0100, Borislav Petkov wrote:
> > On Mon, Feb 26, 2018 at 09:04:46PM +0300, Kirill A. Shutemov wrote:
> > > Borislav, could you check which patch breaks boot for you (if any)?
> > 
> > What is that ontop? tip/master from today or?
> 
> I made it on top of tip/x86/mm, but tip/master should be fine too.

Ok, those 5 look good ontop of tip/master from last night.

I did a clean build and guest boot with each one applied in succession
just to make sure there's no funky business from the build system.

Tested-by: Borislav Petkov <bp@suse.de>

Thx.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-02-26 18:04 ` [PATCH 1/5] x86/boot/compressed/64: Describe the logic behind LA57 check Kirill A. Shutemov
@ 2018-03-12  9:27   ` tip-bot for Kirill A. Shutemov
  2018-03-12 12:40     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2018-03-12  9:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: kirill.shutemov, gorcunov, keescook, luto, willy, torvalds,
	andy.shevchenko, bp, tglx, linux-kernel, mingo, hpa, ebiederm,
	peterz, jgross

Commit-ID:  a403d798182f4f7be5e9bab56cfa37e9828fd92a
Gitweb:     https://git.kernel.org/tip/a403d798182f4f7be5e9bab56cfa37e9828fd92a
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Mon, 26 Feb 2018 21:04:47 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 12 Mar 2018 09:29:24 +0100

x86/boot/compressed/64: Describe the logic behind the LA57 check

The patch explains the LA57 check in more details.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/boot/compressed/pgtable_64.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 3f1697fcc7a8..45c76eff2718 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -18,10 +18,22 @@ struct paging_config paging_prepare(void)
 {
 	struct paging_config paging_config = {};
 
-	/* Check if LA57 is desired and supported */
-	if (IS_ENABLED(CONFIG_X86_5LEVEL) && native_cpuid_eax(0) >= 7 &&
-			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
+	/*
+	 * Check if LA57 is desired and supported.
+	 *
+	 * There are two parts to the check:
+	 *   - if the kernel supports 5-level paging: CONFIG_X86_5LEVEL=y
+	 *   - if the machine supports 5-level paging:
+	 *     + CPUID leaf 7 is supported
+	 *     + the leaf has the feature bit set
+	 *
+	 * That's substitute for boot_cpu_has() in early boot code.
+	 */
+	if (IS_ENABLED(CONFIG_X86_5LEVEL) &&
+			native_cpuid_eax(0) >= 7 &&
+			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
 		paging_config.l5_required = 1;
+	}
 
 	return paging_config;
 }

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [tip:x86/mm] x86/boot/compressed/64: Find a place for 32-bit trampoline
  2018-02-26 18:04 ` [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline Kirill A. Shutemov
  2018-02-26 22:30   ` Borislav Petkov
@ 2018-03-12  9:28   ` tip-bot for Kirill A. Shutemov
  1 sibling, 0 replies; 28+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2018-03-12  9:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, ebiederm, bp, gorcunov, willy, peterz, linux-kernel,
	torvalds, mingo, andy.shevchenko, keescook, tglx,
	kirill.shutemov, jgross, hpa

Commit-ID:  3548e131ec6a82208f36e68d31947b0fe244c7a7
Gitweb:     https://git.kernel.org/tip/3548e131ec6a82208f36e68d31947b0fe244c7a7
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Mon, 26 Feb 2018 21:04:48 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 12 Mar 2018 09:37:23 +0100

x86/boot/compressed/64: Find a place for 32-bit trampoline

If a bootloader enables 64-bit mode with 4-level paging, we might need to
switch over to 5-level paging. The switching requires the disabling of
paging, which works fine if kernel itself is loaded below 4G.

But if the bootloader puts the kernel above 4G (not sure if anybody does
this), we would lose control as soon as paging is disabled, because the
code becomes unreachable to the CPU.

To handle the situation, we need a trampoline in lower memory that would
take care of switching on 5-level paging.

This patch finds a spot in low memory for a trampoline.

The heuristic is based on code in reserve_bios_regions().

We find the end of low memory based on BIOS and EBDA start addresses.
The trampoline is put just before end of low memory. It's mimic approach
taken to allocate memory for realtime trampoline.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/boot/compressed/misc.c       |  6 ++++++
 arch/x86/boot/compressed/pgtable.h    | 11 +++++++++++
 arch/x86/boot/compressed/pgtable_64.c | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index b50c42455e25..8e4b55dd5df9 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -14,6 +14,7 @@
 
 #include "misc.h"
 #include "error.h"
+#include "pgtable.h"
 #include "../string.h"
 #include "../voffset.h"
 
@@ -372,6 +373,11 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	debug_putaddr(output_len);
 	debug_putaddr(kernel_total_size);
 
+#ifdef CONFIG_X86_64
+	/* Report address of 32-bit trampoline */
+	debug_putaddr(trampoline_32bit);
+#endif
+
 	/*
 	 * The memory hole needed for the kernel is the larger of either
 	 * the entire decompressed kernel plus relocation table, or the
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
new file mode 100644
index 000000000000..57722a2fe2a0
--- /dev/null
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -0,0 +1,11 @@
+#ifndef BOOT_COMPRESSED_PAGETABLE_H
+#define BOOT_COMPRESSED_PAGETABLE_H
+
+#define TRAMPOLINE_32BIT_SIZE		(2 * PAGE_SIZE)
+
+#ifndef __ASSEMBLER__
+
+extern unsigned long *trampoline_32bit;
+
+#endif /* __ASSEMBLER__ */
+#endif /* BOOT_COMPRESSED_PAGETABLE_H */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 45c76eff2718..21d5cc1cd5fa 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -1,4 +1,5 @@
 #include <asm/processor.h>
+#include "pgtable.h"
 
 /*
  * __force_order is used by special_insns.h asm code to force instruction
@@ -9,14 +10,27 @@
  */
 unsigned long __force_order;
 
+#define BIOS_START_MIN		0x20000U	/* 128K, less than this is insane */
+#define BIOS_START_MAX		0x9f000U	/* 640K, absolute maximum */
+
 struct paging_config {
 	unsigned long trampoline_start;
 	unsigned long l5_required;
 };
 
+/*
+ * Trampoline address will be printed by extract_kernel() for debugging
+ * purposes.
+ *
+ * Avoid putting the pointer into .bss as it will be cleared between
+ * paging_prepare() and extract_kernel().
+ */
+unsigned long *trampoline_32bit __section(.data);
+
 struct paging_config paging_prepare(void)
 {
 	struct paging_config paging_config = {};
+	unsigned long bios_start, ebda_start;
 
 	/*
 	 * Check if LA57 is desired and supported.
@@ -35,5 +49,25 @@ struct paging_config paging_prepare(void)
 		paging_config.l5_required = 1;
 	}
 
+	/*
+	 * Find a suitable spot for the trampoline.
+	 * This code is based on reserve_bios_regions().
+	 */
+
+	ebda_start = *(unsigned short *)0x40e << 4;
+	bios_start = *(unsigned short *)0x413 << 10;
+
+	if (bios_start < BIOS_START_MIN || bios_start > BIOS_START_MAX)
+		bios_start = BIOS_START_MAX;
+
+	if (ebda_start > BIOS_START_MIN && ebda_start < bios_start)
+		bios_start = ebda_start;
+
+	/* Place the trampoline just below the end of low memory, aligned to 4k */
+	paging_config.trampoline_start = bios_start - TRAMPOLINE_32BIT_SIZE;
+	paging_config.trampoline_start = round_down(paging_config.trampoline_start, PAGE_SIZE);
+
+	trampoline_32bit = (unsigned long *)paging_config.trampoline_start;
+
 	return paging_config;
 }

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [tip:x86/mm] x86/boot/compressed/64: Save and restore trampoline memory
  2018-02-26 18:04 ` [PATCH 3/5] x86/boot/compressed/64: Save and restore trampoline memory Kirill A. Shutemov
@ 2018-03-12  9:29   ` tip-bot for Kirill A. Shutemov
  0 siblings, 0 replies; 28+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2018-03-12  9:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: gorcunov, keescook, mingo, willy, andy.shevchenko, luto,
	kirill.shutemov, ebiederm, peterz, jgross, torvalds, tglx, bp,
	hpa, linux-kernel

Commit-ID:  fb5268354d20b82c12569e325b0d051c09f983f7
Gitweb:     https://git.kernel.org/tip/fb5268354d20b82c12569e325b0d051c09f983f7
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Mon, 26 Feb 2018 21:04:49 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 12 Mar 2018 09:37:25 +0100

x86/boot/compressed/64: Save and restore trampoline memory

The memory area we found for trampoline shouldn't contain anything
useful. But let's preserve the data anyway. Just to be on safe side.

paging_prepare() would save the data into a buffer.

cleanup_trampoline() would restore it back once we are done with the
trampoline.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/boot/compressed/head_64.S    | 10 ++++++++++
 arch/x86/boot/compressed/pgtable_64.c | 13 +++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d598d65db32c..8ba0582c65d5 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -355,6 +355,16 @@ ENTRY(startup_64)
 	lretq
 lvl5:
 
+	/*
+	 * cleanup_trampoline() would restore trampoline memory.
+	 *
+	 * RSI holds real mode data and needs to be preserved across
+	 * this function call.
+	 */
+	pushq	%rsi
+	call	cleanup_trampoline
+	popq	%rsi
+
 	/* Zero EFLAGS */
 	pushq	$0
 	popfq
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 21d5cc1cd5fa..01d08d3e3e43 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -1,5 +1,6 @@
 #include <asm/processor.h>
 #include "pgtable.h"
+#include "../string.h"
 
 /*
  * __force_order is used by special_insns.h asm code to force instruction
@@ -18,6 +19,9 @@ struct paging_config {
 	unsigned long l5_required;
 };
 
+/* Buffer to preserve trampoline memory */
+static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
+
 /*
  * Trampoline address will be printed by extract_kernel() for debugging
  * purposes.
@@ -69,5 +73,14 @@ struct paging_config paging_prepare(void)
 
 	trampoline_32bit = (unsigned long *)paging_config.trampoline_start;
 
+	/* Preserve trampoline memory */
+	memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
+
 	return paging_config;
 }
+
+void cleanup_trampoline(void)
+{
+	/* Restore trampoline memory */
+	memcpy(trampoline_32bit, trampoline_save, TRAMPOLINE_32BIT_SIZE);
+}

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [tip:x86/mm] x86/boot/compressed/64: Set up trampoline memory
  2018-02-26 18:04 ` [PATCH 4/5] x86/boot/compressed/64: Set up " Kirill A. Shutemov
@ 2018-03-12  9:29   ` tip-bot for Kirill A. Shutemov
  0 siblings, 0 replies; 28+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2018-03-12  9:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, torvalds, willy, jgross, hpa, keescook, ebiederm,
	andy.shevchenko, bp, luto, mingo, gorcunov, tglx, linux-kernel,
	kirill.shutemov

Commit-ID:  32fcefa2bfc8961987e91d1daeb00624b4176d2e
Gitweb:     https://git.kernel.org/tip/32fcefa2bfc8961987e91d1daeb00624b4176d2e
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Mon, 26 Feb 2018 21:04:50 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 12 Mar 2018 09:37:25 +0100

x86/boot/compressed/64: Set up trampoline memory

This patch clears up trampoline memory and copies trampoline code in
place. It's not yet used though.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/boot/compressed/head_64.S    | 3 ++-
 arch/x86/boot/compressed/pgtable.h    | 9 +++++++++
 arch/x86/boot/compressed/pgtable_64.c | 7 +++++++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 8ba0582c65d5..c813cb004056 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -501,8 +501,9 @@ relocated:
 	jmp	*%rax
 
 	.code32
+ENTRY(trampoline_32bit_src)
 compatible_mode:
-	/* Setup data and stack segments */
+	/* Set up data and stack segments */
 	movl	$__KERNEL_DS, %eax
 	movl	%eax, %ds
 	movl	%eax, %ss
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 57722a2fe2a0..91f75638f6e6 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -3,9 +3,18 @@
 
 #define TRAMPOLINE_32BIT_SIZE		(2 * PAGE_SIZE)
 
+#define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
+
+#define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x60
+
+#define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
+
 #ifndef __ASSEMBLER__
 
 extern unsigned long *trampoline_32bit;
 
+extern void trampoline_32bit_src(void *return_ptr);
+
 #endif /* __ASSEMBLER__ */
 #endif /* BOOT_COMPRESSED_PAGETABLE_H */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 01d08d3e3e43..810c2c32d98e 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -76,6 +76,13 @@ struct paging_config paging_prepare(void)
 	/* Preserve trampoline memory */
 	memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
 
+	/* Clear trampoline memory first */
+	memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
+
+	/* Copy trampoline code in place */
+	memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
+			&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
+
 	return paging_config;
 }
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [tip:x86/mm] x86/boot/compressed/64: Prepare new top-level page table for trampoline
  2018-02-26 18:04 ` [PATCH 5/5] x86/boot/compressed/64: Prepare new top-level page table for trampoline Kirill A. Shutemov
@ 2018-03-12  9:30   ` tip-bot for Kirill A. Shutemov
  0 siblings, 0 replies; 28+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2018-03-12  9:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: gorcunov, ebiederm, peterz, mingo, jgross, tglx, hpa, willy,
	torvalds, keescook, kirill.shutemov, linux-kernel, bp, luto,
	andy.shevchenko

Commit-ID:  e9d0e6330eb81ca49bdd8849cc52b3b0f70ed5cb
Gitweb:     https://git.kernel.org/tip/e9d0e6330eb81ca49bdd8849cc52b3b0f70ed5cb
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Mon, 26 Feb 2018 21:04:51 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 12 Mar 2018 09:37:26 +0100

x86/boot/compressed/64: Prepare new top-level page table for trampoline

If trampoline code would need to switch between 4- and 5-level paging
modes, we have to use a page table in trampoline memory.

Having it in trampoline memory guarantees that it's below 4G and we can
point CR3 to it from 32-bit trampoline code.

We only use the page table if the desired paging mode doesn't match the
mode we are in. Otherwise the page table is unused and trampoline code
wouldn't touch CR3.

For 4- to 5-level paging transition, we set up current (4-level paging)
CR3 as the first and the only entry in a new top-level page table.

For 5- to 4-level paging transition, copy page table pointed by first
entry in the current top-level page table as our new top-level page
table.

If the page table is used by trampoline we would need to copy it to new
page table outside trampoline and update CR3 before restoring trampoline
memory.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/boot/compressed/pgtable_64.c | 61 +++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 810c2c32d98e..32af1cbcd903 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -22,6 +22,14 @@ struct paging_config {
 /* Buffer to preserve trampoline memory */
 static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
 
+/*
+ * The page table is going to be used instead of page table in the trampoline
+ * memory.
+ *
+ * It must not be in BSS as BSS is cleared after cleanup_trampoline().
+ */
+static char top_pgtable[PAGE_SIZE] __aligned(PAGE_SIZE) __section(.data);
+
 /*
  * Trampoline address will be printed by extract_kernel() for debugging
  * purposes.
@@ -83,11 +91,64 @@ struct paging_config paging_prepare(void)
 	memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
 			&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
 
+	/*
+	 * The code below prepares page table in trampoline memory.
+	 *
+	 * The new page table will be used by trampoline code for switching
+	 * from 4- to 5-level paging or vice versa.
+	 *
+	 * If switching is not required, the page table is unused: trampoline
+	 * code wouldn't touch CR3.
+	 */
+
+	/*
+	 * We are not going to use the page table in trampoline memory if we
+	 * are already in the desired paging mode.
+	 */
+	if (paging_config.l5_required == !!(native_read_cr4() & X86_CR4_LA57))
+		goto out;
+
+	if (paging_config.l5_required) {
+		/*
+		 * For 4- to 5-level paging transition, set up current CR3 as
+		 * the first and the only entry in a new top-level page table.
+		 */
+		trampoline_32bit[TRAMPOLINE_32BIT_PGTABLE_OFFSET] = __native_read_cr3() | _PAGE_TABLE_NOENC;
+	} else {
+		unsigned long src;
+
+		/*
+		 * For 5- to 4-level paging transition, copy page table pointed
+		 * by first entry in the current top-level page table as our
+		 * new top-level page table.
+		 *
+		 * We cannot just point to the page table from trampoline as it
+		 * may be above 4G.
+		 */
+		src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
+		memcpy(trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long),
+		       (void *)src, PAGE_SIZE);
+	}
+
+out:
 	return paging_config;
 }
 
 void cleanup_trampoline(void)
 {
+	void *trampoline_pgtable;
+
+	trampoline_pgtable = trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET;
+
+	/*
+	 * Move the top level page table out of trampoline memory,
+	 * if it's there.
+	 */
+	if ((void *)__native_read_cr3() == trampoline_pgtable) {
+		memcpy(top_pgtable, trampoline_pgtable, PAGE_SIZE);
+		native_write_cr3((unsigned long)top_pgtable);
+	}
+
 	/* Restore trampoline memory */
 	memcpy(trampoline_32bit, trampoline_save, TRAMPOLINE_32BIT_SIZE);
 }

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12  9:27   ` [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the " tip-bot for Kirill A. Shutemov
@ 2018-03-12 12:40     ` Peter Zijlstra
  2018-03-12 12:43       ` Kirill A. Shutemov
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2018-03-12 12:40 UTC (permalink / raw)
  To: kirill.shutemov, gorcunov, luto, keescook, willy, torvalds, tglx,
	bp, andy.shevchenko, linux-kernel, hpa, mingo, ebiederm, jgross
  Cc: linux-tip-commits

On Mon, Mar 12, 2018 at 02:27:58AM -0700, tip-bot for Kirill A. Shutemov wrote:
> +	/*
> +	 * Check if LA57 is desired and supported.
> +	 *
> +	 * There are two parts to the check:
> +	 *   - if the kernel supports 5-level paging: CONFIG_X86_5LEVEL=y
> +	 *   - if the machine supports 5-level paging:
> +	 *     + CPUID leaf 7 is supported
> +	 *     + the leaf has the feature bit set
> +	 *
> +	 * That's substitute for boot_cpu_has() in early boot code.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_5LEVEL) &&
> +			native_cpuid_eax(0) >= 7 &&
> +			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
>  		paging_config.l5_required = 1;
> +	}

Should this not also include something like: machine actually has
suffient memory for it to make sense to use l5 ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 12:40     ` Peter Zijlstra
@ 2018-03-12 12:43       ` Kirill A. Shutemov
  2018-03-12 13:10         ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-03-12 12:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kirill.shutemov, gorcunov, luto, keescook, willy, torvalds, tglx,
	bp, andy.shevchenko, linux-kernel, hpa, mingo, ebiederm, jgross,
	linux-tip-commits

On Mon, Mar 12, 2018 at 01:40:27PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 02:27:58AM -0700, tip-bot for Kirill A. Shutemov wrote:
> > +	/*
> > +	 * Check if LA57 is desired and supported.
> > +	 *
> > +	 * There are two parts to the check:
> > +	 *   - if the kernel supports 5-level paging: CONFIG_X86_5LEVEL=y
> > +	 *   - if the machine supports 5-level paging:
> > +	 *     + CPUID leaf 7 is supported
> > +	 *     + the leaf has the feature bit set
> > +	 *
> > +	 * That's substitute for boot_cpu_has() in early boot code.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_X86_5LEVEL) &&
> > +			native_cpuid_eax(0) >= 7 &&
> > +			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
> >  		paging_config.l5_required = 1;
> > +	}
> 
> Should this not also include something like: machine actually has
> suffient memory for it to make sense to use l5 ?

Define "suffient". :)

The amount of physical memory is not the only reason to have 5-level
paging enabled. You may need 5-level paging to get access to wider virtual
address space to map something not backed by local physical memory
(consider RDMA).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 12:43       ` Kirill A. Shutemov
@ 2018-03-12 13:10         ` Peter Zijlstra
  2018-03-12 14:04           ` Kirill A. Shutemov
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2018-03-12 13:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kirill.shutemov, gorcunov, luto, keescook, willy, torvalds, tglx,
	bp, andy.shevchenko, linux-kernel, hpa, mingo, ebiederm, jgross,
	linux-tip-commits

On Mon, Mar 12, 2018 at 03:43:37PM +0300, Kirill A. Shutemov wrote:
> On Mon, Mar 12, 2018 at 01:40:27PM +0100, Peter Zijlstra wrote:
> > On Mon, Mar 12, 2018 at 02:27:58AM -0700, tip-bot for Kirill A. Shutemov wrote:
> > > +	/*
> > > +	 * Check if LA57 is desired and supported.
> > > +	 *
> > > +	 * There are two parts to the check:
> > > +	 *   - if the kernel supports 5-level paging: CONFIG_X86_5LEVEL=y
> > > +	 *   - if the machine supports 5-level paging:
> > > +	 *     + CPUID leaf 7 is supported
> > > +	 *     + the leaf has the feature bit set
> > > +	 *
> > > +	 * That's substitute for boot_cpu_has() in early boot code.
> > > +	 */
> > > +	if (IS_ENABLED(CONFIG_X86_5LEVEL) &&
> > > +			native_cpuid_eax(0) >= 7 &&
> > > +			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
> > >  		paging_config.l5_required = 1;
> > > +	}
> > 
> > Should this not also include something like: machine actually has
> > suffient memory for it to make sense to use l5 ?
> 
> Define "suffient". :)
> 
> The amount of physical memory is not the only reason to have 5-level
> paging enabled. You may need 5-level paging to get access to wider virtual
> address space to map something not backed by local physical memory
> (consider RDMA).

Special needs can always use special knobs :-) But I was thinking
something like >2/3 46 bits or so switching to 5L. My main concern is
the increased worst case TLB miss cost on machines that really don't
need 5L paging (like my desktop, which I suspect will not exceed the
multi terabyte of memory class for a while yet).

We can of course bike shed / benchmark this once my desktop refresh
sports this feature, but ISTR this being one of the very first things
Ingo mentioned when we started this whole 5L thing.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 13:10         ` Peter Zijlstra
@ 2018-03-12 14:04           ` Kirill A. Shutemov
  2018-03-12 14:32             ` Ingo Molnar
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-03-12 14:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kirill.shutemov, gorcunov, luto, keescook, willy, torvalds, tglx,
	bp, andy.shevchenko, linux-kernel, hpa, mingo, ebiederm, jgross,
	linux-tip-commits

On Mon, Mar 12, 2018 at 02:10:55PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 03:43:37PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Mar 12, 2018 at 01:40:27PM +0100, Peter Zijlstra wrote:
> > > On Mon, Mar 12, 2018 at 02:27:58AM -0700, tip-bot for Kirill A. Shutemov wrote:
> > > > +	/*
> > > > +	 * Check if LA57 is desired and supported.
> > > > +	 *
> > > > +	 * There are two parts to the check:
> > > > +	 *   - if the kernel supports 5-level paging: CONFIG_X86_5LEVEL=y
> > > > +	 *   - if the machine supports 5-level paging:
> > > > +	 *     + CPUID leaf 7 is supported
> > > > +	 *     + the leaf has the feature bit set
> > > > +	 *
> > > > +	 * That's substitute for boot_cpu_has() in early boot code.
> > > > +	 */
> > > > +	if (IS_ENABLED(CONFIG_X86_5LEVEL) &&
> > > > +			native_cpuid_eax(0) >= 7 &&
> > > > +			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
> > > >  		paging_config.l5_required = 1;
> > > > +	}
> > > 
> > > Should this not also include something like: machine actually has
> > > suffient memory for it to make sense to use l5 ?
> > 
> > Define "suffient". :)
> > 
> > The amount of physical memory is not the only reason to have 5-level
> > paging enabled. You may need 5-level paging to get access to wider virtual
> > address space to map something not backed by local physical memory
> > (consider RDMA).
> 
> Special needs can always use special knobs :-) But I was thinking
> something like >2/3 46 bits or so switching to 5L.

42TiB or so?

This basically means that 5-level paging will not get run on vast majority
of *capable* hardware. That's not good from testing POV.

> My main concern is the increased worst case TLB miss cost on machines
> that really don't need 5L paging (like my desktop, which I suspect will
> not exceed the multi terabyte of memory class for a while yet).

The microarchitecture was adjusted to accommodate the increased TLB
pressure. You shouldn't see the difference unless you actively use
increased virtual address space.

> We can of course bike shed / benchmark this once my desktop refresh
> sports this feature, but ISTR this being one of the very first things
> Ingo mentioned when we started this whole 5L thing.

I would rather not fix the problem that may not actually exist. :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 14:04           ` Kirill A. Shutemov
@ 2018-03-12 14:32             ` Ingo Molnar
  2018-03-12 14:50               ` Kirill A. Shutemov
  2018-03-12 14:52               ` Cyrill Gorcunov
  0 siblings, 2 replies; 28+ messages in thread
From: Ingo Molnar @ 2018-03-12 14:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Zijlstra, kirill.shutemov, gorcunov, luto, keescook, willy,
	torvalds, tglx, bp, andy.shevchenko, linux-kernel, hpa, ebiederm,
	jgross, linux-tip-commits


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> > We can of course bike shed / benchmark this once my desktop refresh
> > sports this feature, but ISTR this being one of the very first things
> > Ingo mentioned when we started this whole 5L thing.
> 
> I would rather not fix the problem that may not actually exist. :)

That 5 level pagetables involve more overhead is a realy problem.

By default we should only enable 5-level paging if memory mappings exist in
the memory map that require the extended physical memory space.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 14:32             ` Ingo Molnar
@ 2018-03-12 14:50               ` Kirill A. Shutemov
  2018-03-12 16:42                 ` Linus Torvalds
  2018-03-12 14:52               ` Cyrill Gorcunov
  1 sibling, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2018-03-12 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Peter Zijlstra, gorcunov, luto, keescook,
	willy, torvalds, tglx, bp, andy.shevchenko, linux-kernel, hpa,
	ebiederm, jgross, linux-tip-commits, Dave Hansen

On Mon, Mar 12, 2018 at 02:32:12PM +0000, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > > We can of course bike shed / benchmark this once my desktop refresh
> > > sports this feature, but ISTR this being one of the very first things
> > > Ingo mentioned when we started this whole 5L thing.
> > 
> > I would rather not fix the problem that may not actually exist. :)
> 
> That 5 level pagetables involve more overhead is a realy problem.

As I mentioned before, microarchitecture changes takes care about
additional overhead: size of intermediate TLB was increased which should
make the difference between 4- and 5-level paging negligible.

> By default we should only enable 5-level paging if memory mappings exist in
> the memory map that require the extended physical memory space.

I disagree that we should decide usefulness of the 5-level paging based on
size of physical memory on the machine.

Consider use case when you have 100TiB database file. It's pretty
reasonable to mmap() such file at once even if you don't have 100TiB of
physical memory to back it up. 1/100 of the file size may still work
fairly well.

Virtual address space is useful on its own and we shouldn't take the
value from the user just because he doesn't have tens of terabytes of
memory.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 14:32             ` Ingo Molnar
  2018-03-12 14:50               ` Kirill A. Shutemov
@ 2018-03-12 14:52               ` Cyrill Gorcunov
  1 sibling, 0 replies; 28+ messages in thread
From: Cyrill Gorcunov @ 2018-03-12 14:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Peter Zijlstra, kirill.shutemov, luto,
	keescook, willy, torvalds, tglx, bp, andy.shevchenko,
	linux-kernel, hpa, ebiederm, jgross, linux-tip-commits

On Mon, Mar 12, 2018 at 03:32:12PM +0100, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > > We can of course bike shed / benchmark this once my desktop refresh
> > > sports this feature, but ISTR this being one of the very first things
> > > Ingo mentioned when we started this whole 5L thing.
> > 
> > I would rather not fix the problem that may not actually exist. :)
> 
> That 5 level pagetables involve more overhead is a realy problem.
> 
> By default we should only enable 5-level paging if memory mappings exist in
> the memory map that require the extended physical memory space.

Does it mean that if a machine supports 5lvl but has phisycal memory
installed fitting the 4lvl space, and has memory hotplug supported,
adding more memory won't have effect until next reboot?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 14:50               ` Kirill A. Shutemov
@ 2018-03-12 16:42                 ` Linus Torvalds
  2018-03-12 17:06                   ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2018-03-12 16:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Kirill A. Shutemov, Peter Zijlstra, Cyrill Gorcunov,
	Andy Lutomirski, Kees Cook, Matthew Wilcox, Thomas Gleixner,
	Borislav Petkov, Andy Shevchenko, Linux Kernel Mailing List,
	Peter Anvin, Eric W. Biederman, Jürgen Groß,
	linux-tip-commits, Dave Hansen

On Mon, Mar 12, 2018 at 7:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> I disagree that we should decide usefulness of the 5-level paging based on
> size of physical memory on the machine.
>
> Consider use case when you have 100TiB database file. It's pretty
> reasonable to mmap() such file at once even if you don't have 100TiB of
> physical memory to back it up. 1/100 of the file size may still work
> fairly well.
>
> Virtual address space is useful on its own and we shouldn't take the
> value from the user just because he doesn't have tens of terabytes of
> memory.

Absolutely.

Also, I'd suggest enabling 5-level paging as aggressively as possible
by default (ie whenever the hardware supports it), just for test
coverage.

Maybe in a year or two, when we actually have a fair amount of
coverage, we'll then say "ok, this just hurts normal workstations that
have the capability but only have ridiculously small fraction of
memory", and at that point say that unless you have a ton of RAM we'll
default to 4-level.

               Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 16:42                 ` Linus Torvalds
@ 2018-03-12 17:06                   ` Andy Lutomirski
  2018-03-12 17:12                     ` Linus Torvalds
  2018-03-12 17:21                     ` Dave Hansen
  0 siblings, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2018-03-12 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Ingo Molnar, Kirill A. Shutemov,
	Peter Zijlstra, Cyrill Gorcunov, Kees Cook, Matthew Wilcox,
	Thomas Gleixner, Borislav Petkov, Andy Shevchenko,
	Linux Kernel Mailing List, Peter Anvin, Eric W. Biederman,
	Jürgen Groß,
	linux-tip-commits, Dave Hansen

On Mon, Mar 12, 2018 at 4:42 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 12, 2018 at 7:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>>
>> I disagree that we should decide usefulness of the 5-level paging based on
>> size of physical memory on the machine.
>>
>> Consider use case when you have 100TiB database file. It's pretty
>> reasonable to mmap() such file at once even if you don't have 100TiB of
>> physical memory to back it up. 1/100 of the file size may still work
>> fairly well.
>>
>> Virtual address space is useful on its own and we shouldn't take the
>> value from the user just because he doesn't have tens of terabytes of
>> memory.
>
> Absolutely.
>
> Also, I'd suggest enabling 5-level paging as aggressively as possible
> by default (ie whenever the hardware supports it), just for test
> coverage.
>
> Maybe in a year or two, when we actually have a fair amount of
> coverage, we'll then say "ok, this just hurts normal workstations that
> have the capability but only have ridiculously small fraction of
> memory", and at that point say that unless you have a ton of RAM we'll
> default to 4-level.
>

I'd be surprised if there's a noticeable performance hit on anything
except the micro-est of benchmarks.  We're talking one extra
intermediate paging structure cache entry in use, maybe a few data
cache lines, and (wild guess) 0 extra cycles on a TLB miss in the
normal case.  This is because the walks are almost never going to
start at the root.

The real hit will be the extra page table for every task.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 17:06                   ` Andy Lutomirski
@ 2018-03-12 17:12                     ` Linus Torvalds
  2018-03-12 17:41                       ` Ingo Molnar
  2018-03-12 17:21                     ` Dave Hansen
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2018-03-12 17:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Ingo Molnar, Kirill A. Shutemov,
	Peter Zijlstra, Cyrill Gorcunov, Kees Cook, Matthew Wilcox,
	Thomas Gleixner, Borislav Petkov, Andy Shevchenko,
	Linux Kernel Mailing List, Peter Anvin, Eric W. Biederman,
	Jürgen Groß,
	linux-tip-commits, Dave Hansen

On Mon, Mar 12, 2018 at 10:06 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> I'd be surprised if there's a noticeable performance hit on anything
> except the micro-est of benchmarks.  We're talking one extra
> intermediate paging structure cache entry in use, maybe a few data
> cache lines, and (wild guess) 0 extra cycles on a TLB miss in the
> normal case.  This is because the walks are almost never going to
> start at the root.

Probably. But VM people may disagree if they already have high TLB miss costs.

> The real hit will be the extra page table for every task.

.. and it's unclear how noticeable that might be. It's not like it's
per-thread, only per process, and very few people have so many
processes that a page per process matters.

But regardless, I think we're better off with a "wait and see" approach.

IOW, try to use 5-level whenever possible for now, and _if_ somebody
actually can show that 4-level page tables perform better or have some
other advantage, we can then try to be clever later when it's all
tested and it's just an optimization, not a "that code won't even run
normally and gets basically zero coverage".

                Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 17:06                   ` Andy Lutomirski
  2018-03-12 17:12                     ` Linus Torvalds
@ 2018-03-12 17:21                     ` Dave Hansen
  1 sibling, 0 replies; 28+ messages in thread
From: Dave Hansen @ 2018-03-12 17:21 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Kirill A. Shutemov, Ingo Molnar, Kirill A. Shutemov,
	Peter Zijlstra, Cyrill Gorcunov, Kees Cook, Matthew Wilcox,
	Thomas Gleixner, Borislav Petkov, Andy Shevchenko,
	Linux Kernel Mailing List, Peter Anvin, Eric W. Biederman,
	Jürgen Groß,
	linux-tip-commits

On 03/12/2018 10:06 AM, Andy Lutomirski wrote:
> I'd be surprised if there's a noticeable performance hit on anything
> except the micro-est of benchmarks.  We're talking one extra
> intermediate paging structure cache entry in use, maybe a few data
> cache lines, and (wild guess) 0 extra cycles on a TLB miss in the
> normal case.  This is because the walks are almost never going to
> start at the root.

The hardware guys are keenly aware of the concerns about the extra
latency that the extra level might cause us.  I frankly expect that
we'll see the overhead in *software* via get_user_pages() and friends
before we ever see a practical bump in TLB fill latency.

I'm also super in favor of enabling LA57 everywhere that we can, up
front, and only disabling selectively it if it has real-world problems.
It makes our lives (as Intel software people) massively easier because
we don't have to go tell everyone how to turn it on in the first place
to test it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the LA57 check
  2018-03-12 17:12                     ` Linus Torvalds
@ 2018-03-12 17:41                       ` Ingo Molnar
  0 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2018-03-12 17:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Cyrill Gorcunov, Kees Cook, Matthew Wilcox,
	Thomas Gleixner, Borislav Petkov, Andy Shevchenko,
	Linux Kernel Mailing List, Peter Anvin, Eric W. Biederman,
	Jürgen Groß,
	linux-tip-commits, Dave Hansen


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But regardless, I think we're better off with a "wait and see" approach.
> 
> IOW, try to use 5-level whenever possible for now, and _if_ somebody actually 
> can show that 4-level page tables perform better or have some other advantage, 
> we can then try to be clever later when it's all tested and it's just an 
> optimization, not a "that code won't even run normally and gets basically zero 
> coverage".

Ok, fair enough - and the testing argument makes sense as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-03-12 17:46 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-26 18:04 [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Kirill A. Shutemov
2018-02-26 18:04 ` [PATCH 1/5] x86/boot/compressed/64: Describe the logic behind LA57 check Kirill A. Shutemov
2018-03-12  9:27   ` [tip:x86/mm] x86/boot/compressed/64: Describe the logic behind the " tip-bot for Kirill A. Shutemov
2018-03-12 12:40     ` Peter Zijlstra
2018-03-12 12:43       ` Kirill A. Shutemov
2018-03-12 13:10         ` Peter Zijlstra
2018-03-12 14:04           ` Kirill A. Shutemov
2018-03-12 14:32             ` Ingo Molnar
2018-03-12 14:50               ` Kirill A. Shutemov
2018-03-12 16:42                 ` Linus Torvalds
2018-03-12 17:06                   ` Andy Lutomirski
2018-03-12 17:12                     ` Linus Torvalds
2018-03-12 17:41                       ` Ingo Molnar
2018-03-12 17:21                     ` Dave Hansen
2018-03-12 14:52               ` Cyrill Gorcunov
2018-02-26 18:04 ` [PATCH 2/5] x86/boot/compressed/64: Find a place for 32-bit trampoline Kirill A. Shutemov
2018-02-26 22:30   ` Borislav Petkov
2018-02-27  8:14     ` Kirill A. Shutemov
2018-03-12  9:28   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
2018-02-26 18:04 ` [PATCH 3/5] x86/boot/compressed/64: Save and restore trampoline memory Kirill A. Shutemov
2018-03-12  9:29   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
2018-02-26 18:04 ` [PATCH 4/5] x86/boot/compressed/64: Set up " Kirill A. Shutemov
2018-03-12  9:29   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
2018-02-26 18:04 ` [PATCH 5/5] x86/boot/compressed/64: Prepare new top-level page table for trampoline Kirill A. Shutemov
2018-03-12  9:30   ` [tip:x86/mm] " tip-bot for Kirill A. Shutemov
2018-02-26 19:32 ` [PATCH 0/5] x86/boot/compressed/64: Prepare trampoline memory Borislav Petkov
2018-02-26 20:55   ` Kirill A. Shutemov
2018-02-27  9:32     ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).